Memory usage and parallel processing

Overview

The latest versions of rdeval are designed to be very memory efficient. Internally, rdeval uses an MPMC (multi-producer / multi-consumer) queue implemented as a bounded ring buffer. This design provides strict, predictable memory bounds while still running at (or very close to) theoretical full throughput on modern CPUs.

Default parallelism

By default (values can be changed in the CLI), rdeval uses:

parallel-files = 4
decompression-threads = 4
compression-threads = 6

This means:

Up to 4 input files are processed in parallel. If you provide more than 4 input files, rdeval processes them in batches of 4 (first 4, then the next 4, and so on).
For BAM/CRAM input, each file can use up to 4 decompression threads.
These are threads, not physical cores, so it is usually fine to have more threads than hardware cores; the scheduler will multiplex them.

Input-side ring buffer

The main input pipeline uses a bounded ring buffer shared between producer and consumer threads.

Let:
- producersN be the number of producer threads (i.e. input file readers)
- consumersN be the number of consumer threads (typically the total worker threads in the pipeline)

The number of buffers in the ring is:

buffersN = (consumersN + producersN) * 2

Each individual buffer is by default ~1,000,000 bp, i.e. ~1 MB of sequence data.

Since consumersN corresponds to the total number of active threads in the pipeline, the total memory allocated for this ring buffer is roughly:

total_input_buffer_memory  ≈  buffersN × 1 MB

Example: single compressed FASTQ file

If your machine has, for instance, 32 threads and you allow rdeval to use all of them for a single input file, the upper bound for the input-side buffering is approximately:

consumersN  ≈ 32
producersN  ≈ 1
buffersN    = (32 + 1) * 2  ≈ 66

total_input_buffer_memory  ≈ 66 MB

In internal benchmarks, the actual memory footprint is minimal in practice, and substantially smaller than the <4 GB previously reported in the original manuscript for older configurations. If you observe a significantly larger memory usage under typical workloads, please let us know so we can investigate.

Output-side ring buffer (writing to disk)

When the user requests writing data to disk (e.g. generating .rd files or converting FASTQ → BAM), rdeval uses a second bounded ring buffer for the output stage. This is governed by:

outBuffersN = consumersN * 4 + 1

where consumersN is again the total number of worker threads in the pipeline. Each output buffer is of comparable size (on the order of ~1 MB), so the additional memory required for the output ring is approximately:

total_output_buffer_memory  ≈  outBuffersN × 1 MB

Thus, when writing to disk, a good upper bound for total buffer memory is:

total_buffer_memory  ≈  (buffersN + outBuffersN) × 1 MB

In most practical use cases this remains modest compared to available RAM on typical analysis nodes.

Practical recommendations on clusters

Even though rdeval can in principle process many files on a single node with a large number of CPUs without significant performance degradation, in cluster environments it is often more convenient and robust to:

Generate individual .rd files for each input file (or logical group of inputs).
Combine the resulting .rd files at the end of the pipeline.

The final combination step is extremely fast and is described in detail in the rdeval usage documentation.

This approach simplifies resource management (especially memory and I/O) on multi-user clusters while preserving parallelism and throughput. It also enables the visualization of multiple .rd results in a single rdeval report.

Notes and feedback

The current configuration (defaults, buffer sizes and thread scheduling) has been tuned for typical high-throughput use cases, but has not yet been exhaustively benchmarked across all possible hardware and input types.

Different combinations of options:

parallel-files
decompression-threads
compression-threads
max-memory

may yield better performance or memory usage on specific systems, especially for highly compressed BAM/CRAM inputs where I/O and decompression behave differently from FASTQ.

We welcome feedback and real-world benchmarks from users to further refine these defaults and improve both performance and memory efficiency.