Memory usage and parallel processing
Overview
The latest versions of rdeval are designed to be very memory efficient.
Internally, rdeval uses an MPMC (multi-producer / multi-consumer) queue
implemented as a bounded ring buffer. This design provides strict, predictable
memory bounds while still running at (or very close to) theoretical full
throughput on modern CPUs.
Default parallelism
By default (values can be changed in the CLI), rdeval uses:
parallel-files = 4decompression-threads = 4compression-threads = 6
This means:
Up to 4 input files are processed in parallel. If you provide more than 4 input files,
rdevalprocesses them in batches of 4 (first 4, then the next 4, and so on).For BAM/CRAM input, each file can use up to 4 decompression threads.
These are threads, not physical cores, so it is usually fine to have more threads than hardware cores; the scheduler will multiplex them.
Input-side ring buffer
The main input pipeline uses a bounded ring buffer shared between producer and consumer threads.
Let:
producersNbe the number of producer threads (i.e. input file readers)consumersNbe the number of consumer threads (typically the total worker threads in the pipeline)
The number of buffers in the ring is:
buffersN = (consumersN + producersN) * 2
Each individual buffer is by default ~1,000,000 bp, i.e. ~1 MB of sequence data.
Since consumersN corresponds to the total number of active threads in the
pipeline, the total memory allocated for this ring buffer is roughly:
total_input_buffer_memory ≈ buffersN × 1 MB
Example: single compressed FASTQ file
If your machine has, for instance, 32 threads and you allow rdeval to use
all of them for a single input file, the upper bound for the input-side
buffering is approximately:
consumersN ≈ 32
producersN ≈ 1
buffersN = (32 + 1) * 2 ≈ 66
total_input_buffer_memory ≈ 66 MB
In internal benchmarks, the actual memory footprint is minimal in practice, and substantially smaller than the <4 GB previously reported in the original manuscript for older configurations. If you observe a significantly larger memory usage under typical workloads, please let us know so we can investigate.
Output-side ring buffer (writing to disk)
When the user requests writing data to disk (e.g. generating .rd files or
converting FASTQ → BAM), rdeval uses a second bounded ring buffer for
the output stage. This is governed by:
outBuffersN = consumersN * 4 + 1
where consumersN is again the total number of worker threads in the
pipeline. Each output buffer is of comparable size (on the order of ~1 MB),
so the additional memory required for the output ring is approximately:
total_output_buffer_memory ≈ outBuffersN × 1 MB
Thus, when writing to disk, a good upper bound for total buffer memory is:
total_buffer_memory ≈ (buffersN + outBuffersN) × 1 MB
In most practical use cases this remains modest compared to available RAM on typical analysis nodes.
Practical recommendations on clusters
Even though rdeval can in principle process many files on a single node
with a large number of CPUs without significant performance degradation, in
cluster environments it is often more convenient and robust to:
Generate individual
.rdfiles for each input file (or logical group of inputs).Combine the resulting
.rdfiles at the end of the pipeline.
The final combination step is extremely fast and is described in detail in the rdeval usage documentation.
This approach simplifies resource management (especially memory and I/O) on
multi-user clusters while preserving parallelism and throughput. It also enables
the visualization of multiple .rd results in a single rdeval report.
Notes and feedback
The current configuration (defaults, buffer sizes and thread scheduling) has been tuned for typical high-throughput use cases, but has not yet been exhaustively benchmarked across all possible hardware and input types.
Different combinations of options:
parallel-filesdecompression-threadscompression-threadsmax-memory
may yield better performance or memory usage on specific systems, especially for highly compressed BAM/CRAM inputs where I/O and decompression behave differently from FASTQ.
We welcome feedback and real-world benchmarks from users to further refine these defaults and improve both performance and memory efficiency.