Rdeval snapshot format specification
Introduction
Rdeval can optionally generate read ‘snapshots’, i.e. highly condensed representation of relevant read features that can then be stored in rdeval’s own highly compressed .rd files, which are described hereafter.
.rd file format specification
An .rd file is currently made of a binary header followed by a gzip-compressed binary data section.
Header
The header contains information on the files that went into the generation of the snapshot file along with their md5sum values. The first 4 bytes correspond to a uint32 with the # files, followed by the following structure for each file:
entry |
type |
# bytes |
|---|---|---|
filename length |
uint16 |
2 |
filename |
char |
1*char |
md5 length |
uint16 |
2 |
md5 |
char |
1*char |
Data
The header is followed by a binary uint64 with the size of the uncompressed data. Data follows as gzip-compressed binary data structured as:
entry |
type |
# bytes |
|---|---|---|
A count |
uint64 |
8 |
C count |
uint64 |
8 |
G count |
uint64 |
8 |
T count |
uint64 |
8 |
N count |
uint64 |
8 |
len8 count |
uint64 |
8 |
len16 count |
uint64 |
8 |
len64 count |
uint64 |
8 |
read8 |
uint8+float |
len8 count*8 |
read16 |
uint16+float |
len16 count*8 |
read64 |
uint64+float |
len64 count*16 |
len8, len16, len64 help compression under different scenarios of read length distribution, and refer to specialized data structures to store each individual read length/quality pair. These are contiguous pairs for each read of uint8/uint16/uint64 (read length) and float (read quality). Reads are pre-sorted descending by size and then by quality
Examples of how to read .rd files are provided for C/C++ and R here and here.