Rdeval snapshot format specification

Introduction

Rdeval can optionally generate read ‘snapshots’, i.e. highly condensed representation of relevant read features that can then be stored in rdeval’s own highly compressed .rd files, which are described hereafter.

.rd file format specification

An .rd file is currently made of a binary header followed by a gzip-compressed binary data section.

Header

The header contains information on the files that went into the generation of the snapshot file along with their md5sum values. The first 4 bytes correspond to a uint32 with the # files, followed by the following structure for each file:

Header specification
entry	type	# bytes
filename length	uint16	2
filename	char	1*char
md5 length	uint16	2
md5	char	1*char

Data

The header is followed by a binary uint64 with the size of the uncompressed data. Data follows as gzip-compressed binary data structured as:

Data specification
entry	type	# bytes
A count	uint64	8
C count	uint64	8
G count	uint64	8
T count	uint64	8
N count	uint64	8
len8 count	uint64	8
len16 count	uint64	8
len64 count	uint64	8
read8	uint8+float	len8 count*8
read16	uint16+float	len16 count*8
read64	uint64+float	len64 count*16

len8, len16, len64 help compression under different scenarios of read length distribution, and refer to specialized data structures to store each individual read length/quality pair. These are contiguous pairs for each read of uint8/uint16/uint64 (read length) and float (read quality). Reads are pre-sorted descending by size and then by quality

Examples of how to read .rd files are provided for C/C++ and R here and here.