Rdeval snapshot format specification

Introduction

Rdeval can optionally generate read ‘snapshots’, i.e. highly condensed representation of relevant read features that can then be stored in rdeval’s own highly compressed .rd files, which are described hereafter.

.rd file format specification

An .rd file is currently made of a binary header followed by a gzip-compressed binary data section.

Data

The header is followed by a binary uint64 with the size of the uncompressed data. Data follows as gzip-compressed binary data structured as:

Data specification

entry

type

# bytes

A count

uint64

8

C count

uint64

8

G count

uint64

8

T count

uint64

8

N count

uint64

8

len8 count

uint64

8

len16 count

uint64

8

len64 count

uint64

8

read8

uint8+float

len8 count*8

read16

uint16+float

len16 count*8

read64

uint64+float

len64 count*16

len8, len16, len64 help compression under different scenarios of read length distribution, and refer to specialized data structures to store each individual read length/quality pair. These are contiguous pairs for each read of uint8/uint16/uint64 (read length) and float (read quality). Reads are pre-sorted descending by size and then by quality

Examples of how to read .rd files are provided for C/C++ and R here and here.