File Formats

The program is compatible with standard BED, BAM and VCF formats (VCFv4.x).

ReadCountKeeper (.rck)

RCK is a tabular format that allows to efficiently store counts by strand (ForWard-ReVerse) for reads that support REFerence allele, ALTernate alleles, INSertions or DELetions at CHRomosome and POSition. RCK files can be further compressed with bgzip and indexed with tabix for storage, portability and faster random access. 1-based.

Tabular format structure:

13     1     23         0        0        11       12       0        0        0        0
13     2     35         18       15       1        1        0        0        0        0

Commands to compress and index files:

    bgzip PATH/TO/FILE
    tabix -b 2 -s 1 -e 0 -c "#" PATH/TO/FILE.gz

BinaryIndexGenome (.big)

BIG is a hdf5-based binary format that stores boolean values for each genomic position as bit arrays. Each position is represented in three complementary arrays that account for SNVs (Single-Nucleotide Variants), insertions and deletions respectively. 1-based.

hdf5 format structure:

chr1_snv: array(bool)
chr1_ins: array(bool)
chr1_del: array(bool)
chr2_snv: array(bool)
chrM_del: array(bool)

note: hdf5 keys are built as the chromosome name based on reference (e.g. chr1) plus the suffix specifying whether the array represents SNVs (_snv), insertions (_ins) or deletions (_del).

Pedigree in JSON format

When the program requires pedigree information, the expected format is as follow:

    "individual": "NA12877",
    "sample_name": "NA12877_sample",
    "gender": "M",
    "parents": []
    "individual": "NA12878",
    "sample_name": "NA12878_sample",
    "gender": "F",
    "parents": []
    "individual": "NA12879",
    "sample_name": "NA12879_sample",
    "gender": "F",
    "parents": ["NA12878", "NA12877"]

where individual is the unique identifier for member inside the pedigree, sample_name is the corresponding sample ID in VCF file, and parents is the list of unique identifiers for member parents if any.