File Formats

The program is compatible with standard BED, BAM and VCF formats (VCFv4.x).

ReadCountKeeper (.rck)

RCK is a tabular format that allows to efficiently store counts by strand (ForWard-ReVerse) for reads that support REFerence allele, ALTernate alleles, INSertions or DELetions at CHRomosome and POSition. RCK files can be further compressed with bgzip and indexed with tabix for storage, portability and faster random access. 1-based.

Tabular format structure:

#CHR   POS   COVERAGE   REF_FW   REF_RV   ALT_FW   ALT_RV   INS_FW   INS_RV   DEL_FW   DEL_RV
13     1     23         0        0        11       12       0        0        0        0
13     2     35         18       15       1        1        0        0        0        0

Commands to compress and index files:

    bgzip PATH/TO/FILE
    tabix -b 2 -s 1 -e 0 -c "#" PATH/TO/FILE.gz

BinaryIndexGenome (.big)

BIG is a HDF5-based binary format that stores boolean values for each genomic position as bit arrays. Each position is represented in three complementary arrays that account for SNVs (Single-Nucleotide Variants), insertions and deletions respectively. 1-based.

HDF5 format structure:

e.g.
chr1_snv: array(bool)
chr1_ins: array(bool)
chr1_del: array(bool)
chr2_snv: array(bool)
...
...
chrM_del: array(bool)

note: HDF5 keys are built as the chromosome name based on reference (e.g. chr1) plus the suffix specifying whether the array represents SNVs (_snv), insertions (_ins) or deletions (_del).

Pedigree in JSON format

When the program requires pedigree information, the expected format is as follows:

[
  {
    "individual": "NA12877",
    "sample_name": "NA12877_sample",
    "gender": "M",
    "parents": []
  },
  {
    "individual": "NA12878",
    "sample_name": "NA12878_sample",
    "gender": "F",
    "parents": []
  },
  {
    "individual": "NA12879",
    "sample_name": "NA12879_sample",
    "gender": "F",
    "parents": ["NA12878", "NA12877"]
  }
]

where individual is the unique identifier for a member within the pedigree, sample_name is the corresponding sample ID in VCF file, and parents is the list of unique identifiers for the parents, if any.