Variant Filtering

filterByTag

filterByTag allows to filter variants from input VCF file by INFO field tags and thresholds. Users can define one or more filters on numeric, string, or boolean annotations. Each filter specifies the tag or field name, a value, an operator, type, and logic for aggregation. Multiple filters can be combined with global across-tag logic (any or all).

Arguments

    usage: granite filterByTag [-h] -i INPUTFILE -o OUTPUTFILE
                               [-l {any, all}] -t TAG_FILTER [TAG_FILTER ...]
                               [--separator SEP] [-v]

    optional arguments:
      -i INPUTFILE, --inputfile INPUTFILE
                            input VCF file
      -o OUTPUTFILE, --outputfile OUTPUTFILE
                            output file to write results as VCF, use .vcf as extension
      -l {any, all}, --logic {any, all}
                            across-tag logic (combine multiple tag filters). 
                            Accept "any" or "all" [any]
      -t TAG_FILTER [TAG_FILTER ...], --tag TAG_FILTER [TAG_FILTER ...]
                            one or more tag filters. Quote each TAG_FILTER to protect
                            special characters

                            format:
                              'name/value/operator/type/logic[/entry=sep][/field=sep][/value=sep]'

                            components:
                              name       tag name (e.g. DP, CSQ) or
                                         field name (e.g. IMPACT, Consequence for VEP annotations)
                              value      threshold or string to compare against. For
                                         bool use placeholder "-"
                              operator   one of:
                                           ==      equal to
                                           !=      not equal to
                                           <       less than (int, float)
                                           >       greater than (int, float)
                                           <=      less than or equal to (int, float)
                                           >=      greater than or equal to (int, float)
                                           ~       substring contains (str)
                                           !~      substring does not contain (str)
                                           true    flag is set (bool)
                                           false   flag is unset (bool)
                              type       str | int | float | bool
                              logic      any | all (within-tag aggregation across entries)
                              entry=sep  entry separator within a tag, if tag has
                                         multiple entries (e.g. VEP transcripts)
                              field=sep  field separator within a tag, if tag/entry has
                                         embedded fields (e.g. VEP annotations)
                              value=sep  value separator within a field, if tag/entry
                                         has multiple values per field (e.g. VEP Consequence)

                            notes:
                              - if a numeric tag or embedded numeric value is missing in the
                                VCF INFO field, it is treated as 0
                              - if a string tag or embedded string value is missing in the
                                VCF INFO field, it is treated as empty string
                              - all string comparisons and tag matching are case-sensitive

      --separator SEP       tag separator within INFO field [;]

For complex annotations with multiple fields or multiple entries (e.g. transcript-level annotations from VEP), the program expects a VEP-like structure with proper field and entry definitions in the VCF header.

Format definition (example from VEP):

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature|...|gnomADg_AF|...">

Examples

Filter variants with depth >= 10.

granite filterByTag -i file.vcf -o file.out.vcf -t 'DP/10/>=/int/any'

Filter variants with gnomAD genome allele frequency (“gnomADg_AF”) <= 0.01, evaluating all entries (transcripts) from VEP annotations.

granite filterByTag -i file.vcf -o file.out.vcf -t 'gnomADg_AF/0.01/<=/float/all/field=|/entry=,/value=&'

Filter variants where a boolean PON (in panel of normal) flag is not set.

granite filterByTag -i file.vcf -o file.out.vcf -t 'PON/-/false/bool/any'

Filter variants with an “IMPACT” value equal to “HIGH” in any entry (transcript) from VEP annotations.

granite filterByTag -i file.vcf -o file.out.vcf -t 'IMPACT/HIGH/==/str/any/field=|/entry=,'

Combine filters with global across-tag logic to require all filters to be true.

granite filterByTag -i file.vcf -o file.out.vcf -l all \
    -t 'DP/10/>=/int/any' \
       'gnomADg_AF/0.01/<=/float/all/field=|/entry=,/value=&' \
       'PON/-/false/bool/any' \
       'IMPACT/HIGH/==/str/any/field=|/entry=,'

whiteList

whiteList allows to select and filter-in a subset of variants from input VCF file based on specified annotations and positions. The software can use provided VEP, ClinVar or SpliceAI annotations. Positions can be also specified as a BED format file.

Arguments

    usage: granite whiteList [-h] -i INPUTFILE -o OUTPUTFILE [--SpliceAI SPLICEAI]
                             [--SpliceAItag SPLICEAITAG] [--CLINVAR]
                             [--CLINVARonly CLINVARONLY [CLINVARONLY ...]]
                             [--CLINVARtag CLINVARTAG] [--VEP] [--VEPtag VEPTAG]
                             [--VEPrescue VEPRESCUE [VEPRESCUE ...]]
                             [--VEPremove VEPREMOVE [VEPREMOVE ...]]
                             [--VEPsep VEPSEP] [--BEDfile BEDFILE]

    optional arguments:
      -i INPUTFILE, --inputfile INPUTFILE
                            input VCF file
      -o OUTPUTFILE, --outputfile OUTPUTFILE
                            output file to write results as VCF, use .vcf as
                            extension
      --SpliceAI SPLICEAI   threshold to whitelist variants by SpliceAI delta
                            scores value (>=)
      --SpliceAItag SPLICEAITAG
                            by default the program will search for SpliceAI delta
                            scores (DS_AG, DS_AL, DS_DG, DS_DL) to calculate the
                            max delta score for the variant. If a max value is
                            already defined, use this parameter to specify the TAG
                            | TAG field to be used
      --CLINVAR             flag to whitelist all variants with a ClinVar entry
                            [ALLELEID]
      --CLINVARonly CLINVARONLY [CLINVARONLY ...]
                            ClinVar "CLNSIG" terms or keywords to be saved. Sets
                            for whitelist only ClinVar variants with specified
                            terms or keywords
      --CLINVARtag CLINVARTAG
                            by default the program will search for ClinVar
                            "ALLELEID" TAG, use this parameter to specify a
                            different TAG to be used
      --VEP                 use VEP "Consequence" annotations to whitelist exonic
                            and relevant variants (removed by default variants in
                            intronic, intergenic, or regulatory regions)
      --VEPtag VEPTAG       by default the program will search for "CSQ" TAG
                            (CSQ=<values>), use this parameter to specify a
                            different TAG to be used (e.g. VEP)
      --VEPrescue VEPRESCUE [VEPRESCUE ...]
                            additional terms to overrule removed flags to rescue
                            and whitelist variants
      --VEPremove VEPREMOVE [VEPREMOVE ...]
                            additional terms to be removed
      --VEPsep VEPSEP       by default the program expects "&" as separator for
                            subfields in VEP (e.g.
                            intron_variant&splice_region_variant), use this
                            parameter to specify a different separator to be used
      --BEDfile BEDFILE     BED format file with positions to whitelist

Examples

Whitelists variants with ClinVar entry. If available, ClinVar annotation must be provided in INFO column.

granite whiteList -i file.vcf -o file.out.vcf --CLINVAR

Whitelists only “Pathogenic” and “Likely_pathogenic” variants with ClinVar entry. ClinVar “CLNSIG” annotation must be provided in INFO column.

granite whiteList -i file.vcf -o file.out.vcf --CLINVAR --CLINVARonly Pathogenic

Whitelists variants based on SpliceAI annotations. This filters in variants with SpliceAI score equal/higher than --SpliceAI. If available, SpliceAI annotation must be provided in INFO column.

granite whiteList -i file.vcf -o file.out.vcf --SpliceAI <float>

Whitelists variants based on VEP “Consequence” annotations. This whitelists exonic and functional relevant variants by removing variants flagged as “intron_variant”, “intergenic_variant”, “downstream_gene_variant”, “upstream_gene_variant”, “regulatory_region_”, “non_coding_transcript_”. It is possible to specify additional terms to remove using --VEPremove and terms to rescue using --VEPrescue. To use VEP, annotation must be provided for each variant in INFO column.

granite whiteList -i file.vcf -o file.out.vcf --VEP
granite whiteList -i file.vcf -o file.out.vcf --VEP --VEPremove <str> <str>
granite whiteList -i file.vcf -o file.out.vcf --VEP --VEPrescue <str> <str>
granite whiteList -i file.vcf -o file.out.vcf --VEP --VEPrescue <str> <str> --VEPremove <str>

Whitelists variants based on positions specified as a BED format file.

granite whiteList -i file.vcf -o file.out.vcf --BEDfile file.bed

Combine the above filters.

granite whiteList -i file.vcf -o file.out.vcf --BEDfile file.bed --VEP --VEPrescue <str> <str> --CLINVAR --SpliceAI <float>

blackList

blackList allows to filter-out variants from input VCF file based on positions set in BIG format file and/or provided population allele frequency. Positions can be also specified as a BED format file.

Arguments

    usage: granite blackList [-h] -i INPUTFILE -o OUTPUTFILE [-b BIGFILE]
                             [--aftag AFTAG] [--afthr AFTHR] [--BEDfile BEDFILE]

    optional arguments:
      -i INPUTFILE, --inputfile INPUTFILE
                            input VCF file
      -o OUTPUTFILE, --outputfile OUTPUTFILE
                            output file to write results as VCF, use .vcf as
                            extension
      -b BIGFILE, --bigfile BIGFILE
                            BIG format file with positions set for blacklist
      --aftag AFTAG         TAG (TAG=<float>) or TAG field to be used to filter by
                            population allele frequency
      --afthr AFTHR         threshold to filter by population allele frequency
                            (<=) [1]
      --BEDfile BEDFILE     BED format file with positions to blacklist

Examples

Blacklist variants based on position set to True in BIG format file.

granite blackList -i file.vcf -o file.out.vcf -b file.big

Blacklist variants based on population allele frequency. This filters out variants with allele frequency higher than --afthr. Allele frequency must be provided for each variant in INFO column.

granite blackList -i file.vcf -o file.out.vcf --afthr <float> --aftag tag

Combine the two filters.

granite blackList -i file.vcf -o file.out.vcf --afthr <float> --aftag tag -b file.big

cleanVCF

cleanVCF allows to clean INFO field of input VCF file. The software can remove a list of TAG from INFO field, or can be used to clean VEP annotations.

Arguments

    usage: granite cleanVCF [-h] -i INPUTFILE -o OUTPUTFILE [-t TAG] [--VEP]
                            [--VEPtag VEPTAG]
                            [--VEPrescue VEPRESCUE [VEPRESCUE ...]]
                            [--VEPremove VEPREMOVE [VEPREMOVE ...]]
                            [--VEPsep VEPSEP] [--SpliceAI SPLICEAI]
                            [--SpliceAItag SPLICEAITAG] [--filter_VEP]

    optional arguments:
      -i INPUTFILE, --inputfile INPUTFILE
                            input VCF file
      -o OUTPUTFILE, --outputfile OUTPUTFILE
                            output file to write results as VCF, use .vcf as
                            extension
      -t TAG, --tag TAG     TAG to be removed from INFO field. Specify multiple
                            TAGs as: "-t TAG -t TAG -t ..."
      --VEP                 clean VEP "Consequence" annotations (removed by
                            default terms for intronic, intergenic, or regulatory
                            regions from annotations)
      --VEPtag VEPTAG       by default the program will search for "CSQ" TAG
                            (CSQ=<values>), use this parameter to specify a
                            different TAG to be used (e.g. VEP)
      --VEPrescue VEPRESCUE [VEPRESCUE ...]
                            additional terms to overrule removed flags to rescue
                            annotations
      --VEPremove VEPREMOVE [VEPREMOVE ...]
                            additional terms to be removed from annotations
      --VEPsep VEPSEP       by default the program expects "&" as separator for
                            subfields in VEP (e.g.
                            intron_variant&splice_region_variant), use this
                            parameter to specify a different separator to be used
      --SpliceAI SPLICEAI   threshold to save intronic annotations, from VEP
                            "Consequence", for variants by SpliceAI delta scores
                            value (>=)
      --SpliceAItag SPLICEAITAG
                            by default the program will search for SpliceAI delta
                            scores (DS_AG, DS_AL, DS_DG, DS_DL) to calculate the
                            max delta score for the variant. If a max value is
                            already defined, use this parameter to specify the TAG
                            | TAG field to be used
      --filter_VEP          by default the program returns all variants in the input VCF file.
                            This flag will drop the variants with no VEP annotations after the
                            cleaning

Examples

Remove tag from INFO field.

granite cleanVCF -i file.vcf -o file.out.vcf -t tag

Clean VEP based on VEP “Consequence” annotations. This removes annotations flagged as “intron_variant”, “intergenic_variant”, “downstream_gene_variant”, “upstream_gene_variant”, “regulatory_region_”, “non_coding_transcript_”. It is possible to specify additional terms to remove using --VEPremove and terms to rescue using --VEPrescue. VEP annotation must be provided for each variant in INFO column.

granite cleanVCF -i file.vcf -o file.out.vcf --VEP
granite cleanVCF -i file.vcf -o file.out.vcf --VEP --VEPremove <str> <str>
granite cleanVCF -i file.vcf -o file.out.vcf --VEP --VEPrescue <str> <str>
granite cleanVCF -i file.vcf -o file.out.vcf --VEP --VEPrescue <str> <str> --VEPremove <str>

The program also accepts a SpliceAI threshold that will rescue annotations for “intron_variant” by SpliceAI. SpliceAI annotation must be provided in INFO column.

granite cleanVCF -i file.vcf -o file.out.vcf --VEP --SpliceAI <float>

Combine the above filters.

granite cleanVCF -i file.vcf -o file.out.vcf -t tag --VEP --VEPrescue <str> <str> --SpliceAI <float>

geneList

geneList allows to filter VEP annotations from input VCF file using a list of genes. If a transcript is not mapping to any of the genes in the list, the transcript is removed from VEP annotation in INFO field. If all transcripts are removed, the VEP tag is removed from INFO field for the variant.

Arguments

    usage: granite geneList [-h] -i INPUTFILE -o OUTPUTFILE -g GENESLIST
                            [--VEPtag VEPTAG]

    optional arguments:
      -i INPUTFILE, --inputfile INPUTFILE
                            input VCF file
      -o OUTPUTFILE, --outputfile OUTPUTFILE
                            output file to write results as VCF, use .vcf as
                            extension
      -g GENESLIST, --geneslist GENESLIST
                            text file listing ensembl gene (ENSG) IDs for all
                            genes to save annotations for, IDs must be listed as a
                            column
      --VEPtag VEPTAG       by default the program will search for "CSQ" TAG
                            (CSQ=<values>), use this parameter to specify a
                            different TAG to be used (e.g. VEP)