## VCF Parser granite library can be used directly to access and manipulate information in VCF format. ### Import the library from granite.lib import vcf_parser ### Usage The library implements the objects [*Vcf*](#vcf), [*Header*](#header) and [*Variant*](#variant). #### Vcf This is the main object and has methods to read and write VCF format. ##### Initialize the object vcf_obj = vcf_parser.Vcf('inputfile.vcf') This will automatically read the file header into a *Header* object. ##### Read and access variants The method *parse_variants()* will read the file and return a generator to *Variant* objects that store variants information. for vnt_obj in vcf_obj.parse_variants(): ... ##### Write to file The method *write_header(fo)* writes header definitions and columns to specified buffer (fo). with open('outputfile.vcf', 'w') as fo: vcf_obj.write_header(fo) It is possible to write only definitions or columns respectively with the methods *write_definitions(fo)* and *write_columns(fo)*. with open('outputfile.vcf', 'w') as fo: vcf_obj.write_definitions(fo) vcf_obj.write_columns(fo) The method *write_variant(fo, Variant_obj)* writes information from *Variant* object to specified buffer (fo). with open('outputfile.vcf', 'w') as fo: vcf_obj.write_variant(fo, vnt_obj) #### Header This is the object used to store information for the header in VCF format. Methods are available to extract and modify information in the header. ##### Attributes ###### definitions *\* Stores the full header information minus the last line where columns are defined. vcf_obj.header.definitions ###### columns *\* Stores the last header line where columns are defined. # Columns example # #CHROM POS ID REF ALT ... vcf_obj.header.columns ###### IDs_genotypes *\* Stores sample ID(s) available in the VCF as list. If multiple samples, the order from the VCF is maintained. vcf_obj.header.IDs_genotypes ##### Add or remove definitions The method *add_tag_definition(tag_definition, tag_type='INFO')* adds tag_definition to the header on top of the block specified by tag_type (e.g. FORMAT, INFO). tag_definition = '##INFO=' vcf_obj.header.add_tag_definition(tag_definition) The method *remove_tag_definition(tag, tag_type='INFO')* removes tag definition from the header block specified by tag_type (e.g. FORMAT, INFO). tag = 'CSQ' vcf_obj.header.remove_tag_definition(tag) ##### Extract information The method *get_tag_field_idx(tag, field, tag_type='INFO', sep='|')* gets the index corresponding to value field in tag from definition, block specified by tag_type (e.g. FORMAT, INFO). *sep* is the fields separator used in the tag definition. # Return the index corresponding to 'Consequence' field # from CSQ definition (VEP) in the header INFO block # ##INFO= tag, field = 'CSQ', 'Consequence' idx = vcf_obj.header.get_tag_field_idx(tag, field) The method *check_tag_definition(tag, tag_type='INFO', sep='|')* checks if a tag is in the header and if it is standalone or a field of another leading tag. Returns the leading tag and the field corresponding index, if any, to access the tag. *sep* is the fields separator used in the tag definition. # Return the leading tag and index corresponding to 'Consequence' field # from CSQ definition (VEP) in the header INFO block # ##INFO= tag = 'Consequence' lead_tag , idx = vcf_obj.header.check_tag_definition(tag) *note: tag and field are case sensitive.* #### Variant This is the object used to store information for variants in VCF format. ##### Attributes ###### CHROM *\* Stores chromosome name (e.g. 1, chr1), as in the VCF file. vnt_obj.CHROM ###### POS *\* Stores variant position. vnt_obj.POS ###### ID *\* Stores variant ID(s), as in the VCF file. vnt_obj.ID ###### REF *\* Stores reference allele at position. vnt_obj.REF ###### ALT *\* Stores alternate allele(s) at position. vnt_obj.ALT ###### QUAL *\* Stores phred-scaled quality score for the assertion made in ALT. vnt_obj.QUAL ###### FILTER *\* Stores filter status. vnt_obj.FILTER ###### INFO *\* Additional information for the variant. vnt_obj.INFO ###### FORMAT *\* Stores specification for the genotype column(s) structure. vnt_obj.FORMAT ###### IDs_genotypes *\* Stores sample ID(s) available in the VCF as list. If multiple samples, the order from the VCF is maintained. vnt_obj.IDs_genotypes ###### GENOTYPES *\* Stores a dictionary linking genotype(s) for the variant to corresponding sample ID(s). # {ID_genotype: genotype, ...} vnt_obj.GENOTYPES ##### Format variants The method *to_string()* returns the variant representation in VCF format. vnt_vcf = vnt_obj.to_string() The method *repr()* returns the variant representation in the form *CHROM:POSREF>ALT*. vnt_repr = vnt_obj.repr() ##### Manipulate genotype(s) The method *remove_tag_genotype(tag, sep=':')* removes a tag from FORMAT and GENOTYPES. *sep* is the tags separator used in format definition and genotype(s). tag = 'AD' vnt_obj.remove_tag_genotype(tag) The method *complete_genotype(sep=':')* fills in the trailing fields that are missing and by default dropped in GENOTYPES. *sep* is the tags separator used in format definition and genotype(s). vnt_obj.complete_genotype() The method *empty_genotype(sep=':')* returns an empty genotype based on the FORMAT structure. *sep* is the tags separator used in format definition and genotype(s). empty = vnt_obj.empty_genotype() The method *add_tag_format(tag, sep=':')* adds a tag at the end of FORMAT structure. *sep* is the tags separator used in format definition and genotype(s). tag = 'RSTR' vnt_obj.add_tag_format(tag) The method *add_values_genotype(ID_genotype, values, sep=':')* adds values at the end of the genotype specified by corresponding ID. *sep* is the tags separator used in format definition and genotype(s). vnt_obj.add_values_genotype(ID_genotype, values) The method *get_genotype_value(ID_genotype, tag, complete_genotype=False, sep=':')* returns value for tag from the genotype specified by corresponding ID. *sep* is the tags separator used in format definition and genotype(s). If *complete_genotype=True*, it returns '.' if tag is missing. If complete_genotype=False (default), it raises an exception for the missing tag. tag_val = vnt_obj.get_genotype_value(ID_genotype, tag) ##### Manipulate INFO The method *remove_tag_info(tag, sep=';')* removes a tag or a flag from INFO. *sep* is the tags separator used in INFO. vnt_obj.remove_tag_info(tag) The method *add_tag_info(tag_value, sep=';')* adds a tag and its value or a flag at the end of INFO. *sep* is the tags separator used in INFO. tag_value = 'tag=value' vnt_obj.add_tag_info(tag_value) tag_value = 'flag' vnt_obj.add_tag_info(tag_value) The method *get_tag_value(tag, is_flag=False, sep=';')* returns the value from tag in INFO. *sep* is the tags separator used in INFO. If the tag is a flag, set *is_flag=True*; the function will return True or False instead. tag_val = vnt_obj.get_tag_value(tag) tag_val = vnt_obj.get_tag_value(tag, is_flag=True) *note*: tag and ID are case sensitive. ### Custom error classes *MissingTag* describes a missing tag or tag value. *MissingTagDefinition* describes a missing tag definition. *TagDefinitionError* describes a format error for a tag definition. *TagFormatError* describes a format error for a tag. *MissingIdentifier* describes a missing genotype identifier in the VCF file. *VcfFormatError* describes an error in the VCF format.