Plugin split-vep

The plugin allows to extract fields from structured annotations such as INFO/CSQ created by bcftools/csq or VEP.

Assuming the tag added by VEP is the INFO/CSQ field, let’s start with printing the list of available subfields:

bcftools +split-vep test/split-vep.vcf -l | head

0   Allele
1   Consequence
2   IMPACT
3   SYMBOL
4   Gene
5   Feature_type
6   Feature
7   BIOTYPE
8   EXON
9   INTRON

The default tag can be changed using the -a, -annotation option. For example, if the tag is named XXX, add the -a XXX option.

Extract the Consequence field using a bcftools query like output:

bcftools +split-vep test/split-vep.vcf -f '%CHROM:%POS %Consequence\n'

1:14464 non_coding_transcript_exon_variant,non_coding_transcript_exon_variant,...

To print each consequence on a separate line, rather than as a comma-separated string on a single line, use the -d, --duplicate option:

bcftools +split-vep test/split-vep.vcf -f '%CHROM:%POS %Consequence\n' -d

1:14464 non_coding_transcript_exon_variant
1:14464 downstream_gene_variant
1:14464 regulatory_region_variant

Any number of subfields can be printed this way. If you want to print all subfields, use the -A option to avoid creating a very lengthy formatting expression:

bcftools +split-vep test/split-vep.vcf -f '%CHROM:%POS %CSQ\n' -d -A tab

14464   T   non_coding_transcript_exon_variant  MODIFIER    WASH7P  ENSG00000227232 ...
14464   T   regulatory_region_variant   MODIFIER    .   .   RegulatoryFeature   ENSR00000000002 ...

This printed the consequence prediction for all transcripts. Now limit the output to transcript with the most severe consequence printing also the coordinates and the name of the affected gene:

bcftools +split-vep test/split-vep.vcf -f '%CHROM:%POS %SYMBOL %Consequence\n' -s worst

1:14464 . regulatory_region_variant
1:14574 WASH7P non_coding_transcript_exon_variant

Limit the output to missense or more severe variants:

bcftools +split-vep test/split-vep.vcf -f '%CHROM:%POS %Gene %Consequence\n' -s worst:missense+

1:16856 ENSG00000227232 splice_donor_variant&non_coding_transcript_variant
1:17365 ENSG00000227232 splice_acceptor_variant&non_coding_transcript_variant
1:69428 ENSG00000186092 missense_variant
1:69469 ENSG00000186092 frameshift_variant

It is also possible to use the standard -i/-e expressions to query the fields. For example, filter consequences predicted to be damaging by PolyPhen:

bcftools +split-vep test/split-vep.vcf -f '%CHROM:%POS %Consequence %PolyPhen\n' -d -i 'PolyPhen~"damaging"'

1:69428 missense_variant probably_damaging(0.984)
1:69559 missense_variant possibly_damaging(0.675)

Rather than printing a formatted text, create two INFO fields MySYMBOL with the gene name and MyConsequence:

bcftools +split-vep test/split-vep.vcf -c SYMBOL,Consequence -s worst:missense+ -p My | less

In this example we show the distinction between the VCF/BCF and custom text output:

# Output VCF adding the INFO/Consequence column to every site. We pipe through `query` in order
# to visualize the differences.
bcftools +split-vep test/split-vep.5.vcf -c Consequence | bcftools query -f'%POS\t%Consequence\n' | head -2
69270    synonymous_variant,regulatory_region_variant
69428    missense_variant,regulatory_region_variant

# Output VCF adding the INFO/Consequence column only to missense sites. However, outputs all sites.
bcftools +split-vep test/split-vep.5.vcf -c Consequence -s :missense | bcftools query -f'%POS\t%Consequence\n' | head -2
69270    .
69428    missense_variant,regulatory_region_variant

# Output VCF adding INFO/Consequence column to missense sites. By adding the -x option we only output the missense sites.
bcftools +split-vep test/split-vep.5.vcf -c Consequence -s :missense -x | bcftools query -f'%POS\t%Consequence\n' | head -2
69428    missense_variant
69496    missense_variant

# Create a custom text output, print all sites.
bcftools +split-vep test/split-vep.5.vcf -f'%POS\t%Consequence\n' | head -2
69270    synonymous_variant,regulatory_region_variant
69428    missense_variant,regulatory_region_variant

# Create a custom text output, print only missense consequences.
bcftools +split-vep test/split-vep.5.vcf -f'%POS\t%Consequence\n' -s :missense | head -2
69428    missense_variant
69496    missense_variant

The complete usage page obtained by bcftools +split-vep -h:

About: Query structured annotations such INFO/CSQ created by bcftools/csq or VEP. For more
   more information and pointers see http://samtools.github.io/bcftools/howtos/plugin.split-vep.html
Usage: bcftools +split-vep [Plugin Options]
Plugin options:
   -a, --annotation STR        INFO annotation to parse [CSQ]
   -A, --all-fields DELIM      Output all fields replacing the -a tag ("%CSQ" by default) in the -f
                                 filtering expression using the output field delimiter DELIM. This can be
                                 "tab", "space" or an arbitrary string.
   -c, --columns LIST[:type]   Extract the fields listed either as indexes or names. The default type
                                 of the new annotation is String but can be also Integer/Int or Float/Real.
   -d, --duplicate             Output per transcript/allele consequences on a new line rather rather than
                                 as comma-separated fields on a single line
   -f, --format <string>       Formatting expression for non-VCF/BCF output, same as `bcftools query -f`
   -l, --list                  Parse the VCF header and list the annotation fields
   -p, --annot-prefix          Prefix of INFO annotations to be created after splitting the CSQ string
   -s, --select TR:CSQ         Select transcripts to extract by type and/or consequence. (See also the -x switch.)
                                 TR, transcript:   worst,primary(*),all        [all]
                                 CSQ, consequence: any,missense,missense+,etc  [any]
                                 (*) Primary transcripts have the field "CANONICAL" set to "YES"
   -S, --severity -|FILE       Pass "-" to print the default severity scale or FILE to override
                                 the default scale
   -x, --drop-sites            Drop sites with none of the consequences matching the severity specified by -s.
                                  This switch is intended for use with VCF/BCF output (i.e. -f not given).
Common options:
   -e, --exclude EXPR          Exclude sites and samples for which the expression is true
   -i, --include EXPR          Include sites and samples for which the expression is true
   -o, --output FILE           Output file name [stdout]
   -O, --output-type b|u|z|v   b: compressed BCF, u: uncompressed BCF, z: compressed VCF or text, v: uncompressed VCF or text [v]
   -r, --regions REG           Restrict to comma-separated list of regions
   -R, --regions-file FILE     Restrict to regions listed in a file
   -t, --targets REG           Similar to -r but streams rather than index-jumps
   -T, --targets-file FILE     Similar to -R but streams rather than index-jumps

Examples:
   # List available fields of the INFO/CSQ annotation
   bcftools +split-vep -l file.vcf.gz

   # List the default severity scale
   bcftools +split-vep -S -

   # Extract Consequence, IMPACT and gene SYMBOL of the most severe consequence into
   # INFO annotations starting with the prefix "vep". For brevity, the columns can
   # be given also as 0-based indexes
   bcftools +split-vep -c Consequence,IMPACT,SYMBOL -s worst -p vep file.vcf.gz
   bcftools +split-vep -c 1-3 -s worst -p vep file.vcf.gz

   # Same as above but use the text output of the "bcftools query" format
   bcftools +split-vep -s worst -f '%CHROM %POS %Consequence %IMPACT %SYMBOL\n' file.vcf.gz

   # Print all subfields (tab-delimited) in place of %CSQ, each consequence on a new line
   bcftools +split-vep -f '%CHROM %POS %CSQ\n' -d -A tab file.vcf.gz

   # Extract gnomAD_AF subfield into a new INFO/gnomAD_AF annotation of Type=Float so that
   # numeric filtering can be used.
   bcftools +split-vep -c gnomAD_AF:Float file.vcf.gz -i'gnomAD_AF<0.001'

   # Similar to above, but add the annotation only if the consequence severity is missense
   # or equivalent. In order to drop sites with different consequences completely, we add
   # the -x switch. See the online documentation referenced above for more examples.
   bcftools +split-vep -c gnomAD_AF:Float -s :missense    file.vcf.gz
   bcftools +split-vep -c gnomAD_AF:Float -s :missense -x file.vcf.gz

Feedback

We welcome your feedback, please help us improve this page by either opening an issue on github or editing it directly and sending a pull request.