Plugin split-vep

The plugin allows to extract fields from structured annotations such as INFO/CSQ created by bcftools/csq or VEP.

Assuming the tag added by VEP is the INFO/CSQ field, let’s start with printing the list of available subfields:

bcftools +split-vep test/split-vep.vcf -l | head

0   Allele
1   Consequence
2   IMPACT
3   SYMBOL
4   Gene
5   Feature_type
6   Feature
7   BIOTYPE
8   EXON
9   INTRON

The default tag can be changed using the -a, -annotation option. For example, if the tag is named XXX, add the -a XXX option.

Extract the Consequence field using a bcftools query like output.

bcftools +split-vep test/split-vep.vcf -f '%CHROM:%POS %Consequence\n'

1:14464 non_coding_transcript_exon_variant,non_coding_transcript_exon_variant,...
Note
Unlike bcftools query -f, the plugin bcftools +split-vep -f drops lines with all of the queried VEP fields empty. To print also lines with all values absent, add the option -X, --keep-sites.

To print each consequence on a separate line, rather than as a comma-separated string on a single line, use the -d, --duplicate option:

bcftools +split-vep test/split-vep.vcf -f '%CHROM:%POS %Consequence\n' -d

1:14464 non_coding_transcript_exon_variant
1:14464 downstream_gene_variant
1:14464 regulatory_region_variant

Any number of subfields can be printed this way. If you want to print all subfields, use the -A option to avoid creating a very lengthy formatting expression:

bcftools +split-vep test/split-vep.vcf -f '%CHROM:%POS %CSQ\n' -d -A tab

14464   T   non_coding_transcript_exon_variant  MODIFIER    WASH7P  ENSG00000227232 ...
14464   T   regulatory_region_variant   MODIFIER    .   .   RegulatoryFeature   ENSR00000000002 ...

This printed the consequence prediction for all transcripts. Now limit the output to transcript with the most severe consequence printing also the coordinates and the name of the affected gene:

bcftools +split-vep test/split-vep.vcf -f '%CHROM:%POS %SYMBOL %Consequence\n' -s worst

1:14464 . regulatory_region_variant
1:14574 WASH7P non_coding_transcript_exon_variant

Limit the output to missense or more severe variants:

bcftools +split-vep test/split-vep.vcf -f '%CHROM:%POS %Gene %Consequence\n' -s worst:missense+

1:16856 ENSG00000227232 splice_donor_variant&non_coding_transcript_variant
1:17365 ENSG00000227232 splice_acceptor_variant&non_coding_transcript_variant
1:69428 ENSG00000186092 missense_variant
1:69469 ENSG00000186092 frameshift_variant

It is also possible to use the standard -i/-e expressions to query the fields. For example, filter consequences predicted to be damaging by PolyPhen:

bcftools +split-vep test/split-vep.vcf -f '%CHROM:%POS %Consequence %PolyPhen\n' -d -i 'PolyPhen~"damaging"'

1:69428 missense_variant probably_damaging(0.984)
1:69559 missense_variant possibly_damaging(0.675)

Rather than printing a formatted text, create two INFO fields MySYMBOL with the gene name and MyConsequence:

bcftools +split-vep test/split-vep.vcf -c SYMBOL,Consequence -s worst:missense+ -p My | less

In this example we show the distinction between the VCF/BCF and custom text output:

# Output VCF adding the INFO/Consequence column to every site. We pipe through `query` in order
# to visualize the differences.
bcftools +split-vep test/split-vep.5.vcf -c Consequence | bcftools query -f'%POS\t%Consequence\n' | head -2
69270    synonymous_variant,regulatory_region_variant
69428    missense_variant,regulatory_region_variant

# Output VCF adding the INFO/Consequence column only to missense sites. However, outputs all sites.
bcftools +split-vep test/split-vep.5.vcf -c Consequence -s :missense | bcftools query -f'%POS\t%Consequence\n' | head -2
69270    .
69428    missense_variant,regulatory_region_variant

# Output VCF adding INFO/Consequence column to missense sites. By adding the -x option we only output the missense sites.
bcftools +split-vep test/split-vep.5.vcf -c Consequence -s :missense -x | bcftools query -f'%POS\t%Consequence\n' | head -2
69428    missense_variant
69496    missense_variant

# Create a custom text output, print all sites.
bcftools +split-vep test/split-vep.5.vcf -f'%POS\t%Consequence\n' | head -2
69270    synonymous_variant,regulatory_region_variant
69428    missense_variant,regulatory_region_variant

# Create a custom text output, print only missense consequences.
bcftools +split-vep test/split-vep.5.vcf -f'%POS\t%Consequence\n' -s :missense | head -2
69428    missense_variant
69496    missense_variant

The complete usage page obtained by bcftools +split-vep -h:

About: Query structured annotations such INFO/BCSQ or CSQ created by bcftools/csq or VEP. For more
   more information and pointers see http://samtools.github.io/bcftools/howtos/plugin.split-vep.html
Usage: bcftools +split-vep [Plugin Options]
Plugin options:
   -a, --annotation STR            INFO annotation to parse, CSQ by default or BCSQ when not present [CSQ]
   -A, --all-fields DELIM          Output all fields replacing the -a tag ("%CSQ" by default) in the -f
                                     filtering expression using the output field delimiter DELIM. This can be
                                     "tab", "space" or an arbitrary string.
   -c, --columns [LIST|-][:TYPE]   Extract the fields listed either as 0-based indexes or names, "-" to extract all
                                     fields. See --columns-types for the defaults. Supported types are String/Str,
                                     Integer/Int and Float/Real. Unlisted fields are set to String. Existing header
                                     definitions will not be overwritten, remove first with `bcftools annotate -x`
       --columns-types -|FILE      Pass "-" to print the default -c types or FILE to override the presets
   -d, --duplicate                 Output per transcript/allele consequences on a new line rather rather than
                                     as comma-separated fields on a single line
   -f, --format STR                Create non-VCF output; similar to `bcftools query -f` but drops lines w/o consequence
   -g, --gene-list [+]FILE         Consider only features listed in FILE, or prioritize if FILE is prefixed with "+"
       --gene-list-fields LIST     Fields to match against by the -g list, by default gene names [SYMBOL,Gene,gene]
   -H, --print-header              Print header
   -l, --list                      Parse the VCF header and list the annotation fields
   -p, --annot-prefix STR          Before doing anything else, prepend STR to all CSQ fields to avoid tag name conflicts
   -s, --select TR:CSQ             Select transcripts to extract by type and/or consequence severity. (See also -S and -x.)
                                     TR, transcript:   all,worst,primary,pick,mane,EXPRESSION [all]
                                     CSQ, consequence: any,missense,missense+,etc [any]
                                   Where the transcript EXPRESSION is of the form <FIELD><OPERATOR><VALUE>
                                     FIELD:    field name (e.g. "CANONICAL")
                                     OPERATOR: string comparison (=,!=), regex matching (~,!~)
                                     VALUE:    required string value (e.g. "YES")
                                   The TR presets are defined as follows
                                     primary:  CANONICAL=YES
                                     pick:     PICK=1
                                     mane:     MANE_SELECT!=""
   -S, --severity -|FILE           Pass "-" to print the default severity scale or FILE to override
                                     the default scale
   -u, --allow-undef-tags          Print "." for undefined tags
   -x, --drop-sites                Drop sites without consequences after -s is applied (the default with -f)
   -X, --keep-sites                Do not drop sites without consequences (the default without -f)
Common options:
   -e, --exclude EXPR              Exclude sites and samples for which the expression is true
   -i, --include EXPR              Include sites and samples for which the expression is true
       --no-version                Do not append version and command line to the header
   -o, --output FILE               Output file name [stdout]
   -O, --output-type u|b|v|z[0-9]  u/b: un/compressed BCF, v/z: un/compressed VCF, 0-9: compression level [v]
   -r, --regions REG               Restrict to comma-separated list of regions
   -R, --regions-file FILE         Restrict to regions listed in a file
       --regions-overlap 0|1|2     Include if POS in the region (0), record overlaps (1), variant overlaps (2) [1]
   -t, --targets REG               Similar to -r but streams rather than index-jumps
   -T, --targets-file FILE         Similar to -R but streams rather than index-jumps
       --targets-overlap 0|1|2     Include if POS in the region (0), record overlaps (1), variant overlaps (2) [0]
       --write-index               Automatically index the output files [off]

Examples:
   # List available fields of the INFO/CSQ or INFO/BCSQ annotation
   bcftools +split-vep -l file.vcf.gz

   # List available fields of INFO/BCSQ when both INFO/CSQ and INFO/BCSQ are present
   bcftools +split-vep -l file.vcf.gz -a BCSQ

   # List the default severity scale
   bcftools +split-vep -S -

   # Extract Consequence, IMPACT and gene SYMBOL of the most severe consequence into
   # INFO annotations starting with the prefix "vep". For brevity, the columns can
   # be given also as 0-based indexes
   bcftools +split-vep -c Consequence,IMPACT,SYMBOL -s worst -p vep file.vcf.gz
   bcftools +split-vep -c 1-3 -s worst -p vep file.vcf.gz

   # Same as above but use the text output of the "bcftools query" format
   bcftools +split-vep -s worst -f '%CHROM %POS %Consequence %IMPACT %SYMBOL\n' file.vcf.gz

   # Print all subfields (tab-delimited) in place of %CSQ, each consequence on a new line
   bcftools +split-vep -f '%CHROM %POS %CSQ\n' -d -A tab file.vcf.gz

   # Extract gnomAD_AF subfield into a new INFO/gnomAD_AF annotation of Type=Float so that
   # numeric filtering can be used.
   bcftools +split-vep -c gnomAD_AF:Float file.vcf.gz -i'gnomAD_AF<0.001'

   # Similar to above, but add the annotation only if the consequence severity is missense
   # or equivalent. In order to drop sites with different consequences completely, provide
   # the -x option. See the online documentation referenced above for more examples.
   bcftools +split-vep -c gnomAD_AF:Float -s :missense    file.vcf.gz
   bcftools +split-vep -c gnomAD_AF:Float -s :missense -x file.vcf.gz

Feedback

We welcome your feedback, please help us improve this page by either opening an issue on github or editing it directly and sending a pull request.