Plugin split-vep
The plugin allows to extract fields from structured annotations such as INFO/CSQ created by bcftools/csq or VEP.
Assuming the tag added by VEP is the INFO/CSQ field, let’s start with printing the list of available subfields:
bcftools +split-vep test/split-vep.vcf -l | head 0 Allele 1 Consequence 2 IMPACT 3 SYMBOL 4 Gene 5 Feature_type 6 Feature 7 BIOTYPE 8 EXON 9 INTRON
The default tag can be changed using the -a, -annotation
option. For example, if the tag is named XXX
, add
the -a XXX
option.
Extract the Consequence
field using a bcftools query
like output.
bcftools +split-vep test/split-vep.vcf -f '%CHROM:%POS %Consequence\n' 1:14464 non_coding_transcript_exon_variant,non_coding_transcript_exon_variant,...
Note
|
Unlike bcftools query -f , the plugin bcftools +split-vep -f drops lines with all of the queried
VEP fields empty. To print also lines with all values absent, add the option -X, --keep-sites .
|
To print each consequence on a separate line, rather than as a comma-separated string on a
single line, use the -d, --duplicate
option:
bcftools +split-vep test/split-vep.vcf -f '%CHROM:%POS %Consequence\n' -d 1:14464 non_coding_transcript_exon_variant 1:14464 downstream_gene_variant 1:14464 regulatory_region_variant
Any number of subfields can be printed this way. If you want to print all
subfields, use the -A
option to avoid creating a very lengthy formatting
expression:
bcftools +split-vep test/split-vep.vcf -f '%CHROM:%POS %CSQ\n' -d -A tab 14464 T non_coding_transcript_exon_variant MODIFIER WASH7P ENSG00000227232 ... 14464 T regulatory_region_variant MODIFIER . . RegulatoryFeature ENSR00000000002 ...
This printed the consequence prediction for all transcripts. Now limit the output to transcript with the most severe consequence printing also the coordinates and the name of the affected gene:
bcftools +split-vep test/split-vep.vcf -f '%CHROM:%POS %SYMBOL %Consequence\n' -s worst 1:14464 . regulatory_region_variant 1:14574 WASH7P non_coding_transcript_exon_variant
Limit the output to missense or more severe variants:
bcftools +split-vep test/split-vep.vcf -f '%CHROM:%POS %Gene %Consequence\n' -s worst:missense+ 1:16856 ENSG00000227232 splice_donor_variant&non_coding_transcript_variant 1:17365 ENSG00000227232 splice_acceptor_variant&non_coding_transcript_variant 1:69428 ENSG00000186092 missense_variant 1:69469 ENSG00000186092 frameshift_variant
It is also possible to use the standard -i/-e
expressions to query the fields. For
example, filter consequences predicted to be damaging by PolyPhen:
bcftools +split-vep test/split-vep.vcf -f '%CHROM:%POS %Consequence %PolyPhen\n' -d -i 'PolyPhen~"damaging"' 1:69428 missense_variant probably_damaging(0.984) 1:69559 missense_variant possibly_damaging(0.675)
Rather than printing a formatted text, create two INFO fields MySYMBOL with the gene name and MyConsequence:
bcftools +split-vep test/split-vep.vcf -c SYMBOL,Consequence -s worst:missense+ -p My | less
In this example we show the distinction between the VCF/BCF and custom text output:
# Output VCF adding the INFO/Consequence column to every site. We pipe through `query` in order # to visualize the differences. bcftools +split-vep test/split-vep.5.vcf -c Consequence | bcftools query -f'%POS\t%Consequence\n' | head -2 69270 synonymous_variant,regulatory_region_variant 69428 missense_variant,regulatory_region_variant # Output VCF adding the INFO/Consequence column only to missense sites. However, outputs all sites. bcftools +split-vep test/split-vep.5.vcf -c Consequence -s :missense | bcftools query -f'%POS\t%Consequence\n' | head -2 69270 . 69428 missense_variant,regulatory_region_variant # Output VCF adding INFO/Consequence column to missense sites. By adding the -x option we only output the missense sites. bcftools +split-vep test/split-vep.5.vcf -c Consequence -s :missense -x | bcftools query -f'%POS\t%Consequence\n' | head -2 69428 missense_variant 69496 missense_variant # Create a custom text output, print all sites. bcftools +split-vep test/split-vep.5.vcf -f'%POS\t%Consequence\n' | head -2 69270 synonymous_variant,regulatory_region_variant 69428 missense_variant,regulatory_region_variant # Create a custom text output, print only missense consequences. bcftools +split-vep test/split-vep.5.vcf -f'%POS\t%Consequence\n' -s :missense | head -2 69428 missense_variant 69496 missense_variant
The complete usage page obtained by bcftools +split-vep -h
:
About: Query structured annotations such INFO/BCSQ or CSQ created by bcftools/csq or VEP. For more more information and pointers see http://samtools.github.io/bcftools/howtos/plugin.split-vep.html Usage: bcftools +split-vep [Plugin Options] Plugin options: -a, --annotation STR INFO annotation to parse, CSQ by default or BCSQ when not present [CSQ] -A, --all-fields DELIM Output all fields replacing the -a tag ("%CSQ" by default) in the -f filtering expression using the output field delimiter DELIM. This can be "tab", "space" or an arbitrary string. -c, --columns [LIST|-][:TYPE] Extract the fields listed either as 0-based indexes or names, "-" to extract all fields. See --columns-types for the defaults. Supported types are String/Str, Integer/Int and Float/Real. Unlisted fields are set to String. Existing header definitions will not be overwritten, remove first with `bcftools annotate -x` --columns-types -|FILE Pass "-" to print the default -c types or FILE to override the presets -d, --duplicate Output per transcript/allele consequences on a new line rather rather than as comma-separated fields on a single line -f, --format STR Create non-VCF output; similar to `bcftools query -f` but drops lines w/o consequence -g, --gene-list [+]FILE Consider only features listed in FILE, or prioritize if FILE is prefixed with "+" --gene-list-fields LIST Fields to match against by the -g list, by default gene names [SYMBOL,Gene,gene] -H, --print-header Print header, -HH to omit column indices -l, --list Parse the VCF header and list the annotation fields -p, --annot-prefix STR Before doing anything else, prepend STR to all CSQ fields to avoid tag name conflicts -s, --select TR:CSQ Select transcripts to extract by type and/or consequence severity. (See also -S and -x.) TR, transcript: all,worst,primary,pick,mane,EXPRESSION [all] CSQ, consequence: any,missense,missense+,etc [any] Where the transcript EXPRESSION is of the form <FIELD><OPERATOR><VALUE> FIELD: field name (e.g. "CANONICAL") OPERATOR: string comparison (=,!=), regex matching (~,!~) VALUE: required string value (e.g. "YES") The TR presets are defined as follows primary: CANONICAL=YES pick: PICK=1 mane: MANE_SELECT!="" -S, --severity -|FILE Pass "-" to print the default severity scale or FILE to override the default scale -u, --allow-undef-tags Print "." for undefined tags -x, --drop-sites Drop sites without consequences after -s is applied (the default with -f) -X, --keep-sites Do not drop sites without consequences (the default without -f) Common options: -e, --exclude EXPR Exclude sites and samples for which the expression is true -i, --include EXPR Include sites and samples for which the expression is true --no-version Do not append version and command line to the header -o, --output FILE Output file name [stdout] -O, --output-type u|b|v|z[0-9] u/b: un/compressed BCF, v/z: un/compressed VCF, 0-9: compression level [v] -r, --regions REG Restrict to comma-separated list of regions -R, --regions-file FILE Restrict to regions listed in a file --regions-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [1] -t, --targets REG Similar to -r but streams rather than index-jumps -T, --targets-file FILE Similar to -R but streams rather than index-jumps --targets-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [0] --write-index Automatically index the output files [off] Examples: # List available fields of the INFO/CSQ or INFO/BCSQ annotation bcftools +split-vep -l file.vcf.gz # List available fields of INFO/BCSQ when both INFO/CSQ and INFO/BCSQ are present bcftools +split-vep -l file.vcf.gz -a BCSQ # List the default severity scale bcftools +split-vep -S - # Extract Consequence, IMPACT and gene SYMBOL of the most severe consequence into # INFO annotations starting with the prefix "vep". For brevity, the columns can # be given also as 0-based indexes bcftools +split-vep -c Consequence,IMPACT,SYMBOL -s worst -p vep file.vcf.gz bcftools +split-vep -c 1-3 -s worst -p vep file.vcf.gz # Same as above but use the text output of the "bcftools query" format bcftools +split-vep -s worst -f '%CHROM %POS %Consequence %IMPACT %SYMBOL\n' file.vcf.gz # Print all subfields (tab-delimited) in place of %CSQ, each consequence on a new line bcftools +split-vep -f '%CHROM %POS %CSQ\n' -d -A tab file.vcf.gz # Extract gnomAD_AF subfield into a new INFO/gnomAD_AF annotation of Type=Float so that # numeric filtering can be used. bcftools +split-vep -c gnomAD_AF:Float file.vcf.gz -i'gnomAD_AF<0.001' # Similar to above, but add the annotation only if the consequence severity is missense # or equivalent. In order to drop sites with different consequences completely, provide # the -x option. See the online documentation referenced above for more examples. bcftools +split-vep -c gnomAD_AF:Float -s :missense file.vcf.gz bcftools +split-vep -c gnomAD_AF:Float -s :missense -x file.vcf.gz
Feedback
We welcome your feedback, please help us improve this page by either opening an issue on github or editing it directly and sending a pull request.