Filtering
Most BCFtools commands accept the -i, --include
and -e, --exclude
options
which allow advanced filtering. In the examples below, we demonstrate the
usage on the query
command because it allows us to show the output in
a very compact form using the -f
formatting option. (For details about
the format, see the Extracting information page.)
Fixed columns such as QUAL, FILTER, INFO are straightforward to filter. In this example,
we use the -e 'FILTER="."'
expression to exclude sites where FILTER is not set:
$ bcftools query -e'FILTER="."' -f'%CHROM %POS %FILTER\n' file.bcf | head -2 1 3000150 PASS 1 3000151 LowQual
In this example, we use the -i 'QUAL>20 && DP>10'
expression to include only sites with big enough quality and depth:
$ bcftools query -i'QUAL>20 && DP>10' -f'%CHROM %POS %QUAL %DP\n' file.bcf | head -2 1 14930 31.2757 13 1 17538 37.9458 12
When filtering FORMAT tags, the OR logic is applied with multiple
samples. For example, if we want to remove sites with an uncalled
genotype in any sample, the expression -i 'GT!="mis"'
is not going to work:
$ bcftools view -i'GT!="mis"' file.vcf | bcftools query -f'%CHROM %POS [ %GT]\n' | head -2 1 3000150 ./. ./. 0/1 1/1 1 3000151 ./. ./. 0|0 0|0
Instead, the reverse logic -e 'GT="mis"'
must be applied:
$ bcftools view -e'GT="mis"' file.vcf | bcftools query -f'%CHROM %POS [ %GT]\n' | head -2 1 3062915 0/1 0/1 0/1 0/1
Note we used bcftools view .. | bcftools query …
to filter first by site, then format the output.
Using the same filtering expression directly with the query
command would not work, because
query
excludes samples that do not satisfy the filtering expression
$ bcftools query -i'GT!="mis"' f'%CHROM %POS [ %GT]\n' file.vcf | head -2 1 3000150 0/1 1/1 1 3000151 0|0 0|0 $ bcftools query -e'GT="mis"' f'%CHROM %POS [ %GT]\n' file.vcf | head -2 1 3000150 0/1 1/1 1 3000151 0|0 0|0
&&
vs &
and ||
vs |
)Say our VCF contains the per-sample depth and genotype quality annotations
and we want to include only sites where one or more samples have big enough coverage
(DP>10
) and genotype quality (GQ>20
). The expression -i 'FMT/DP>10 & FMT/GQ>20'
selects
sites where the conditions are satisfied within the same sample:
$ bcftools query -i'FMT/DP>10 & FMT/GQ>20' -f'%POS[\t%SAMPLE:DP=%DP GQ=%GQ]\n' file.bcf 49979 SampleA:DP=10 GQ=50 SampleB:DP=20 GQ=40
On the other hand, if we need to include sites where both conditions met but
not necessarily in the same sample, we use the &&
operator rather than &
:
$ bcftools query -i'FMT/DP>10 && FMT/GQ>20' -f'%POS[\t%SAMPLE:DP=%DP GQ=%GQ]\n' file.bcf 31771 SampleA:DP=10 GQ=50 SampleB:DP=40 GQ=20 49979 SampleA:DP=10 GQ=50 SampleB:DP=20 GQ=40
Similarly, the |
operator can select just the matching samples:
$ bcftools query -f'[%POS %SAMPLE %DP\n]\n' -i'FMT/DP=19 | FMT/DP="."' test/view.filter.vcf 3162006 A 19 3162007 A . 3162007 B .
or the whole record when ||
is used:
$ bcftools query -f'[%POS %SAMPLE %DP\n]\n' -i'FMT/DP=19 || FMT/DP="."' test/view.filter.vcf 3162006 A 19 3162006 B 1 3162007 A . 3162007 B .
Feedback
We welcome your feedback, please help us improve this page by either opening an issue on github or editing it directly and sending a pull request.