Frequently Asked Questions
The VCF specification recommends that all INFO and FORMAT tags that appear throughout the file body are defined in the VCF header.
Fix the header using the reheader command
# Write out the header to be modified bcftools view -h old.vcf > header.txt # Edit the header using your favorite text editor and add the missing definition, eg # ##INFO=<ID=XYZ,Number=1,Type=Integer,Description="Describe the tag"> vi header.txt # Reheader the file bcftools reheader -h header.txt -o new.vcf old.vcf
Why do you have to do it? Although VCF specification allows undefined tags, HTSlib and BCFtools internally treat VCF as BCF, where all tags must be defined in the header. This is because of the way BCF is designed: the tags throughout the BCF file are represented as pointers to the dictionary of tags stored in the header. We work around this problem by adding missing definitions on the fly. Note this can work for read-only operations, but will still lead to problems when writing the file out as BCF: even though the reader updated its internal structures with a dummy definition and continued reading, the writer was not aware about the new tag when the header was written.
This error is triggered when the number of values in the data line does not match its definition in the header. For example, one may see an error like
[W::bcf_calc_ac] Incorrect number of AC fields at Chrxx:xxxxx. (This message is printed only once.)
In this example, the VCF specification defines the tag AC as Number=A
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
and expects a value for each ALT allele, for example
chr1 64334 . A C,T . . AC=1,1 GT 0/1 0/1
The error above is printed when different number of values is encountered, for example AC=1
or AC=1,1,1
in the example above.
Other such definitions are Number=R
(there must be as many values as there are REF+ALT alleles in total),
and Number=G
(this is more complicated, see the section 1.4.2 of the VCF specification).
How to verify:
Look up the tag definition in the header (bcftools view -h file.vcf.gz | grep TAG
) to check the expected number
of values and then check the number of alleles and values in the data line (bcftools view -H file.vcf.gz -r chr1:1234567
).
Note that the program only works with ploidy 1 or 2, so if defined as Number=G
and the ploidy is bigger,
the program is not ready for cases like that.
How to fix:
If the tag is not important for your analysis, a quick and dirty workaround is to remove the
tag from the VCF completely
bcftools annotate -x TAG
If the tag must remain in the VCF, change the definition of the tag in the header to Number=.
bcftools view -h old.vcf > hdr.txt # edit hdr.txt and change the tag definition to Number=. bcftools reheader -h hdr.txt old.vcf > new.vcf
As described in the manual page, the -R
option takes into account overlapping records.
If a strict subset by position is required, add (or replace with) the -T
option.
view
and query
. What is going on?Say you want to print a list of samples with non-reference genotypes. This can be done using the following command
$ bcftools query -f'[%CHROM:%POS %SAMPLE %GT\n]' -i 'GT="alt"' file.vcf 1:67893 sample3 0/1
However, you may also want to print genotypes of ALL samples at variant sites with at least one
non-reference genotype. In order for this to work, first select the desired rows
with the view
command, then let query
format the output
$ bcftools view -i 'GT="alt"' file.vcf -Ou | bcftools query -f'[%CHROM:%POS %SAMPLE %GT\n]' 1:67893 sample1 0/0 1:67893 sample2 0/0 1:67893 sample3 0/1
Please see here.
Feedback
We welcome your feedback, please help us improve this page by either opening an issue on github or editing it directly and sending a pull request.