Frequently Asked Questions

'XYZ' is not defined in the header, assuming Type=String

The VCF specification recommends that all INFO and FORMAT tags that appear throughout the file body are defined in the VCF header.

Fix the header using the reheader command

# Write out the header to be modified
bcftools view -h old.vcf > header.txt

# Edit the header using your favorite text editor and add the missing definition, eg
#   ##INFO=<ID=XYZ,Number=1,Type=Integer,Description="Describe the tag">
vi header.txt

# Reheader the file
bcftools reheader -h header.txt -o new.vcf old.vcf

Why do you have to do it? Although VCF specification allows undefined tags, HTSlib and BCFtools internally treat VCF as BCF, where all tags must be defined in the header. This is because of the way BCF is designed: the tags throughout the BCF file are represented as pointers to the dictionary of tags stored in the header. We work around this problem by adding missing definitions on the fly. Note this can work for read-only operations, but will still lead to problems when writing the file out as BCF: even though the reader updated its internal structures with a dummy definition and continued reading, the writer was not aware about the new tag when the header was written.

Incorrect number of fields at chr1:1234567

This error is triggered when the number of values in the data line does not match its definition in the header. For example, one may see an error like

[W::bcf_calc_ac] Incorrect number of AC fields at Chrxx:xxxxx. (This message is printed only once.)

In this example, the VCF specification defines the tag AC as Number=A

##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">

and expects a value for each ALT allele, for example

chr1  64334  .  A  C,T  .  .  AC=1,1  GT  0/1  0/1

The error above is printed when different number of values is encountered, for example AC=1 or AC=1,1,1 in the example above.

Other such definitions are Number=R (there must be as many values as there are REF+ALT alleles in total), and Number=G (this is more complicated, see the section 1.4.2 of the VCF specification).

How to verify:
Look up the tag definition in the header (bcftools view -h file.vcf.gz | grep TAG) to check the expected number of values and then check the number of alleles and values in the data line (bcftools view -H file.vcf.gz -r chr1:1234567). Note that the program only works with ploidy 1 or 2, so if defined as Number=G and the ploidy is bigger, the program is not ready for cases like that.

How to fix:
If the tag is not important for your analysis, a quick and dirty workaround is to remove the tag from the VCF completely

bcftools annotate -x TAG

If the tag must remain in the VCF, change the definition of the tag in the header to Number=.

bcftools view -h old.vcf > hdr.txt
# edit hdr.txt and change the tag definition to Number=.
bcftools reheader -h hdr.txt old.vcf > new.vcf
The -R option pulls in sites from outside of the regions file

As described in the manual page, the -R option takes into account overlapping records. If a strict subset by position is required, add (or replace with) the -T option.

Filtering expressions seem to work differently for view and query. What is going on?

Say you want to print a list of samples with non-reference genotypes. This can be done using the following command

$ bcftools query -f'[%CHROM:%POS %SAMPLE %GT\n]' -i 'GT="alt"' file.vcf
1:67893 sample3 0/1

However, you may also want to print genotypes of ALL samples at variant sites with at least one non-reference genotype. In order for this to work, first select the desired rows with the view command, then let query format the output

$ bcftools view -i 'GT="alt"' file.vcf -Ou | bcftools query -f'[%CHROM:%POS %SAMPLE %GT\n]'
1:67893 sample1 0/0
1:67893 sample2 0/0
1:67893 sample3 0/1
How to cite BCFtools?

Please see here.

Feedback

We welcome your feedback, please help us improve this page by either opening an issue on github or editing it directly and sending a pull request.