Plugin fixref

Warning
Do not use the program blindly, make an effort to understand what strand convention your data uses! Make sure the reason for mismatching REF alleles is not a different reference build!! Also do NOT use bcftools norm --check-ref s for this purpose, as it will result in nonsense genotypes!!!

This tool helps to determine and fix strand orientation. Currently it can collect and print numbers useful in determining the strand convention (the stats mode), swap REF/ALT alleles based on the SNP reference ID (the id mode), flip or swap non-ambiguous SNPs (the flip mode), or convert from the Illumina TOP strand convention to the forward strand (the top mode).

Run the stats to learn the number of REF allele mismatches and the number of non-biallelic sites:

bcftools +fixref test.bcf -- -f ref.fa

Another tool for checking the reference allele mismatches:

bcftools norm --check-ref e -f /path/to/reference.fasta input.vcf.gz -Ou -o /dev/null

If there are no REF mismatches and the number of multi-allelic sites is small, we are done. If the output shows that the VCF is TOP-compatible, the following command can be used to fix the strand:

bcftools +fixref test.bcf -Ob -o output.bcf -- -f ref.fa -m top

If the file contains dbSNP reference identificators (rsXXX in the ID column), the following commands can be used to swap the reference and alternate alleles:

# Get the dbSNP annotation file. Make sure the correct reference build is used (e.g. b37)
#   https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/
wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b146_GRCh37p13/VCF/All_20151104.vcf.gz
wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b146_GRCh37p13/VCF/All_20151104.vcf.gz.tbi

# Swap the alleles
bcftools +fixref broken.bcf -Ob -o fixref.bcf -- -d -f /path/to/reference.fasta -i All_20151104.vcf.gz

# The above command might have changed the coordinates, we must sort the VCF.
bcftools sort fixref.bcf -Ob -o fixref.sorted.bcf

In the most extreme case when nothing else is working, one can simply force the unambiguous alleles onto the forward strand and drop the ambiguous genotypes.

bcftools +fixref test.bcf -Ob -o output.bcf -- -f ref.fa -m flip -d

Note that this is an extremely unsafe operation and will most likely result in nonsense genotypes. If you decide to use it anyway, make sure to check the sanity of the result with the af-dist plugin!!

Warning
Do not use the program blindly, make an effort to understand what strand convention your data uses! Make sure the reason for mismatching REF alleles is not a different reference build!! Also do NOT use bcftools norm --check-ref s for this purpose, as it will result in nonsense genotypes!!!

Feedback

We welcome your feedback, please help us improve this page by either opening an issue on github or editing it directly and sending a pull request.