Plugin af-dist

This plugin allows to detect possible strand issues by checking genotype frequencies against population allele frequencies.

If working with human data, first download the 1000 Genomes allele frequency annotations

wget -O af.vcf.gz ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz

bcftools index af.vcf.gz

Then annotate your data file and stream the result through the af-dist plugin to create the genotype frequency distribution

bcftools annotate -c INFO/AF -a af.vcf.gz data.vcf.gz | bcftools +af-dist | grep ^PROB > data.dist.txt

The output should something like this

PROB_DIST   0.000000    0.100000    100618
PROB_DIST   0.100000    0.200000    144103
PROB_DIST   0.200000    0.300000    214923
PROB_DIST   0.300000    0.400000    320721
PROB_DIST   0.400000    0.500000    817965
PROB_DIST   0.500000    0.600000    84027
PROB_DIST   0.600000    0.700000    86531
PROB_DIST   0.700000    0.800000    97986
PROB_DIST   0.800000    0.900000    108776
PROB_DIST   0.900000    1.000000    176755

Finally plot the distribution to check whether there are only few unlikely genotypes.

Example of two af-dist outputs combined into one plot. In this figure, one of the VCFs (blue line) has many low-probability genotypes (given the 1000Genomes allele frequencies and assuming HWE), which suggests that many of the REF and ALT alleles are on the incorrect strand.

The method is reliable and robust even for non-European populations, as shown below.

The plot shows af-dist results for 52 samples from 26 populations of the 1000 Genomes Project (two samples randomly selected from each population), subset to MEGA sites, an array with population-specific variants.

Feedback

We welcome your feedback, please help us improve this page by either opening an issue on github or editing it directly and sending a pull request.