Detecting runs of homozygosity (RoH)
The BCFtools/RoH command detects regions of autozygosity in sequencing data, including exome data, using a hidden Markov model.
Preparing input data
The roh
command takes on input VCF with FORMAT columns containing either
genotype likelihoods (PL) or genotypes (GT). By default, genotype likelihoods
are expected unless the -G
option is provided.
The second required information is the estimate of the alternate allele
frequencies in the population for each site. There are two ways to obtain this
information. One is to supply a VCF with many samples from the same population
(preferably >100, though often it works down to just 20 individuals or
so). This allows the frequencies to be estimated from the samples provided.
Otherwise, the allele frequency information for your specific population can be
provided from other datasets: from a custom VCF annotation (--AF-tag
),
from a tab-delimited file (--AF-file
) or a fixed default can be used
if none of the previous is available (--AF-dflt
).
Finally, a genetic map can be provided (--genetic-map
) so that the model
can account for recombination hotspots. If not available, a constant
recombination rate per-base can be given (--rec-rate
).
Note that also both options can be given. In that case, the --rec-rate
parameter
will be interpreted differently by the program - as a fold increase of transition probabilities. This
allows the model to become more sensitive yet still account for recombination hotspots.
Of course, also the range of the values given to --rec-rate
must be different in both cases:
it will be typically in the range (1e-9,1e-3) without the --genetic-map
option
and in the range (10,1000) with --genetic-map
. In both cases bigger values result in more sensitive calls.
The program considers only biallelic sites, sites with non-zero AF, and sites with non-missing PL/GT.
Example 1
In this example, the FORMAT/PL annotation is not present, therefore we must use FORMAT/GT, see the -G option. Also, the VCF does not contain allele frequency information and there is just one sample so it cannot be estimated on the fly. Hence we use a default AF value 0.4.
bcftools roh -G30 --AF-dflt 0.4 file.vcf
Please check this usage example for details and some test data to experiment with.
Example 2
In the second example we use the allele frequencies from the 1000 Genomes Project and the genetic map distributed with Impute2 (1000GP_Phase3.tgz). Because these files are big and require some (although easy) pre-processing, a packaged version can be downloaded from here: allele frequencies and genetic map.
# Annotate VCF with the AFs and run ROH bcftools annotate -c CHROM,POS,REF,ALT,AF1KG -h AFs.tab.gz.hdr -a AFs.tab.gz in.bcf | \ bcftools roh --AF-tag AF1KG -M 100 -m genetic-map/genetic_map_chr{CHROM}_combined_b37.txt -o roh.txt
In case a different version is needed, the following commands show how the pre-packaged genetic map and allele frequencies were created:
# Get the genetic map curl -O https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.tgz tar --wildcards -xzf 1000GP_Phase3.tgz 1000GP_Phase3/genetic_map_chr\* mv 1000GP_Phase3 genetic-map rm -f 1000GP_Phase3.tgz # Get alleles frequencies from the 1000 Genomes Project curl -Lo AFs.vcf.gz https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz bcftools query -f'%CHROM\t%POS\t%REF,%ALT\t%AF\n' AFs.vcf.gz | bgzip -c > AFs.tab.gz tabix -s1 -b2 -e2 AFs.tab.gz echo '##INFO=<ID=AF1KG,Number=1,Type=Float,Description="Allele Frequency">' > AFs.tab.gz.hdr rm -f AFs.vcf.gz
Example 3
Sometimes multiple samples need to be analyzed and compared. There is the convenience script run-roh.pl which takes multiple VCF/BCF files and locates regions private to one sample or shared across multiple samples. On input it expects a directory with .vcf, .vcf.gz or .bcf files, a file with allele frequencies and a genetic map (optional). The calls can be then plotted or visualized interactively using the plot-roh.py script
run-roh.pl -i input-dir -o output-dir plot-roh.py -i output-dir
An example of the output is shown below. In the first window, all chromosomes from three samples
are shown. Each dot corresponds to one marker with either the alternate homozygous genotype (top) or the
heterozygous genotype (bottom). The grey and red rectangles highlight regions predicted by bcftools roh
as homozygous - they should contain few heterozygous genotypes. The grey rectangles highlight homozygous
regions private to one sample, while the red rectangles show homozygous regions shared across multiple samples.
In the second window is chromosome 17, zoomed-in using the button with the magnifying glass icon.
Troubleshooting
If the results look unexpected, check how many sites have been ignored by the program. For example in this run many sites were filtered:
Number of lines: total/processed: 599218/37730
If the number of the processed sites is too low, check what was the reason for excluding them. This command should give the number of sites that were processed:
bcftools view -i'type="SNP" && AF>0' -m2 -M2 -s Indiv1 -Ou file.vcf | bcftools +counts -i'GT!="./." && GT!="."
References
Please cite this paper if you find our software useful: http://www.ncbi.nlm.nih.gov/pubmed/26826718
Feedback
We welcome your feedback, please help us improve this page by either opening an issue on github or editing it directly and sending a pull request.