Consensus sequence

Sometimes there is the need to create a consensus sequence for an individual where the sequence incorporates variants typed for this individual. This is possible using the consensus command.

We need the reference sequence reference.fa in the fasta format and an indexed VCF with the variants calls.bcf. The command is:

cat reference.fa | bcftools consensus calls.vcf.gz > consensus.fa

This assumes we have already made the calls, normalized indels and filtered. There is another page which goes deeper and is devoted just to this, but in brief, the variant calling command in its simplest form is:

# call variants
bcftools mpileup -Ou -f reference.fa alignments.bam | bcftools call -mv -Oz -o calls.vcf.gz
bcftools index calls.vcf.gz

# normalize indels
bcftools norm -f reference.fa calls.vcf.gz -Ob -o calls.norm.bcf

# filter adjacent indels within 5bp
bcftools filter --IndelGap 5 calls.norm.bcf -Ob -o calls.norm.flt-indels.bcf

# apply variants to create consensus sequence
cat reference.fa | bcftools consensus calls.vcf.gz > consensus.fa

# output IUPAC ambiguity codes based on REF+ALT columns (regardless of genotype)
cat reference.fa | bcftools consensus --iupac-codes calls.vcf.gz > consensus.fa

# output IUPAC ambiguity codes based on sample genotypes
cat reference.fa | bcftools consensus --haplotype I calls.vcf.gz > consensus.fa

Note: in order to keep the program relatively simple, consensus merely applies variants from the VCF regardless of AF or other properties. If more more advanced logic is required, the VCF has to be preprocessed with other tools, such as the +setGT plugin (see an example here).

Feedback

We welcome your feedback, please help us improve this page by either opening an issue on github or editing it directly and sending a pull request.