VCF and the Genome Analysis Toolbox

INSIDE THE BOX by Brian Osborne

March 14, 2012 | We tend to take the extraordinary for granted. Roughly ten years ago we saw the first human genome sequence at a cost of roughly $3 billion. Now a person could have their genome sequenced in a few days for a few thousand dollars, turn around, and in a few more days compute how their sequence differs from any public sequence. This analysis might cost you just a few more dollars to rent the server. Let’s consider how one version of the bioinformatic part of this exercise might work.

The heart of this analytic task is open-source SAMtools and VCFtools packages, built to create and analyze the SAM/BAM (alignment) and VCF (Variant Call Format) formats. We spend a lot of time in bioinformatics puzzling over formats and reformatting before we get to the science. The “science” in this case is what’s in the VCF file.

VCF is designed to describe sequence variation, specifically the difference between some reference sequence and closely related sequences. The designers of VCF come from centers and labs studying the human genome, so it’s built to handle diploidy but can accommodate haploids and polyploids as well. A prominent use case is cancer, where aneuploidy of all types is found. As you’d expect, VCF can also express the phase or parental origin of some set of alleles.

VCF looks like a spreadsheet with a handful of columns and a header section. No literal references are made using terms like “replace” in the data section; VCF simply states what the reference and analyte sequences are in a given interval. For example, “14370 rs6054257 G A” means that G in the reference is replaced by A in the analyte, position 14370 (a known SNP with an RS id). This is also how you would handle some complex rearrangement, by showing it in its entirety.

This illustrates a key design decision: VCF does not specify rules on how to describe rearrangements, the VCF writer makes those choices. For example, if the analyte contains a SNP in a large inversion then the VCF may describe only the inversion -- it is up to the reading application to detect the SNP. Given the possible complexities it is easy to see why no rules of this sort are specified.

The intention of this format is to provide both the minimum required information (chromosome, position, variation, identifier, quality score) in a small number of columns yet allow the variety in detail that different researchers may wish to provide. This is done by allowing the definition of custom, additional keys and their value types in the VCF header. For example, something like

##INFO=

could mean that “CH” in the INFO column of the data section signifies that this variant was also detected by some chip-based method. The underlying requirement is that this format, which could be used for clinical sequencing, is flexible. The good news is that the added annotations have to be defined and can be validated.

Initial versions of the VCF format, originally developed for the 1000 Genomes Project, focused primarily on SNPs and small in-dels. This led to alternative format proposals to capture larger structural variation (SV), including copy number variants. However, the current version of VCF (4.1) incorporates explicit support for SV. This version also includes tags for the description of imprecise locations because the breakpoints of the SV may not be known. A common practice is that in describing an SV, you also define some labels, like DEL or CNV. These annotations give essential hints to the viewer, so no further analysis is needed. There are also ways to specify common mobile elements (e.g. INS:ME:ALU for insertion of an Alu repeat). An annotation like this obviates the need for showing the Alu sequence.

Say you are ready to analyze a genome sequence in SAM format (plus a corresponding fasta file). Your personal computer is probably not the best platform for compiling and computing, you might prefer to use some cloud resource with sufficient RAM. For the sake of simplicity, let’s stipulate that StarCluster is a straightforward way to configure and then use Amazon Web Services. Copy your files there, then proceed to sourceforge to compile and install SAMtools and VCFtools. Your commands will look something like this, once per chromosome:

>samtools faidx my.fa

>samtools view -bt my.fa.fai my.sam > my.1.bam

>samtools sort my.1.bam my.2

>samtools rmdup my.2.bam my.3.bam

>samtools index my.3.bam

>samtools mpileup -uf my.fa my.3.bam | bcftools view -bvcg - > my.raw.bcf

>bcftools view my.raw.bcf | vcfutils.pl varFilter -D100 > my.filtered.vcf

The first commands create a BAM file, position-sort the reads, remove duplicates, and index (generally speaking all files are indexed for rapid retrieval). The key commands are the final two, where we first create a binary version of a VCF containing the probabilities of each possible genotype, then the readable, filtered version containing the actual variant calls.

The apparent simplicity of these commands conceals an extensive set of options and sophisticated functionality. For example, the “mpileup” option usually performs detailed local re-alignment to recover SNPs from in-del’s that may have been created as alignment artifacts. And you would probably create these last files using options to specify depth and quality thresholds at the very least, depending on the details of the sequencing.

The “mpileup” example above uses the fasta file corresponding to the BAM file, but the comparison we really want is to a reference genome. Substitute as needed. You can get a quick look at the results from the command-line using tview from SAMtools, but you’ll get a richer and more familiar view using the Broad Institute’s IGV graphic viewer, which superimposes human RefSeq genes by default.

With a VCF file in hand you can perform deep analysis with VCFtools. You might run these analyses in conjunction with other BAM files from the 1000 Genomes Project (downloaded from NCBI or EBI). You could examine SNP density or look for runs of homozygosity, perform pairwise-comparisons, or examine specific regions, SNPs, or groups of SNPs. This is also the tool you’d use to export data to genotype analysis programs like PLINK.

My intention here is to describe one way to perform a genotyping “under the hood” at the command-line but there are excellent alternatives. The Genome Analysis Tool Kit (Broad Institute) has built-in pipelines like this one plus a GUI, and various public Galaxy installations can also run similar routines. One advantage of these alternatives is dynamic access to meaningful annotation of RS SNPs. These fully featured tools may also use existing HapMap knowledge to question or verify novel SNPs or phasing.

The technical details of this exercise prevent all but the highly trained from using it. On the other hand, there are companies that will take samples and return a fully annotated, user-friendly documentation of a genome. But where is the middle ground, the application that an average citizen might use for free?

There’s heated debate in some circles about whether this citizen should even see such knowledge without a doctor present, but the genie is already out of the bottle—the tools are here. In the past decade, specialists have gone from bits and pieces, genes, to genomes at the price of a laptop. The genome app—presented in a way that all can understand it—will appear in the next decade. In fact it’s probably being written as you read.

Brian Osborne is a principal investigator with The BioTeam. He can be reached at briano@bioteam.net.