New Study Reveals 1 Million Human Genome Sequence Errors Across Two NGS Platforms

By Kevin Davies

April 1, 2011 | “What does it mean to have a ‘healthy’ genome?” That was the question that University of Utah geneticist Mark Yandell and colleagues set out to address in an important recent paper in the journal Genetics in Medicine.* Among the key conclusions: there are 1.1 million discrepancies when the identical human genome sample is sequenced using two popular next-generation sequencing (NGS) platforms.

As Yandell and coworkers point out in the paper’s introduction, neither J. Craig Venter’s nor James Watson’s genomes were found to contain any strongly deleterious gene variants likely to cause or strongly predispose them to genetic illness, prompting some commentators to express skepticism regarding the prognostic value of personal genome sequences.

“To date,” the authors write, “the standard reply to the skeptic has been that healthy adults have healthy genomes. Although reasonable, this rebuttal presumes that we know what a healthy genome is. No doubt, a clean bill of genomic health will be the most common clinical scenario in genomic medicine. However just what does a healthy genome look like? What is the impact of sequencing technology on prognostic accuracy? What role will ethnicity play in prognosis? Finally, how useful will existing resources, such as OMIM, be for categorizing personal genome variants as deleterious? The answers to these questions are of immediate importance for the future of genomic medicine.”

Yandell recently spoke to Bio-IT World about his team’s results, including the release of the 10Gen set of personal genome variant data. While his own group works on tools for genome annotation and functional genomics, he is increasingly interested in developing tools for personal genome analysis. Late last year, in collaboration with Martin Reese and colleagues at San Francisco-based software firm Omicia, Karen Eilbeck (University of Utah), Gabor Marth (Boston College), Paul Flicek (EBI) and Lincoln Stein (Ontario Institute for Cancer Research), the consortium published a paper in Genome Biology describing a standardized file format called GVF (Genome Variation Format) for exchanging and comparing personal genome sequences.

In the new paper, the collaboration presents an analysis of the first ten publicly-available human genome sequences, including the genomes of Watson, Venter, Steve Quake, two Asian and four HapMap individuals, one of which has been sequenced on two platforms. A major goal is to explore ways to interpret personal genome sequences for clinical diagnostic purposes, says Yandell, rather than from a population genetics viewpoint.

A Million Variations

Although Yandell’s team looked at the first ten human genomes sequenced and publically released using six different platforms (Sanger, Illumina, Life Tech, Complete, Roche/454, Helicos), and found that the platform differences were not sufficient to obscure the ethnic relationships between the genomes, there was a striking result from the side-by-side comparison of two published sequence datasets on the same HapMap sample. This subject was an anonymous African subject (NA18507) that was sequenced independently both by David Bentley’s team at Illumina (published in Nature in 2008) and Kevin McKernan’s group at Life Technologies on the SOLiD platform (published in Genome Research in 2009).

Although the two sequences shared some 77% of the total variants, Yandell and colleagues found that they differ at more than 1.1 million positions. (The Life Technologies and Illumina versions of the NA18507 genome had 575,099 and 526,836 unique positions, respectively.)

“Most people are quite shocked,” says Yandell. “But is it glass half full or half empty? From the standpoint of whole genomes consisting of 3 billion bases, there is actually very good congruence. If you’re trying to do population genetics, it’s pretty good to do platform cross comparisons.”

The view is less rosy from a diagnostics point of view, however. “Congruence is better within the coding regions of genes but it’s still a long way from perfect. We find 99% congruence within coding regions, but even then, if you’re trying to do diagnostics, taking into effect platform considerations is something that has to be done.”

Yandell stresses that sequence discrepancies are not simply a matter of which NGS platform is selected. “It’s also the variant calling procedures,” he says. “Depending upon which tool you use, you can see pretty big differences between even the same genome called with different tools—nearly as big as the two Life Tech/Illumina genomes.”

It also depends on the parameters used with the software tools, an issue that is not as broadly recognized in the NGS community as it should be, says Yandell. “There’s still a bit of black art in variant calling. It’s not so much the accuracy of the sequencing platforms, it’s also how you’re post-processing the data and calling the variants. Right now, there’s no right answer, but a lot of smart people are working very hard on this.”

On average, each personal genome contains between 20,000-25,000 single nucleotide variants in protein-coding genes compared to the reference genome. In the collaboration with the Omicia group, Yandell also found that focusing on the OMIM (Online Mendelian Inheritance in Man) collection of disease genes provides the same result as whole genome sequences in defining ethnicity with 80% certainty. “The magnitude of that signal struck us as interesting,” says Yandell. “There’s a long-term bias towards disease studies in particular ethnic groups.”

Another result was that the African genomes are typically homozygous for many more OMIM variants than the Caucasian genomes. “That’s probably due to what we might call background effects,” says Yandell. “You’ve got alleles that do you no harm as an African or African-American, but in a Caucasian or Asian background, they are legitimately disease predisposing.”

“That has implications for diagnostic medicine,” Yandell continues. “It can’t be ethnically blind. The right decision will depend upon the ethnicity of the individual. That’s a touchy subject in the field, because people get concerned when you mention ethnicity. There are already [some areas of medicine] that takes ethnicity into account. We will likely have to do that in the diagnostics domain as well.”

File Formats

Following the development of a standardized file format called GVF for personal genome sequences, Yandell needed a trial set of personal genomes to use for software development, both for his own group and the broader community—the 10Gen set. (Those data are available from the Sequence Ontology website.)

Yandell’s next goal is to establish methods to automatically analyze newly resequenced genomes. A priority is to provide what he calls “clinical decision support”—relating individual DNA variants to known disease-causing variants. The goal here—primarily the Omicia side of the collaboration—is to mine a personal genome sequence, identify all known alleles associated with ill health, and then relate that to known variants in an easy manner for rapid reports.

Another focus is developing an ontology to classify disease genes for even broader clinical decision support. “The idea is you’re not just asking if someone has a nasty allele in the cystic fibrosis (CF) or BRCA1 gene, but looking at sets of genes, e.g. all genes in cardiovascular health or cancer. Does this individual have an especially unlucky combination of slightly deleterious alleles spread among several genes all involved in the same disease, which might give them a red light for cardiovascular health, even though there’s no one bad allele for that disease?”

The flip side for this clinical decision support is what to do with ‘private’ variants, novel variants that look potentially problematic? “What does it mean when you sequence someone and they have a stop codon smack in the middle of a growth factor receptor?” says Yandell “What do you do then? How do you know if you have a problem?”

That aspect of analyzing novel variants—of which every individual has hundreds—has prompted the Yandell lab in collaboration with Omicia to develop software called VAAST (Variant Annotation and Selection Tool). “It’s a tool to automatically identify damaged genes and disease-causing variants, even if they’re completely novel and never been seen before,” says Yandell, who thinks it could have a big impact. (A manuscript describing the software has been submitted, and the software will be made publicly available for academic use—and commercially through Omicia—once that paper is published.)

VAAST Potential

But Yandell has already demonstrated the potential of the VAAST tool, testing it on the same dataset used in a 2010 study identifying the Mendelian gene mutation for Miller syndrome. An earlier analysis of the genomes of the two affected siblings and their parents using a popular tool called SIFT, which predicts the phenotypic severity of amino acid changes, resulted in hundreds of variants flagged as highly deleterious, which then had to be sorted through by hand. VAAST, by comparison, identifies the disease causing alleles automatically.

The Yandell group set out to come up with a probabilistic tool that not only considers the severity of the DNA variant but also frequency information. “If everyone in the case dataset is homozygous for a stop codon in some particular gene, but 75% humans are homozygous for that allele, you can say this is unlikely to be deleterious,” says Yandell. “Those probabilistic arguments are what things like SIFT don’t do… We wanted to develop a tool that would deal with all those frequencies in a truly probabilistic fashion, so you could identify disease-causing genes with greater accuracy.”

Importantly, says Yandell, it’s fast. “You can process the genome in just a few minutes, which really cuts down on the cost of analysis,” he says. Yandell says he also has unpublished data in which VAAST identified a mystery X-linked gene mutation in a large Utah family in a matter of 15 minutes.

“I think this is huge,” says Yandell. “I was skeptical at first. I wasn’t like a member of ‘the personal genomes cult,’ if you will. I just started playing with these data, and wow: there really are prognostic and diagnostic answers to be found in them. Now I’m truly a believer.”

*Moore B. et al. “Global analysis of disease-related DNA sequence variation in 10 healthy individuals: Implications for whole genome-based clinical diagnostics.” Genetics in Medicine 13, 210-7 (2011).