The Return of Finished Genomes: Hybrid Sequencing Strategy Boosts Pacific Biosciences Accuracy, Assembly

By Kevin Davies

July 2, 2012 | Two new papers in the July issue of the journal Nature Biotechnology – one from Michael Schatz and colleagues at Cold Spring Harbor Laboratory, the other from a team at Pacific Biosciences – demonstrate the value of combining long but relatively error-prone reads from Pacific Biosciences’ “3^rd-generation” instrument with the high-throughput, shorter “2^nd-generation” read lengths from Illumina.

The hybrid error correction approach shows its potential in the assembly of a variety of microbial and more complex genomes, including that of the parrot, and could have long-term ramifications in comparative genome analyses, microbial genome analyses, and the study of structural variations.

It could also provide a timely scientific and commercial boost for PacBio’s single-molecule sequencing platform, which is facing stiff competition and striving to allay the perception that its technology lacks the accuracy of its more established rivals.

“Many people hear this 85-87% raw accuracy figure [for PacBio data] and say, ‘How can I do anything with that?’ The Cold Spring Harbor Lab paper shows you can do some pretty impressive stuff,” says PacBio’s Jonathan Bingham, product manager, informatics. “This error correction approach allows the return of the era of finished genomes.”

“High-error long reads can be efficiently assembled in combination with complementary short reads to produce assemblies not previously possible,” says Michael Schatz, a senior author on the CSHL paper.

When PacBio launched its single-molecule sequencing platform to great fanfare in 2010, the sky was the limit. The company’s founders and executives promised the “15-minute genome” by 2013, with super-long DNA read lengths and a dazzling interplay of advanced physics and nanotechnology.

Not everything went according to plan, however. Like Helicos, another single-molecule sequencing platform before it, there was market resistance at forking big bucks for a machine weighing close to a metric ton. More significantly, the stochastic nature of single-molecule sequencing meant that – by PacBio’s own admission over the past 18-24 months -- the accuracy of any single DNA read hovered unsatisfactorily around the 85 percent mark.

On the business front, the company had to ride out an economic downturn, at one point shedding about one third of its workforce, while founding CEO Hugh Martin battled multiple myeloma. Martin was succeeded by sequencing veteran Michael Hunkapiller last year, tasked with lifting the company’s sagging stock price with the competitive threat of nanopore technologies looming on the horizon.

But PacBio did show its stripes last year, identifying the bacterial strains behind the cholera outbreak in Haiti and the E.coli food poisoning crisis in Germany. And a recent software release paves the way for the PacBio platform to provide direct detection – and eventually identification – of epigenetic modifications in double-stranded DNA.

Hybrid Vigor

The idea of marrying PacBio’s long reads with the shorter read-length, higher abundance reads produced on an Illumina or other 2^nd-gen platform has been mooted for some time. Explains Bingham:

“With next-gen sequencing platforms such as Illumina, SOLID and 454, it was really cheap to get a draft sequence but very expensive to get a finished genome. With 2^nd-gen systems, you can get a draft genome quickly and relatively inexpensively. But to close the genome, you have to go back in and do Sanger sequencing, PCR, and so on. It’s a very painstaking process. The cost compared to generating an Illumina draft [genome] was about 10:1. You might spend one lane of Illumina obtaining a rough draft of a bacterial genome, but to get a finished genome, you had to do $30,0000-worth of Sanger sequencing.”

Although using slightly different approaches, the two new papers clearly demonstrate the benefit of sequencing the same template DNA on two platforms and marrying their respective strengths – the longer read-lengths of PacBio with the higher integral sequencing accuracy of the 2^nd-gen Illumina platform.

The CSHL “hybrid error correction” approach was developed Adam Phillippy and Sergey Koren (National Biodefense Analysis and Countermeasures Center, Maryland) along with CSHL colleagues including Schatz and Dick McCombie.

This mathematical “fix” – released as open-source code – produced “a finished genome for much less money than a Sanger finishing approach,” says Bingham. He adds that the ability to produce a composite assembly with accuracy around 99.9% suggests that high error rates associated with long reads need not be a barrier to genome assembly.

The CSHL PBcR (PacBio corrected Reads) strategy aligns the shorter Illumina reads against the longer read PacBio data, trimming and correcting the data before taking the consensus of that alignment – typically above 99% -- and develops a finished assembly. (About 60% of the PacBio reads are retained for the final assembly, although that figure should increase with newer chemistries.)

Two general approaches have gained traction for de novo genome assembly -- the overlap-layout-consensus (OLC) paradigm (a graph is constructed from overlapping sequencing reads) and the de Bruijn graph formulation (the graph is constructed from substrings). The authors found that the OLC assembly approach became more powerful as read lengths increased, whereas the de Bruijn method reached a plateau, and so built their approach around the former.

The best results were obtained using the Celera Assembler, developed by Eugene Myers and colleagues for the Human Genome Project, and favored over ALLPATHS-LG and ALLORA. The resulting contigs had a median size at least double that obtainable with 2^nd-gen sequencers alone, and in some cases a five-fold increase.

The average length of PacBio reads with the latest C2 chemistry (released earlier this year) is significantly higher than any other available platform, says Bingham. “On a typical run, you can get average read lengths of 3,000 bases on an exponential distribution. As you go to the tail, you get reads at 6,000, 8,000 bases. The longest reads obtained by the CSHL group were 10,000 bases.”

While the chief application of PacBio has largely been in the microbial world, the new paper also demonstrates success in assembling the parrot genome, as part of the latest Assemblathon competition, which Bingham says provided better assembly than any 2^nd-gen system alone. The CSHL group also looked at the corn transcriptome – PacBio reads were long enough to span entire RNA sequences.

“The CSHL group did something really amazing – they took the Celera Assembler and were able to add on this error correction piece, so once you have it set up and configured, it’s easy to take your IIlumina raw data and your PacBio Long reads, put them into an automated pipeline, and get a high-quality finished genome at the end,” says Bingham.

Indeed, the CSHL authors predict that the hybrid approach should be able to deliver what Bingham calls “a perfectly structural genome – one contig/one chromosome.”

The PacBio in-house strategy described in the accompanying paper by chief science officer Eric Schadt and colleagues begins with an assembly based on the 2^nd-gen data, and then uses the PacBio data in conjunction with that draft assembly. The method was tested in further analysis of the cholera genome.

The value of these new hybrid approaches extends beyond microbes. “This isn’t just for bacteria – It scales up to larger genomes,” says Bingham. For example, the new parrot genome assembly is “far superior to that of any previously sequenced bird genome,” Schatz says.

Bingham says the two groups have had friendly conversations about which is the better approach for doing hybrid de novo genome assembly. He concedes however that the CSHL approach will attract interest because it lets users avoid locking up mis-assemblies, “which could happen if you were to first assemble with Illumina and then add PacBio. In our approach, if the Illumina data had mis-assemblies and misjoined contigs etc., then we’d not be able to undo the error,” he says.

“The error correction step itself doesn’t require any black magic – it’s relatively straightforward,” says Bingham, although he notes that there are chimeric reads that need to be caught, which requires a little skill. The approach is not just limited to PacBio and Illumina data. For example, it could be applied to PacBio circular consensus reads and the longer reads.

Bingham is also excited about the potential to combine this approach with PacBio’s evolving software for epigenetic detection. “This combined with the base modification software provides something really compelling – the ability to get a finished genome with a complete methylome at the same time. Currently there’s no other way to do that,” he says.

“The PacBio long reads are emerging as the gold standard for finishing genomes, giving us something we’d given up.”

While agreeing that error correction of PacBio reads is important, UC Davis bioinformatician Ian Korf, who helps manage the Assemblathon competition for de novo genome assembly, argues the real game changer is kilobase-sized read lengths. “One of the better mammalian genomes was the dog, done years ago with 6-7x Sanger sequencing,” says Korf. “Those [reads] are only ~1kb and yet the assembly was very good. You could do 100x short reads and not get that good an assembly. Short read assembly is a hard problem. Once read lengths are 5 kb or more, assembly will become a completely different problem.”