Get SMRT: Pacific Biosciences Unveils Software Suite with Commercial Launch
By Kevin Davies
April 29, 2011 | Third-generation sequencing company Pacific Biosciences (PacBio) began commercial shipment of its PacBio RS single-molecule sequencer this week. The instrument has been in beta testing at 11 institutions in North America and elsewhere for the past year. A notable success was the recent sequencing and identification of the cholera strain sweeping Haiti after the devastating 2010 earthquake.
In a briefing with Bio-IT World, PacBio staffers Kevin Corcoran, Jon Sorenson and Edwin Hauw previewed the new suite of software tools on the RS sequencer. The SMRT (single molecule/real time) Analysis software suite features web-based software, an analysis pipeline framework, and algorithms for sequence alignment and de novo assembly.
“We’re accelerating the development of software with the community,” says Kevin Corcoran. “A key feature of third-generation sequencing is that [the technology] doesn’t match up with what’s out there now. The key features of the PacBio system include fast time to result, high granularity, long read lengths, and new sequencing modes, including a circular mode and strobe sequencing.”
PacBio’s single-molecule sequencing system offers significantly longer read lengths (1,000 bases on average) than its second-generation sequencing rivals, and faster run times. That said, the total sequence throughput per run is currently less than other commercial platforms. The single-read accuracy hovers in the 85-90% range.
A revelatory feature of the SMRT software portal is that it captures kinetic information – the time for each registered nucleotide to be captured and incorporated into the growing DNA strand. “This is the first time you can watch DNA polymerase in real time, so that kinetic information will provide additional applications that have never been enabled before,” says Corcoran.
The genome browser is called SMRT View. “This takes advantage of our longer reads and kinetic information,” says Sorenson. It includes strobe and consensus sequence modes, allowing the user to visualize and interact with secondary analysis sequence data. PacBio says the interactive graphical representations of variants, quality values, and other metrics is the first data visualization application that can visualize kinetics and structure information unique to PacBio's SMRT technology.
Sorenson demonstrated this by displaying sequence of a yeast genome: A color-coded “heat map” shows the time between successive pulses, and appears highly reproducible across multiple reads. Independent runs across a given stretch of sequence can be stacked on top of each other to compare the kinetics of incorporation.
“There is sequence context to the kinetics. Much of that is short range but it is reproducible,” says Sorenson, citing data presented in a recent PacBio paper in Nature Methods. In time, such information could allow PacBio in principle to increase sequence accuracy and study epigenetic effects by detecting modified nucleotides.
Real Player
Because results can typically be generated in runs of less than one hour, Sorenson explains that PacBio needed to develop new scalable algorithms to interpret the data in real time.
The SMRT Portal is an open-source, browser-based application that supports standard sequence format. That means that next-generation sequencing (NGS) data from other platforms, such as Illumina, Life Technologies, Ion Torrent, and Roche, can be integrated with PacBio data. Users can align reads to a reference sequence or assemble reads de novo.
PacBio exports sequence data into the SAM/BAM format for sequence alignment. It is also using a variant-calling format adopted by the 1000 Genomes Project called VCF. SMRT Portal enables third-party software analysis and collaboration, facilitated by a python-based framework for secondary analysis functions called SMRT Pipe.
“We embrace openness,” says Sorenson about PacBio’s decision to make the entire secondary analysis software open source. “We have a DevNet web site, a developer-based site for getting data and information. Our APIs are very modular in how we approach the system. We want to work with ISVs and academic collaborators to either plug in their own tools or, vice versa, to promote connectivity and cooperation.”
Among a number of algorithms on offer is BLASR (which stands for Basic Local Alignment with Successive Refinement) which conducts sequence alignments against the reference human genome. “It’s based on widely used strategies, but more a synthesis of several different strategies,” says Sorenson. “It can align more than several hundred megabases in an hour on a multi-core machine. It’s very fast.”
BLASR finds the highest scoring local alignment (or set of local alignments) between a read and the reference. The initial set of candidate alignments is found by querying a pre-computed index of the reference, and then refined until only high scoring alignments are retained.
ALLORA (A Long Read Assembler) is a de novo assembler, based on an open-source package called AMOS. “We used parts of AMOS and other parts customized to our particular read types,” says Sorenson. The EviCons tool is used for consensus calling, matching sequence calls to the reference and assessing whether a given call is an error or a polymorphism. Using conditional probabilities and a likelihood ratio test, the algorithm demarcates multiple sequence alignments into regions of certainty or uncertainty.
Several notable NGS software providers have signed up to be ISV partners with PacBio, including CLCbio, Geospiza, Genologics, GenomeQuest, DNAStar, and BioTeam, not to mention Amazon Web Services.
Sorenson says the recently published cholera study, conducted with researchers at Harvard Medical School, showed the value of real-time resequencing in identifying strain-specific structural variations. But PacBio executives are also excited about the promise of combining their own long reads with more voluminous short-read data from the likes of Illumina and Life Technologies, as well as Roche/454. This hybrid approach has been used, says Sorenson, to close difficult gaps in some microbial assemblies. “We haven’t done whole genome human hybrids yet, but we’re moving towards that,” he says.
Shipping
The first commercial PacBio RS units are being shipped to sites including biotechnology companies, service providers, government and academic organizations. The National Biodefense Analysis and Countermeasures Center (part of the Department of Homeland Security) will use its new instrument for characterizing microbial pathogens.
One of the early access sites, the Wellcome Trust Sanger Institute (UK), was recently upgraded to the commercial hardware specifications. Harold Swerdlow, director of sequencing technology at the Sanger, says he intends to use the instrument “to improve pathogen de novo assemblies, to increase the coverage of sequence information from organisms like the malaria parasite or Mycobacterium tuberculosis at the extremes of AT/GC representation, and in the future, to explore epigenetics via direct detection of methylated sites.”
Richard McCombie’s group at Cold Spring Harbor Laboratory is also taking delivery of a commercial system, and plans to use it to study disease-related structural variation in the human genome and de novo sequencing of plant genomes.