PacBio Generates Accurate Long Read Sequences Through Circular Consensus Sequencing
By Benjamin Ross
February 25, 2019 | PacBio has developed a protocol based on single-molecule, circular consensus sequencing (CCS) to generate accurate long read sequences. The company hopes this approach will provide an alternative to the limited read lengths of short read sequencing and the limited read accuracy of long read sequencing.
The company’s research was recently posted on bioRxiv (DOI:https://doi.org/10.1101/519025).
The study was conducted in mid-2018, Aaron Wenger, a principal scientist in bioinformatics at PacBio and lead author of the study, told Bio-IT World. The initial thought was to develop a way to improve reads, but Wenger and his colleagues didn’t know exactly how they would integrate the disparate uses of long and short read sequencing.
“We initially had this idea where we’d just make the long reads accurate, and then they’ll be just like short reads, so that you can take the software people wrote for short reads and use it on these longer, accurate reads,” Wenger said. That was true to an extent, but not as much as Wenger expected.
“Although the error rates of the long, accurate reads and the short reads were similar, the types of errors that were remaining were different enough between the two datatypes that the software had to be aware of it,” said Wenger.
The errors in short reads tend to be that you call a letter in the DNA wrong—maybe you’ll mistake an A for a T—while the errors in long accurate reads that you miss or add a letter entirely.
“The way people have viewed sequencing in the market today is that there’re short read sequencing instruments that give observations of a small segment of DNA but that are extremely accurate, and then there’s long read sequencing which gives you observations that are tens of thousands of base pairs, but there’s a mistake in one of every ten bases,” said Wenger. Wenger and his colleagues managed to produce reads that are both long and accurate.
This was accomplished by using CCS, a form of sequencing developed at PacBio over fifteen years ago that makes DNA topologically circular, meaning that researchers can sequence the DNA multiple times in order to create a consensus for what the correct sequence is.
This technology hasn’t been previously used due to limitations of how much data it could process, Wenger said. “Traditionally, [CCS has] been limited to relatively shorter fragments of DNA because in order to observe something like a 15,000 base pair piece of DNA ten times like we do here means you have to read 150,000 bases of raw DNA.”
However, research advances in PacBio chemistry in 2018 provided very long reads, allowing the team to get multiple observations from relatively long pieces of DNA.
The results were long reads with 99.8% accuracy. PacBio applied this technology to reference genomes within the Genome in a Bottle Consortium, which Wenger said was able to estimate over two thousand correctable mistakes within the standard.
“The standard had been built using short read sequencing,” Wenger said. “It was a pleasant surprise how many mistakes were in the standard that we were able to correct with this new data type.”
Collaborative Effort
Shortly after collecting initial data, PacBio shared their results with Google in order to apply Google’s DeepVariant software to call variants from the long read data. This was after initially applying the data to the Broad Institute’s GATK software. Wenger says the results with GATK were good, but they weren’t as strong as they would be for short reads.
The Google software was able to adapt straightaway to the long accurate read data, whereas the GATK was hand-coded to work with short reads, said Wenger. “We were able to take [Google]’s machine learning approach and be able to look at our data and be able to realize whether certain errors were insertion and deletion errors or substitution errors.”
After Google handled the data, other institutes were brought in to analyze the results and use it to improve workflows, including Johns Hopkins, the National Human Genome Research Institute, and the Dana-Farber Cancer Institute.
Wenger says there’s plenty of work to be done with the data. Currently, PacBio is working to produce this data in an easier, less expensive way.
“There’s some algorithmic work that could be done to take advantage of the data and how accurate it is,” he said. “We now think you might be able to get even more accurate, complete assemblies if you design approaches going in knowing how accurate the reads can be.”