Dovetail Genomics Launches Genome Assembly Service with Twist on Hi-C Method
By Aaron Krol
October 20, 2015 | This morning, Dovetail Genomics of Santa Cruz, Calif., launched its first service, performing de novo whole genome assemblies on demand. The company, founded by paleogenomics specialist Richard Edward Green of the University of California, Santa Cruz, has created a new way to capture long-range information on the arrangement of DNA in the genome. The technology makes it possible to build nearly complete genomes from scratch, while still taking advantage of ultra-cheap short-read sequencers like those produced by Illumina, whose DNA data is normally too fragmented to resolve the large-scale structure of chromosomes.
“The ultimate goal is to reconstruct the complete, accurate genome sequence of any organism,” Green tells Bio-IT World. Dovetail’s technology also has potential uses in haplotyping ― distinguishing between DNA inherited from mother and father ― and for studying large duplications, deletions, and rearrangements in the genome.
Green created Dovetail in 2013, based on an idea he had while teaching a method called Hi-C in one of his classes at UCSC. As chromosomes are folded tightly inside the cell nucleus, very distant regions of the genome may come into contact. Hi-C is a method for studying this 3D arrangement of the genome. It involves treating the dense layer of chromatin around the chromosome so that sequences of DNA pressed together in the cell form bonds with each other, providing a glimpse of where folds occur.
Trying to test his students’ understanding of Hi-C, Green started asking questions about what kinds of data the method might produce in extreme situations ― like during mitosis, when chromosomes are scrunched especially tight. He quickly realized that Hi-C was accidentally catching a lot of data that would be useful in genome assembly. Many of the bonds formed during Hi-C are not between distant pieces of DNA brought together by folding, but between relatively close sequences, on the order of several thousand bases apart. That’s too short to help study the 3D architecture of the genome, but it’s a much greater distance than reads on an Illumina sequencer can span.
“The idea of using this data, sort of off-label, for genome assembly was something that several groups had independently,” Green says. While Green was creating a prototype piece of software to use Hi-C data in genome assembly, others like Jay Shendure and Jan Korbel were writing papers to demonstrate the technique in the scientific literature.
But Green’s idea was different. In a method he named Chicago (or “Cell-free Hi-C for Assembly and Genome Organization”), Green extracted DNA from cells and fragmented it into long pieces, roughly 150 kilobases. Then he added a synthetic layer of chromatin and performed Hi-C on these fragments, which no longer contained the elaborate loops of whole chromosomes folded in the cell.
The result is a series of conjoined sequences from defined 150-kilobase regions of the genome. This can be combined with an entire Illumina shotgun sequence, which splits the genome into many small contigs without enough information to orient or merge them. When those contigs overlap with Chicago reads, new information is gained that can be used to scaffold them together.
“We spent more or less the last two and a half years perfecting this, and writing the software necessary to use this unique data type to do assembly,” says Green. “For the vast majority of genomes that we have encountered, getting connectivity information out at 150kb provides a huge win for assembly.”
Early Results
In-house, Dovetail has produced new assemblies of several genomes, including human, chimpanzee, and alligator. (This pre-peer-review article published to arXiv describes the alligator assembly.) It’s hard to know exactly how accurate these assemblies are, because there are few existing highly contiguous genomes to compare against. “We take advantage of any kind of legacy data set that can be made available to us,” says Green.
The best measure of Dovetail’s accuracy is the company’s assembly of NA12878, a commonly used human cell line whose genome has been sequenced to a gold standard. Dovetail scaffolded this genome to a reported N50 (a common measure of genome contiguity) of 13 megabases, which would be among the highest figures ever achieved. Dovetail executives said agreement with past NA12878 assemblies was in the “high 90 percents.”
Meanwhile, early access customers have been enrolled in a beta program for the past six months, resulting in more than 50 completed projects with both academic and commercial users. One customer is Axel Meyer of the University of Konstanz, whose lab studies speciation and the changes to genomes as species diverge and evolve.
“We asked Dovetail to provide us with a new genome assembly of Midas cichlid fish from Nicaragua,” Meyer told Bio-IT World. These fish, which have radiated into several species in parallel in isolated crater lakes, are a useful model for rapid speciation. Meyer’s lab had previously used methods like long-read PacBio sequencing to generate whole genome assemblies of these cichlids, which help to trace large duplications and rearrangements in the genome that have occurred as species diverged.
PacBio sequencing is expensive, however, and while its reads in the tens of kilobases are much more cohesive than Illumina and provide a basis for de novo assembly, they cannot routinely span the 150 kilobase distances achieved by Dovetail.
“Dovetail [provided] a huge improvement in our N50 lengths and all other measures of genome quality,” says Meyer. “We had been working on this for years before, and now got a better genome within weeks at a fraction of the prior costs.”
Dovetail, like alternative methods, still falls short of delivering absolutely complete genomes. For Meyer’s lab, the company has provided cichlid genomes in a few thousand scaffolds; to achieve greater contiguity requires combining these assemblies with other approaches like linkage mapping. Nonetheless, says Meyer, “this will be a great advance and a boost to many labs.”
“Every Base Pair in the Genome”
While long-range genomic data has been somewhat neglected in the era of short-read sequencing, there is growing demand for more cohesive genomes and better resolution on large structural variants. PacBio’s newest sequencer, the Sequel System, could bring long-read sequencing to a much greater number of labs, and there is a small but expanding field of companies like Dovetail trying to supplement short-read data with new chemistry and computational approaches.
Earlier this year, for instance, 10X Genomics released an instrument that labels DNA in long fragments before preparing it for sequencing, creating a signal that can be used to computationally reconstruct large scaffolds of continuous DNA. Like Dovetail, 10X has conceived its business as an adjunct to Illumina sequencing; Illumina executives have told Bio-IT World that they welcome these bolt-on services.
“Obviously, they’re the market leader, and the accuracy and quality of their data is at the top of the pile, but they recognize that they’re missing the long-range information,” says Todd Dickinson, who until 2011 served as Illumina’s Director of Product Development. Today, Dickinson is CEO of Dovetail.
Dickinson predicts that his new company will have an advantage over competitors because it presents no upfront capital costs. Dovetail is launching with a pure service model, creating and sequencing Chicago libraries in its own laboratory space in Santa Cruz. “For many people who don’t have the capital equipment budgets for PacBio [or] 10X, this is a really nice option,” says Dickinson. “They can come to us with either nothing, or a shotgun assembly, and we can take it from there.”
He adds that Dovetail does plan to eventually sell kits for use in customers’ own labs. Chicago libraries can be created with standard lab equipment and sequenced like any other DNA libraries, so even this option won’t involve purchasing an instrument like 10X’s GemCode Platform.
Still, Dovetail is not unambiguously the cheapest option for de novo assembly. The all-in estimate for an assembly can run as high as $40,000 for complex genomes like those of plants, says Dickinson, including sequencing and all the computational work of scaffolding. That’s comparable to the cost of a high-coverage human genome on a long-read sequencer like the PacBio RS II, making it an expensive choice for the kind of small lab that would shy away from buying their own instruments. On the other hand, Dovetail offers an a la carte service model, and prices can come down dramatically for customers who make their own arrangements for sequencing: Dovetail charges around $10,000 or less to produce Chicago data and use that data to scaffold a customer’s own shotgun sequence.
Even if it has trouble getting early traction, Dovetail will be an interesting addition to the market. In the difficult fields of structural variation and genome assembly, any new types of data are welcome ― and the early data from Dovetail suggests it may be achieving better contiguity than its competitors.
The technology’s twist on Hi-C also opens up new paths for development. “Genome assembly is just the first step on our roadmap,” says Dickinson. “Our teams are already diving into some other very exciting applications.” Among them is metagenomics, sorting out the messy data that comes from sequencing mixed samples of microbial DNA, as in environmental studies or research on the human microbiome. In a research capacity, multiple groups have reported using Hi-C to separate out genomes of unknown species of bacteria and fungi. A dedicated workflow for this approach would be extremely welcome.
For Green, it’s exciting to have so much competition in whole genome assembly, something he might not have expected when he founded Dovetail two and a half years ago. For a long time, the allure of cheap data led most geneticists away from the large-scale structure of the genome, a blind spot that he believes is finally getting much-deserved attention.
“We want full chromosomes, every base pair in the genome,” Green says. “Beyond that, it kind of doesn’t matter how we get there... Who will get to the finish line first is hard to say.”