Clone-Free, Single-Molecule Genome Assembly Illuminates Structural Variation

By Aaron Krol

June 30, 2015 | A large team of scientists has published one of the most detailed explorations to date of complex structural variation in a human genome. The work, centered at the Icahn Institute of Mount Sinai in New York and including contributions from researchers at several other academic and commercial groups, appeared this week in Nature Methods, providing the most complete and accurate whole human genome ever sequenced directly from a DNA sample, without first cloning the DNA into bacteria.

Two technologies provided the genomic information for this project: Pacific Biosciences’ long-read SMRT sequencers, and the Irys instrument developed by BioNano of San Diego. PacBio users have previously demonstrated that SMRT sequencing can be used to assemble a whole human genome without guidance by a reference genome, thanks to long read lengths that cross structurally complex regions normally too difficult to resolve by shotgun sequencing. By adding data from the Irys, however, the Icahn Institute was able to assemble a genome in far fewer and larger pieces than in previous efforts. The Irys does not read DNA at single-base resolution, but by tagging DNA molecules with fluorescent labels, it provides information on the arrangement of large structural elements, showing where regions of the genome have been expanded, duplicated, moved, or inverted. (See, “Bio-Nano Teases Out the Genome’s Structural Quirks.”)

Both of these technologies produce unusually long-range information on the genome. Single reads from a PacBio sequencer can span 10,000 bases or more, while the Irys creates optical maps of regions hundreds of thousands of bases long. Traditionally, keeping such long stretches of the genome intact has involved cloning DNA into libraries of fosmids or BACs, two types of genomic elements that can be stored in bacteria for later sequencing. However, that process is expensive, time-consuming, and can introduce biases.

“There wasn’t really a public solution for doing this type of analysis,” says Ali Bashir, a member of the Icahn Institute and the senior author of the Nature Methods paper. While Bashir’s team used public assembly algorithms to process its PacBio data, and BioNano’s algorithms to create Irys genome maps, merging the two types of data required custom scripts, made available as supplementary materials with the publication.

The results yielded exponentially longer contiguous DNA sequences than either technology could produce alone. The PacBio data split the genome into over 20,000 contigs with an N50 length of 900 kilobases, and the Irys optical maps were roughly five times as long. Combining these fragments, however, produced a whole genome in just over 200 scaffolds, with an N50 length approaching 30 megabases ― and the longest scaffolds reaching 80 megabases. Those figures mark the genome produced with these technologies as one of the most coherent ever assembled.

Because both PacBio and BioNano chemistries work directly with native DNA ― rather than DNA copies produced by polymerase chain reaction ― the new genome assembly was also constructed entirely through single-molecule analysis.

“Beautiful Substructures”

One of the main aims of this project was to get new information on the most structurally complex regions of the genome, where long stretches of tandem repeats, or bizarre combinations of structural events, make it very difficult for most methods of DNA analysis to make sense of sequence. “Structural variations tend to be buried within very complex regions,” says Erik Holmlin, CEO of BioNano, which generated the project’s Irys data in-house. “Scientists have just been trained to stay away, like it’s a bad neighborhood.”

The cell line sequenced for this project, NA12878, has perhaps the world’s best-described human genome; among other things, it provides the standard-setting reference sequence maintained by the Genome in a Bottle Consortium. (See, “Genome in a Bottle Uncapped.”) Nonetheless, Bashir and his colleagues discovered several structural events that had never before been captured, some of which bridged gaps left in the human reference genome, a global resource curated by the Genome Reference Consortium. They also determined that the human reference genome systematically undercounts the expansion of short tandem repeats.

“What we’re seeing now is there’s this beautiful substructure to expansions,” says Bashir. “You can imagine a single SNP happens in one of those tandem repeat periods, and then that gets expanded and you get these substructures within.” His team’s new genome assembly, with its detailed sequence in these tandem repeat regions, shows this kind of substructure directly, as a repeat sequence suddenly gives way to a slight variation, then returns, sometimes after hundreds of repetitions, to the original motif.

The new assembly also reveals, in exquisite detail, that many structural events are born from several compounded mutations. In particular, more than half of the inversions in the assembly co-occur with at least one other event, such as an insertion, deletion, or duplication, creating elaborate remixes of long DNA structures.

“Walking in the Dark”

All of these results highlight just how much genomic variation is missed when working exclusively with short-read sequencing technologies. As Bashir observes, that almost certainly has large implications for our understanding of the genome’s function.

“The chance that a structural variant has a functional impact is so much higher than with a SNP,” he says, yet most geneticists can at best only make predictions about what structural variants are likely to exist in the human genome. “A lot of the literature has been focused on inferring variation, by looking at break point signals or read patterns. What we hope moving forward is that, when you can do a deeper dive, you’ll be directly observing what your genome looks like.”

In the time since this project began roughly two years ago, PacBio has improved its read lengths, pricing, and throughput, and BioNano has upgraded the Irys to work with whole human genomes at once ― all trends that will make similar assemblies much easier for future groups. Using the computational pipelines developed at the Icahn Institute, Bashir predicts that a mid-size sequencing center could repeat this work with a new genome in just a few months, something he is already working on himself.

“New genome projects, we hope, are going to have far less of an activation energy,” he says. “1000 Genomes, Genome in a Bottle ― those groups are going to start using this.” Bashir also hopes that groups working in specific disease areas where structural variants play a large role, including cancer and genetic disorders like Huntington’s, will start performing this kind of assembly on key regions. Even a few well-chosen assemblies of this quality ― to provide specialized reference genomes in different ethnic groups, or to explore complex cancer mutations ― could have a big impact on research.

Nevertheless, highly contiguous genome assemblies like this one remain difficult and expensive to produce, compared to the short-read, reference-guided solutions that dominate genomics. It’s also very difficult to confirm new discoveries in these kinds of projects, because few if any parallel methods of reading the genome can give the same resolution on complex structural events.

“We’re walking around in the dark with a flashlight, and we’re seeing things that look really interesting,” says Holmlin. “What we’re doing with collaborators at the Icahn, and folks throughout the community, is bringing other people with flashlights to continue to investigate these findings.”