Sequencing the Strange Communities: Taking on Metagenomics
June 24, 2014
Earlier this month, Bio-IT World spoke with Maitreya Dunham, part of a team at the University of Washington that described a new method for understanding the variety of species that live in communities of microbes. Dunham published a paper on the technique in G3 this May. However, we neglected to fully credit another group at UC Davis that had hit upon an almost identical technique, and whose own paper in PeerJ appeared as a preprint in February. Two members of the UC Davis group — Jonathan Eisen and lead author Chris Beitel — were gracious enough to agree to a follow-up interview. — The Editors
By Aaron Krol
June 24, 2014 | Metagenomes offer a reasonable approximation of what types of organisms can be found in a given environment. But without knowing how the DNA from a sample is distributed between different species, it’s impossible to tell how these microbes interact with each other, and what unique role each one plays in its microscopic ecosystem.
“Communities are made up of organisms that interact,” says Jonathan Eisen, a professor at the School of Medicine and the College of Biological Sciences at the University of California, Davis. “They’re not made up of short reads of Illumina sequences. And we need to stitch those together into organisms in order to make useful predictions, and interpretation of experimental data.”
Eisen’s lab, along with a team at the University of Washington profiled in May, hit upon a technique to take a metagenome — a patchwork of DNA fragments created by sequencing a mixed microbial sample — and split it into the separate species involved.
The method repurposes a process called Hi-C, which links together DNA from different regions of the genome, then carves out the linked fragments for sequencing. Hi-C was first used to infer the 3-D structure of chromosomes by showing which bits of DNA could be caught close together. However, Hi-C can also be used to demonstrate that two DNA fragments came from the same cell — a hugely valuable piece of information when you’re trying to figure out which pieces of a metagenome belong to the same species.
Strange Communities
“Microbial communities are of exponentially growing interest, in a lot of different environments,” says Eisen. “Whether it’s the human microbiome, or communities associated with various plants, or free-living communities out in the environment, there’s a massive increase in interest.” Eisen’s lab has been at the crest of this trend, trying to understand how new traits originate in bacterial communities, and how these traits are shared between distantly related groups. His team started working with metagenomes almost fifteen years ago, a time when gene sequencing was far more difficult and costly, but lately he’s seen a big uptick in the use of these community-wide genetic maps.
“That’s because we’re finally able to get some data about many of these communities,” he says. “It’s mostly coming from high-throughput sequencing, and it’s been revolutionary in that we’re getting amazing first-pass samples. Unfortunately, most of the studies that have been done so far have not gotten to the promise of metagenomics.”
It was Chris Beitel, a PhD student in Eisen’s lab, who suggested that Hi-C might provide the necessary signal to connect different reads from the same organisms. He had been working with Lutz Froenicke, a post-doc in a different UC Davis lab, who had realized that Hi-C linkage data could be useful in putting together individual genomes. “After that insight, we started using Hi-C for assembly, Hi-C for haplotype phasing, and now Hi-C for clustering metagenomes, which is going to lead to a whole bunch of other cool projects,” says Beitel.
While the University of Washington team wrote a new program, MetaPhase, to split up a metagenome into species clusters using Hi-C linkages, Beitel and his colleagues tinkered with an existing tool. They used the Markov Cluster Algorithm (MCL), a popular tool in bioinformatics for sorting data, and made a few customizations.
“It did fantastically well on this problem,” says Beitel. “I’m still trying to wrap my head around exactly why it does so well.” In a sample with four different bacterial species, the method was easily able to cluster sequencing reads from each species separately, with well over 99% of Hi-C links correctly placing two reads from the same species together, rather than falsely connecting reads from different species.
Although the UC Davis group’s test sample was much smaller than the one used by the University of Washington team — which included 18 different species, representing bacteria, yeasts, and one species of archaea — their analysis did shine some extra light on the Hi-C method’s capabilities. The group deliberately mixed in organisms they thought would be challenging to cluster, including two different strains of E. coli, a bacterium with two chromosomes (Burkholderia thailandensis), and a bacterium with two plasmids (Lactobacillus brevis). Even when dealing with the two strains of E. coli, the clustering tools successfully connected reads from the same strain 96% of the time. The University of Washington team, meanwhile, failed to distinguish between two strains of the same yeast species. (It’s worth noting that with bacteria, more genetically distinct organisms can be considered two strains of the same species, whereas eukaryotes like yeast are more likely to be split into separate species if their genomes vary too noticeably.)
The method was also extremely successful in joining chromosomes and plasmids from the same organisms together, which Beitel says came as a surprise. Over the long term, this may be the most crucial result of the experiment. DNA fragments from the same chromosome can theoretically be stitched together with deep enough sequencing; Eisen expects long-read sequencing technologies like PacBio and Oxford Nanopore to eventually make Hi-C obsolete for this purpose. Revealing which chromosomes and plasmids come from the same species, however, is a different story.
“The aspect of co-localizing DNA within a cell, like plasmid and chromosome, or different copies of a chromosome, does not ever come from long-read technology,” says Eisen. “My guess is that adaptations of Hi-C may be, in the short term, useful for linking together pieces of DNA in the same chromosome. But in the long run, it’s going to be more useful for co-localization of different pieces of DNA in the same cell.”
New Promises, New Challenges
Both the UC Davis group’s refinements to MCL and the University of Washington’s MetaPhase have been made freely available on GitHub for other researchers to use on their own microbial samples. Both teams’ papers were also published in open access journals. By choosing PeerJ, Beitel and his colleagues made their preprint and peer review process, and huge amounts of their raw data, open for viewing online. All of this should help encourage replication attempts, and new refinements to the process, that will prepare the Hi-C method for eventual use with real environmental samples.
“The more open the publication system can be the better, in my view,” says Eisen. “And I’m growing to like more and more the concept of open peer review, and the concept of publishing your preprint as soon as you submit it. PeerJ is one of the places that are leading the charge.” (Eisen has been a major advocate for open access, currently serving as chair of the PLOS Biology advisory board and as an academic editor at PeerJ. He is also a past winner of the Benjamin Franklin Award recognizing open science advocacy.)
Still, despite the efforts to make this research as open as possible, there are obstacles to its future development. One is that Hi-C is a notoriously troublesome, multi-step procedure. “The laboratory aspect of acquiring Hi-C data is exceptionally challenging,” Beitel acknowledges.
More fundamentally, there’s a lot of work to be done before this method will be trusted with natural samples, which will usually contain many more species, some of them at extremely low abundance.
“This method was not designed to solve this problem,” Eisen stresses. “ I think there are many things you could do differently to do this well, but each of those ideas probably needs to be tested, both in a model system like we used, and then in a real-world system. I can easily imagine a three- to five-year project developing Hi-C as a metagenomic tool.”
And it’s not yet clear how one could prove that Hi-C is working correctly when it clusters a metagenome derived from a natural sample. Both groups that have performed this method so far have used species with good reference genomes, letting them check their answers against the best available data. In the field, this method would be assembling genomes of organisms no one has ever sequenced individually. On the one hand, this is a promising application for the method: it could give us our first glimpse at unknown microbes. On the other hand, the unfamiliar genomes it builds would have to be taken on faith.
“In my mind, that’s a big problem with Hi-C metagenomics,” says Beitel. “Once you infer this clustering solution in a community that doesn’t have reference genomes, how do you know that your answer is right? We don’t have a way to address that.”
Beitel is already busily thinking of other applications for the method, which could be adopted more quickly and trusted more readily. There are a few situations where biologists really do need to work with metagenomes of simple, synthetic communities. One of Beitel’s ideas is to use the Hi-C method on pools of BAC clones — a kind of living storage system for stretches of DNA — letting researchers keep different BAC clones together without losing track of where each read comes from.
He also suggests using Hi-C metagenomics in cancer studies, to find subpopulations of tumor cells that may harbor important mutations. “If you think about it, a tumor is a lot like a species that has, over time, diverged,” he says. “You have the original, unmutated version of someone’s genome, but then mutations accumulate, and you get multiple different clonal subpopulations.” With Hi-C, it might be possible to get a clearer picture of each separate lineage of tumor cells, rather than counting up all the mutations in a tumor together.
With luck, other groups will make a concerted effort to test the limits of Hi-C metagenomics. The method may one day illuminate the muddled communities that scientists are still struggling to put into context, but like all new techniques, it has to be approached with some caution.
“I think it’s incredibly promising,” Eisen concludes. “It’s very unique, and has all sorts of different potential uses to complement the toolkit that we have now. But put me down as a skeptic for now, as to what we do with the data once we get it from real communities.”