One Codex and Microbiology's Search Problem
By Aaron Krol
February 9, 2015 | Genomics, as anyone working in the field can tell you, has a data problem. There’s too much of it, in too many places, and it’s growing too fast to keep up. Thanks to next-generation sequencing technologies, scientists who ten years ago were wondering what to do with the first human genome are now wondering what to do with tens of thousands of them. Many of their most basic software tools have not kept pace.
“A lot of genomics has operated on this model of one default reference, or a few default references,” says Nick Greenfield, co-founder and CEO of Reference Genomics in San Francisco. That is, tools tend to be built to compare new sequences against a single well-characterized genome of the same species, a shortcut that’s good at flagging certain types of simple genetic variation but very inefficient for other purposes.
Take BLAST, a foundational tool for genetics that finds matching strings of DNA between different genomes. BLAST can be used as a “search” tool of sorts, combing through a database of many reference genomes to pull out the closest hits. The National Center for Biotechnology Information (NCBI), for instance, maintains huge reference databases of thousands of genomes for conducting BLAST searches, helping to make this program the most frequently used search function in the field. But BLAST, written in the 1990s, was built with the expectation that reference databases would be relatively small, and that users would also want to do some alignment, figuring out exactly how and where a new sequence lines up with the reference. As a result, says Greenfield, “it doesn’t scale very well with regards to reference size.”
Greenfield became aware of these problems in early 2013, when he got together with a computational geneticist named Nik Krumm who had just completed his PhD at the University of Washington. Greenfield’s background was in general data management — his most recent job had been working on fraud prevention software — but he was intrigued by the scale of the data bottlenecks in genomics.
“We were really interested in the microbial side,” he says, “because you had this computational problem and data management problem where you didn’t just have one reference E. coli organism anymore. You had 3,000 E. coli references, and a further 2,000 Salmonella references, and those numbers are ballooning very quickly.”
Together, Greenfield and Krumm founded Reference Genomics and built an online platform called One Codex as an alternative to the clunky search algorithms used in most genomics applications. One Codex is specifically geared to microbiology, a field where just figuring out which species are in a given sample is a continual challenge. The invisibility of microbes, the unwillingness of many to be cultured in the lab, and the sheer variety of organisms that can be found in a single sample all contribute to a situation where microbial geneticists struggle to catalogue all the species they’re working with.
“There are still a lot of papers that get published where, at the end of the day, the research conclusion is the identification of something in a sample,” says Greenfield. “That’s indicative of the fact that it’s still really hard, and people are spending a lot of time on it. We’d like to make it so researchers can spend more time on characterization or tracking, or whatever it is they’re interested in.”
The Tech in Biotech
Greenfield describes the One Codex search algorithm as a “pure k-mer approach.” Like most genomic search functions, it takes an input file of DNA sequence and splits it into a complete set of overlapping k-mers*, and then searches for matching k-mers within a database of reference genomes. In One Codex, that database contains nearly 25,000 bacterial genomes made public by the NCBI, plus hundreds of viral and fungal genomes.
Searches are made faster by the system Reference Genomics uses to index all those references. “A core part of what we’ve done is build a really large database, or key-value store, where the keys are those k-mers and the values are associated biological characteristics,” says Greenfield. In other words, the reference database has already been broken down into target k-mers, and each k-mer is stored alongside information on the organism it came from: the species, subspecies, strain, and higher-order taxonomic groups.
Unlike BLAST, the One Codex algorithm never tries to reassemble its input reads. Instead, it simply compares the values of all its k-mer “hits” to arrive at the narrowest possible picture of which organisms are in a sample. If the sample contains a very well-characterized species like E. coli, for which there are thousands of high-quality reference genomes in the NCBI database, One Codex would typically suggest a specific strain. For novel or less well-described organisms, One Codex will back up to the narrowest taxonomic level of which it can be confident: say, an unknown species from the enterobacteria family.
One of several ways of visualizing the results of a One Codex search. Image credit: Reference Genomics
Greenfield says that this minimalist search approach, combined with “some data structure innovations that relate to packing more of these k-mer value associations into smaller space and accessing them more quickly in memory,” results in computations over a thousand times faster than a BLAST search. In practice, that means One Codex can process millions of reads in just a few minutes, depending on some of the search parameters. The One Codex platform is currently in open beta, and users can try it for free at onecodex.com.
Reference Genomics is a bit unusual as a biotechnology startup without any wet lab operations, placing it at a crossroads between the tech and biotech worlds. Fortunately, Krumm and Greenfield found a sympathetic source of funding for their company in Y Combinator, an accelerator in San Francisco that has provided seed money and early guidance to major Silicon Valley companies like Dropbox and Airbnb.
Reference Genomics is part of the freshman class of biotechs supported by Y Combinator, one of five life sciences startups accepted to the accelerator in the summer of 2014. Of all these companies, Reference Genomics may be the most suited to Y Combinator’s traditional style of support, with the relatively low capital requirements and rapid scalability of a tech startup. “We’re focused on enabling life science, and we’re an applied science company,” says Greenfield. “[But] we can push code three times a day if we want to push code three times a day.” The Reference Genomics team has also raised a private financing round from undisclosed investors, and with a comfortable funding cushion and a working product, the company is now starting to think about ways to commercialize the One Codex platform.
One Codex, Many Models
Early adopters of One Codex have been a mixed bunch. Reference Genomics hasn’t run any kind of marketing campaign for the platform yet, and Greenfield says he found many of the first users through Twitter, where there is a very active bioinformatics community. “We’ve got a fairly large contingent of academics doing all sorts of things,” he says. “There are people doing cow microbiome, human microbiome, people doing contaminant screening of human samples making sure nothing has gotten into the lab, people doing pathogen-centric work, people doing environmental metagenomics.”
These users on the basic science side tend to be tinkerers, groups with at least some experience digging into the plumbing of bioinformatics programs and modifying them for their own ends. That’s an important customer base for Reference Genomics, and the company has been responsive to their needs, keeping its application programming interface open and working with users on custom features like adding new reference libraries for specific purposes. But ultimately Greenfield hopes to appeal at least as much to casual users, people who could make effective use of genetic information but need ready-to-go workflows before they can dive in.
“On the applied side, we’ve seen a lot of use in public health, both more clinically oriented and food safety oriented,” says Greenfield. A rapid genomic search tool could be a powerful platform for tracking outbreaks and the spread of infections in near-real time, or for identifying bacterial contaminants in food or water supplies. Government agencies and hospitals could be key customers if One Codex is able to support their needs without demanding a great deal of customization.
The Centers for Disease Control and Prevention (CDC) has already signaled its support, awarding Reference Genomics $200,000 in prize money this January for a “No-Petri-Dish” challenge to quickly spot Shiga toxin-producing E. coli in biological samples. A highly specific tool like One Codex is essential for this task: E. coli is a very common and usually benign species of bacterium, but strains that produce Shiga toxin are dangerous food contaminants that can cause diarrhea and, in the worst cases, kidney failure.
The “No-Petri-Dish” prize doesn’t guarantee Reference Genomics a contract with the CDC, but it does put the company on the health agency’s radar. Greenfield says that government groups like the CDC badly need new tools for the next-generation sequencing age, when spotting pathogens has evolved from a slow hands-on process to a faster one with a large digital component.
“We used to live in a world where the primary diagnostic tool involved culturing, and people sent those cultures to CDC, and they built big biobanks of freezers,” says Greenfield. While that system was time-consuming, it did have the advantage of giving the CDC a physical library of samples. Now that hospitals increasingly diagnose infections with molecular tests whose results are recorded digitally, the CDC doesn’t always have a comprehensive view of what’s out there.
“The second-order intelligence is often being lost,” Greenfield adds. “So I think a lot of the promise of [next-generation sequencing] on the infectious disease side is that it’s another way of getting that full picture, and perhaps a much more powerful way of getting that full picture.”
Moving forward, the Reference Genomics team is adding basic tools to One Codex that could be used for a whole suite of different applications. A recent update allowed users to make their analyses public, letting them share results with collaborators — or, in the case of outbreak tracking, with different public health teams.
In the medium term, the company also plans to create workflows for more specific use cases. Right now, a One Codex search can produce a list of the organisms in a sample, but users have to link out to the NCBI’s database to get more information on what those results mean. Newer workflows could put some of the most important information straight on the One Codex page. In outbreak detection, for instance, it might be useful to immediately flag information that suggests a strain might be resistant to antibiotics.
“We’ll build out end applications for a subset of things where we think we can do a really good job, and that’s what we’re starting to do now,” says Greenfield. At the same time, the program will remain open for users who want to take advantage of a rapid search tool for their own niche applications.
A commercial model for One Codex is still taking shape. Some users might eventually pay to download an enterprise version of the platform on their own hardware, a model that would fit well with government agencies’ IT protocols. The company could also take a similar approach to open source companies — common in Y Combinator’s world of Silicon Valley startups, but still unusual in the life sciences — releasing the program for free but offering paid support and hosting. In the meantime, One Codex is free for all users on Reference Genomics’ own compute infrastructure.
“We’re still not fully committed to a path,” says Greenfield. “What we have decided is that we want to have a platform model, we want it to be open and extensible, and we want to build some end-to-end solutions.”
* A k-mer is a DNA sequence k letters long; the most common setting of BLAST, for instance, uses 11-mers. The complete set of k-mers that make up a read is much longer than the input read itself, because k-mers overlap each other. If, for example, the sequence GATTACA were split into 4-mers, it would be read as GATT, ATTA, TTAC, TACA.