Biodatomics' Open Source Approach
By Aaron Krol
November 6, 2013 | Imagine you’ve decided to create an end-to-end platform for analysis of genomic data. Your company will compete with industry giants like CLC bio and DNAnexus to offer comprehensive workflows that translate raw sequencing reads into information on the level of phenotypes, populations, drug interactions and all the other gene associations researchers need, and will aim to make a profit doing so. You’ll probably want to write new, proprietary tools for searching through the genome to draw out all that information. It’s likely you’ll leverage a large private dataset toward those tools. And almost certainly, you expect to license your bundle of tools exclusively to paid users.
Or just maybe, you’re Maxim Mikheev, co-founder and CEO of Biodatomics, and you have a different plan altogether.
Like many of his counterparts at competing gene informatics software companies, Mikheev is a molecular biologist, computer programmer, and aggressive entrepreneur. More uniquely, he is also an active member of the open source community, and BioDT, the platform his company unveiled two weeks ago at the ASHG conference in Boston, is open source from top to bottom. None of the roughly 400 distinct tools included in BioDT are exclusive to Biodatomics; instead, popular open source tools like the Galaxy suite, Bowtie, and the Broad Institute’s Genome Analysis Toolkit are collected on a common platform to be stitched together however the user likes. BioDT’s own source code, of course, is also freely available to view and modify – as is legally required of any program that uses preexisting open source tools – and anyone is welcome to use the community version of the platform at no charge.
There are good reasons one would want to, say the creators. Without the need to build tools from scratch, Biodatomics’ programmers have instead focused on creating a top-of-the-line interface, with visualized results and intuitive operations. Even for sophisticated workflows and queries, says Mikheev, “end users don’t need to know programming languages.” Instead, workflows are created in a drag-and-drop manner, with users choosing from BioDT’s arsenal of tools and placing their operations in order – if desired, feeding the output data from one analysis as input data into another. BioDT also incorporates Impala, a query execution program that frees users from writing PERL scripts to search through their tables. (BioDT allows users to display results either graphically or in the traditional table format.) Impala not only allows queries to be written in natural language, but also returns information at a greatly accelerated rate, on the order of just seconds. “We have virtually real-time queries,” Alan Taffel, Biodatomics’ Chief Marketing Officer, told Bio-IT World. “You just enter [your search term] into a field, and the table, or the subset of a table, that you’re looking for pops right up. That makes getting insights out of the data a lot quicker and easier.”
True to form, Impala is itself an open source project from the company Cloudera. BioDT draws from a number of open source data management programs, including the JBoss Java application server from Red Hat, and crucially, the Hadoop framework for data processing. “Hadoop is wildly popular for big data analytics,” says Taffel. The Hadoop framework replicates large blocks of data across multiple machines, and preferentially schedules tasks to run on the same hardware where the relevant data is stored. This combination of parallel processing and minimal transfer of data between machines yields major returns when dealing with large datasets like those needed to make sense of the genome. Using Hadoop, Taffel adds, “leads to multiple orders of magnitude faster execution. So workflows that today, on other platforms, would take days or weeks to execute, we can execute in hours.” With dramatically faster analysis, Biodatomics believes the resources available to an open source company will provide a major advantage against competitors who aren’t legally able to take advantage of software like the Hadoop platform.
A Product Years in the Making
The open source model is an unusual one for the end-to-end genomic informatics market, but it is familiar territory for Mikheev, who began work on his program over a decade ago in his native Russia. “[The program’s] original name was BioUML,” Mikheev told Bio-IT World. “We started developing this platform in 2002, and it was pointed toward microarray analysis and genetic pathways, gene networks,” the kinds of narrowly targeted analyses that represented the frontiers of practical genetic testing at the time. BioUML is still freely available for these tasks, but Mikheev moved on in 2006, traveling to the United States to join the National Institute on Aging and the University of Pittsburgh Medical Center. In the meantime, next-generation sequencing made it possible for even smaller laboratories to move from targeted panels to analyses on the genomic level.
In 2011, Mikheev decided that the same open source philosophy behind BioUML would also benefit researchers in this new sequencing environment, and formed Biodatomics to pursue a revamped platform. “I had seen how researchers at the university medical center struggled without good tools to analyze data from next-generation sequencing,” he says. “We decided to convert our platform from microarrays to next-generation sequencing.”
While opening a free state-of-the-art bioinformatics platform to the research community is a noble ambition, Biodatomics is still a for-profit company. The unconventional business model, however, doesn’t worry the Biodatomics management team, who take inspiration from companies outside the biological space that have made open source software a highly profitable enterprise. Taffel points to the example of Red Hat, a software developer centered on the open source Linux operating system. “Linux is available for free to anybody,” says Taffel. “Download Ubuntu and you’ve got it. And yet Red Hat is a billion-dollar company – so there are plenty of people who are interested in an enterprise-grade, fully-supported, fully-featured version of the software. Those people are willing to pay for it, and we have a product for them.”
The paid versions of BioDT add important capabilities, including much faster processing time, full product support from Biodatomics, and security features that may be key for certain users. BioDT operates in the cloud, which offers functional advantages. An advanced real-time collaboration system allows users to simultaneously access and modify the same project, just as though they were cooperating on a Google Doc. Cloud access also provides users with an expanding dataset, wherein results from one user’s project can offer valuable tools and data for another’s. “We encourage users to put their workflows into the community,” says Taffel. “That’s one of the big benefits of being an open source platform.” But for private companies, this free access to data and project pipelines may be untenable, and for them, Biodatomics offers a Pro version with a private cloud, data encryption, and modifiable permissions. “That could be a better choice… [for companies that] want to keep things very proprietary, enclosed, and have high security needs,” adds Taffel.
Overall, Biodatomics is confident that their open source approach is not just a statement of principle, but also a sound business model. “We’re pretty confident that we’ve leapfrogged the market, in terms of both capability and speed,” says Taffel. Community users will get a taste of the platform’s tools, visual style, and drag-and-drop interface, while enterprise users enjoy what Biodatomics is confident will be the fastest analysis time on the market. Although the public launch of BioDT was only announced two weeks ago, early adopters include high-profile users like the J. Craig Venter Institute and Digicon. As a wider user base starts to experiment with the platform, Mikheev and Taffel hope that these organizations will be just the first of many to follow Biodatomics into the world of open source genomic analysis.