Bridging the Gap: Bring Genome Analysis Tools to the Masses with GenePattern
August 22, 2013
By Matt Luchette
August 22, 2013 | This past month, researchers from the Broad Institute released a software update to GenePattern, the Institute’s open-source genome sequence analysis software (and winner of a 2005 Bio-IT World Best Practices Award), which will allow programmers to upload their own analysis tools to an open database.
“We want to let them release their own tools into the wild,” said Dr. Michael Reich, the Director for Cancer Informatics at the Broad.
After being publicly released in 2004 by Jill Mesirov’s lab at the Broad, GenePattern has found a niche in providing researchers a platform for integrating any of the thousands of genomic analysis programs available on the web into a seamless pipeline.
Reich, a researcher in Mesirov’s lab, estimates that there are over 10,000 genome analysis tools available online from a number of developers, and many of the tools do not support the same file formats, making it difficult to use multiple programs to analyze genomic data—a small problem for a computer scientist, but a rate-limiting step for biologists who may not know their bytes from their Booleans. GenePattern makes it easier for the less tech-savvy researcher to combine these programs together, like Lego pieces, into a single workflow.
“The input to any step,” said Reich, “makes a compatible output” for the next tool in the pipeline. And they seem to be on to something: the software supports nearly 50,000 total users and handles thousands of analyses per week.
The update, said Reich, provides improved integration between GenePattern and GParc, GenePattern’s third-party software archive, making it easier for outside developers to “disseminate data easily” by contributing their own analysis programs to GParc. Reich hopes that combining GParc’s growing repository of third-party applications with GenePattern’s pipeline creation tools will further the program’s mission of making genome analysis software easy to use for non-programmers.
Additionally, the update allows users to easily add modules from the GParc repository to their pipeline, and it can tell the user which repository a module came from after an analysis is completed.
In his role at the Broad, Reich plans to pursue software projects that are “designed to serve the requirements of the world-wide genomics community.” GenePattern is one such way he hopes to serve that community by making powerful programs for genome analysis more accessible, bridging the gap between biology and bioinformatics.
Compared to the web’s other open-source genome analysis programs, such as the UNIX-based Tuxedo Suite of applications from Johns Hopkins University, which require users to have programming experience to build the analysis workflow, GenePattern provides tools to help researchers without programming experience automate workflow creation online, without downloading any programs, and even run the analysis on the Broad’s computer servers. The feature is particularly useful for users who may have not have the computer science experience to build their project from a command-line interface. The program, however, still allows users manipulate its modules in MATLAB, R, or Java if they wish.
Aside from its module integration tools, one of the major benefits of GenePattern, Reich said, is recording all of a pipeline’s analysis steps, allowing users to easily summarize their analysis methods. Mesirov calls this technique “accessible reproducible research.” Reich stressed that this feature is especially important for allowing other researchers to reproduce and check a lab’s work. “A paper is not research,” he said. “It’s an ad for research.”
GenePattern is also designed to integrate with other web-based genome analyzers. Galaxy, for instance, “is complementary to GenePattern,” says Reich. Both programs are a part of the GenomeSpace Initiative, a collaboration between Mesirov’s and Aviv Regev’s labs at the Broad, which provides an environment for integrating multiple analysis programs from different developers into a single workflow.
Since its public release in 2004, GenePattern has been cited in over 20 papers that have expanded the program’s initial applications to a number of others, from flow cytometry to microRNA expression analysis.
Making bioinformatics tools more accessible to a wider audience may be a difficult problem for programmers, but through GenePattern, Reich and his colleagues hope to bridge the divide and “bring powerful technology together.”