Benjamin Langmead Named Fifteenth Annual Benjamin Franklin Award
By Ryan Cross
April 14, 2016 | Benjamin Langmead, an assistant professor of computer science at Johns Hopkins University, was presented with the fifteenth Benjamin Franklin Award from Bioinformatics.org at the 2016 Bio-IT World Conference & Expo for his development of open source and cloud-based bioinformatics software.
Bioinformatics.org President Jeff Bizzaro presented the award. Bizzaro founded Bioinformatics.org in 1998, a time when he says there was much reluctance towards sharing. The award has been honoring open access life science projects since 2002, and is inspired by Benjamin Franklin’s famous refusal to patent his inventions.
Langmead got his start in computer science. “I hadn’t thought of genomics as an area that I could work in until I met Steven Salzberg,” he said. That was in 2007, when Langmead began graduate school at the University of Maryland. He attended a talk on next-generation sequencing where Salzberg, a computational biologist and 2013 Benjamin Franklin Award winner, discussed limitations software imposed on genomics research. Langmead approached Salzberg to see if his background in writing fast pattern matching software could be of use. Salzberg said, “Absolutely.”
Their collaboration resulted in the open source sequence read alignment tools Bowtie and Bowtie 2. The programs have over 10,000 citations and are used within over 50 bioinformatics programs.
In his acceptance speech, Langmead said, “The sequencing revolution is something that has potential to benefit all scientists.” But the practical barriers of time, funds, and lack of computational expertise can stand in the way. That’s where Langmead’s work publishing free software, tools, and educational resources comes into play.
Langmead said computer scientists have skills to help “level the playing field” for biologists. One skill is writing algorithms to perform the same amount of work in less time and without any additional computing power. “That’s just magic, doing more with less,” Langmead said.
A second skill is creating scalability. Designing scalable programs is essential for handling increasingly vast amounts of data and for balancing the computing power and computing time required for running a program.
Accessing the Crown Jewels
Langmead and his lab have spent a lot of time working with the Sequence Read Archive, or SRA, a common destination for sequencing data that he calls “the mother of all life science datasets.” The SRA currently contains about 5 petabases of data, or five million billion nucleic acids. 18 months ago it held about 2.5 petabases, a doubling rate reminiscent of Moore’s Law.
“Sequencing is cheap.” Langmead said that refrain is common and true. But that doesn’t make the SRA data cheap. SRA datasets include rare diseases, hard-to-obtain tissues, carefully prepared single-cell data, and multiple tissue sets from single individuals. “These are hard datasets to get,” Langmead said. “So I think it would be a mistake to think of the data in the SRA as somehow being cheap.”
Langmead describes the SRA as a “sort of miracle.” The U.S. arm almost closed in 2011 from budget pressures, and another difficultly arises from investigators who don’t want to publically deposit their data.
As a computer scientist, Langmead looks for ways to make the SRA easier for biologists to use. “You can’t just let the last ten years of big data technology go by without trying to apply it to the problem,” he said. “We have to find a way to grab onto the speeding truck and steer it towards science a little bit more.”
One of the Langmead lab’s most recent tools is Rail-RNA, a scalable software that can run on individual computers, computer clusters, or in the cloud. Rail-RNA is a spliced alignment program for studying exon-exon junctions.
Using Rail-RNA and a cloud computing service, Langmead’s group analyzed 50,000 human RNA sequencing samples from the SRA. Using the cloud has many benefits over institutional clusters which may be busy, broken, or simply lacking in power. They discovered that only 80% of splice junctions are found in existing gene annotations, and their results were compiled into a “partially-digested data project” called Intropolis.
Next, they plotted the first appearance of all distinct splice junctions over time, and found the number of known splice junctions to asymptote in 2013. “This has an interesting punctuation mark on the question of completeness of annotation,” Langmead said. If there was ever a good time to update gene annotations, “right around now would be a good time to do it,” he said.
Even though Langmead acknowledged that the SRA can be hard to use, he urged researchers to not think of it as a data graveyard. “It is actually the crown jewels,” he said.
Langmead emphasized the continued need for computational scientists with the ability to scale software to assist in life science endeavors saying, “They can contribute to leveling the playing field for all labs to benefit from the huge amount of open access sequencing data that we now have available.”