The Broad’s Approach to Genome Sequencing (part II)
Bio-IT World | Since 2001, computer scientist Toby Bloom has been the head of informatics for the production sequencing group at the Broad Institute. Her team is the one that has to cope with the “data deluge” brought about by next-generation sequencing. Kevin Davies spoke with Bloom about her team’s operation, successes and ongoing challenges.
(Part I of this series, with Rob Nicol and Chad Nusbaum, is here.)
BIO-IT WORLD: What’s the mission of your team?
TOBY BLOOM: Our goal is to be able to track everything that’s going into the lab from the time samples enter the building, through all standard processing of the data. My team is responsible for managing the sample inventory, keeping track of the projects, the LIMS (Laboratory Information Management System), which tracks processing events in the lab, the analysis pipeline, which at this point includes the standard vendor software, then alignment, various quality checks, generating metrics, and we do SNP [single nucleotide polymorphism] calling for fingerprinting. We have a data repository – a content management system that makes all that sequence data available to the researchers when it gets handed over to their side for further analysis.
Did your team build the LIMS?
We did, many times! We have a couple of times in the past looked at what’s on the market, but not recently. Because of the scale we’re at and how fast we change things, we usually find that any one product that’s out there is aimed at one of the things we do, but not all of them. We’ve had our own LIMS since before I got here, nine years ago, but for next-gen sequencing, we’ve had to rebuild much of that. We’re far enough ahead of the curve on most of this stuff that most of the LIMS out there wouldn’t be ready in time.
How big is your team and what is its chief expertise?
There are about 25 people. The vast majority is software engineers, but I have a couple of people in data management, and a couple deal with the databases. Everybody else is writing code, mostly Java programmers.
What brought you to the Broad, or the Whitehead Genome Center as it was then?
I was just fascinated with what was going on! It was clear from as far back as ’95 they were bringing in computer scientists to deal with the data challenges. I didn’t get here until ’01. I was looking to change positions, and it’s just a fascinating place to be. I didn’t know a lot about the biology but it’s exciting to be a part of what Broad is doing.
What do you do in terms of downstream processing?
We do a number of things in the software. We handle a variety of different technologies. We built a pipeline manager that allows us to specify the workflows that should be used for various types of analyses or sequencing. So if we’re doing RNA-sequencing for example, we’re doing something a little different than whole-genome shotgun or targeted sequencing. It [the pipeline manager] lets us handle many pipelines at once. It lets us pull in information from our instruments, our LIMS, and our sample repository to decide what to do on the fly. All of those pieces are integrated.
We’re doing 0.5-1 Terabases a day. We’re doing a lot of processing! There’s a focus on high throughput and automation. Having it under our own control and being able to change rapidly is important.
Within the pipeline manager, we run the Illumina vendor software. What part runs off instrument changes over time. We started out doing the image processing off the instrument, but it got to the point where [Illumina] could do the image processing reliably enough on–instrument that we could use that. Then we started pulling the intensities from the instrument, instead of images. I hope that with the HiSeq 2000s, we soon get to the point where they do the base calling on the instrument. We’re still doing all the base calling in the pipeline right now, but maybe we’ll soon get to the point where we can take the base calls off the instrument and just do the downstream processing – creating BAM files, recalibration, alignment, deduplication, quality analysis, SNP calling, etc.
Are you able to distribute Broad Institute resources such as the pipeline manager to the community?
We’re not proprietary about it, but because all our pieces are integrated, it’s sometimes hard to release pieces of code because they depend on our databases or other internal metadata. We’ve released some of our BAM file processing tools, the Picard tools, publicly. I’ve been trying to see if we could make our pipeline manager run in the Cloud, to make it available to other people. We’ve done some work to isolate our actual pipeline management from our database and internal structures, but it’s not ready to do that yet . . . The goal is to be able to flexibly create and change workflows, and run many at once. It’s focused on massively high throughput and very automated processing. We want to make sure that if something fails -- because we’re running 2,000 compute cores at a time, things will fail, servers will drop out – it can track where everything is, what’s failed, what’s stuck and hasn’t turned up. Our goal is to be able to restart from the last step automatically without a lot of human intervention. In some ways we do that better than others but we’re always working on it. The goal is to push all of this stuff through without a lot of people.
Do you function at all like a core lab?
We don’t’ function as a core informatics facility, in other words we don’t provide a service for building custom software, but we will do specialized things for certain projects. We’re not here as a service, we’re here to support production, but we do get requests. We’re very much a production system, so we’re not [writing] research algorithms. We do take research algorithms when they’re working well enough and optimize them to make them more robust to run in a pipeline. There are specialized cases where [Broad] research groups will come to us and say, we need specialized processing of the data in the pipeline. So we have to do something different, e.g. for epigenetics or RNA-seq, things like that. Sometimes they’ll ask us to write specialized code, and we can sometimes, but we don’t always have the bandwidth.
We often hear about the “data deluge.” What’s it like facing the brunt of that?
From the informatics side, we were all prepared to see the data volume go up. We were ready for the hardware choices … same for the software. The big surprise wasn’t that or the amount of data coming off the machines. What had us playing catch up was the impact on the lab and the LIMS. It was more the change in the number of libraries we were making, the number of samples in the lab at once, than the actual amount of data – from my point of view. I’m not saying it was easy to scale the data, but we weren’t predicting the whole lab process would have to change so much because of the volume of samples. When we were doing large genomes on the [ABI] 3730s, a library would last a month or more. We didn’t have to worry about 1,000 samples in the same step in the lab at the same time and having to track them to make sure they didn’t get mixed up or lost. We built many layers of tracking in the LIMS that weren’t needed before. That’s been a major change.
Other parts of the Broad’s sequencing operation borrow from proven factory automation methods. Can you apply any of those methods on the informatics side?
Well it’s not quite the same -- you can’t put tape on your screen! It’s more of focus on how do we do rapid iterations reliably. Things change in the lab very quickly. Your standard engineering software practices – gathering requirements and writing careful specifications and then doing careful design and building – isn’t a process that works in this environment, because by the time you get through all those steps, you’re building the wrong thing. We need to move rapidly, whereas that model is made for building software for processes that are well established and you’re automating them.
We are building software often ahead of the process. We’re trying to get enough working that they can function and get the data they need, before they really know what they need from us. They’re doing ongoing process improvement . . . We have to focus on how we can identify what they need most, and how to change along with their process changes. We need to build it in small pieces and then add to it without rebuilding what you already did. In many ways, it’s agile software development. It’s as close to agile as anything else, but it doesn’t follow a (typical) agile process.
How much are you involved in the decision to put machines or new platforms into production?
We’re definitely part of that. It’s not just that we have to be satisfied that it’s ready to go, it’s that “we’ve got the software changes in that you need.” The Illumina software that runs in our pipeline has changed several times. We had to get that software into production too. When there are changes in data types and metrics, we have to change all that in our system.
Do you communicate much with the vendors on the software?
We often take beta versions of their software before it’s out. We’re expected to find their bugs! We can’t wait for their official release. We’re the first ones to get the instruments often, so yes, there’s a very definite interaction on the informatics side as well.
We have weekly calls with [Illumina] about the informatics. On the HiSeqs, we don’t want to pull the intensity data if we don’t have to. We’re just validating we can get the same results using their on-instrument base-calling; we need to understand the failure modes in the integration so we can make sure we’re not going to lose data.
What impact will the third-generation sequencing technologies such as PacBio and Ion Torrent have on your team?
The long reads will help with a number of kinds of analyses downstream, but they don’t affect the production software I build as much. We’re looking forward to having long reads.
I try to make sure I know what all of the things are on the horizon that might show up and what the informatics implications are. Sometimes it matters, sometimes we don’t have to do much to prepare. We initially thought the biggest difference might be the size of the data, the bytes/base might be different on different instruments. On the HiSeq, Illumina has gotten down to essentially one byte/base, so that difference -- where you didn’t have to pull images -- has gone away. There are some that have much higher volumes of data than others.
The differences in the [sample] prep process matter to us, because it matters what our LIMS can handle. We watch but we don’t jump into active building until we have a machine in house and we think it’s time to ramp up.
What have you been doing in the Cloud?
There are a couple of reasons for exploring the Cloud. One is small centers that don’t have the IT infrastructure to be able to handle the volumes of data and the complexity of the processing. That’s not an issue for us but it is for small centers. The other side is the big collaborative projects, where you have many, many centers sharing data -- many centers producing data, and many centers that are then processing the data, e.g. 1000 Genomes or TCGA. (TCGA we can’t yet put on the Cloud because of security concerns.)
For some of the groups that want to do analysis, getting the data back and forth from NCBI or other centers is a burden. If the data could be put in one place and everyone could move the compute to the data . . . When you get to these very big projects, moving the compute to the data seems very much more efficient than having to keep moving the data every time you want to compute. So that’s the model: can we put the data in one place and move the compute to the data as needed?
That’s the experiment. I’m not saying we’ve gotten there. Very different from some of the Grid models, who provide the compute, and essentially, if you don’t have the compute you need, you can just borrow it, but you have to get your data there. … This is very much, let’s not move the data—the data is the problem. I have a grant to experiment with my pipeline. One of the experiments is: Could we put a pipeline manager up there [Amazon EC2] with a lot of the standard analysis steps, let people go to the Cloud and figure out how to use it, for their own pipelines and workflows? I don’t’ think we’re there yet. This only works if it’s easy to get the data in the Cloud. It’s clearly not the kind of application that was targeted originally by the public Cloud vendors.
Are putting the data and running the pipeline in the Cloud two separate issues?
They are. We all submit these data to NCBI or EBI, and those two data repositories are already putting some of their data up on the Cloud to make it more available. Whether it goes on a public Cloud or not is not the question. If we’re already sending data to NCBI and EBI, and those guys exchange and replicate all the data anyway, could that be the foundation for the data already being there? NCBI isn’t about to provide all the compute for the world. But the notion is we already have a central repository, if that were part of a Cloud-like infrastructure -- whether on the Amazon Cloud or another commercial Cloud, or a private Cloud -- is that kind of architecture useful?
The jury is still out on the Cloud and how to do it. Is the Cloud model a helpful model for the community? We won’t know that until we can run things enough to test it. Can we make it work on the existing Cloud infrastructure? The jury is still out on that one also.