Distributed Bio’s Chris Smith on the Rise of iRODS
By Kevin Davies
December 23, 2010 |During 13 years at Platform Computing, Christopher Smith became something of an expert in integrating high-performance computing (HPC) systems and designing the infrastructure surrounding Platform LSF. He then became a principal product architect, providing a more holistic view of all platforms across different verticals, including life sciences.
Last fall, he and Giles Day, formerly at Pfizer, set up their own bio-IT consultancy, Distributed Bio. The firm’s early projects are with pharma and biotech companies, leveraging the cloud for a biofuels company, and two projects with iRODS (Integrated Rule-Oriented Data System), including one with the Broad Institute.
“We provide a multidimensional view of things. I understand the technology for grids, clouds and clusters,” says Smith, who was VP standards for the Open Grid Forum from 2007-2010.” But he had an itch to deal more directly with client issues. “Customers compartmentalize you when you are a software vendor, they don’t bring you in to help develop entire solutions. I wanted to get deeper with these customers and go end-to-end with their solution.”
“I’m very interested in how people are dealing with the data deluge, managing instrument data, and how they tie that into processing,” he continues. Much of that centers on next-generation sequencing (NGS) data, and especially human genome data. “It’s intuitively more relevant – it’s human genomes and diseases. All things we can touch.”
Distributed Bio focuses on the macro-level. “A lot of people can find a storage provider or a solution provider, and they’ll do just fine. We take more of a view of the pipeline and the scientific workflows involved. We have a long horizon and think ahead -- how to architect a solution not just for today but anticipating the future.” That’s where Day’s big pharma experience comes in. “Giles has that understanding of what people are trying to accomplish,” says Smith, who calls himself the “propeller head” in the team.
Image analysis is another vexing problem for life scientists. “Take your digital camera,” says Smith. “You have no idea what you have on your disk anymore. If I’m trying to tie image data to experiments or lab analyses, and group them in interesting ways, it becomes very difficult.” Complicating matters further is that NGS, imaging and other data all exist in silos.
Silo Mentality
“Places like the Broad or Sanger Institute will have a data problem whether it’s in house or not. Using any systems on the premises just adds to the complexity. The Sanger folks are pretty public about testing the waters with Amazon. The big centers will figure out where it best fits. It will be around collaboration, and is intended to provide a more viable option for giving people access to datasets. Right now, the onus is on the consumer to download and use it. It makes a lot of sense for medium-sized organizations to bring the processing to the Cloud.”
Smith expects Amazon to gain traction with small- and medium-sized biotechs as more companies deploy NGS instrumentation. “It’s a quandary – do we buy clusters or racks?” says Smith. “The economics of owning your own hardware is not so clear cut compared to the value proposition for Amazon. Everyone’s happy with a co-lo, Amazon’s just taken it to the nth degree. I’m very bullish on the cloud, and so far, Amazon has been the most innovative.”
But there are some emerging alternatives to Amazon. “If you know exactly what sort of computing you’re doing, there’s definitely a place for those kinds of niche players. It’s just about getting the economics right. I think Penguin has a play, SGI, probably a few others. Amazon innovates – look at the GPU instances -- but there’s an opacity to Amazon’s offering that some people don’t like as much.”
A frequently mentioned concern with the cloud is data security, but Smith counters that Amazon “arguably run their data center better than many other companies. But perception is reality. The fact that Amazon does a good job of keeping your data safe is immaterial – it’s the trust level.”
iRODS Interest
An open-source data management system that Smith is particularly excited about is iRODS. “Reagan Moore [DICE Center, UNC] promoted the notion of metadata driven file management. These systems came out when grids were still fashionable.”
Smith’s practical interest in data management tools dates back to scheduling jobs at Platform, where data location became an important consideration. “A pet interest of mine is to figure out how to make a scheduler more intelligent – where it places jobs with respect to data? You need a metadata service for the scheduler to use. Where is this file located? Where are the replicas?”
iRODS is a flexible solution for deciding what to do with large volumes of data and how to organize, locate and replicate them. The system initially took off in communities such as high-energy physics, astronomy, and digital libraries. “As I saw Platform Computing’s customers struggling with data, located across gazillions of file heads, or a massive Isilon or Panassas cluster, I thought of my digital camera problem. That layer of manageability was missing. iRODS is a key piece that helps manageability,” says Smith.
Smith sees two key aspects to iRODS. First is the ability “to annotate data with metadata over and above the usual Unix time-based stuff. This is the part that will be useful to end users.”
Second, the rules engine capability makes data management very powerful. “You can execute a number of rules on how you manage data over time. For example, expiry – I have data living on very expensive tier 1 storage. At a certain time, I migrate to tier 2. You can do very complex things -- automatic replication to disaster recovery sites, check summing, combat bit rot a little.”
For the Broad Institute and other genome centers, Smith says large datasets that are say three months old can be moved. “Is this dataset part of a collection that’s meaningful? Let’s put a project time stamp on it.”
iRODS is open-source and “as free as you can find someone to install it,” says Smith. “You don’t have to worry about licensing costs. There’s a very active mailing list for support. I think they have a vision of providing a little more commercial support.” Smith says he found iRODS “incredibly easy” to install – barely 30 minutes to compile and install.
To get the most use out of iRODS requires the user understanding their goals, the archiving rules, the style of deployment, and how to collect one’s data centers into a zone.
“I think anybody could benefit,” says Smith. “It provides something the file system is missing. Managing the data deluge on a large workstation otherwise requires Windows Index Search or Mac OS X Spotlight or Search. It’s not well directed, you get a lot of false positives. It requires you to sift through the data. The metadata approach is much more structured.”
Smith wonders if the software bundled with NGS machines, such as Illumina’s HiSeq 2000, couldn’t be made iRODS compatible. “If these mechanisms of annotating can be automatic, that reduces the burden on end users and it provides value. You avoid structured naming and the accidents around that. We need to a point where file structure is meaningful to end users.”
Overall, Smith’s take is that “people need to start thinking about metadata. It’s almost cliché in certain circles, but I think everyone will benefit from it. The file system is not a sustainable object store at scale. You can try other things, but people have to start thinking from the point of view of the scientific pipelines.”
“iRODS is very good – the systems using it now are such large scale, with multiple sites, that it’s the only kind of system that really works.”
Editor's Note: There will be much more on iRODS in the January-February 2011 issue of Bio-IT World.