Research and the Public Cloud: Potential and Pitfalls

June 3, 2013

By Melissa Chua 
 
June 3, 2013 | SINGAPORE—Jumping on the public cloud bandwagon could go a long way toward alleviating data storage, retrieval and archival woes currently experienced by the research industry, said Chris Dagdigian, founding partner and director of technology, The BioTeam. Dagdigian gave his annual “Trends from the Trenches” keynote address at Bio-IT World Asia last week in Singapore. 
 
Challenges, however, exist around governance and regulations, including setting guidelines around how data is shared between scientists. The tremendous disconnect between the rate of innovation in the lab versus the rate of innovation in IT is the chief driver behind the move to infrastructure-as-a-service (IaaS) clouds in the research industry, he said. 
 
It has never been easier for scientists to generate, acquire, or download data and scientists are acquiring data faster than the IT industry can make disk drives bigger, said Dagdigian. With lab protocols changing from month to month, the days when IT could solve storage problems simply by throwing in disk drives are long gone. IT has been rendered unable to predict what will happen in the next six to 12 months.
 
IaaS clouds such as Amazon Web Services (AWS) should be viewed as pressure release solutions to the burgeoning data challenge, due to their ability to respond instantly to changing demand. In-house IT departments will never be able to store data as optimally as Google or Microsoft, but most cloud providers can sell you access to storage at cheaper prices than in-house implementations, he shared.
 
A proper cloud strategy is the way forward for labs, Dagdigian emphasized. The risks are high otherwise; if companies don’t give researchers the capabilities they require, they will have problems with staff retention, lost opportunities, and product development, he said. 
 
The cloud is especially beneficial for small labs that get funded according to the time it takes to generate results, said Dagdigian, adding the cloud’s pay-as-you –use model allows labs to avoid upfront expenditure associated with buying storage in-house. Other advantages include cloud models that allow the cost for data download to be shifted to the parties requiring the data, versus having such costs borne internally.
 
Doing a pure cost comparison when comparing in-house storage with cloud storage is problematic, Dagdigian said, adding it would be easy for figures to be manipulated to show a sound cost case for the cloud and vice versa. Be suspicious of people on both sides of the fence and be aware that there is an agenda involved on both sides, advised Dagdigian.
 
Hype and marketing tools aside, IaaS clouds are the real deal, said Dagdigian, adding a more accurate cost comparison for cloud storage requires taking into account the fact that data in the cloud is geographically redundant. The cost of cloud storage in AWS is also expected to go down in the coming years so the inflection point where cloud storage starts to make sense is only going to get wider. 
 
It is important to recognize the limits of sales staff when planning cloud strategy for a research environment, Dagdigian advised. Sales people do not understand the breadth and diversity of life science applications, and many think codes need to be rewritten for Hadoop, when only two or three applications are worth rewriting, re-architecting and re-engineering, said Dagdigian. 
 
The Elephant in the Room 
 
There is no one-size-fits-all research design pattern, said Dagdigian, adding high performance computing is all about the effective deployment of building blocks. Each organization needs three tested cloud design patterns: one for handling legacy scientific applications and workflows, one for handling special applications worth re-architecting and one for Hadoop and big data analytics.
 
Dagdigian recommended the open source MIT StarCluster as a baseline for the first pattern, describing the cluster-computing toolkit as an ideal building block for dynamic cluster farms. It would be better to start with StarCluster than to re-engineer anything from scratch, Dagdigian said.
 
The second pattern is the one which gives IT the most freedom, said Dagdigian. There are many published best practices for handling the re-architecting of special applications for the cloud, but it is crucial to take the possibility of cloud vendor lock-in into account before embarking on the re-architecting process.
 
The hype around big data and Hadoop is real, said Dagdigian, with reference to the third cloud design pattern. This cloud pattern is architecturally different from the other two, but data query methods and the analysis of structured and unstructured data is the way of the future. 
 
Challenges, however, may arise around planning a big data or Hadoop strategy, particularly when senior management is involved. Find out what exactly senior management wants, advised Dagdigian, adding both Hadoop and big data can be attempted on the same hardware or software stacks. Investing in software engineering specific to big data may or may not be necessary.
 
Getting on the cloud is technologically easy, the hardest part is overcoming procedural and legal challenges, said Dagdigian, citing real-life cases where firms would spend two years of precious time discussing legal aspects of a cloud strategy with lawyers. 
 
Research firms need a cloud strategy today, or scientists might just bypass the IT department, warned Dagdigian. With Amazon Web Services being just a credit card away, scientists might just take that credit card to solve their problems and they will be more concerned with results versus other issues such as privacy and security.