Clouds, Supercomputing Shine at Bio-IT World Cloud Summit
By Kevin Davies
September 13, 2012 | SAN FRANCISCO—A variety of cloud computing and supercomputing resources and applications shone brightly at the Bio-IT World Cloud Summit* in San Francisco this week.
Miron Livny (University of Wisconsin) opened the three-day conference discussing the Open Science Grid (OSG), which played a critical role in providing the compute power for the provisionally successful search for the Higgs boson earlier this year.
The OSG logged some 712 million CPU hours last year, almost 2 million CPU hours/day, on 1 Petabyte (PB) data. Other applications include analysis of structural variants on a genome-wide scale, and modeling the 3D conformation of DNA.
Future challenges, Livny said, included what he called the “portability challenge” and the “provisioning challenge.” The former was how to make sure a job running on a desktop can also run on as many “foreign” resources as possible. The latter was being addressed by using targeted spot instances in the Amazon cloud, with prices dropping below 2 cents/hour. “Use it when the price is right, get out as fast as possible when the price is wrong,” Livny advised.
Jason Stowe (CEO, Cycle Computing) reviewed Cycle’s successes in spinning up high-performance computers with 50,000 cores on Amazon, such as a project with Schrodinger and Nimbus Discovery to screen a cancer drug target. A virtual screen of 21 million compounds ran in three hours at a cost of less than $5,000/hr. Today, using AWS spot instances, Stowe said the same analysis could be run for as little as $750/hr.
The winner of Cycle Computer’s recent $10,000 Big Science Challenge competition also presented at the conference. Victor Ruotti (Morgridge Institute) is about halfway through his ambitious experiment using the cloud to conduct an extensive pairwise comparison of RNAseq signatures from 124 embryonic stem cell samples. By performing a total of some 15,000 alignments, Ruotti intends to create a sequence-based index to facilitate the precise identification of unknown ES cell samples.
So far, Ruotti has analyzed 64 samples, and logged about 580,000 CPU hours (roughly 66 CPU-years). The experiment has run for 3-4 days using 6-8,000 cores, generating 20 TB data. Stowe said Cycle is planning a new competition in the near future.
Vijay Pande (Stanford University), best known for creating the folding@home initiative, discussed progress in molecular dynamics simulations. Microsecond timescales are where the field is, but millisecond scales are “where we need to be, and seconds are where we’d love to be,” he said. Using a Markov State Model, Pande’s team is studying amyloid beta aggregation with the idea of helping identify new drugs to treat Alzheimer’s disease. Several candidates have already been identified that inhibit aggregation, he said.
Borrowing some of the same ideas as folding@home, Wu Feng (Virginia Tech) described Project MOON (MapReduce On Opportunistic eNvironments), a distributed computing initiative that harnesses the power of some 550 dual-core Intel Mac computers used in the Virginia Tech Math Lab.
“Institutional clusters are expensive,” said Feng, noting that Japan’s K computer cost over $1 billion and consumes $10 million annually in cooling costs, while the National Security Agency is building a $2-billion data center in Utah. The collective compute power in the Math Emporium is equivalent to a modest supercomputer, said Feng, noting that the average processor unavailability hovers around 40%.
Super Compute
Several speakers highlighted major supercomputer installations, such as Japan’s K Computer, described by Makoto Taiji (Riken). The computer, which is located in Kobe, Japan, began in 2006. The cost has been estimated at $1.25 billion. For that, one gets 80,000 nodes (640,000 cores), memory capacity exceeding 1 PB (16 GB/node) and 10.51 PetaFlops (3.8 PFlops sustained performance). Using a 3D-Torus Network, bandwidth is 6 GB/s, bidirectional for each of six directions.
Power efficiency is ranked at 20 MW, or about half of Blue Gene. Taiji said the special features of the K Computer include high bandwidth and low latency. Anyone can use the K computer—academics and industry—for free if results are published. Life sciences applications make up about 25% of K computer usage, with applications including protein dynamics in cellular environments, drug design, large-scale bioinformatics analysis, and integrated simulations for predictive medicine.
A drug screening pipeline uses the K Computer over several days to help whittle down a library of 1 billion compounds down to a single drug candidate. However, for molecular dynamics applications, Taiji acknowledged that Anton (the supercomputer designed by David E. Shaw & Co.), is about 100x faster. As a result, Taiji and colleagues are building MDGRAPE-4, a new supercomputer with a performance that will rival that of Anton—with a target of modeling 20 microseconds for 100,000 atoms. It will be online in 2014.
Robert Sinkovits (San Diego Supercomputer Center) described Gordon, the supercomputer that makes extensive use of flash memory that is available to all academic users on a competitive basis. “It’s very good for lots of I/O,” said Sinkovits.
A great Gordon application, said Sinkovits, will among other things, make good use of the flash storage for scratch/staging; require the large, logical shared memory (approximately 1 TB DRAM); should be a threaded app that scales to a large number of cores; and need a high-bandwidth, low latency inter-processor network. The Gordon team will turn away applications that don’t fully meet these requirements, he said, but singled out computational chemistry as one particularly good match.
Gordon features 64 dual-socket I/O nodes (using Intel Westmere processors) and a total of 300 TB flash memory. Other features include a dual-rail 3D Torus InfiniBand (40Gbit/s) network and a 4-PB Lustre-based parallel file system, capable of delivering up to 100 GB/s into the computer.
Another domestic supercomputer was introduced by Weijia Xu (Texas Advanced Computer Center/TACC). The Stampede supercomputer should be online early next year, featuring 100,000 conventional Intel processor cores and a total of 500,000 cores, along with 14 Petabytes disk, 272 TB+ of RAM, and a 56-Gbyte FDR InfiniBand Interconnect.
From China, Nan Li (National Center for Supercomputing, Tianjin) described Tianhe-1A (TH-1A), the top-ranked supercomputer in China, with a peak performance of 4.7 PFlops, which is housed at the National Supercomputer Center in TianJin. (The computer was ranked the fastest in the world two years ago.) Applications range from geology, video rendering, and engineering, but include a number of biomedical research functions. Among users are BGI and a major medical institute in Shanghai. Li indicated this resource could also be made available for the pharmaceutical industry.
Gary Stiehr (The Genome Institute at Washington University) described the construction of The Genome Institute’s new data center, required because of the unrelenting growth of next-generation sequencing data. “The scale of HPC wasn’t the challenge—but the time scale was caused by rapid, unrelenting growth,” said Stiehr.
The new data center required more power and cooling capacity, and data transfers reaching 1 PB/week. The issue, said Stiehr, was whether to move the data to the compute nodes, or analyze the data already on the nodes by using internal data storage and processing the data stored there.
Mirko Buholzer (Complete Genomics) presented a new “centralized cloud solution” that Complete Genomics is developing to expedite the digital delivery of genome sequence data to customers, rather than the current system of shipping hard drives, fulfilled by Amazon via FedEx or UPS. 100 genomes sequenced to 40x coverage consumers about 35 TB data, or a minimum of 12 hard drives, said Buholzer.
The ability to download those data was appealing in principle, but to where exactly? Who would have access? Complete plans to give customers direct access to their data in the cloud, providing information such as sample ID, quality control metrics, and a timeline or activity log. For a typical genome, the reads and mappings make up about 90% of the total data, or 315 GB. (Evidence and variants make up 31.5 GB and 3.5 GB, respectively.)
Customers will be able to download the data or push it to an Amazon S3 bucket. The system is currently undergoing select testing, but Buholzer could not say whether anyone had agreed to forego their hard drives just yet.
*Bio-IT World Cloud Summit, Hotel Kabuki, San Francisco; September 11-13, 2012.