Meet Tanuki, a 10,000-core Supercomputer in the Cloud

By Kevin Davies

April 25, 2011 | Bio-IT World | For the past couple of years, Connecticut-based Cycle Computing has been helping academic and industrial organizations run high-performance computing projects – from internal management software to research algorithms and simulations -- in the Amazon EC2 cloud, and enjoying “hockey stick” growth in the process. “Researchers can sign in on EC2 really quickly. We’ve made a big push to make that process as simple as possible,” says founder and CEO, Jason Stowe.

That effort recently scaled new heights with the debut of Tanuki, a 10,000-core supercomputer – in the cloud. (Tanuki is named after the Japanese god of virility and gluttony.)

Last year, Cycle began spinning up several clusters in the 1,000-2,000-core range. Stowe started blogging about their success, including one large GPU-based cluster, and not surprisingly started receiving a few requests. But last January, he decided to push the envelope a bit further. “What if we could add a zero to that?” he wondered. “Rather than spin up 1,000 cores, let’s try a 10,000-core cluster. Would anyone be interested in that scale?”

In other words, just how big a file system could Cycle handle inside a cloud environment such as Amazon EC2?

Stowe had been in discussions with Genentech, which was running a series of protein-binding analyses, prompting the Genentech informatics team to start investigating cloud computing applications fall. “As part of a proof-of-concept with Cycle, we ran a two-hour, 4096-core test,” says associate scientist Jacob Corn. “Since that went smoothly, we decided to get some work done on a real problem, and push 10,000 cores for eight hours on a scientific problem.”

Corn and colleagues study the prediction and design of protein interactions, and use computational methods to evaluate hundreds of thousands of possibilities. “Then I cherry pick the most promising looking outcomes to bring into the real world and test in the lab,” he says.

While Stowe briefed Genentech on what would be required to run 80,000 core-hours, he also checked that his engineering team could handle the relevant infrastructure. There was stress testing on the file system and the configuration management software (using Chef). Stowe also elected to use the Condor scheduling software, which he says scales fairly straightforwardly.

Another key issue was load testing. “When you request that many nodes from a cloud provider, you find many cases where the server won’t recognize the disk,” Stowe explains. “We built software to do all the error handling – rather than that being a problem, where a single node on the grid sucks in thousands of jobs and tells you they failed because it can’t read its own hard disk, we handle that automatically. It’s completely transparent to the user, who doesn’t even realize there [might have been] an error node involved.”

In short, the challenges were less on the server side than getting everything to scale. Was encryption working on all nodes? Could Cycle handle the dead nodes? What about the security issues for Genentech’s data?

Stowe placed one phone call to Amazon to ask if there were any recommended times for the experiment. “Amazon’s a great partner, but they had next to no involvement,” says Stowe. “I mostly just wanted to tell them [we were doing this]!”

March Madness

On March 1, Stowe looked on as Genentech pushed the button, thereby submitting 10,000 jobs to their queue. That simple action effectively harnessed more cores than the 115^th-ranked computer on the Top 500 supercomputing list at the time. Stowe wanted to be there in person because “it was such a crazy, momentous idea. I was expecting to have to do stuff. But it was actually a pretty boring exercise. Nothing crazy going on -- it was pretty awesome.”

The job ran for eight hours, utilizing 10,000 cores, 1,250 servers, and approximately 8.75 Terabytes (TB) RAM aggregated across all machines. “That’s a lot of infrastructure, but the entire cluster was up in 30-45 minutes,” says Stowe. And yet the end-user effort was just a few seconds. The cost of the exercise was $1,060/hour, including all infrastructure costs and Cycle’s costs as well. (Stowe says there is no upfront cost, unless the client hires Cycle for professional services.)

Having run the same code on Genentech’s internal clusters, Corn says the quality of the results from Tanuki was essentially identical. But the time savings were considerable. “I had been running these simulations internally for a few weeks, and estimated that I probably had 1-2 more weeks remaining,” says Corn. “With the 10,000-core [cloud] cluster, that time was slashed down to eight hours.” Stowe says Genentech was moving from in silico analysis to in vitro experiments within a matter of days.

[On the infrastructure side, Stowe says Cycle used a Linux CentOS 5 (community enterprise) operating system; Chef for configuration management and encryption; and Condor as a scheduler (which has handled up to 40,000-core environments with other investigators.) Cycle Server manages HPC clusters, while CycleCloud.com handles the orchestration – server requests and aspects of provisioning the cluster. “From a data perspective, we had lots of compute for the data – about 1 TB shared file system, head nodes running all the time.”]

Stowe admits that spinning up 2,000 cores is “pretty pedestrian now,” but that is twice the capacity that he used seven years ago to render a movie for Disney (see “Cycle Computing’s Tour de Cloud,” Bio-IT World November 2009). Cycle Cloud is a horizontal platform – the software can be used by more or less anybody.

“We’re going to be spinning up a second customer with a 10,000-core environment in the next few weeks,” says Stowe. “If there’s a use case, we’ll probably be doing it.” The only limitation Stowe can foresee is whatever the finite capacity is at the cloud provider. “We’re interested in going up to 12,000, 20,000, 25,000 cores. As we have users for it, we’ll do it. We’ve shown the 10,000 [cores] is canonically easy to just push a button and do it.”

Stowe says Cycle is “cloud provider agnostic and scheduler agnostic. There are other infrastructure providers out there. But Amazon has its game together, including per hour billing.”

“The real story is what this can do for science,” he says. “The only person who can tell that is the client.”

Ed. Note: For more on Tanuki, see this Cycle Computing blog entry.