Data Conversations: Cycle, University of Arizona Researchers Model Strength Of The Cloud
By Allison Proffitt
February 17, 2017 | When May Khanna ran her first library of 50,000 compounds against a protein target on Friday, she got only four hits. Only four of 50,000 small molecule compounds docked to a protein target linked to ALS. It didn’t bode well for the public presentation scheduled for Monday.
“That’s a super low, super low hit. That was really scary!” Khanna recalled this week in an interview. “At that point we thought, do we back out of this? There were only four hits. We’ve never seen such a low hit on target!”
Khanna, an assistant professor of Pharmacology at the University of Arizona College of Medicine in Tucson, was planning to present at the opening of the Arizona RNA Salon, a group of researchers, undergraduates, and grad students interested in RNA science-based activities. Khanna had won funding for the Arizona RNA Salon from the RNA Society in November, and about 50 researchers each in Tucson and Phoenix were gathering for the inaugural webcast presentation.
Khanna hoped to put together a bold demonstration of what she believes is the future of drug discovery: to dock a million compounds to a target in silico.
The pieces came together over time. At an amyotrophic lateral sclerosis (ALS) meeting, she learned of an intriguing protein target. She contacted Schrödinger, whose Glide application she had used before for molecular docking simulations, and asked for enough licenses to run a docking demonstration on one million compounds. She got a library of one million small molecules from researchers at the University of California, San Francisco.
Next step: compute.
Khanna contacted the high performance computing cluster at the University of Arizona and shared her vision: about an hour of live docking so the students could grasp the power of in silico drug design. The response was not encouraging; HPC suggested screening 10,000 compounds.
“Ten thousand compounds? That doesn’t sound cool! That’s not the power I want to be able to give [the students]!” Khanna said.
But while doing a Google search on Schrödinger, Khanna found coverage of Schrödinger’s previous work with Cycle Computing. In 2012, Schrödinger and Cycle used Glide to screen 7 million compounds for a protein target. Khanna reached out and was impressed with Cycle’s enthusiastic response. The demonstration for RNA Salon was set.
Where In The Haystack?
For this demonstration, Cycle and Glide were run on the Google Cloud using the pre-emptible virtual machine (VM) instances. Google’s pre-emptible instances are cheap with a predictable price point (as opposed to AWS spot instances, which can vary in price), but they can turn off without warning. That’s where Cycle Computing comes into play. CycleCloud segments jobs and starts and stops instances as necessary. The software can dynamically size the cluster based on the number of jobs in queue and replace preempted instances.
Schrödinger’s in silico molecular docking happens in two phases: the LigPrep package converts 2D structures to the 3D format used in the next stage; then Glide performs the Virtual Screening Workflow. To get the simulation done as efficiently as possible, the workload was split into 300 smaller jobs.
The first stage consumed 1,500 core-hours of computation over about an hour and a half. The Glide stage used up to 5,000 cores of n1-highcpu-16 instances, taking about four hours and costing $192, Jason Stowe, Cycle Computing’s CEO told Bio-IT World.
The demonstration was a success. It started the evening before the presentation, and wrapped up during the RNA Salon. But the biggest surprise was not that it worked, but how well. Glide revealed 600 compounds that look—in silico—to be candidates to target the ALS protein.
It’s not a linear progression from the four compounds revealed by the 50,000 run, but that’s a finding that doesn’t surprise Stowe.
“The problem with needle-in-a-haystack problems is when you’re not looking in the right part of the haystack, you can really miss things that are useful,” Stowe said. “This is a statistical problem that we see over and over again! It doesn’t matter if you’re doing life sciences drug design or financial services work or manufacturing design work. All of those different fields you really do get a much, much better answer—and potentially the right answer—by being able to run more simulations than you would otherwise.”
“Yes, exactly!” Khanna interjected. “We would have honestly given up if we had tested this particular target with the 50,000 library [and only found 4 potential molecules].” If Khanna hadn’t already committed to using CycleCloud to present docking at the seminar, “we would have probably walked away from it.”
Conversing With Data
But Stowe doesn't simply advocate for more, more, more. He sees computing science at the cusp of a new way of thinking about and working with data—a major paradigm shift in in silico research.
“We think there’s an order of magnitude productivity increase for any profession that relies on computing by just being able to take advantage of dynamic capacity in cloud,” he said. "Where I think the real opportunity is, is where end users can get live answers back… I would be much happier if I could give [Khanna] the ability to run tests against 10 million families of compounds and then peel the onion on that as results come off the machine and pick a new top ten.”
No one can justify that sort of compute internally, Stowe said. “I want them to have more interactive answers. That’s where being able to go to an external cloud provider and saying, ‘Give me 50,000 cores,’ and then enabling them to have interactive answers back is going to be really exciting.”
That sort of low overhead scheduling with real-time answer delivery is what Cycle is working on now, Stowe said. He wants to remove the delay between asking a question and getting an answer, enabling researchers to “have conversations with the data.”
Khanna is a believer, and she’s convinced that students need to be learning now this new model of research.
“We’re going to be doing this again with Cycle, I hope, and what we’d like to do in future courses is start from the beginning of the class and [let students] come up with a target. Then they go with Cycle and set this up together with them.” Khanna hopes to mimic the whole drug discovery process—from target to early leads—within a course, giving students access to all the compute they’d need to explore the process.
“I’m really very excited about having the students push this forward.”