Clouds And Collaboration: How St. Jude Built The Team That Built The St. Jude Cloud
By Allison Proffitt
August 16, 2018 | It’s been years in the making, but in April of this year St. Jude Children’s Research Hospital launched the St. Jude Cloud, an online data-sharing and collaboration platform that provides researchers access to the world's largest public repository of pediatric cancer genomics data. The platform allows scientists to explore more than 5,000 whole-genome, 5,000 whole-exome, and 1,200 RNA-Seq datasets paired with clinical data from more than 5,000 pediatric cancer patients and survivors. By 2019, St. Jude expects to make 10,000 whole-genome sequences available on the St. Jude Cloud.
St. Jude didn’t build the cloud environment alone; it is a collaboration between the Memphis, Tenn., pediatric cancer research hospital, Microsoft, and DNAnexus. What started with St. Jude’s isolated bioinformatics frustrations has expanded to a vision for global collaboration in pediatric oncology.
A Cloud Of Their Own
In 2015, the St. Jude bioinformatics team was committed to sharing their data and tools, but they were facing plenty of challenges internally getting stuff done. “One FTE was recruited just to upload data to enable data sharing,” Jinghui Zhang, the chair of Computational Biology at St. Jude Children’s Research Hospital explained. Other team members were spending time troubleshooting for researchers around the world. (Read more, CIO Perspective: Creating Community On Campus At St. Jude)
And while all the St. Jude data was shared on The European Genome-phenome Archive (EGA), members of Zhang’s team were still helping outside researchers download the data. If data were corrupted, someone would have to go to the St. Jude archive to retrieve the data. “Some of these had already been put on tape!” Zhang said. “It was very awkward.”
Scott Newman, who now leads the clinical genome analysis team in Zhang’s lab, was working at Emory University at the time. “I found an interesting mutation in high-grade glioma. And I wanted to know, is this just a one-off? Or is it found in a few more [patients], then it’ll be generally quite interesting,” Newman recalls. He applied for access to a St. Jude dataset of high-grade glioma whole genomes, and was approved. “They said, ‘Sure, that’s a great use of the data. Here’s a link to an encrypted download.’ I kicked off the encrypted download and it took ages. It took weeks. Then it took months… It took me nine months to actually get the data. And the frustration was, I had one question about one gene. And I had to wait nine months to get the data to actually ask the question. I turned out to be right, of course.”
But a change was coming. When NCI launched its Genomics Cloud, Zhang said, her team began evaluating how to create a cloud of their own. As both data consumers and data providers, Zhang’s team had a vision for the functionalities they wanted in a cloud platform.
Data analysis and visualization tools should work on the cloud, Zhang believes. “This is really our vision. To make the cloud a space for laboratories that do not have huge local infrastructure, make the cloud a common platform for accessing the data and analyzing the data.”
She and her group reached out to Microsoft, and about two years ago Microsoft hosted the St. Jude team at the Redmond, Washington, campus, explained Geralyn Miller, lead of Microsoft Genomics Services.
“They shared with us some of their challenges: a fixed-size research organization that had some of these complex tools and pipelines that they had to manage and maintain. Conceptually, it was taking away a lot of focus from the broader, downstream problems they wanted to solve around pediatric oncology,” Miller said.
Meanwhile, Microsoft was working on moving their Microsoft Genomics Services—traditionally conducted in an academic research lab model—“out of the lab, per se, and into the hands of the world,” Miller explained. Microsoft was building a service to let customers do the initial phase of genomics analysis, “so they can then take the output of this, and then go off and do… the machine learning, the deep learning, and the other types of analytics that are going to drive the insights we need to move the industry forward.”
It was a “Wow” moment for the Microsoft team, Miller said. The St. Jude problem is “kind of the same problem we’re trying to address with what we’re doing with [Microsoft Genomics Services]. Let’s enable researchers to have this commoditized work done very easily in the cloud, so they can stick their focus to the net new.”
Together, Microsoft and St. Jude found what Miller calls “a great synergy”, starting with a research collaboration. “[Microsoft’s] genomics group was very interested in finding a collaborator to showcase their new genomics pipeline,” Zhang said, and St. Jude had particularly unique and challenging data to play with. The St. Jude team shared hard-to-call genomes as the Microsoft team iterated its service and checked to make sure results were accurate and precise.
The next step, of course, was to start realizing the St. Jude vision for a workable cloud. “That’s when DNAnexus got involved,” Miller says.
The DNAnexus Decision
Although Microsoft and DNAnexus have an existing relationship, the St. Jude team did their own due diligence in choosing a platform partner. After narrowing the field to five finalists, Newman made side-by-side comparisons over the course of a year, ultimately choosing DNAnexus.
“There were three things that drove our decision,” Newman explained. “Number one was security. They appear to have the best-thought-out security strategy. This is important data; we’ve got to keep it secure. Number two was ‘develop-ability’. Jinghui, being a tools person, wants to port her tools to the cloud. And of all the platforms, this is by far the easiest to do it. I’m a biologist by training, not a programmer, and I’m able to port tools in record time now using their containerization system. And the third thing was their enthusiasm to work with us; co-developing with them has been relatively easy.”
For DNAnexus, the enthusiasm came easily. “One can’t have contact with them… or go on site [in Memphis], and not basically say, ‘What can we do to help you?’” said DNAnexus CEO Richard Daly. “It’s obviously such a good thing to do.”
And the dataset that St. Jude has compiled was extremely compelling for the company that powers the PrecisionFDA platform, MOSAIC for microbiome research, and the Rady Children’s network.
“The concentrated set of samples St. Jude had—there’s about 10,000 to date and consistently growing—and the fact that they have all the records, cleanly and consistently integrated, will lead to faster insight out of that data,” explained Brad Sitko, VP of Finance, Operations, and Corporate Development at DNAnexus. “One of the challenges with messy data across healthcare, is currently how it’s collected, how it’s organized. St. Jude felt that in particular they’d be able to accelerate [research] by having a really well-curated and collected dataset.”
So far, St. Jude has deposited more than 5,000 whole-genome, 5,000 whole-exome, and 1,200 RNA-Seq datasets from more than 5,000 pediatric cancer patients and survivors into Microsoft Azure. “We are quickly approaching 1 petabyte [of data]”, explained Clay McLeod, manager of the St. Jude Cloud development team. The goal is to host 10,000 whole genomes by the beginning of 2019.
Managing The Data
These data have been generated from three large St. Jude-supported genomics initiatives: the St. Jude-Washington University Pediatric Cancer Genome Project, designed to understand the genetic origins of childhood cancers; the Genomes for Kids clinical trial, focused on moving whole genome sequencing into the clinic; and the St. Jude Lifetime Cohort study (St. Jude LIFE), which conducts comprehensive clinical evaluations on thousands of pediatric cancer survivors throughout their lives.
St. Jude Cloud hosts BAM files, coding and non-coding somatic and germline SNVs and indels, copy number and structural alterations. Microsoft provided use of its genomics pipeline to remap all of the existing St. Jude data to create a uniform dataset before launch and has committed to two years of free cloud storage.
It’s a lot of data. “When we were scoping out the project, we thought, How much is it going to [cost] to store a petabyte in some cloud platform? The number was eye-watering!” said Newman. But the St. Jude Cloud project is structured so investigators don’t absorb any of those costs. “When we give access to that data to somebody, they’re not paying to store the data themselves, because it’s not really a copy, it’s more of symbolic link. I can apply for access to the whole half petabyte and stick it in a secure cloud project and it’s not going to cost [me] anything to store that.”
Access to the data happens in levels. Non-identifiable data (e.g., somatic alterations, genotype frequency, cancer diagnosis, and demographics) can be viewed immediately by anyone using the interactive genome browser. Researchers can apply to a Data Access Committee for patient-level data access and explore it using St. Jude’s visualization tools without downloading or storing any data for themselves. Researchers may also upload their own data in a private, password-protected environment to explore using tools available on the St. Jude Cloud platform.
“For example, if someone is interested in seeing a recurrent mutation in a specific patient or certain genes, you could potentially just go to the website; if you’re granted data access, you could view the BAM files directly on the website without having to do the download,” Newman said.
There’s no re-contact. The data are shared for purely research purposes and are anonymized. But, Zhang said, if a researcher has a collaborator at St. Jude, the St. Jude researcher could have the ability to re-contact patients.
Tools Building
From the beginning, the St. Jude team wanted to make sure the St. Jude Cloud wasn’t just a portal to a valuable pediatric oncology dataset. They wanted researchers to be able to do real work on the platform.
“Data is one thing. Having data is really important to drive value. But it’s not just about the data. It’s about the tools,” said Brady Davis, VP of strategy and marketing at DNAnexus. “Starting with having a platform that’s secure, compliant, and able to actually allow people to come in in a secure and compliant way. Having data there is what brings people in, which was critically important. And then having the tools available to make more sense of that data.”
The tools available on St. Jude Cloud help both experts and non-specialists work with genomics data. These tools include validated data analysis pipelines and interactive visualization tools to make it easier to make discoveries from large datasets.
ProteinPaint, for instance, is a genomic visualization engine developed at St. Jude. ProteinPaint visualizations allow users to rapidly navigate through the genome and identify genetic changes linked to cancer development. St. Jude Cloud tools also produce custom visualizations of the user’s own research data for exploration or comparison with St. Jude-generated data.
It’s the right balance of data and tools that will bring more and more users to the platform, Davis said, calling it a “continual feeding mechanism.”
Future Visions
All three of the collaborators emphasized the community aspects of the St. Jude Cloud. Currently there are 500 registered users representing 342 institutions that St. Jude considers “power users”. More than 9,700 unique users from 33 countries have visited the site.
“We didn’t want to just build this ecosystem for them called St. Jude Cloud and walk away and hope it is successful,” Davis said. “The goal is to continue to put [in] more data—more patient data, more data from not only St. Jude but from other research partners that want to invest time and energy into uploading data. Creating this to continue to be the world’s leading infrastructure for cancer and kids.”
Data and results can already be securely shared with collaborators within the platform, and Davis talked about additional efforts to build a “sticky” user environment with challenges and other interactions so that other researchers are encouraged to share data and engage with the St. Jude Cloud.
For Microsoft, Miller is excited about the opportunities the St. Jude Cloud affords her diverse team of computer scientists, physicians, mathematicians, physicists, bioengineers, and more to analyze the genotype and phenotype data together and apply machine learning, deep learning, and other analytics, but she also sees great value in the concept of the platform.
“By enabling a platform for researchers to come together and collaborate in a fashion that’s easy to do—I’m not sharing a thumb drive with you, I’m not sharing a hard drive with you, I can basically bring my data and my compute to this space in the cloud—we fundamentally believe that this will accelerate the pace of innovation,” she said. “That’s why we’re doing this.”