Intel's Cancer Cloud: Complementing NCI Collaboration
By Joe Stanganelli
July 11, 2016 | Intel may have missed out on a major government grant, but that hasn't stopped it from developing a collaborative cloud for genomicists and bioinformaticians that it wanted to build all along.
In late 2005, the NCI—in partnership with the National Human Genome Research Institute—launched The Cancer Genome Atlas (TCGA). The TCGA was—and remains—an ambitious effort to categorize and catalog dozens of cancer types, along with the plethora of genomic quirks that can contribute to them.
Despite (or, debatably, because of) ample federal funding and major technological improvements leading to the ongoing dramatic decrease of the cost of sequencing a genome, TCGA—as with many other big-data efforts—quickly ran into a problem. The data loads were simply too much, and are continuing to get yet more unwieldy.
“Between 2014 and 2018, production of [NGS] data is going to exceed 2 exabytes," warned Tony Kerlavage, NCI's chief of cancer informatics at the Center for Biomedical Informatics and Information Technology, in a workshop presentation at this year's Bio-IT World Conference & Expo. "So in this environment, obviously the standard model… of computational analysis where a university will download public data [and] publicly available software,… mine their local data, and... locally develop software… just totally breaks down."
Going to the cloud was the obvious answer—particularly given the Obama Administration's "cloud first" policy that has resulted in opportunities for subsidies for cloud users along with substantial datacenter streamlining. Enter the NCI Cloud Pilot.
Kerlavage related that "the NCI released a broad agency announcement [serving as] a specialized type of RFP" for three R&D projects to encourage cloud-based genomic-research innovation to support TCGA. A multi-step competitive process ensued starting in 2013. The NCI's final selections for its Cloud Pilot were publicized in October 2014. The winners were the Broad Institute, the Institute for Systems Biology, and Seven Bridges Genomics—the latter being the only commercial organization selected.
Of course, several organizations who applied did not make the NCI's cut. One such also-ran: Intel.
Intel Outside
Despite not winning its NCI bid, Intel has made substantial progress with its own Collaborative Cancer Cloud.
In an onsite interview on the first day of the Bio-IT World Conference & Expo this year, Ketan Paranjape, Intel's General Manager of Life Sciences and Analytics, told Bio-IT World that after NCI rejected Intel's proposal for a cancer cloud project, Intel decided to just go ahead and build out its own proposal anyway—NCI funding or no.
I prodded Paranjape for more details about Intel's cloud solution and the extent to which it competes with the NCI Cloud Pilot projects; I specifically brought up the notion of direct competition with the Seven Bridges Cancer Genomics Cloud (CGC). Paranjape was quick to correct me.
"We don't want to call it compete; we call it complementary," said Paranjape. "We just want to build a network and connect it to other networks [in a] collaborative environment."
The notion seems dubious (after all, both Intel and Seven Bridges are commercial, for-profit organizations operating in the healthcare and life-sciences space)—at first. But one day after my interview with Paranjape, Intel announced a partnership with the Broad Institute (another NCI Cloud Pilot winner) to make the latter's Genome Analysis Toolkit (GATK) software available on Intel's CCC.
Paranjape makes a compelling case, as he pointed out the problem of data silos in healthcare and the life sciences.
"If I have access to 4,000 genomes, and you have access to 4,000 genomes, then why shouldn't we have access to 8,000 genomes?" Paranjape asked rhetorically. "96% of data [is] being locked away… All us here have to play together with each other. [Working] collectively as a group… should be the end state."
Later in the conference, David Delaney, Chief Medical Officer of SAP, echoed Paranjape's statistic—give or take a percentage point—in a presentation titled "Innovating, Reimagining, and Digitally Transforming Personalized Medicine." According to Delaney, 97% of patient data is used for treatment and then—except for the occasional retrospective study—never again.
"At the point of decision ... it's effectively like trying to drink through a tea stirrer," said Delaney, pointing out that the oceans of data that could theoretically be made accessible are at once too vast and too siloed to effectively search through or access in any meaningful way.
"Each cancer center… has enormous sets of genomic data—but most of that data is siloed," affirmed Ethan Cerami, Knowledge Systems Group Director of Dana-Farber Cancer Institute's Department of Biostatistics and Computational Biology. "By combining and integrating data sets, we can drive science forward faster and ultimately generate better outcomes for all our patients. The Intel platform provides a unique and novel solution to enable broader sharing, and we are excited to help them pioneer this field."
Indeed, Dana-Farber is one of three partners in the Intel CCC pilot; the other two are Oregon Health & Science University (OHSU) and the Ontario Institute for Cancer Research (OICR). All three of these partners are working together on a variety of projects (including, notably, research to identify previously unknown cancer-causing mutations) and coding new tools to aid their efforts, using the CCC.
The partners seem well-matched for each other collaboratively.
"We are working with Intel, OICR, and OHSU to develop a one-year pilot project to demonstrate secure genomic data sharing across our three institutions," explained Cerami. "Each of our institutions currently perform some type of genomic sequencing on patients, and the goal of the pilot project is to pool genomic data across all three centers and make it available for joint computation.
Does It Have to Be Cancer?
Despite Paranjape's insistence that the Intel CCC is intended to be a complement as opposed to competition, he does not shy away from being boastful about Intel's solution.
"[We] will be the first ones to actually do something with it!" Paranjape excitedly added, contending that the Intel CCC is well ahead of the NCI cancer clouds of Seven Bridges, Broad, and ISB—and further noting that Intel's CCC partners already have specialized tools virtualized on the CCC. Paranjape went on to point out that Intel and its partners are open-sourcing the tools on the CCC—anticipating completion of this process somewhere between October 2016 and April 2017.
These are laudable points of pride, but Paranjape's primacy claim here may be a bit much.
"We have…226 tools and workflows on the CGC today," Brandi Davis-Dusenbery, who is in charge of the CGC project as part of her role as a scientific program manager at Seven Bridges Genomics, told Bio-IT World in a separate interview. "As far as APIs, I think it’s in the 45 or 50 range—and these allow researchers to do anything: uploading data, adding metadata, starting a computational job, downloading the results, adding collaborators, [and] creating projects."
Paranjape's pride in the CCC seems clear, and he sees varied applications for it. "It doesn't have to be cancer," he told me, envisioning Intel's Collaborative, er, Cancer Cloud used to research other pathologies, like autism and AIDS. Even if the business direction of Intel's cloud seems foggy, it is easy to envision a scenario in which Fortune 51 Intel—having already done a lot of the legwork on its NCI Cloud Pilot pitch—decided to make the most of its once-rejected work.
The Killer Feature of Flexibility
Nonetheless, Paul C. Boutros, an OICR informatics and biocomputing investigator and University of Toronto professor, says that the kind of flexibility that Paranjape describes is a big plus for his organization. It turns out that one of the three major projects OICR is working on with the Intel CCC directly involves this broad flexibility of the distributed machine-learning capabilities inherent to Intel's cancer-but-not-necessarily cloud solution.
"Imagine that Hospital A has a patient cohort and [has] identified some important biomarker that could be clinically useful [and] they want to validate it with a patient at Hospital B," posits Boutros. "This isn't easy, for many reasons: patient consents or government regulations can make it difficult to share data, legal data transfer agreements can be time-consuming to negotiate, and individual researchers may be competitive. The [Intel] CCC can bypass these issues by allowing Hospital A to validate their results at Hospital B without having to transfer data anywhere, and without Hospital B needing to see all the details of Hospital A's discovery."
This, Paranjape told me, is where one of the largest benefits of Intel's CCC lies: the perks of a virtualized cloud where the relevance of privacy-sensitive "geographical boundaries" can be minimized by leaving data in place.
"We rely on the data centers being offered by [each] entity," said Paranjape. "We tap into that cluster to get that compute, [and] will just send you the compute."
Thus, beams OICR's Boutros, with the help of Intel's distributed machine-learning, "We are creating novel machine-learning algorithms that can work efficiently in this context, where the learning is distributed across multiple independent sites."
Dana-Farber's Cerami, meanwhile, proclaims the CCC's abilities to enable joint computation on data sets, "a killer feature" for all three of the primary CCC partners because no data are directly shared in the process.
"This enables us to compute on much larger data sets," said Cerami, "while ensuring the security and protection of our own data sets."
To be certain, the entire healthcare and life-sciences sector has clamored for years for new technological solutions that make the power of the cloud accessible while working around the compliance and data-privacy issues that normally come with the territory. (This subject was, in fact, one of the running themes of this year's Bio-IT World Conference & Expo.) The CCC's capability in this respect is thereby impressive and in high demand—but, to be fair, it is not exclusive to Intel.
"To address [big-data] challenges, the [NCI] cloud pilots are establishing systems where the data are colocated with the computational capability—and APIs provide secure data access," Kerlavage told his workshop audience. "So here, applications are brought to the data rather than bringing data to the applications. The goal here is to democratize access to these data and create a cost-effective way to provide compute to the cancer research community."
And, as with democratization, everyone has something to bring to the table.
"At the time of the [NCI] award, we’d spent about 5 years building our infrastructure," said Seven Bridges' Davis-Dusenbery. "[Meanwhile,] it’s clear that the Broad Institute has some of the best tools that the researcher community uses, and then the Institute of Systems Biology has really powerful visualizations—so I think this makes kind of a nice complementary portfolio."
Now, Intel can join that group. Whether seen as a collaborator, a competitor, or a complementor, it does not matter—because Intel's CCC is enabling impressive collaborative work in its own right.