Intel Joins Up with Cancer Centers to Build Secure Networks for Patient Data Sharing
By Bio-IT World Staff
September 2, 2015 | This summer, a paper in PLOS Biology made headlines by predicting that, in the coming decade, the study of genomics will outstrip data-hungry services like YouTube and Twitter in its storage demands. (doi:10.1371/journal.pbio.1002195.) As sequencing genomes becomes a standard part of healthcare, researchers and care providers will be faced with new stresses on their data centers ― in the volume of data they have to manage, and just as importantly, in the resources needed to analyze, share, and protect it.
For many hospitals, this is a pressing worry. Alongside imaging data and ever more sophisticated clinical records, genomics is forcing these institutions to quickly remake themselves as computing powerhouses. But for creators of compute infrastructure like Intel, this state of affairs is an opportunity. Whether by building custom data centers, or through cloud services that rest on their hardware, these companies stand to pick up a great deal of new business in the healthcare system ― as long as the tools are in place to take full advantage of their processors when demand ramps up.
“If you think about precision medicine at scale, this is an exascale computing challenge,” says Eric Dishman, General Manager of the Health and Life Sciences Group at Intel. “It challenges everything we and the industry know about databases, storage, processors, the fabric that connects the data center.”
Recently, while working on a new, 1.5-petabyte private cloud for the Knight Cancer Institute at Oregon Health and Science University (OHSU), Dishman became convinced that the rate of data growth for these clients would soon make it impossible for any single care provider to work purely inside its own data center. Physicians at the Knight Institute were already sequencing the genomes of their patients’ tumors, trying to find similar cancer cases to learn what treatments led to the best outcomes. To find close matches, these physicians would have to reach out to other cancer centers by phone and email, hoping someone else’s database would hold a clue to the most effective therapy.
For that process to scale up to the point of being a routine part of cancer care and research, oncologists will need new tools to search their partners’ databases as easily as they search their own, or the big public databases like The Cancer Genome Atlas. “The vast majority of the data is not in the public datasets,” Dishman says. “It’s sitting in the private clouds of all the hospitals and clinics, and that’s the data you have to tap into for the research on clinical outcomes to get real.”
Now, in collaboration with the Knight Cancer Institute and two additional, unnamed cancer centers, Intel is building the software infrastructure to connect these private clouds. The resulting Collaborative Cancer Cloud (CCC) will allow users to send queries to their partners’ databases without retrieving or transferring large quantities of data. As a result, they will have access to a hugely expanded set of patient records, while tamping down on both compute demands and security and privacy concerns.
Intel does not have a major line of business in selling software, and both the company and its academic partners plan to open source the code behind CCC. “At the end of the day, Intel’s interest in this area is the huge numbers of Xeons it’s going to take to actually store and process this data,” says Dishman, referring to Intel’s brand of microprocessors. “If we just sequenced all 1.65 million Americans who are diagnosed with cancer each year, that would drive more than four exabytes of data. We want to accelerate that market, which will be huge for Intel’s core technologies.”
A Middleware Project
Most if not all of the analysis tools that will be available in the Collaborative Cancer Cloud are already in widespread use. The project is adopting popular open source tools for genomic analysis like the Broad Institute’s GATK, although in some cases Dishman says that Intel is “optimizing” these workflows to run more efficiently on Intel microprocessors. Similarly, CCC will not mandate any particular kinds of patient records to be stored in databases connected through the project. Users could choose to include genomic data; disease, treatment, and outcome histories; and any number of test results or biomarkers that could help refine the complete picture of an individual patient’s health.
What is novel is the process for looking at data in a partner institution’s cloud or data center. CCC takes advantage of standard file formats and APIs supported by the Global Alliance for Genomics and Health to let users write queries that can be interpreted across institutions, requesting data that matches key characteristics. A query might, for instance, search for patients who have both a certain genetic variant in their tumors and a positive outcome after treatment. “The query gets contained within a secure, encrypted virtual machine,” says Dishman, “and it gets sent as a secure container to the multiple sites to say, ‘Do you have anybody who looks like this?’ And the query then runs on those local machines to find the data.”
Because the query runs locally, it never transfers raw data to the institution that sent out the query. Instead, the data is scrubbed of clear identifiers, like a patient’s name or birth date, before being returned with an anonymous patient ID. When Intel and the Knight Cancer Center introduced CCC last month, a key part of the announcement was validation that these queries could be sent between three partner centers without leaking any unauthorized data.
Of course, removing patients’ names is not bulletproof protection against the transfer of sensitive information. A determined enough party could still identify individuals through their clinical records or genomic signatures. “All of the issues around re-identifying somebody through clinical and omics data still obtains,” says Dishman, including the responsibility of all users to comply with federal data protections under the Health Information Privacy and Accountability Act. Organizations using CCC will have to authorize and credential all their partners, and may choose to set different access controls regarding what data can be passed through different CCC networks they belong to.
“The magic here is that you’re now doing this across data sets that you couldn’t access any other way,” Dishman continues. “The barriers at this point are going to be less about the technology, and more about the legal contracts and trust that centers develop with one another.”
The plan for CCC now is to begin open sourcing the middleware used for these secure queries at the beginning of 2016, at the same time it makes the identities of all its initial participating cancer centers public. “We’re trying to do crawl-walk-run,” says Dishman. “We announced the technical proof points. Now we’re going to prove these three institutions can meaningfully share data clinically and for research.”
At that point, the creators will have to settle on a commercial model to keep the Collaborative Cancer Cloud going, and open up the service to additional users ― some of whom might set up fully independent networks on their own clouds. In addition to patient matching in clinical care, CCC could be used for research projects, sending much broader queries that perhaps return less specific data about patient histories.
It could also expand to other disease areas. Dishman suggests Alzheimer’s and mental health as examples of fields where scientists and physicians need much larger patient pools to draw meaningful conclusions about genomic risk factors. “Our focus is on solving problems that are keeping omics and imaging files from being used because they’re so large,” he says. “There’s nothing inherent in this that would limit it only to cancer.”