De-Identifying Genomic Data With Hashing Technology

By Deborah Borfitz

April 24, 2019 | The human genome is personally unique and identifying—even identical twins are very different genetically—which has created a big privacy problem for researchers. Researchers have shown that research participants can be re-identified using genomic data alongside genealogical databases and public records. An algorithm can literally “pick someone out of a crowd” using their DNA, says Kenneth Park, M.D., vice president of offering development at IQVIA.

The worldwide supply of genomic data continues to mushroom, most notably from public initiatives in the UK, Iceland and China, Park says. But it will take more than just data volume to unlock the potential of genomics research to identify new molecular targets for drugs and understand their mechanisms of action, comparative effectiveness, and safety for specific populations, he adds. Addressing the privacy risks, for one. If people fear their private genome data will be breached, they might think twice about participating in medical research and sharing their information.

Two approaches have traditionally been taken to data privacy protection, says Park, the most popular of which is process controls. Data gets sequestered in a secure area, with normal identifiers removed and access subject to review and approval. Limitations of the approach are that it “relies on the good behavior of users, data is still identifiable and in the United States you can’t link to other de-identified data without going through the consent process.”

The second strategy, using technologies such as homomorphic encryption, have other challenges such as adding computation time and complexity to already stressed servers, Park says.

Tokenizing Genetic Variants

IQVIA has come up with an “alternative approach” to de-identifying genomic data to ensure anonymity and better secure data, says Park. It involves using hashing technology to generate tokens that represent different genetic variants—and is different than just using encryption, although the two words are often used interchangeably.

As Park explains it, researchers looking at non-identified, patient-level data will be able to do analysis against the tokens to see, for example, that token ABC123 correlates with type 2 diabetes at a certain P value. But they won’t know the identity of the individuals with that token, or even what genetic variant the token represents.

Only aggregated results get detokenized, Park continues. No information about the variants gets lost in translation because metadata is tagged to the tokens.

The genomic de-identification technology is at the core of IQVIA’s recently announced E360 Genomics genotypic-phenotypic database solution, and remains under development, Park says. But it has already had “significant engagement” from healthcare providers, biopharmaceutical companies, government agencies, and patient advocacy groups. Some of these institutions work with genomics data themselves and others are interested in licensing the technology.

IQVIA’s role will include aggregating genomic data from multiple sites to get closer to the “holy grail sample size” of around one million individuals, says Park. Genomic sequences will go through the privacy deidentification process before being received by IQVIA, he notes. IQVIA will also bring in additional, non-identified clinical data from multiple sources to provide a more complete phenotypic picture for individuals in the sample.

Enlarging the E360 Ecosystem

Last October, IQVIA and Genomics England announced a collaboration to develop a platform that will connect clinical and de-identified genomics data to accelerate treatment advancements for patients. This alliance will enable faster and more efficient drug research, more robust evidence to support treatment value, and greater access to personalized medicines.

Using IQVIA’s E360 platform, authorized researchers will have privacy-protected, technology-enabled access to Genomics England’s patient-consented, de-identified data to create custom clinical-genomic datasets and run leading-edge analytics on genomics and observable traits.

It’s an effort to ensure the E360 platform is sufficiently diverse not only in terms of genetics but where individuals are raised and where they seek healthcare, says Park. “Our customers are interested in understanding the interplay between genetics and the environment… and through our federated analytics methodology they’re able to ask the same research question across multiple populations and come back with a combined answer.”

IQVIA intends to develop similar partnerships in other countries and continents, Park says. “We expect we’ll be able to represent the full diversity of chronic, widespread diseases as well as rare diseases and cancers, but how long that will take largely depends on how quickly people get sequenced around the world.”

E360 Genomics, one of several new modules on the E360 platform, is “only a starting step” toward the promise of precision medicine and personalized healthcare, says Park, and will help address search capacity and data storage issues. IQVIA is also developing an E360 Genomics analytics platform, to launch in Q4 2019, which will also solve some of the computational issues, he adds. From a policy standpoint, IQVIA is working to ensure its technology “is an option for conducting privacy-preserving research.”