The Rise and Importance of Population-Scale Genome Programs

September 20, 2024

Contributed Commentary by Asha S. Collins, PhD, DNAnexus 

September 20, 2024 | With the wealth of population-scale projects going on, we have entered a golden age of genomics. Based on efforts such as the All of Us program in the U.S. and the UK Biobank, among many others, we are poised to discover more than we ever knew about natural genetic diversity, what’s considered “normal” human biology, and the continuum of health and disease. 

It’s also worth noting the sheer magnitude of people volunteering to participate in these projects. With so much news about the erosion of trust in science and institutions, the enrollment numbers in these population-scale studies provide a much-needed counter perspective. Millions of people around the world are willing to provide their own genomic and clinical information with the goal of accelerating science and contributing to the health of communities beyond their own. 

The generosity of so many people—and the collective faith they have in the value of these projects—means that we as a scientific community have a deep responsibility to deliver on the potential of these programs to expand scientific understanding and to improve healthcare. I believe honoring participants means we must strike a better balance between protecting their data and ensuring access to it for the advancement of science.  

First and foremost, we must continue to be good stewards of all participant data. Study participants have entrusted data providers with their biological information and, in some cases, clinical data and DNA, as well. We must prioritize keeping these data safe and creating environments that help ensure they are being used solely for their intended purpose. We must also understand the continued evolution of threats against these data sets, such as the potential for genetic discrimination in nations that have not yet adopted protective laws. These data are deeply personal and deserve the strongest protection. Trusted research environments for population programs were built with this goal in mind: providing a platform where researchers are able to analyze a specific dataset at scale without compromising data security or privacy. 

Our understanding of handling large-scale healthcare data has allowed us to create and evolve our processes to keep that data secure. As our controls and capabilities that keep data secure continue to evolve, we must also continue to track the latest approaches for accelerating insights from large-scale data. There is still a tremendous amount of insight waiting to be gathered from individual datasets, and there are readily apparent benefits to cross-analyzing information between data enclaves. 

It’s time we set out best practices for how to provide access to these multimodal data, ideally with the goal of allowing researchers to securely access and analyze data across population cohorts so that they can mine novel and more impactful insights. This would make it possible to unlock additional potential from all these different population-scale programs by allowing for a more diverse and inclusive understanding of global communities. While tremendous progress has been made by organizations such as the Global Alliance for Genomics and Health, the outcomes of this work have not gained consistent industry adoption. We must now put these concepts into practice to protect participants while maximizing scientific discovery. 

I propose that a collaboration model will be the best answer for this, with population-scale programs partnering with a trusted research environment provider to ensure their data are securely accessed and analyzed. Ensuring that participant data remain secure also requires restricting access to approved researchers across a variety of different programs. If this is implemented correctly, researchers could tap into these data sets in order to boost discovery across the scientific community, not just in countries or organizations with the resources and infrastructure to support this. It would require a flexible and nuanced approach to security that could keep data safely within its original repository while making it possible to call data for analysis temporarily when justified, with secure and appropriate connections to each data source.

This approach has already been evaluated for meta-analysis of two programs, All of Us and the UK Biobank. In a paper published last year, scientists used cloud-based trusted research environments to assess two approaches to computing results across these data sets. First, the researchers performed a meta-analysis, where they mined data for a genome-wide association study separately within each program’s dashboard, and then exported and analyzed those results again together. Next, they performed a pooled analysis, where they called data from both projects and analyzed them together in a single workspace. 

While both approaches revealed novel findings, it was the pooled analysis that really stood out. Nearly 1.5 million variants were identified only in the pooled analysis, representing the lower minor allele frequencies that are more likely to be missed. Of the variants hitting a specific significance threshold, those found in the pooled analysis had higher scores for being deleterious than the ones in the meta-analysis.  

In addition to scientific impact, the authors compared cost and complexity for each approach. They determined that the meta-analysis required far more computational steps than the pooled analysis and involved more human labor. If additional population-scale programs were added, the meta-analysis would become even more complex while the pooled analysis would largely stay the same. 

Whether it’s a meta-analysis, a pooled analysis, or some other approach, it’s clear that computing across population programs will streamline discovery and allow us to get to new findings faster. Ultimately, this concept should help unleash new drug discovery and development efforts so insights can be translated into clinical utility. As we move forward, it will be important for us to understand how to properly deploy cutting-edge tools, including potentially utilizing generative AI tools for processing clinical imaging data, and to expand datasets to include proteomics and other key modalities for a broader view of biology. 

In the spirit of honoring participants’ contributions, we should also aim to share our results, pipelines, and workflows back to the community. This does not have to include anything proprietary; there will be many non-competitive tools that could help scientists around the world drive innovation. If we do this right, we will be protecting participant data while also benefiting from each other’s learnings. 

Implementing a collaborative model that enables population-scale programs to partner with a trusted research environment provider will facilitate responsible access to their data for cross-cohort analysis. This will allow us to make the most of the data shared with us by so many study participants around the world, powering scientific breakthroughs and ultimately having a positive impact on patient care. 

 

Asha S. Collins, PhD is the Senior Vice President and General Manager for Population Programs and Biobanks at DNAnexus. She can be reached at acollins@dnanexus.com