UK Biobank Pivoting to Platform-Only Model For Big Data Sharing
By Deborah Borfitz
June 12, 2024 | When it comes to big data, the largest amount comes from whole genome sequencing (WGS). At the population scale, WGS produces petabytes of information that for sharing purposes requires cloud computing capabilities.
That realization came early to UK Biobank, the large-scale biomedical database and research resource built on the altruism of half a million largely healthy Brits who “donated bits of themselves” in the form of information and biomedical samples between 2006 and 2010, according to Mark Effingham, Deputy CEO of UK Biobank. Six years ago, as those blood samples started being sequenced by an industry consortium (comprising Amgen, AstraZeneca, GSK, and Johnson & Johnson), plans were being laid with DNAnexus for the creation of an online research analysis platform where the resulting data could be securely shared and flexibly analyzed.
UK Biobank launched their Research Analysis Platform (UKB-RAP) in January 2022 and added the world’s largest tranche of WGS data to it last November. Researchers the world over can explore the WGS data alongside up to 10,000 variables detailing the health and lifestyle of each participant. About 7,000 users are now on the platform seeking to better understand the biological drivers of a wide range of diseases and other health conditions, Effingham says.
UK Biobank itself doesn’t do any research, adds Effingham. Its mission is to make the deidentified data available to support a wide range of researchers—provided they are working for the good of public health. This is crowdsourcing at its best, he says, with scientific minds around the world all applying their creativity to the data to generate new findings and insights.
The UKB-RAP supports research not just by providing access to data at scale but putting a growing repository of sophisticated computational and analytical tools at their fingertips. It fits well with UK Biobank’s “democratization agenda” ensuring researchers at institutions in low- and middle-income countries can engage with the data in “exactly the same way” as their well-resourced counterparts in the U.S. and the UK, says Effingham.
For its work to make WGS data globally available via the purpose-built, cloud-based platform, UK Biobank brought home honors as a 2024 Innovative Practices Award winner at the Bio-IT World Conference & Expo in April. Sequencing the whole genomes of half a million people was the most ambitious sequencing project of its kind ever undertaken and took five years and over £200 million of investment to bring to fruition, he says.
The ‘Big Bet’
UK Biobank was set up by Wellcome, a global philanthropic organization, and the Medical Research Council, a UK government funding body. Discussions about the creation of a nationwide longitudinal study had been ongoing since at least 2000 and it was to some extent a “big bet” that the return to public health would outdo what could be accomplished by the alternative of funding many small research studies, says Effingham. “It had mixed reviews to begin with,” he recalls.
In many respects, the bet has already paid off. Over 30,000 researchers from more than 90 countries have registered to use the goldmine of data to help answer their research questions and produced over 10,000 peer-reviewed studies looking into cancer, diabetes, Alzheimer’s, depression, tinnitus, and heart disease, Effingham reports.
“Nothing is in it for the participants themselves,” he adds, beyond the satisfaction of helping improve the health of future generations. It took six years simply to get to the recruitment stage, including building protocols regarding how the biobank would be set up and what information should be collected.
UK Biobank “really benefited from leading lights across academia,” down to the exact wording on health surveys to extract the most meaningful information from participants without being overly burdensome. Touch screen questionnaires (e.g., detailed health and lifestyle history) were part of the baseline assessment, together with a face-to-face interview with a study nurse, an assortment of physical measurements (e.g., hand grip, spirometry, and bone density), and standardized collection of 55 milliliters of sample (blood, urine, and saliva) “with no specific intended purpose in mind.”
Subsets of the participants have also gone on to wear a 24-hour activity monitor for a week (100,000 people), undertake repeat measures (20,000), and have their heart, brain, and abdomen scanned as part of the world’s largest imaging project (over 85,000 of the aimed 100,000 participants have already contributed). Participants, all falling between the ages of 40 and 69 at recruitment, each consented to having UK Biobank follow their health over time via linkage to their healthcare records.
Assessments were undertaken in 22 centers in Scotland, England and Wales and took approximately two hours. The biological samples went into long-term storage, the only exception being hematology assays completed on whole blood from participants since fresh samples are required for the testing, Effingham says.
Genetic Considerations
From the outset, UK Biobank was intending to understand the genetic characteristics of participants and their contribution to disease, says Effingham, noting that the open science organization was still in its formative stages when the first essentially complete sequence of the human genome was generated. The assumption was that it would be many years before large-scale DNA genotyping could happen.
It was in fact a relatively short wait. In 2013, the UK government’s Department of Health and Social Care provided funding for UK Biobank to extract DNA from donated blood and have U.S.-based Affymetrix put those samples through a genotyping chip that measures 850,000 disease-associated genetic markers. When that dataset was released in 2017, it was reported that Oxford University scientists were able to impute a further 90 million other genotypes from each participant, since areas of the genome tend to be passed on together from parent to child.
UK Biobank’s data access policy accommodated co-development work subsequently pursued by a consortium of pharmaceutical companies—comprising Regeneron Pharmaceuticals, AbbVie, Alnylam Pharmaceuticals, AstraZeneca, Biogen, Bristol Myers Squibb, Pfizer, and Takeda—which generated whole exome sequencing data on over 470,000 participants. The companies were granted exclusive access to the samples for nine months, after which the resulting sequencing data became available to researchers everywhere, says Effingham.
The exome represents 2% of the genome that encodes for proteins and the part thought to be most informative for any kind of disease, he notes. Exome sequencing was “very impactful in moving the field forward” and opened the opportunity for the subsequent consortium (Amgen, AstraZeneca, GSK, and Johnson & Johnson) Ato come together in 2018 to fund WGS on all 500,000 participants.
The Vanguard study, funded by the Medical Research Council, sequenced the first 50,000 individuals. A £100 million investment from the consortium, matched by a similar amount from the government and charity, covered the other 450,000, says Effingham.
Platform Build
DNAnexus and Amazon Web Services (AWS) were chosen as UK Biobank’s technology partners for building the big-data-sharing UKB-RAP in 2020 following a competitive procurement process, Effingham says. The dataset was initially for exclusive use of WGS project funders and during this period they provided feedback to help improve its usability.
The UKB-RAP initially hosted 10 petabytes of data, which researchers downloaded through transfer use agreements to analyze within their own on-premises environment. Since then, the platform has grown to hold over 30 petabytes of data, including the WGS dataset.
The biological samples being turned into large-scale data sets with the exome and whole genome sequencing were too large for downloading. Allowing full data access effectively requires a way for researchers to work in situ with all the necessary analysis tools to make sense of all the genetic information, Effingham says.
"We will soon be a platform-only model and have been working on ways to broaden the tools and training so that even researchers who have never used giant datasets before can start to tackle [them],” he continues. “The opportunities are enormous and for us to scale research, democratize access, maximize the impact of the data our participants have donated, and truly transform health around the world, we need to have a platform that works for everyone.”
The existing user base includes statistical epidemiologists using spreadsheets for data analysis and machine learning experts who want to apply algorithms on imaging data sets. So, UK Biobank is working with DNAnexus to extend the utility of the UKB-RAP to help support research using other data types, says Effingham. Already added to the platform is Rstudio, an integrated development environment for the popular programming language R, to support researchers wanting only to do basic statistical analyses.
Jupyter notebooks is also available on the UKB-RAP, providing a highly visual and interactive way to engage with very large-scale data sets, he adds. It has been quickly gaining in popularity with data scientists involved in clinical research over the past few years.
Just the Beginning
Accessing data in UK Biobank comes with a fee, which is intended for cost recovery rather than to recoup the cost of setting up the taxpayer-subsidized resource, says Effingham. In recent years, a tiered fee structure has emerged that is tied to the size of the requested datasets but represents consistent fees for all excepting the discounted rate offered to early career researchers and those in lower- and middle-income countries.
In early April, UK Biobank formally launched the Global Researcher Access Fund that covers application costs of approved researchers at institutes from less wealthy countries, he says. Among the fund’s chief contributors are AstraZeneca, Bristol Myers Squibb, Johnson & Johnson and Regeneron. Courtesy of AWS, UK Biobank is also able to make available up to half a million dollars of credits per year that can be awarded to early career and low-to-middle-income country researchers to be used exclusively in the UKB-RAP.
The embedded toolset in the platform could grow to include statistical analysis tools beyond RStudio, such as Stata and SAS, to accommodate individual preferences, says Effingham. DNAnexus has already produced a Swiss Army Knife app as a starting point for many common bioinformatics manipulations that is heavily used for genome-wide association studies. UKB-RAP also allows users to flexibly bring their own analysis tools and algorithms to the environment.
One of the novelties of the platform is that many different research groups can have their own project workspace and lens into the data, Effingham says. A single master copy of all the data gets immediately shared as needed and permitted to those digital spaces, which is one of the key advantages of cloud computing. AWS can scale up or down power to its cloud servers based on the computational requirements of the research.
The biggest challenge facing UK Biobank currently is managing the change process for researchers who have longed worked in familiar, on-premises environments, he says. Invariably, “lightbulb moments” occur once they see the power of the UKB-RAP to support and accelerate their research projects.
For UK Biobank, and science in general, this is just the beginning, says Effingham, pointing to a few of the many learnings it has enabled thus far. These include the fact that type 1 diabetes is not just a childhood disease, and the discovery of the genes associated with protection against obesity and type 2 diabetes.