The RGC UK Biobank Consortium Data Delivery And Cohort Browser
By George Asimenos
June 28, 2019 | In April, we announced the winners of the inaugural Bio-IT World Innovative Practices Awards. These awards are given with the goal of celebrating projects that highlight excellence in bioinformatics, basic and clinical research, and IT frameworks for biology and drug discovery. While only three projects were rewarded, we can't help but recognize our honorable mention for their efforts in elevating the critical role of information technology in modern biomedical research. – The Editors
The UK Biobank (UKB) has collected and developed a biospecimen and data resource on over 500,000 individuals. This resource has proven valuable to pharmaceutical companies, healthcare organizations and the scientific community, and has already led to more than 170 publications revealing novel associations with important epidemiological markers. In collaboration with a consortium of pharma companies, the Regeneron Genetics Center (RGC) has undertaken the exome sequencing and analysis of all 500,000 samples, using the DNAnexus Platform to host the dataset and run Regeneron's software pipeline. To increase the value to the RGC UKB consortium, Regeneron and DNAnexus partnered to construct a combined database of the UKB genomic and phenotypic data to explore through an innovative "geno/pheno cohort browser" user interface. Part of the DNAnexus Apollo Platform, the cohort browser was designed to democratize data access, giving diverse teams the ability to browse through thousands of phenotypic fields and millions of genomic variants across hundreds of thousands of samples, and build cohorts.
"There is a tremendous amount of knowledge waiting to be gleaned from the UK Biobank collection," said Jeffrey Reid, PhD, Vice President and Head of Genome Informatics & Data Engineering at Regeneron. "DNAnexus has been a key partner in helping us navigate the complexities of generating and delivering this data to the UK Biobank and our pharma partners in the exome sequencing project."
The objective of this project was to provide a comprehensive delivery experience by combining scalable cloud tooling with a visual data integration solution. In sequencing the first 100,000 UKB samples, Regeneron generated more than 800,000 files, which posed a substantial technical challenge for consortium members to consume. The DNAnexus Platform made it possible for Regeneron to address this challenge with a cloud-based solution that enables diverse teams of both technical and non-technical users to intuitively explore the underlying genomic and phenotypic UKB dataset.
DNAnexus deployed a cohort browser web application on its Apollo Platform to enhance the data exploration experience with interactive visualization, filtering, and browsing of integrated phenotypic and genomic information. The cohort browser allows researchers to explore the UKB dataset, using human-friendly metadata in field names and values, and has been designed to provide an intuitive experience across the dataset's 3,000 phenotypic fields and 15,000,000 sites of genomics variation. The browser allows anyone to quickly display hundreds of thousands of samples visually. This includes scientists who do not have intricate knowledge of the dataset who can query and filter across phenotype fields and build cohorts that meet certain criteria. The browser responds in real time to scan billions of data points and return intuitive phenotypic and genomic visualizations and aggregation tables.
Within the same day of Regeneron authorizing the release of the 800,000 files associated with the initial UKB exomes release, designated individuals from each of the participating consortium members were able to access the complete, analysis-ready dataset on the DNAnexus Platform, and download the entire 265 TB dataset to local infrastructure or efficiently copy to another cloud location. The pharmaceutical and biotech companies then had full control over user-, group-, and project-level access to the dataset within their own organization.
Since launching the browser in late January 2019, multiple users from each organization have had access and commented on the ease with which they can explore and visualize this massive dataset, describing it as a success. Participating pharmaceutical companies have been able to answer direct questions about the combined UKB genomic and phenotypic dataset in seconds. These questions would have otherwise taken these organizations days to answer with their existing data resources and internal processes. More data points for ROI will be quantified as the cohort browser gains adoption.
The browser was deployed using DNAnexus Apollo, the big data science exploration platform. DNAnexus Apollo leverages Apache Spark technology and builds on top of it with optimizations that are suitable for large-scale genomic and phenotypic datasets. During ingestion, the genomic data is enhanced with functional annotations, frequency calculations, and stored in a way that allows querying of billions of genotypes in seconds. Likewise, the phenotype querying functionality incorporates any hierarchical structure (such as the ICD10 diagnostic codes) or special-case codes present in the underlying UKB phenotypic data schemas. The DNAnexus access control framework is used for Regeneron to designate which pharma team leads have access, and for each team leader to further manage access within their organization.
DNAnexus has demonstrated that visual analytics are an effective approach to extract knowledge from biobank-scale datasets. The scientific advancements obtained from the UK Biobank initiative support the value of both the immense logistical efforts needed to obtain such datasets and the novel computational tooling required to make use of the results. It is clear at this point that effective data integration of phenotypic variables and genomic datasets is an essential component of precision medicine. DNAnexus points the way to future applications to further extract actionable information from large, complex phenotypic and genomic datasets for the development and application of novel therapies.
The ingestion of the UKB dataset was a collaborative effort between Regeneron, DNAnexus, and the UK Biobank. Regeneron, as the primary applicant, extracted the phenotype fields; the DNAnexus scientific team operated DNAnexus Apollo, reporting back to the UK Biobank any data omissions and corner cases. DNAnexus would like to thank the UKB partners for their support and the improvements made to the UKB data showcase throughout this process.
George Asimenos is the Chief Technology Officer at DNAnexus. Part of DNAnexus since its inception, George has been involved in the design and implementation of the DNAnexus product line and its application in key accounts around the world. He currently heads strategic projects, exploring ways in which DNAnexus technology can be used to craft novel experiences that transcend traditional genomics boundaries. He can be reached at george@dnanexus.com.