Josh Denny On The All Of Us Research Program, Data Management On A National Scale
By Allison Proffitt
November 6, 2017 | Vanderbilt University has been biobanking patient specimens since 2007 when Josh Denny launched BioVU. Today, the biobank has nearly 250,000 samples, some of which have associated medical records dating back to the early 1980s. “That gives you an ability to do in silico prospective studies from a long time ago—essentially for free with cost of the healthcare that we approved,” Denny told the audience at the Leaders in Biobanking Congress event held last month in Nashville, adjacent to Vanderbilt’s campus.
Denny hopes the Precision Medicine Initiative will empower the same kind of studies.
Denny is a physician, professor of biomedical informatics and medicine, and director of Vanderbilt’s Center for Precision Medicine. He has been part of the external working group on building a research cohort for precision medicine since the National Institutes of Health first announced the NIH Precision Medicine Initiative. Denny currently leads the All of Us Research Program Data and Research Center.
Denny has expertise in sharing data as well. Vanderbilt is part of the eMERGE network—electronic Medical Records and GEnomes—a national network organized and funded by the National Human Genome Research Institute (NHGRI) in 2011 that combines DNA biorepositories with electronic medical record systems for large scale, high-throughput genetic research in support of implementing genomic medicine.
One of the first studies the eMERGE network took on was looking at autoimmune hypothyroidism in 2011. Researchers took patients that had been genotyped across five different health systems—about 13,000 patients at the time—and built an algorithm to find a hypothyroidism phenotype. The algorithm evaluated billing codes, lab values, time, procedure codes, and added some “very, very lightweight text mining,” Denny said. The study identified a new genetic risk factor for autoimmune hypothyroidism. (American Journal of Human Genetics; DOI: 10.1016/j.ajhg.2011.09.008)
Denny’s team calls the approach a phenome wide association study—PheWAS—and he says the approach has proven fruitful. In the earliest cohort of 13,000 patients, “we discovered about 60 new things,” he said. “And 90% of people participated in more than one study, so there’s a lot of potential to reuse people with this kind of dense scientific information.” More recently, with a pool of 37,000 genotyped patients, eMERGE researchers have been able to impute HLA types, an important character for many autoimmune diseases. (Science Translational Medicine; DOI: 10.1126/scitranslmed.aai8708)
These early successes with the eMERGE network set the stage for what the All of Us program could be, Denny said.
The Million Person Cohort
The Precision Medicine Initiative was launched by President Obama in his January 2015 State of the Union Address. Within the initiative, the All Of Us research program plans to sequence and gather longitudinal data from at least 1,000,000 re-contactable volunteers. Electronic health record data will form a key part of the All Of Us (AOU) research program and offer a passive way look at exposures, Denny said.
Volunteers start by enrolling, giving consent, and taking a series of health questionnaires online. Then they can donate biospecimens, lab values, and submit electronic health records. Volunteers report blood pressure, body mass index, heart rate, height, hip and waist circumferences, and weight. Volunteers donate blood and urine. “If for some reason we can't collect blood, which is about 2% of the time right now, we use saliva,” Denny said. The 35 million to 40 million samples that will make up the AOU cohort are banked at the Mayo Clinic.
Data on exposures and habits will be collected through next generation technologies and smart phones, Denny added. The consent process, in particular, will be evolving Denny said. There is ongoing research on the consent process and levels of consent needed as the program proceeds.
“We want to create big, diverse datasets available for things like machine learning, which is becoming, obviously, a hot topic now. We want to engage people richly and engage diversity,” Denny said. European ancestry dominates our understanding of the genome right now, and though Denny points out that diversity has improved since 2009, non-Europeans are still grossly underrepresented.
“This leads to us thinking certain variants are pathogenic for instance, that aren't,” he explained. “We find they’re very prevalent in other populations. This lack of diversity in our studies actually dramatically affects everyone.” The All Of Us program hopes to build a cohort of about half under-represented populations, and Denny reported that the beta phase is on target.
One early project of the Data and Research Center has been to validate health surveys to develop and test materials to be inclusive of diverse populations and diverse education levels. A top priority, he said, is providing Spanish translations.
“Our baseline surveys that we have out right now are the survey based on the demographics and lifestyle, which includes habits and exposures; and overall health. Then we have a number of others in development… there're probably about five [surveys] in various stages in development and maturity,” Denny said.
Data Management For All Of Us
Otherwise the Data and Research Center manages all the study data that comes in, making it available for researchers, thinking about the quality control around the security and privacy of that data, and building the needed infrastructure.
“That includes the protocol for creating labels for specimens and printing those labels. Facilitating collection of the physical measures, and the process of recruiting people to individual sites, and dashboards of our progress and completion rates,” Denny said.
Incoming data—including survey data from the web-based tools, data entered by healthcare professionals, all lab reports, direct EHR connections, VCF files etc—go into a broad data repository, and it gets curated over time, Denny explained. “Then we're going to apply patient algorithms to remove obvious identifiers for a lot of this content to reuse and sync them with other data sources.”
The Data and Research Center has been working on a protocol called Sync for Science (S4S), which is designed as a standard representation to encapsulate essentially all of your healthcare data, Denny explained. S4S allows individuals to access their health data and send it to researchers in support of the goals of the Precision Medicine Initiative, and it was developed in conjunction with electronic health record vendors Allscripts, athenahealth, Cerner, drchrono, eClinicalWorks, Epic, and McKesson.
“You start with the research app, in this case, and then you connect to your patient portal, mediated through the app, a handshake processes that’s standardized, and then the patient at that point can authenticate sharing of that data back,” Denny explained. “We have a pilot that we will be starting soon with four EHR vendors, which includes one of the largest ones, in 14 different sites in a small number of individuals to test how this works.”
The creation of the S4S standard protocol should help the All Of Us program, Denny said, by making it easy for volunteers to share their medical records. But S4S meets the needs of Meaningful Use Stage 3 and the requirements for an API, so there’s an incentive for health care systems as well. “It can obviously enable liquidity of healthcare data for much, much larger populations than just those in our study,” Denny said.
The Arrow Back
One of the key components of the Precision Medicine Initiative and the All Of Us research program is the intent to return data to the volunteer. “That arrow back to participants—this bidirectional exchange with participants—is something that is really innovative,” Denny said. “A few have done it, but certainly not on this scale of potentially at least million people.”
And Denny definitely expects there to be data to return. Vanderbilt’s PREDICT program is a pharmacogenetic mutation program that looked at five drug-genome interactions and provided recommendations to affected individuals. And in the first cohort of 10,000 patients, Denny said 91% had an actionable variant for one of the five drugs.
“It's common to find problems with pharmacogenetics that would predict something for drug prescribing,” Denny said. “We are going to see this as a much more common thing, and a chance to improve some people’s health potentially.”
But in order to do that, the data need to be accessible. The All Of Us program is favoring a centralized data model, Denny said. “Our idea is to actually bring the researchers to the data as the size of these data will be challenging to compute on for many people. It also allows us to enforce better security protocols and things like that.”
Datasets will be divided into tiers and access will vary across the tiers. Researchers will need to be identified and verified, they will sign codes of conduct and certify that appropriate human and research ethics training has been done, Denny said. Within the most public dataset, obvious identifiers will be removed, and there will be increasing risk of re-identification in successive tiers. Probably, Denny said, the All Of Us program will mediate any re-contact centrally through the program. “In most cases you probably don't have to know the identity of a person.” With more sensitive data—genomic data, clinician’s notes—Denny expects researchers to access data through an NIH eRA Commons ID. Research aims would be public, a point mandated by the 21st Century Cures Act.
Validated researchers will be able to access data through web-based tools offering both a point-and-click environment and computational and statistical tools. The program wants to enable common analyses to be easily done by all users, Denny said, but also equip “power users” to dig deeper. “We're going to use something like Jupyter Notebooks; we'll support programming languages statistical packages like R,” he said.
The All Of Us program is nothing if not ambitious, but Denny remains undaunted. “We're pretty bullish based on our success in things like eMerge. It's an iterative process, it's not going to be perfect from day one.”
The strength of the Data and Research Center’s approach, Denny said, is keeping the raw data repository, and using technologies like natural language processing, machine learning approaches, geospatial analyses, and integration of different kinds of data sources to create a curated data repository. “This is going to be a journey; it's certainly not a destination. But by having this model it allows us to always keep the raw data and get it smarter over time.”