Identifying Opportunities At The Interface Of Life Sciences And Tech
December 16, 2019 | Geraldine A. Van der Auwera calls herself a benevolent dictator. She boasts the pedigree of pre-doctoral and postdoctoral years in microbiology labs, and then a shift into computation. Now, as Director of Outreach and Communications, Data Sciences Platform, at the Broad Institute, she works with all camps—scientists, biomedical researchers, data scientists, technologists.
“Having experienced the struggles of each group as they try to understand the others, I think I’m positioned to help them communicate and identify the areas of opportunity where they can help each other most effectively,” she says.
Van der Auwera’s role, then, is less dictator and more champion for technology in the life science, reaching across boundaries to make needed connections.
On behalf of Bio-IT World, Mana Chandhok, producer of the upcoming Digitalization of Pharma R&D conference, recently spoke to Van der Auwera about her role in making those connections and what the life sciences needs to move forward.
Editor’s Note: Chandhok, Conference Producer at Cambridge Healthtech Institute, is planning a conference dedicated to Digitalization of Pharma R&D at Bio-IT World West, March 1-4, 2020, in San Francisco. Van der Auwera will be speaking on the program; their conversation has been edited for length and clarity.
Bio-IT World: Your Twitter account describes you as a communicator at the interface of life sciences and technology. Can you explain a bit more about what that means?
Geraldine A. Van der Auwera: Sure. So really what it means is that I see my role as translating between different groups of people. Both on the side of people who are scientists, biomedical researchers, or who are in more in the data sciences sphere, who are developing methods and algorithms for analyzing data, or on the side of the pure technology and infrastructure. For example, the Cloud vendors who are trying to provide services for the other groups. And so my role is to help these different groups communicate across what divides them and what divides them is often the jargon, the differences what they mean and what they’re looking for.
A lot of my days spent guiding our internal teams in how to communicate effectively to the people who use our services and our products. And at the same time advocating for the research community, advocating for technology groups and again find places where I can, the streamlined conversation and smooth out the process.
And so I have experience accumulated from some different facets of my career. I trained originally as a microbiologist, I spent my formative pre-doctoral and postdoctoral years in microbiology labs. Mainly wet lab setting, doing molecular microbiology, some classic microbiology. So I’ve accumulated a lot of experience as a researcher, as an investigator on that front. Then through combination of chance, opportunity, and personal affinity, I got closer to the computational side of things and closer to the technology. And for the past, I would say, seven years at the Broad, a lot of my work has been focused on helping support their research community in their use of computational tools. And so through that I’ve accumulated a lot of experience with data science and technology. And having experienced the struggles of each group as they try to understand the others, I think I’m positioned to help them communicate and identify the areas of opportunity where they can help each other most effectively.
What is your favorite part of the job?
My favorite part is when I can facilitate and hopefully witness one of those aha moments. When maybe it’s a group of researchers who understand that they need to use this newfangled Cloud thing, but they’re feeling lost and they’re not sure how to even get started. And I’ll have the pleasure of giving them a demo or a tour or some kind of introductory session to orient them. And you can see the pieces falling into place and people realizing, ah, now I see how this is going to help me in my work.
Sometimes it can be about the technology. Sometimes it’s about more the tools and algorithms. I’ve done a lot of that with GATK (Genome Analysis Toolkit) in the past. On the life sciences and biology side there’s often not been a lot of formal training in computational science and technology. And they come to the table with some anxiety and some feeling of being overwhelmed by the technicalities. If I can get them to the point where they realize it’s actually simpler than they feared and get them to the place where they understand how this is going to work for them and how they’re going to be able to do their research with these tools. That gives me great pleasure.
That sounds like it can be really rewarding. Is there anything that you want the world to know about some of the Broad’s offerings?
There’s a lot of software that we produce and put out there for the community to use. Now we operate data science platforms as a service. We provide technology and other kinds of services to the biomedical research community. The biggest thing I would say that what we’re striving to do is make these resources available to the worldwide community in a way that’s open and transparent and that will mesh with the efforts of others. We’re involved in several partnerships. With a common goal of building an open ecosystem that solves problems, solves challenges, but doesn’t lock people in. So it’s not like you have to pick a vendor and then you’re locked into a particular solution.
We really wanted to be an ecosystem where you can come find a solution to some of your research needs and make progress, but still be able to utilize other solutions—whatever solutions are right for you. It’s actually part of a more formal project vision called the Data Biosphere. DataBiosphere.org spells out the principles that we’re trying to pursue here of openness, transparency. Building resources that are community driven so that we’re sure that they’re actually addressing real needs of the research community and that we’re doing so in a way that’s collaborative and open and anybody can participate if that they want to.
Do you think the industry is going to embrace open source with open arms one day? Do you think we’re headed in the right direction?
I think we’re headed generally in the right direction. I think much of the industry already has embraced open source to some extent. Certainly in technology there’s a huge amount that is built on open source software already. On the methods development side the field of bioinformatics, which underlies a lot of the method developments in the field, has been since day one based on open source principles with goals of sharing openly and reusing each other’s work, because ultimately you’re working toward the same goals as improving outcomes for people.
So I think there’s already a lot that’s founded on open source principles. Because there is a huge increase right now in the opportunity for generating large amounts of data and utilizing those in biomedical research context, I think there are some checkpoints where we have to be careful that we continue to follow these open source principles to make sure that what we’re building evolves in the right direction. And certainly the data sciences platform at the Broad made some very strong commitments to working on open source principles, and we seek out partners who have the same mindset. I think as we show the value of this approach; this will continue to convince some of the players who might not be completely on board yet.
There’s a huge amount of value, I think, in openness and transparency in the tooling and the platforms so that the tools can be interoperable. You can look at the difficulties that we have with medical records because there are these large systems that are entirely proprietary and that don’t work with each other. That causes huge difficulties for the downstream research that could use that data to provide meaningful improvements to health care. I think that is an example of something we have to be very careful around.
From the human genome project to where we are now, there’s been a wide fascination following the progress of genomics. With sequencing now driving precision medicine forward, one of the biggest challenges is validating, analyzing, and interpreting the data. What do you think is going to help us overcome these challenges?
Well, at the risk of sounding like a broken record, openness, transparency, and interoperability. There are a couple of some challenges here. I think there are some technical challenges in making sure that we respect privacy of the people whose data we’re working with. We have a very strong focus on security and privacy and making sure that we can honor that. But beyond that, we need to have strong standards so that we can understand and compare the results of the analyses that are done. We’re talking about massive amounts of data and very complex methodologies. And so it’s crucial to make sure that what we’re doing is effective, in the sense that we’re discovering real things and not just artifacts. Because that’s one of the problems with really large amounts of data is that it becomes easier to discover things that are artifacts, not real biological phenomenon. To ensure the quality of the research we’re doing, we need to be able to verify the validity of the approaches. The best way to do that is through transparency and making sure that other groups can replicate and reproduce our results.
To briefly disambiguate, replicating the findings of a study is when we’re typically using orthogonal approaches and I’m trying to figure out if the insights about how biology works are actually supported if we try looking at the problem from a different angle. So that’s the verification, the biological validity of findings of a study. When I’m talking about reproducibility, I mean that I’m able to rerun the analysis that you performed under the same conditions and get the same results. Given the same inputs, the same code, I can reproduce that and get the same results.
Both of those are difficult and the way we get there is by sharing our methodology openly, making sure that we have benchmarking procedures and validation procedures that are going to help us achieve that goal.
You mentioned patient privacy while you discussed all the openness, the transparency, the privacy, the security. I’ve heard some things talking about how synthetic data creation can eventually help create these datasets that will still allow patients to have their privacy. How would that work?
It does very much tie back to what I was saying. Synthetic data can give us a way to provide reproducible examples. When somebody publishes a method or a paper, a lot of the time they’ve developed a method specifically for that paper and for the analysis. And what we see often is that researchers are not able to—for good reason—share the patient data. So that poses the problem: How can you effectively show that your method works without actually showing the data that it was applied to? For some kinds of analysis it’s not a problem; you can substitute some standard public data and show that the tool works as designed. But for some of the more detailed analyses and especially in the tertiary analysis space—identifying genes in which you have clusters of mutations that are associating with a particular trait or disease—you can’t just run that on any public data. The first step to trying to apply somebody else’s analysis is to reproduce it as they originally did it. Because otherwise if you just take the code and run it on your own data, you have no way of knowing whether the results are being produced as expected or if there’s an operation somewhere in how you’re running.
That’s where synthetic data come in. The idea is basically to create simulated data that doesn’t belong to anybody in particular, but that has the desired traits and associations built in. Then you can run the analysis and demonstrate that the tool pulls out the expected results, but do it from data that is completely public, sharable.
That’s one classic use case. The other kind of use case is for benchmarking. If you have multiple tools that purport to do the same thing, you need a way to figure out how they compare in terms of performance. Not just how fast they run, but how accurate are they? How susceptible are they to artifacts and imperfections in the data generation process? The quality of the benchmark is limited by the quality of the data.
There are different ways of generating synthetic data; it’s an active area of research. There are many kinds of data: raw sequencing reads or variants that has already been identified or omics data types, or phenotype information that comes from either medical records or lifestyle questionnaires. The more we push the realism of the synthetic data, the more likely we’ll be able to provide reproducible examples of datasets that we can bundle with published publications. For example, if we’re generating synthetic data that is too clean, too perfect, that doesn’t capture some of the systematic biases that are produced by the instruments, which by nature are imperfect, then we have some limitations in terms of what the benchmarks can tell us about the methodologies.
One very exciting opportunity is how can we use machine learning approaches to develop synthetic data sets that are even more realistic, that capture the quirks of the biology, but also the quirks of the instrumentation that is used to generate the data in the first place.