FAIR Data, A Fair Desiderata
April 11, 2019 | Scientific discovery and innovation is not a solitary undertaking. Advancement depends on the ability to build upon the work of others, and thus it is imperative that experimental data be easily retrievable and of reliable quality. The principles of FAIR data—Findability, Accessibility, Interoperability, and Reusability—seek to improve the way data are collected and stored with the goal of preserving datasets from original research while allowing them to be reused or combined in novel ways.
On behalf of Bio-IT World, Kaitlyn Barago interviewed Tom Plasterer, Director, Bioinformatics, at AstraZeneca and Mathew Woodwark, PhD, Director, Research Bioinformatics, Data Science and AI, BioPharmaceuticals R&D, also at AstraZeneca, about how they are using FAIR principles in their work at AstraZeneca.
Editor’s note: Kaitlyn Barago is a Conference Producer for the Bio-IT World Conference & Expo 2019 held in Boston, MA, this April 16-18. Mathew Woodwark will be a keynote presenter on the 16th, exploring the value of obtaining, using, and preserving FAIR data. Their conversation has been edited for length and clarity.
Bio-IT World: As it matures, the philosophy of FAIR data has evolved from focusing on making data fit into FAIR principles to utilizing FAIR data in research. What are some of the ways that you are using FAIR data at AstraZeneca?
Tom Plasterer: There a couple of different places. I think we started down this journey first with competitive intelligence. And so really the idea was, can we take multiple external sources around clinical trial data on drug portfolio data and synthesize them under a FAIR-compliant knowledge graph that would then allow us to seamlessly address questions in the CI space. That’s really where we started a couple of years ago with the CI 360 program that’s been written about in Bio-IT World and won an award a couple of years ago.
From there it’s evolved into a program called Integrative Informatics that Mathew and I lead. And here really what we’re trying to do is integrate a lot of translational science cross-omics data—anything from transcriptomics, proteomics, metabolomics, clinical chemistry, patient surveys, pretty much anything in the space of pre-clinical research up through early clinical research—with the idea again of using FAIR-compliant principles to populate a knowledge graph so we can flexibly ask questions across this knowledge graph for the scientific community. It’s really those two main areas where we started, and I think now it’s starting to expand a little bit more toward earlier R&D and also a little bit more toward the clinic. But the sweet spot has really been around translational research.
Mathew Woodwark: Yeah, and to add to that, the verb ending on Integrative Informatics is important. It’s integrative, not integrated. And the idea is that you can bring together datasets to answer a specific question in a transient way. That’s partly because we don’t want to go back down the path of an enterprise data warehouse where we’ve all been before. Also, the reason that this is important from a FAIR perspective, we want to be able to pull together datasets and integrate them so that we can query them. To do that we need to know that they comply with the FAIR principles so that we don’t have to do a huge amount of QC and checking on that data at the point of integration. We’ve already got an idea how integratable they are before we run the query.
I think the other thing that’s an evolution in terms of using FAIR data in this way is integrating our semantic world with our analytics world. Some of that analytics is large scale ‘omics, domain aware applications and running large complicated analyses. Some of it is cleverer ways of querying our clinical data. We want to be able to represent as much as we can in the knowledge graph, so that we can use semantic techniques to run those queries, but we recognize that we need to pick a hybrid approach for some of those types of data that fit more into either a flat file or a relational structure.
What are some of the challenges that you have faced in adopting FAIR data in your organization?
TP: There are two different types of challenges: technology challenges and cultural challenges. Technology challenges are probably the easier one. We’ve been waiting for a number of things to mature in the technology vendor partner space before we were able to do this at scale. I think that’s now changed, so that it’s less of an issue. There are a couple of things that really helped on the technology side. One, the IMI Open Phacts project really proved that this sort of an approach is ready for prime time. There were certain pieces that they did within their project that aren’t still around, but a lot of the patterns of how you do linked data at scale, which eventually became some of the patterns for how you can do FAIR data at scale, all came out of Open Phacts. That’s now the community driving the FAIRPlus project. I think there’s other parts that are coming to the fore—the PISTOIA FAIR Data Advisory Board and the PISTOIA FAIR tool kit. You’re starting to see also very nice bespoke toolkits coming up around things like FAIR data Hackathons. I know again some of those things will be shown at the Bio-IT hackathon.
The cultural challenge is multiple and layered. If we look at it from the science perspective, scientists are pretty much trained toward a paper, and even in our setting a lot of the effort is focused around clinical trial submission. So having it in your mindset to begin with to think about data as an asset, to think about collecting it in such a way that you can reuse it later for questions you hadn’t considered, researchers just don’t think like that, at least not yet. I think that’s one of the first major things that needs to change. For scientists, it’s just being able to collect your data in such a way that reuse is built in, as best as you can possibly anticipate it, rather than being collected for saving it, rather than it being collected for one particular purpose.
MW: Tom painted a stark picture that we don’t do it. Actually, it’s starting to happen. The overriding driver is still to test that hypothesis in that experiment, whether it’s in research “in a Petri dish”, or it’s in the clinical trial. Here’s what I’m trying to test; I need to collect my data; I need to test my hypothesis. In the past, the next step has been to write it up, whether it’s a paper or a report to a portfolio team or a submission for clinical trials. But what we’ve got people thinking about now, for example in our early clinical development teams, is generating the data for reuse. In fact, I was just talking to our biological screening team about how we generate that data for reuse. How are we tagging that data? How are we structuring our experimental design so that the data is comparable from one day to the next? Or if necessary, how are we generating our data in large blocks, so that you reduce the variability so that we can compare it? This thinking is starting now, because people are recognizing the value of being able to not only just pool data so that you’ve got larger numbers for better statistical power, but also because we want to be able to bring together a lot more corroborating evidence from different data types.
TP: From the IT side, there’s been reluctance to embrace new technologies, especially as we’ve developed so many patterns around thinking of the data in a table situation or a relational database-type situation. I think we’ve gotten very good at taking a particular structure of tables within tables within tables and coming up with ways of making that work to some extent. But that’s not really how science approaches a problem. You have a mismatch between how data is typically collected, and how applications are typically served, and the way that scientists want to think about it. Scientists want to think about all the connections between your data. So, they’ll think: “how do I collect this set of gene transcripts, how does this relate to the behavior in a pathway, how does that pathway influence the disease, how does that disease manifest itself in a population?” It’s inherently looking across all of these different data types to be able to address the science and health questions, and that’s not how IT thinks about it. You need to be able to get IT to switch their thinking and their structures in a flexible, graph-style way.
MW: For a long time, molecular biologists were quite reductionist. This gene codes for this protein that has this effect in this system. What people are going back to—because actually that was the thinking before we became reductionist—is thinking holistically. We weren’t reductionist when we didn’t have any data. People could think about things theoretically. Then we got enough data to have to reduce it down to practice. Now, because people have not only got enough data but enough computer power, we’ve gone back to thinking about what do all of these different levels of observation tell me about my problem?
So the classic example is, I’ve got a variant in the genome, I see some disease association by some change in phenotype in clinical patients. What’s all the corroborating evidence in terms of how that gene’s expressed, how the protein’s expressed? What that means in terms of how the cellular systems behave?
We have to think about things in a more holistic way now to solve more complex problems. To do that the data has to be comparable, and it has to be generated for reuse. And this goes back to, again, being able to use FAIR data as a vehicle to enable that comparison. We do talk about FAIR data internally, but it’s more tractable in terms of getting people to engage by saying, if you treat your data in this way, you’ll be able to answer a different set of questions that you can’t answer now. You’ve got to tie it back to the impact on the scientists and say, “You will come up with a pattern that allows you to make non-invasive samples of a patient that will detect whether they may have an associated higher likelihood of disease,” for example.
TP: That’s the first time we’ve actually mentioned the principles and we’ve been talking for 15 minutes. So that just shows you that it’s really about the science and the sort of things that you either couldn’t do because your data wasn’t FAIR, or you now can do, and can now think about, because it will be FAIR.
What are some of the ways that you’re seeing FAIR data and the FAIR principles enabling applications of AI and machine learning in your organization?
TP: We’ve now seen this in a couple of different places, both inside and outside. I think one of the things that we see in our industry is people are healthy skeptics when a new technology comes along. And we’ve been dealing with the big technology vendors coming in and saying, “I’ve got a hammer, let’s use it on your problem” for quite some time. Because of that, I think there’s this feeling that AI might be just another hammer. We need it to be specific for the problems in our industry, so I think there’s a little bit of the machine learning, artificial intelligence, deep learning industry needing to think about what are the specific use cases where it makes sense.
But before we can even get started there, we need to have clean, well-sorted data. In commerce or other industries, they can get over data problems because they don’t have the same data variety and they have far more data volume and can therefore still get signal out of that, even if the data’s not well cleaned up. I think the belief within R&D and life science is that if you don’t have it cleaned up, because of the variety, you have no chance. This is where FAIR and AI deep learning can complement each other really, really well. So, if you can figure out what is the well-scoped-out problem for which AI is just another analytic data transformation, and you’ve provided really well structured clean data—ontologically, semantically, well-described data to begin with—then you have a chance. I think that’s where we get excited. But, we have to hit that threshold first.
MW: We spend a huge amount of time tidying up data before we can run any analytics. If we were able to generate it in a way that made it more consistent, before we did that, even if it’s just better tagged, then we would have a huge benefit out of it.
When it comes to AI and machine learning, we have a pretty good handle on machine learning. There’s the training set and the test set. You’ve got a set of rules that you define. The toolkits are well established. As long as you can generate a reasonably clean set of data and you’ve got a well-defined hypothesis, you’re good to go. We use it a lot.
On the AI side of things, trying to use a deep learning approach to come up with obfuscated factors or whatever it may be, is harder. That’s partly because people thought, if we throw all our data at it, and put some magic over the top, then we’ll get some patterns. Well you do get patterns, but how meaningful are they? So where we are now, going back to challenges in adopting FAIR data in the organization, it’s about generating the right data sets to enable you to use these techniques.
It starts with the data, and it starts with the questions. And if you’ve got those right, then the tools you use to interrogate it, they’re not irrelevant, but they become less of a problem. The tech is the enabler, it’s not the barrier. I think that’s a key thing that we’re learning, that you can’t just throw these tools at a mismatched set of data, where you haven’t done enough QC, so you can’t see whether it aligns properly.
The other thing is people believe there’s nuggets in there, in that corpus of data that you’ve got. There are, but how valuable are they when the key decisions that you need to make on that data set have already been taken? So, yes, there’s probably some value in mining through project data for projects that failed for competitive reasons, or for a non-scientific reason, because you might find some new indications out of that, you might find that actually there’s a bias theory you can go after against that data set. But if you’re looking at data from things that you’ve already made a decision on and you’ve had a successful outcome, they might be less useful in that way. Instead of starting with the data and trying to work out what you can find out from it, start with the question. Work out which data you need to generate and then which tools you need to query it with.
What do you think it will take for FAIR data to become a standard, and do you think it will ever get to that point?
TP: My take on this is that is to some extent it’s going to have to wait until people like myself and Mathew have moved on, although I think we’re at least advocates of this. And it’s kind of getting to be part of the standard toolkit, that’s just going to be part of your graduate school requirement and part of your granting process. We see this in places like the G20 support of FAIR data, and the idea that if you’re going to get ‘omics grants in Europe at all for doing any sort of cross-omic analysis, you have to set aside a certain percentage of that for data stewardship so that your data just doesn’t disappear. I think as young scientists are being trained in this, and that becomes part of what they have to do, to just learn how to verify your data and think about sharing and reuse, it will slowly change.
Now, that doesn’t help us very much at the point we are in our careers, but I think there are other things that can. I already mentioned the FAIRPlus project. A major component of this is putting together a cookbook for showing us how to FAIR-ify your data. The PISTOIA alliance again is doing a FAIR toolkit, so I think that helps. I think things like the Bio-IT hackathon are enormously helpful, especially since it’s two days where one can come in and know nothing about FAIR data, and at the end of it, have a pretty good sense of where it’s going to go. That’s one of the things I’m interested in; what are all those new accelerants, so that we can really start to use this to transform the industry, and what are the barriers that are stopping that? Obviously we don’t want to wait for that next generation of scientists to make it all the way through, but I think a lot of it is just getting people to recognize the value of having data verified, and are they willing to, either a) change how they create their data, so it’s more FAIR to begin with, or b) go back and retrospectively clean up some of it? That’s only going to happen if there’s proper economic incentive for doing that work of clean up.
MW: Yeah, I was talking to a group of 40 14-year-olds yesterday about FAIR data and what it means, so we’re trying to plant the seed really early now. Now I’m pretty sure that those 14 year olds won’t come to benefit the industry before my time is done. We need to start getting the message out early, as Tom said, but also we need to highlight the benefits of following this approach. And if, as it’s shifting in our organizations now, if the idea around reuse of data and being able to query across multiple data types to come up with more insights on your problem gets traction, and it doesn’t all have to be generated de novo within your project, you can link it to other data, then FAIR data is the enabler for that. If people get that and understand it, that’s what drives it.
We’re doing a lot of exploratory reuse of our own clinical data now, and that’s because the human is the best model organism to develop drugs for the human. When we look at being able to use other systems, we’re always extrapolating, and to a degree, guessing how that will turn out when we get it into humans. If we can reuse that data, have that data more easily available, have that data interoperable and reusable, then we get a much greater chance of success at the other end in terms of making drugs that are tailored to the right conditions. So there’s a great economic driver for us as pharma companies to be able to reuse our data. It has benefits in lots of different ways.
But if we can get people to understand that the enabler for reusing the data is that you can find it, you can get to it, you can join it up, and then you’ve got the ability to reuse it, whether it be for permissions or technical reasons. We haven’t got a flag that says we have to do FAIR data because it’s a great acronym, we’re saying we need to be able to find our data, be able to get access to it, be able to join it up, so that we can answer complex questions. You may not even use the word FAIR for that. But you’re trying to get those principles embedded so that people get the value out.
TP: There is tension between data reuse and our desire, as scientists, to openly explore this when we don’t know what the questions are. There’s also GDPR and other privacy concerns, especially in Europe and also in the US. People do not want to have somebody else come and take advantage of their medical data. We have to come up with a way to simplify it for both of those concerns. I think if you look at places where patients are willingly contributing their data, places like PatientsLikeMe and Cancer Commons, I think the general thought is, if it’s being used for research purposes, and if it’s being used to help people “like me,” then people are pretty willing to share that. And so coming up with ways that we can all play nice here and all do this in a responsible manner, I think is really important.
On the flip side of that, and what we’ve seen many times within pharma, is the use of clinical data is very, very restricted and very much tied to consent around a single question. Having it tied around a single question designed to maybe get a drug approved doesn’t necessarily mean it’s going to help with the next trial, the next set of patients, or an indication that’s pretty close to the one you’ve tested, but not quite. I think there’s places like that where we need to be a little bit more open with the public about how we want to use it, why we want to use it, and how it’s going to help. And maybe that’s not something that we need pharma to lead and to push, but I think just a general idea the research community needs to get across.