Building A SEAL Team Of Data Science Experience
September 6, 2018 | At the 2018 Bio-IT World Conference & Expo, we dedicated the Wednesday morning plenary session to a panel discussion on data science. It was rich conversation. Panelists laid out their own definitions of “data science”, discussed how to build and structure a data science team, and debated what role expertise (data science or domain) should play in experimental design. As we talked, we collected questions from the audience—more than we were able to use. Instead of abandoning the community’s contributions, I’ve collected the top voted questions from the panel and offered them back to our community. Want to weigh in? Download the question list and join the conversation. A.P.
John Reynders, Vice President of Data Science, Genomics, and Bioinformatics, Alexion Pharmaceuticals
Bio-IT World: What is the optimal environment to grow data science?
John Reynders: The optimal environment to grow data science, I'd say, is one that's very multidisciplinary. One where you have a very large dynamic range, where you have access to the strategic critical questions facing the organization, facing the enterprise. But also, the hands-on expertise and skills that can connect that strategic challenge with a strategic solution. It’s not ideal when things are separated: strategy talks over here and then the data science analytics that might help address the challenges over there. Having those in close proximity is important.
Sometimes data science gets misclassified as a sub-discipline thereof, whereas it is actually a field with many, many different dimensions. It's algorithms; it's stats; it's machine learning; it's data engineering; it's a lot of these pieces together. The team should be curious and explorative. There has to be a willingness to take risk and try things. It’s okay to try something and it doesn't work; that leads to learning and a different approach. Data science needs a supportive environment that allows people to take risks, answer questions in a lot of different ways. You have to support that risk taking, this is essential in pushing the envelopes of innovation that are required in data science.
During the panel we talked about the size of the team and your team is small, 6 people right?
Right.
How do you insure enough diversity, a multidisciplinary team, but still keep it fairly small?
Consider a Navy SEAL Team - these teams are pretty small, but each member of the team brings a very powerful and diverse set of different skills to the table. They are fairly senior seasoned people that have a diversity of experiences over their career.
You can have a hundred people each with six experiences, or you can have six people each with a hundred experiences. I'm going to err on the side of the more seasoned experienced folks, with a diverse set of experiences.
What are the different levels of maturity in a data science lab? What does the roadmap look like?
There are many examples of capability maturity models out there, step one, and then step two, three, four, five. I've seen some folks attempt to do that with data sciences. But I have a simple rule: the proof of capability maturity is in the pudding. What questions did you answer?
We can spend a lot of time building a data lake with a significant level of engineering effort, and it could be high quality cloud computing with effective data extraction, transfer, and load—and here's the data lake! But if nothing comes out of it—except for the occasional swamp monster—then it hasn’t really answered a question. It is not “build it and the questions will come.” Maturity for me is, “does the organization value, trust, and count on the data sciences analytics to answer the critical questions.” Are you on the Bat Phone: “Oh my goodness, here's what we're facing, call the data science folks, we need that insight!” That to me is what maturity is—it is more than effort; it must be coupled with impact.
What skill sets would you recommend graduate student researchers have? Does domain expertise still matter in the era that is science machine learning and AI?
Yeah, that's a great question. Back in the day—I mean more of when it was informatics, before it became data sciences—I would have erred on the side of someone having fairly decent domain knowledge. Then they learn the toolkit along the way. But the data sciences tool kit has become so deep, so vast, so rich, and so powerful, I'd switch it now the other way: someone with deep data sciences background and familiarity with the domain.
I'm probably not going to take someone from finance and apply them to biotech if they have zero life sciences background. That's probably a bit of a stretch, but someone that's deep in data sciences and has some background experience in life sciences. That’s the model we're seeing nowadays in terms of how people are finding opportunities, how employers and employees are connecting in the space.
Now as for the training, I recommend a very broad toolbox. On the one hand you can go deep, but there are a lot of capabilities out there. Do you really need to hand code your own multilayer feed forward artificial neural network? Probably not, because there are ANN chips you can buy. There's the new Tensor platform from Google that enables a broad range of machine learning pipelines on data flow. Very bright folks and cutting-edge companies have built these tools—the key for the data scientist is to know how to apply them—and choosing the rights tools to address the questions at hand.
It's similar to how applied mathematicians train. Applied mathematicians learn a lot of tools for their toolbox: perturbation theory, asymptotics, modeling, simulation, etc. But then they actually learn how to apply it into a few different areas, maybe fluid mechanics and plasma physics and geothermics, right?
So what they've learned is how to apply a toolbox. I'd say the same is very applicable to data sciences. With a broad toolbox, you don't necessarily know how to build every tool in the toolbox but you need to know how to use every tool in the toolbox and actually practice it. Don't just practice it in one discipline, practice it in a few. That's what I would say the ideal masters would look like, is a broad, broad set of experiences and you applied it in a few different domains. You know the nature of each tool but you don't necessarily need to know how to build each tool.
How do you define data curation?
There's probably a lot of different ways to define that. I would say that when you actually need the human insight, whether it's looking through a publication and then extracting from that publication the key pieces of information that are needed that then feed into a database, maybe it's going through literature to understand genotype-phenotype correlations, maybe it's going through literature to understand various epidemiological assessments. As powerful as data sciences is, and as powerful as machine learning is, there's still going to be this interface between the frontier of human knowledge and its representation and how that's captured into a machine readable and computable form. That interface is what I would define as curation.
How big of a problem is messy data? Missing metadata, data that's not FAIR. Is it a roadblock and can you discuss tricks and tips to streamline the data cleaning part?
I would defer to others that might have a little more expertise in terms of messy data management. I would, however, point to a paradigm that I think is really intriguing in this space. I don't know to what degree I could name companies explicitly, but let's just say there's classes of companies out there that do this interesting melding of machine learning with human learning. It's very applicable to messy data, where the approach entails humans that go in and look at the data and sort through the messiness to apply a degree of structure and relational insight. Now this feeds into the machine learning which then says, “hmmmm, well let me learn how the humans tried to clean this up and now I will go try and clean it up myself!”
They're these very clever systems that also pose questions back to the humans that say, "I'm trying to do this. Does it look like X, Y, or Z." And you say it looks like Y. So you use human insight to actually help inform the machine learning part of things and it's now this virtuous circle of human learning informing machine learning, informing human learning, informing machine learning. If I look at the sprawl of data that exists across enterprises whether it's siloed, whether it's incomplete, whether there's not a master data management strategy, I would offer, there are a few different companies I've seen with this this bicameral approach of human and machine learning that I think are very promising approaches to messy data.
ROI and economics are not to be forgotten when talking about data science. How do you calculate ROI of your efforts?
The ROI case is made when you answer a critical question with data-sciences. For example, earlier in my career I saw an early-stage machine learning company identify a previously unknown toxicology signal. For me, it was having our head of toxicology in the room, seeing what this system insight generated and his response being, "Oh my goodness, we never would've seen that." Boom! The entire system was paid for and valued itself ten times over with that one insight.
At another company, we used a variety machine learning techniques and it helped us understand a new indication for a molecule in the pipeline. So, Boom! Huge value. It's these insights and answers to questions that create significant value that create the business case – and confidence in investment.
To me it always comes back to that “aha moment”. That toxicologist that went, "Oh my goodness, we never would've seen that." Those are the value stories that are absolutely critical. I'm less compelled by these throughput, systems engineering-y ways of describing value creation. It is the significant value created by compelling, concise, and clear answers to strategic questions that makes the ROI case.
Over the course of your career, have you found that your management has always responded well to that?
In any new environment, one of the most important things to do is get your head around, “What are the strategic and hard questions facing the enterprise?” Then there's a step of being able to help colleagues understand the art of possible. Would it be helpful to be able to answer this question or that question? Fortunately, what happens in a lot of those conversations is a colleague responds "We can answer that? That would be incredibly valuable!" Now with added clarity, you go off and tackle the question. You do that once or twice, then all of a sudden it creates that support and appreciation for what data sciences can do to lend insight. It does start with a lot of work on the part of the data sciences team to really understand the hard, strategic questions facing the company, and the art of the possible—but this is the critical road to take to ensure strategic impact.
So you're sort of training them to understand what questions are even possible? Then delivering on those questions.
It’s more a matter of framing the problem and solution in business terms. There’s a huge risk with data sciences shops that you just disappear into a blizzard of buzzwords: "Well, using an artificial neural network and blah blah with a self-organizing map blah blah data lake." This does not instill confidence in a team’s ability to have business relevance—let alone impact. It starts with learning the business domain, understanding those most challenging questions and then coming back with rigorous answers to those questions in business context, not data science buzzwords.
During the panel you said one of the things you looked for most when you were hiring, is a proven ability to learn new information. Is that what you're talking about?
Yes indeed, there's all sorts of machine learning, statistics, high-performance computing, artificial intelligence, and math to be learned. However, in delivering a solution, our senior executives don't care about how elegant the math is. They don't care about what particular machine learning techniques you use. They want an answer to the question. These questions, however, change. And our job as data-scientists is to track along—quickly learning the new and necessary domain knowledge to address the next critical question. And, very importantly, this many times means adding new tools to the data-sciences toolbox. Not every question requires a particular machine learning hammer. This is a dangerous trap. A data-scientist needs extreme learning agility both in absorbing new domains to understand the key questions, and ensuring they are applying the most appropriate data-sciences approach to the problem at hand. It's rather challenging and exciting, and the most successful data-scientists I find are those that have an insatiable appetite and ability to learn with a dash of deep curiosity.
Are there areas you feel are appropriate for data science tools but which have not really embraced them yet?
If you had asked me 10 years ago where we'd be applying data sciences, it’s the thing my team is finding most exciting now: it's the broad application of data sciences. There's a risk that data sciences can be locked in a corner somewhere, doing fine work, but answering a limited set of questions, like very specific pathway biology questions in discovery.
I'm becoming more and more intrigued and amazed—whether it's manufacturing, whether it's commercial, whether it's strategy, whether it's business development—at the enterprise-level application of data sciences. Core functions. Internal processes. There's just so many ways that it can be applied.
Finally, have you had experience where data science methods are misused, because the implementer didn't really understand them?
I've been very blessed so far. I've not had that much of a wrong answer generated over the years and part of that is being super-careful on the assumptions, and very clear on the limitations of what a model can inform. When you're describing the answer to a question, we are really clear with here's what you have to believe, the provenance of the data, the error bounds, and the critical assumptions.
And then, especially the more strategic a question is, or the greater the enterprise impact, you look for a parallel methods or pair approaches. I've had situations where I was personally involved in doing the data sciences for a very critical enterprise question, and I asked a colleague on my team to assemble an independent and parallel analysis, so we could compare results from completely independent approaches. It was nice to see that they overlapped in their prediction—this provided us even greater confidence in our recommendation.