Trends from the Trenches 2021: Where Bio-IT Stands on Digital Transformation, Data Transformation, AI/ML
By Allison Proffitt
October 5, 2021 | In keeping with his newly self-title “Friends from the Trenches” Chris Dagdigian wasn’t the only speaker at the Bio-IT World Conference & Expo Trends from the Trenches session. He also welcomed Adam Kraut, Karl Gutwin, and Fernanda Foertter—all BioTeam (in Foertter’s case, former BioTeam) consultants who—like Dagdigian—are regularly involved with a host of clients’ workflows and willing to call the trends as they see them.
The three reported on trends in digital transformation, data transformation, and AI/ML.
Kraut led the charge, tackling Digital Transformation as the first trend worth dissecting. A “conglomeration of a bunch of other buzzwords”, digital transformation has been happening in the life sciences for the past 20 years, Kraut contended, and comprises cloud computing, big data, the internet of things, and AI. “You mix that together with some organizational dynamics, and that’s apparently digital transformation.”
Kraut is no great fan of the term, but he conceded that “digital transformation” can represent a way of thinking about how data can impact an organization. He posed steering questions: How can data inform and influence company decisions? How can data science be embedded in product development? What fundamentals does the company need to perform data science well? How can it build a healthy and robust data ecosystem to leverage expertise and inform company decisions?
Building a Healthy Data Ecosystem, Healthy Teams
A healthy data ecosystem—the set of infrastructure and services that empower a community of data scientists and engineers to make decisions and influence the business—is the goal of digital transformation, Kraut believes. BioTeam has collected characteristics of healthy data ecosystems, and Kraut listed having a set a data principles to guide decisions, preserving data integrity at the origin or instrument, having a culture of data citizenship where users view the data as belonging to the company and consider secondary use, treating pipelines and infrastructure as code, having shared workspaces and experiment tracking, and applying continuous delivery mindsets to AI and ML. He dug deeper into several.
Adopting common languages, file formats, data dictionaries, and ontologies is valuable for building a cohesive and functional data ecosystem, he said. “Cross-functional teams really need to be on the same page when it comes to common languages. If you’re going to combine computer engineers, computer scientists, all kinds of specialists—you really need to develop a common language or you’re not going to be able to communicate very effectively.”
He advocated for adoption of standard semantics, APIs, FHIR formats, the GA4GH APIs, and even standard file formats and programming languages, choosing options with the broadest range of application across your tools and platforms. Common formats help teams move faster and reduce friction.
Infrastructure as code is meant to make our lives easier, not harder, Kraut said and he highlighted the progression of tools that life sciences has had to choose from: Puppet, then Chef, now CloudFormation, and Helm. “Tools like Nextflow and Airflow and CWL are great things because now we can specify the entire end-to-end data pipeline as code; it can be treated like code,” he said. He also highlighted the Amazon Web Services tool Parallel Cluster (v3.0) that offers HPC Clusters as Code, calling it a “personal favorite tool.”
While Kraut said infrastructure as code should help, he warned about over-investing in automation if it doesn’t actually make your life easier. Infrastructure is not as easy to test as software, he pointed out, and what used to be simple infrastructure changes can now involve dozens of lines of code, pull requests, pipeline builds. He also flagged the risks to on-boarding new team members: “when they have to learn this massive stack of stuff just to be effective.”
DevOps is a particularly useful lens for viewing ML and AI, Kraut said. “Instead of dev-test-release like we have with software, you’re now thinking of build-test-deploy of machine learning models and how you can quickly put your trained models into the real world.” The advantages include shortened feedback loops; using end-to-end automation for versioning, testing, and deployments; and eliminating manual handoffs. A DevOps mindset helps speed that process, he said.
Finally, Kraut highlighted the characteristics of team health, an essential part of digital transformation. Skills are perishable, Kraut contends. A team must have a continuous learning mindset to stay sharp. At BioTeam, he shared, consultants have a Cloud Club and an AI Skills Club so they can train together, achieving new certifications. “Iron sharpens iron. Problem-solving is the ultimate skill set now. The teams that can learn the fastest are going to be the most effective.”
A team mentality naturally shuns technological unicorns. “I think it’s a real mistake to try to find this world-class AI expert or ML expert if you want to build an effective team… Data science is a team sport,” Kraut said, echoing the urging of Ramesh Durvasula, VP and Information Officer at Eli Lilly from an earlier talk. Diversity is key in high-performance teams. “Look outside your network,” Kraut said.
Data You Can Find, Move, and Use
Karl Gutwin was up next with an even buzzier buzz word: data transformation. If digital transformation means updating the IT strategy, data transformation means using the data itself to improve the business deliverables, he said. Gutwin highlighted three trends within the space worth exploring: data flow automation, data commons, and FAIR data.
“I apologize if you feel I’m leading off with the most boring topic in existence: moving data from point to point” Gutwin said about data flow automation. But, he said, it’s still “super important. We are seeing consistently as we talk to scientists as part of our assessments in the work we’re doing with our clients: they’re spending a substantial amount of time just simply moving data from system to system.”
Gutwin highlighted a recent project with a mid-sized pharma company that wanted to gather data from dozens of instrument types and make it searchable for scientists in both raw and processed formats. “One of the big concerns was how complex is this going to end up being?”
He recommending using off-the-shelf tooling when possible—he named Apache Airflow, but highlighted that there were many—but balancing custom components and off-the-shelf tools. Gutwin echoed Kraut in reminding the audience that sometimes less automation is preferable to keep the solution tractable for the organization.
A data commons is a trendy way to make a company’s data usable once it’s gathered. The “commons” is the most important part of a data common, Kraut said, bringing multiple user communities together to promote the idea of data sharing and reduce the amount of data siloing, he said. It’s not just a data lake.
Building a data commons is a non-linear and multi-factorial choice, Gutwin said. You can use an existing platform such as Gen3, the University of Chicago platform that BioTeam used for a data commons at Bristol-Myers Squibb. Companies can choose to build a data lake that is “commons oriented”. Or companies can build the commons from scratch. There is not straightforward answer, Gutwin said.
For the BMS example, Gutwin reported that the team began with a Gen3 commons, but has continued to evolve the offering as needs have changed. Expect this, Gutwin said. “A commons is not a fixed thing by any stretch of the imagination.” Commons will change focus over time, and needs will expand as different audiences join, requiring custom portals or workflows.
In both data flow automation and data commons, FAIR data is an excellent guiding star for data-centric designs, Gutwin said, preparing data for both use today and unanticipated uses tomorrow. The acronym FAIR—findable, accessible, interoperable, reusable—really progresses in difficulty, Gutwin observed. He focused on interoperable. “I believe the interoperability of data is challenging, but it’s also the thing that we have the most chance of being tractable in our technical solutions.”
Interoperability does not happen naturally, Gutwin said, drawing from BioTeam’s work with the NHLBI BioData Catalyst. He encouraged companies to use standards whenever possible—and look for them first before designing. He, also, highlighted GA4GH standards, specifically the genomic data toolkit, regulatory and ethics toolkit, and data security toolkit.
If, after looking, there are no applicable standards, demand them of vendors, Gutwin said. “By pushing standards forward from the perspective of community requests and community adoption, I think we’ll be able to push ourselves even farther forward toward interoperability.”
Real Talk about AI/ML in Life Sciences
Bringing a tone and candor most similar to Dagdigian, Fernanda Foertter took the stage as the final “Friend” reporting on the latest in AI/ML. “It is not ready for most of us,” she opened. “If any of you are under any impression that somehow you’re going to say, ‘I have all this data and I’m going to implement AI at my organization and we’re going to cure cancer,’ you are absolutely dead wrong.”
It will get better, Foertter reassured the audience. You’ll need more data, better data infrastructure, and long-term planning, she said. “I’m here to reframe where you can put AI in your organization and how you can use it.”
The reality, Foertter said, is that AI is primarily a research tool now, and that’s how it should be used in life sciences organizations. She summarized reality in the life sciences today: swimming in data, all of it is disconnected, there is little continuity, the person who understood any of it left—but the marketing department has already published an announcement about how you’re using AI.
AI and ML require more than just having the data or even finding it. “Just because you search it doesn’t mean you will find it. Just because you find it doesn’t make it usable,” Foertter said, reiterating a point she made in workshop earlier at the event. FAIR is hard, she emphasized, agreeing with Gutwin’s contention that it gets harder as you advance through the acronym.
But life sciences companies are making progress, and Foertter highlighted the top AI/ML best practices.
- Hire data curators, she advised. Pick “failed scientists” who are tired of doing science but love and understand the data and the technology.
- Hire ethicists. “Do not begin doing AI—especially if you’re in healthcare—without an ethicist by your side. Just don’t,” she warned. “It’s just asking for trouble in the future.”
- Foertter encouraged tagging data as it comes off instruments and finding other ways of tagging data within existing processes.
- Start applying AI/ML to a new instrument or a new project, she encouraged. It’s too hard to address the whole legacy data lake. Start with the new.
- Begin with internal processes. “Make the lives if your internal scientists easier,” she said. “Do not think AI is going to be something you’re going to push through FDA approval. That’s hard!” Instead, use some AI to help your researchers decide which targets to push through the drug discovery process.
- Find ways to buy or share data. “You do not have enough data to do your own work. Period.” Foertter said. You must find ways to buy or share data. Thankfully, there is a lot of work being done on federated and safe data sharing. “In the future, leveraging other people’s data, and the secret sauce being the processes—not your data—is going to be the way you’re going to leverage AI.”
- Reuse mature algorithms from Google and Facebook. Don’t reinvent the wheel, she implored. “This is not a research exercise,” she said to any company scientist seeking to create their neural network model. “That’s a Ph.D. for somebody else. You’re in production; you’re not here to do research projects.”
- Solve narrowly-focused, well-defined problems. Most of the AI/ML goals within life sciences companies are far, far too broad.
- And finally, Foertter recommended hiring consultants and having access to a variety of experiences, not just topical education.
In all of these best practices, Foertter applauds companies laying the groundwork for AI success, even if a commercial-facing AI app is still in the future.
Her list of company worst practices were generally the opposites of the right ones: ignoring ethics and bias in the data, starting with too-big historical datasets, targeting AI for customer-facing apps, building custom models, trying to solve broad, poorly-defined problems, and believing the company’s own data is sufficient.
But she also flagged a few things she admitted were controversial. She cautioned against buying AI startups. Most AI startups are working with public data. “They are not going to do something extremely novel without better data, and there are not a lot of public datasets out there. Unless there are AI startups that are generating their own data in some way, and then doing the modeling themselves, chances are they’re not going to be any different than anybody else.” Great ideas can’t be well-tested on the same data everyone else has.
Foertter also warned against outsourcing proprietary data lakes, where one vendor has promised to build the lake and tag your data for you. “If you can find a way to work within open source, you’ll be better off than staying within a vendor proprietary ecosystem. If there’s a vendor out there promising to tag your data for you, you’ll be stuck with that vendor for a very long time. Transferring out of that vendor will be a problem.”
Finally, Foertter warned against a “wait and see approach. “I know I just said that AI is not ready but dabbling in AI right now is going to improve the way you organize your data for when it becomes ready.” Now is the time, she said, to get a grasp on how hard it is to get the data you need and begin laying the groundwork for data collaborations for the future. “Make 2022 the year of building good infrastructure for your data.”