Pharma-Genedata Collaboration Builds ROI-Rich Tool for Omics Data Exploration
By Allison Proffitt
May 31, 2022 | Innovative Practices Award—Merck KGaA, Darmstadt, started exploring ways to improve their biomarker capabilities in 2017. “Biomedical scientists had only limited access to our clinical data,” explained Eike Staub, Senior Director, Head of Oncology Bioinformatics at Merck KGaA, Darmstadt, in his presentation this month at the Bio-IT World Conference & Expo. “We did not have, really, the mechanisms to make biomarker data from our own studies or our clinical studies broadly available for research. Instead, data were often isolated in silos, not accessible, and this severely limited our capabilities for generating insight from our clinical study data.” It was, he said, a data desert.
So Staub and his team set to work building the X-OMICS platform—a “next-generation digital platform for our biomarker research,” he said. “We wanted one platform to integrate all sorts of biomarker data from all technologies: from clinical, pre-clinical, public, real-world data.” In addition, the team wanted self-service machine learning and AI tools so that researchers could explore the data on their own and do analyses themselves, all while keeping permissions and access rights in place that didn’t disturb people.
The first version was up and running by 2020, Staub said, and Merck KGaA has been to see real efficiency gains since then.
Merck KGaA chose to develop the X-OMICS platform in conjunction with Genedata, a Swiss software provider with 25 years’ worth of experience digitalizing biopharmaceutical research and development.
“Right from the beginning we had an eye to productizing what we were doing with Merck—a cost-sharing model there,” explained Marc Flesch, Head of Profiler BU, Product, Genedata, and Staub’s co-speaker during the presentation.
The X-OMICS platform has three layers: a raw data processing layer, a data integration layer, and an analytics layer all built on Genedata’s Profiler platform, a cloud-native solution based on a highly modularized microservice architecture provided by Amazon Web Services facilitating data durability, compute elasticity, platform resilience, cost efficiency, and continuous validation.
The raw data processing layer is a GCP-grade repository for raw big data coming out of instruments that is then processed for condensed molecular readouts like genomics, immunohistochemistry, transcriptomics, cell labeling, and more, Staub explained. Data here are curated and provided by bioinformaticians and digital pathologists.
“The raw data processing layer comes with algorithms wrapped into modules, which can be combined into workflows. And then these workflows can automatically be run to perform multiple analyses steps,” Staub explained. Many workflows have already been developed for specific use cases: tumor mutation burden, microsatellite instability, and more.
From there, these condensed data move to the data integration layer where these data are linked with patient or study metadata and ontologies. “User rights come into play and data access restrictions are handled in the background,” at this point, Staub said. Again a workflow system is used to accomplish curation and integration. “We made basic data tagging mandatory for everyone who wants to channel data into our platform. This has been shown to be a key success factor!” he said, enabling data Findability and Accessibility.
“We have also been able to recently come up with data catalog functionality. This means we can provide different summaries of our data to inform us of the content, for example, at the level of studies, level of patient, level of genes and so on,” he said. “This fulfills a basic need in the organization to know where the data is and what the data is.”
Analysis for All
Finally the integration layer serves data to the analytics layer, which is designed to serve both users adept in data science such as bioinformaticians and data-savvy biologists, as well as those who would like to explore, but don’t have data science experience like citizen data scientists, clinicians, and others.
For bioinformaticians, the analytics layer is fully integrated with RStudio Workbench, which allows users to leverage the power of both R and Python for data transformation, processing, quality control, advanced analytics, and visualization. RStudio Connect allows easy sharing of RShiny dashboards and RMarkdown reports, while the RStudio Package Manager ensures validatable dependency management for regulated analytics based on R and Python.
The analytics layer is also fully integrated with the GitLab platform, offering source code control for analytics users to ensure accuracy, traceability, and reproducibility of data analyses required for a validated system. In addition, GitLab’s CI/CD pipeline allows for automated platform validation, scheduling of workflows managed by Genedata Profiler, ranging from data ingestion, integration, to analytics and AI applications.
“We focus on community tools widely known to the public,” Staub said. “We believe that this is very important, because this guarantees us that new employees and also externals—for example, freelancers—can be hired and quickly on-boarded.”
But most of the users are data consumers, Staub said, and they explore the data using interactive apps or reports. The number of apps has grown steadily since the launch of X-OMICS, he said, evidence of the platform’s adoption.
For users who don’t have R or GitLab expertise, the platform includes more than 150 analysis apps in the X-OMICS AppSpace. “Over the course of two years, mainly the bioinformaticians but also other scientists in the organization have come up with a large amount of apps. Many of them are COVID-specific, but also a lot of them are general purpose.” Some of the apps are also made available to the public. For instance, the Merck KGaA team recently released RosettaSX publicly.
Whether they use the RStudio or GitLab platforms or the apps, analysts and consumers receive only data from the integration layer that they are entitled to see, Staub emphasized.
Culture and Practical ROI
During deliberations, the Bio-IT World Innovative Practices Award judging team highlighted the data culture work evident in this entry, and it’s a theme Staub echoed independently in his presentation.
“One of the main achievements of the last two years was certainly a cultural one having to do with data democratization,” he said. “I think users feel now that they can get access to data outside of their own work area much easier than before. Data sharing has become the default rather than the exceptions.”
Data are now findable within Merck KGaA and he reports that users love the tools available in the AppSpace and he thinks that is a motivation to share data and “participate in our community.”
But Merck KGaA didn’t only measure ROI in terms of data sharing inclinations. Staub reported major efficiency gains for biomarker research.
Within two years the implementation of the X-Omics platform enabled more than 10,000 genomic NGS profiles to be quality controlled, processed, and integrated within the platform; hundreds of pre-clinical and clinical omics datasets comprising 100B rows of data to be prepared and harmonized; more than 300 datasets including metadata to be curated and made available to over 200 scientists throughout the organization; more than 5,000 biomarker-related analysis questions to be processed; four releases of RWE datasets spanning multiple indications including molecular data of different modalities to be integrated; and 39 biomarker and target analysis projects to be conducted on the RWE biomarker/omics data within 9 months.
The 39 biomarker and target analysis projects have saved the company $2.54M the team estimates, and 117 weeks of development time.
The X-Omics platform—developed by Genedata in collaboration with Merck KGaA, Darmstadt, Germany—is the cornerstone of Merck KGaA’s R&D digitalization strategy, the company said in its entry form. “By increasing accessibility to data and cloud computing analytical technologies, the X-Omics platform breaks data silos enabling seamless data exploration by a wide range of users, as well as real-time inter- and intra-organizational collaboration from remote locations.”