Janssen’s Data Ecosystem and the Role of Data Managers
By Allison Proffitt
October 20, 2021 | At the Bio-IT World Conference & Expo last month, Aleksandar Stojmirovic and Weiwei Schultz, both of Janssen’s Data Science group, explained the data management mission and vision of Janssen R&D.
The team, Stojmirovic said, originated in the immunology therapeutic area, but moved to Data Science several years ago and now seeks to deliver end-to-end solutions for translational research data across Janssen R&D focusing on discovery, translational science, biomarkers and data science. The group’s mission is to organize and manage Janssen research data according to FAIR principles (findable, accessible, interoperable, reusable), thus building an integrated and standardize data ecosystem helping to derive deep insights and make timely decisions for the therapeutic portfolio.
Janssen used to be a highly decentralized organization, Stojmirovic explained, with research data acquisition, storage, and analysis all centered on therapeutic area. The approach, he said, led to missing data management workflows, a variety of metadata schema, and inconsistent data curation and storage. As a result, data were not traceable up or downstream of an individual team and there was no way to search or compare data across therapeutic areas.
“Data would get lost,” he summarized. “Our team was formed to unify research data management based on the foundations established in the immunology therapeutic area.”
The guiding principles of the new approach follow FAIR principles. The first priority is a unified data management strategy, enabling data reuse and integrative analysis across the company and breaking silos within and between therapeutic areas and functions.
“Target development programs… of course they are related within therapeutic areas,” Stojmirovic said. “Sometimes the same target may have been examined before and the data may be available within your organization but may not be able to be found.”
The team worked to scale cross-functional initiatives for target and biomarker identification, patient stratification, and disease understanding. These cross-functional initiatives would not only tear down data silos, but also group silos: “Not only to bring data together, but to foster connectivity between people and data,” he said. Fostering people-focused connection facilitates the continuity of operations, Stojmirovic added. When turnover happens within the organization, data and analyses are not lost.
Changing Culture
This was only the first step in a much larger cultural change effort to cultivate a data stewardship mentality across the organization so that data and metadata are handled consistently, ownership is shared, and common workflows flourish.
“It’s not enough just for an analysist to be involved in a portfolio program. We don’t want this attitude of, ‘Just give me the data and I’ll give you the results.’ We need to ensure that everyone share responsibility of ownership and execution of the data management best practices,” Stojmirovic said.
The model Janssen has chosen for an ecosystem to facilitate this is a data manager who sits on the Data Management Team, working with data engineers, curators, and bioinformaticians, but also liaising with stakeholders including data generators, consumers, owners, data scientists and analysists, and senior therapeutic area leadership, as well was with compliance and Janssen Business Technology.
“Data managers—and the Data Management team as a whole—are really at the center of this ecosystem, corresponding with everyone,” Stojmirovic explained.
Schultz flagged this as a key challenge to the launch of the ecosystem. Data managers had to establish trust with stakeholders and data owners. She recommended that managers, “work with them to find optimal data management solutions that keeps them in the loop.”
Ecosystem Structure
The structure of the new ecosystem was meant to ensure data was available in a permanent repository, was annotated at the study and experiment levels, and was quickly turned over to analysts. Final processing and analysis would then be stored with the raw data in the permanent repository for re-use.
The ecosystem is organized, then, by ingest, storage and annotation, and then visualization and cataloging, explained Weiwei Schultz.
All raw research data—either internally or externally generated, along with minimal metadata annotations as yaml files—are ingested first into a staging area either on the cloud or on premises. From there, the responsible data manager transfers data to its permanent storage location on the cloud. Permanent storage is based on therapeutic area and organized in a hierarchical structure of disease, program, study, and experiment. From there, visualization and cataloging happen via internally-developed platforms.
Metadata models were another challenge, Schultz said. The team sought to strike a balance between a metadata model that required minimal effort to populate while still being usefully descriptive, she said.
“End users can only access data from permanent storage,” Schultz explained, and they work with the data on local or cloud-based user workspaces. Processed data and analysis are ingested the same way raw research data are, passing, again, by the responsible data manager to be stored with the raw data.
Among the internally-developed tools data owners and managers use to facilitate the ecosystem, Schultz mentioned JRD Annotator, a web application providing standard schemas, attributes, and vocabularies; and BioViz, which offers visualization by expression, contrast, genes, and more.
The efforts have paid off. Since August 30, 2021, more than 770 experiments and 400 TB of data have been process through the workflow across all therapeutic areas, Schultz reported. The team has realized a 75% reduction in time required to finalize ingestion of a single dataset.
But Shultz still emphasized that Janssen’s data management story is one of continual evolution. It is never too late to implement an enterprise-grade data management strategy for your organization, she said.