Building A Commons: How Bristol-Myers Squibb And BioTeam Used Gen3 To Build A New Data Paradigm

By Allison Proffitt

January 22, 2020 Last March, Robert Grossman, director of the Center for Data Intensive Science (CDIS) at the University of Chicago, took the stage at Bio-IT World West. He was outlining his vision for the evolution of data commons to data ecosystems. Along the way he plugged Gen3, the open-source platform for building data commons.

“Gen3 is how data commons are made,” he said then, but, “it’s not just a data commons software stack. It offers a steppingstone to data ecosystems.”

Coincidentally, Grossman was giving his pitch right as a new Gen3 data commons was coming to life.

Big Pharma, Big Plans

“It’s serendipitous, really, how things have come together,” Daniel Huston told Bio-IT World in October.

Huston is Lead IT Business Partner, Translational Bioinformatics at Bristol-Myers Squibb, and in January of 2019 he started looking to build a new data culture at BMS. Pharma companies are really good at collecting data for one specific purpose and nothing else, he said. To change the system, Bristol-Myers Squibb launched Silo Breaker, an initiative to improve the culture of data sharing at BMS and apply FAIR principles to the company data.

“We’re trying to change the model to be more tech focused… applying more data science principles to our data,” he said. Instead of just applying data to answer one particular question, BMS is seeking to explore the data on hand—starting with in-house genomics clinical trials data—and let it inform the types of questions to ask.

But first they needed a tool or platform to enable that.

“We had some ideas,” Huston said. “We’d been looking at the Genomics Data Commons as a standard to follow. What they’ve done is really impressive and we knew this wasn’t something a commercial software product could solve.” For one, Huston said, it’s probably not a commercially viable business model. In fact, he believes, this is a big enough problem that it won’t be solved without input from many industry players and stakeholders.

The Silo Breaker team wanted a cutting-edge solution with the most modern architecture. “At the speed at which science moves, if you don’t have modern approaches to what you’re doing from a software standpoint, you’re already behind,” Huston said.

To help navigate this process, BMS brought in the BioTeam, a consultancy known in the life sciences for its expertise and due diligence in tools exploration. Together, BioTeam and the Silo Breaker team identified the priorities.

At the top of the list: a solution that adhered to FAIR data principles. FAIR data principles require that data be findable, accessible, reusable, and interoperable. While often applied in the open source community, the principles can be applied to any organization that wants to enhance the reusability of their data holdings.

“We are trying to change the paradigm of control first, and access afterward,” Huston explained. He wants the attitude at BMS to be: Our data are open within the constraints of our access policy, instead of, Data are restricted unless you can get the right permissions. “It’s a different mindset that we’re trying to instill within Bristol-Myers; there’s an openness to the data itself.”

BMS also places a strong emphasis on datasets and analyses that are reproducible, which means a reliance on metadata and standard models for data collection. “Coming up with a standardized metadata model is really important,” Huston explained. Those metadata—what studies, programs, and assays generated the data—enrich the dataset helping researchers find and replicate the data.

Even with a paradigm shift toward data openness, the Silo Breaker team knew that user identity needed to be a foundational component of the platform to enable authentication and authorization, and that stewardship and data governance needed to be addressed.

And finally, users needed to be able to access the tools and applications available at BMS for data analytics. “We have a whole host of tools in house that can do all sorts of advanced analyses, but people didn’t know where to go to find those applications. They didn’t know what was available, what interfaces we had, and what tools were there,” Huston said.

With the needs assessment complete, two tasks loomed: BMS needed a tool that would allow everyone to find well-characterized data they need, and a catalog all of the applications, tools, and analyses in use.

The solution? Gen3.

Common Problems, Uncommonly Solved

Bill Van Etten, senior scientific consultant at BioTeam, was introduced to Gen3 while working with the NIH and NCI on cloud initiatives and was intrigued. BioTeam launched a hackathon to build a data commons on Gen3 and get firsthand experience.

Building a Gen3 data commons involves defining a data model, using Gen3 software to autogenerate the data commons and associated API, and importing data. From there, users can create synthetic cohorts, use platforms such as Tera, Seven Bridges, and Galaxy to analyze those cohorts, and develop container-based workflows, applications, and Jupyter or R Studio notebooks. Van Etten doesn’t call the platform simple, but he’s excited about what’s there.

John Jacquay, Scientific Systems Engineer at BioTeam, has been actively involved in setting up the Gen3 data commons for BMS, and he, too, is impressed with Gen3’s modern architecture and offerings. Gen3 was architected to be both cloud-native and cloud-agnostic, Jacquay explains. AWS, Google Cloud, and Azure are supported, and Gen3 uses services from cloud providers as opposed to targeting the virtual machine or bare-metal infrastructure.

Gen3 is built on a modern development stack using Python, Golang, Node.js, and React, Jacquay continues. It is easily scalable thanks to a microservice architecture. In the BMS commons, for instance, Van Etten says the team is using a dozen or more microservices—some directly from Gen3, others Jacquay has forked for custom development. “Because of the nature of the way Gen3 is structured, it lends itself well to adding tools,” Jacquay says.

And maybe most interestingly, the Gen3 data model is metadata-centric. “Rather than have an object or file-oriented data system as a starting place and building metadata from these files, [Gen3] actually starts from a metadata-first perspective,” Jacquay says. “You build your metadata then you attach files and objects to that metadata, which allows for some really interesting permission structures.”

The data model or data dictionary, “is both a schema and the records within that graphical structure,” Van Etten says. “The distinction for me is the model has no data in it, like a schema, whereas the dictionary has both the schema and data. Dictionary is a more recent term, and different people use it differently.”

The dictionary includes both technical information (for instance, the path to an S3 bucket, who owns the data, when they were created, file locations), as well as the research metadata (project names, study names, analyses, demographic data points, etc.). Metadata are normalized and highly relational. "But don't think this metadata normalization will negatively affect performance,” Jacquay warns. “Gen3 includes an ETL [Extract, Transform, Load] framework out of the box, enabling you to transform and replicate the dictionary in ElasticSearch. You get the best of both worlds: normalized data structures and relationships for input, validation, and data harmonization. Denormalized data for fast querying and search."

The metadata-first architecture is essential for providing the F—findable—in FAIR, Van Etten points out. “All of your files in a research environment have a globally unique identifier associated with them, and they can also exist in one or more locations in one or more object stores. They’re interconnected by the metadata attributes.”

Because of the metadata interconnectivity, researchers can perform faceted searches to find cohorts of data on which to do analyses. Cohorts can be exported into workspaces—Jupyter or R Studio notebooks—and analyses can be performed through the web user interface.

“Execution is happening on cloud compute against the object storage,” Van Etten explains. “It gives you a sharable web user interface for doing analyses where the compute and the storage are next to each other in the cloud. You’re not downloading files.”

After the BioTeam’s exploration, Van Etten was blown away by the possibilities. In March of 2019 he pitched it to the Silo Breaker team at BMS; they were equally impressed. “The main core was that it offered capabilities that basically weren’t possible in any commercial tool,” Huston explained.

Getting the larger BMS leadership on board required a bit more work. The pharma often uses open source products, Huston said, but IT was still more comfortable with established SaaS products, commercial software systems that already exist and have been proven. Ultimately key BMS stakeholders were convinced by the value of the Gen3 community in the genomics domain in particular, Huston said.

“This is technology that is valuable on the merits of its being open source... All of the industry has been involved in thinking about these sorts of things, and that’s what’s been baked into the Gen3 philosophy.”

And BMS is equally committed to the community. “The project is two things,” Huston said. “Half of it is learning the Gen3 platform ourselves—all the things it can do—the other half is applying it specifically to BMS. Either end of that spectrum is giving BioTeam full freedom to give back what they’ve learned and what they’re using specifically with us back into the code branch.”

With green flags across the board, the newest Gen3 data commons, the BMS Genomics Data Hub, got started.

Community Service

The Gen3 support team lives within the Center for Data Intensive Science (CDIS) at the University of Chicago under the leadership of Robert Grossman. Chris Meyer has been with the group for about two years. After earning a PhD in evolutionary biology and working in robotics, he was enticed by Grossman’s vision. Now he serves as a liaison between Gen3 users and the development team.

Gen3 serves as the foundation for 12 data commons that work closely with the support team including Brain Commons, BloodPAC, NIAID Data Hub, and more. But because all of the Gen3 APIs are open, there are likely more commons built on Gen3. Most of the Gen3 commons deal with biological data, though the OCC Environmental Data Commons hosts datasets from NASA and NOAA. The type of data that fills the commons is entirely defined by the data model.

Gen3 is developed as a completely open source platform. “Anyone can go on GitHub right now and download our software and modify it, and even propose changes to our software,” Meyer explained. “We can either integrate those or not. Bob [Grossman], in particular, is a big proponent of everything being open source and interoperable.”

For that reason, anyone can login to any data commons built on Gen3, but—once inside—the data can be protected with stringent authorization tools. Some data commons are fully open; anyone can access and download data. Other commons sponsors maintain their own limited access white list, Meyer explained. Query gateways or analysis gateways are another access option. Users send a query to the data commons, which is internally processed, then results are returned to the user.

“We’ve written applications where somebody can come in and not see the raw data for genomics data, and they can select a cohort based on a disease diagnosis and they can do things like a genome wide association study and we give them back the results,” Meyer explained. “They don’t actually see the patient genotypes individually, but they get the report.”

Meyer and the Gen3 team at CDIS provide that sort of application help to anyone who needs is. Plus plenty of developers and teams like BioTeam are exploring on their own. Support and pull requests come through the Gen3 Slack channel and GitHub, often from users no one even knew were building a data commons, Meyer said.

Jacquay is particularly active in the Gen3 Slack community and submitting pulls to Github. He’s noted several improvements he’d like to see from Gen3, some of which are already on the Gen3 roadmap, for example, management capabilities brought into the user interface.

“The Gen3 community is awesome,” Jacquay says. “The thing I like about it the most is how responsive these developers are. If you come to them with a question or a technical problem, you actually get it answered… The Slack channel provides a more direct channel for communication. You can talk directly to the people writing the code. You can work directly with them. It’s near real-time communication.”

Ecosystem Engineers

That level of connection and community will be foundational to realizing Grossman’s vision of the commons of commons, or data ecosystems. And Van Etten believes the Gen3 architecture is also paving the way. When multiple organizations are using either identical or sufficiently similar data dictionaries, he says, data can be combined to do broader analyses.

“If there was ever to be another black plague, you wouldn’t want to rely on little teeny research groups to do some research and make a publication five years later. You’d want all the researchers in the world to be interconnected, sharing data,” Van Etten says.

Gen3 is already enabling data sharing among researchers studying other infectious diseases. “With one of our data commons focusing on infectious disease, we have three data commons that are interoperating,” Gen3’s Meyer explains. The three data commons—focused on AIDS, tuberculosis, and the microbiome—have unique components to their data models, but each also share common elements. “These common elements in the data models act as a sort of ‘backbone’, which makes cross-commons queries possible,” Meyer explains. “If you were interested in looking at AIDS and tuberculosis, you could run a query that could find a cohort of people from both of those data commons and run an analysis over those.”

The three groups are all divisions of one very large national health organization, Meyer explained, but each has data stored in various locations, and historically there has been very little knowledge shared between the groups. The groups are opposed to merging their data into a single data commons because, for example, the TB group doesn’t want to see AIDS/Flu-specific variables in their data model, and the microbiome group doesn’t want to see TB-specific variables, etc.

“This is one problem the ‘data ecosystem’ is trying to solve,” Meyer said. “The data ecosystem is an extension of our Gen3 Framework Services for building Data Commons, and it allows queries to run across multiple resources while still allowing each resource to have its own unique data model and leave its data stored in its original location.” There is no copying, or moving data required as long as a resource has open or queryable APIs. And a “resource” doesn’t even need to be a Gen3 data commons. It can include non-Gen3 data portals and data lakes internal to an organization or those built by competitors.

In this case, the data ecosystem is meant to be a “one-stop shop” for finding studies, Meyer explained. It has a Dataset Browser for identifying studies of interest across resources using selectable filters like “research focus", and it has a Data Explorer for identifying cohorts (groups of patients) based on data elements like demographics, comorbidities, and other health history or clinical attributes.

Back at BMS

“I should be totally transparent. We haven’t proven it out yet. We think we have something unique here, but it hasn’t been fully realized yet,” Dan Huston said in October. “As we started to build out the architecture, Gen3 became more and more central to it. It was kind of interesting how that played out,” he said. It was a bit of a snowball effect. One Gen3 component led to another microservice and then another. “It became more and more central. Essentially it’s the core product that we’re using for the [Silo Breaker] initiative.”

But the pharma has since taken a big step forward. BMS launched the BMS Genomics Data Hub on December 20 for 232 users in the translational bioinformatics group. “It was a nice present for the holidays,” Huston laughs.

The whole Gen3 platform is now live and in use, Huston says. “Everything from managing projects, substantiation of the data model,… as well as all the querying capabilities that it has,” to analysis functions. The application catalog of analysis tools that was identified in the earliest needs assessment is available as well.

The overarching Silo Breaker investment program continues for 2020, and Huston already has a list of next steps. The team is working to integrate the BMS Genomics Data Hub with other tools used at BMS—Seven Bridges and cBioPortal, for instance. And more datasets are being added.

“We have a lot of systems in place, and we really want to make sure that everything has a purpose, and everything fits together really well. In a way, Gen3 is the system that fits on top of all of that. Once that is in place, all the other things that are adjacent to it—our immediate next steps, our growth aspect to it as we build integrations with these other systems. It keeps growing from there.”

Huston is also working with BioTeam on other tools including a micropublishing feature that will allow BMS scientists to share results and findings internally within the BMS community, and incentives for company researchers to add data to the system. The goal is for the BMS Genomics Data Hub to “build on itself, so it’s a community-oriented system rather than just a regular IT system or platform that houses data,” Huston explains.

The data model remains the central component of the whole application, Huston says. “It kind of puts a stake in the ground. Now we have a data model. Now we can hold us our data against and see how it holds up. Does it fit within the model? Are there things we can improve on in how we collect data up front so it better fits within that model?”

“It’s a virtuous cycle,” Huston says. “That’s the cool thing.”

Editor's Note: Robert Grossman, Chris Meyer, Bill Van Etten, and Daniel Huston will be among the speakers in a session on Data Commons in Practice at the 2020 Bio-IT World Conference & Expo. They will be sharing further updates and answering questions.