Reevaluating Data Platforms: The Future of Decentralized Data
By Irene Yeh
May 22, 2024 | At the 2024 Bio-IT World Conference & Expo, artificial intelligence (AI) was the main topic stirring up discussions and debates, especially generative AI and its data platforms. Data platforms are crucial tools for AI models, yet there are several barriers that hinder their utilization, efficiency, and accessibility—and none of them are technical or data-related.
Karl Gutwin, principal consultant at BioTeam, discussed these barriers during the Future of Decentralized Data session at Bio-IT World. “I want to call out the human barriers, the ones that are harder to remember and need special focus,” Gutwin stated.
The Two Biggest Barriers
During the presentation, Gutwin highlighted two main obstacles that obstruct data platform progress. The first one is the data accessibility gap, the divide between the people that have the time, skills, and resources to use available technical tools and the people that do not have access to these tools. This is fundamentally a systemic equity problem, according to Gutwin, and it can be addressed through the technical design of systems by focusing on the low-code users and data scientists.
“The second major human barrier I want to call out is FAIR,” Gutwin said. “But I assert that FAIR data is not about the data itself. It’s about the technical systems that allow for interchange of data… FAIR is only going to be able to be achieved when we have a diversity of technologies that are all speaking a common language.”
Data platforms are meant to provide a predictable and validated data model for the platform's users. However, if you have a data platform that requires data engineers or other experts to load the data, then it can devolve into gatekeeping. In other words, like the first barrier, only those with knowledge of the technology and process can use the platform. As a result, non-expert users cannot achieve their goals. The question is, how do we close this gap and ensure that all users of all backgrounds and varying levels of expertise can use these platforms?
The Four Principles
Gutwin argued that while it is possible to achieve greater utilization through upskilling users, it only results in excluding a category of users or forcing them through a gatekeeping process. Though there are some platforms that claim to provide an all-in-one solution, this can also easily turn into walled gardens that are only accessible through expensive fees—another form of gatekeeping.
So, what can be done to make sure the gap is not widened and provide equal accessibility? Gutwin listed four principles that can shape the ideal platform.
Extract, Load, Transform (ELT)
Extract, load, and transform is described as loading data, defining the schema the data were loaded into, and then supporting the system to transform that data into the desired target. According to Gutwin, there are fundamental issues with the process of defining a schema and then transforming the data because, for biomedical data, there is no one true model. This principal highlights that a system should be able to accept any data and build that through a plug-in style interface, and then at the core of the platform, incorporate a transformation engine.
Metadata
In today’s industry, it is common practice to store metadata in a data catalog or dictionary file or wiki, which is always external to the data. This means that the metadata has some distance from the data itself, and it can turn into a bigger problem when data transformation is a core element of scientific discovery. Transformations can alter data, which loses the connection between the metadata and the resulting data. As such, data transformations need to be informed by the metadata, which means that the data platform needs to be able to support specific transformations depending on the data elements present in the data. If users can define their own data elements, then there needs to be a push toward common data elements—things that represent the core understanding of one’s data, questions, and critical data points within their organizations.
Federation Over Centralization
Gutwin described science as “fractally distributed.” Every individual is part of a group, which is part of an organization, and so on. As such, technology needs to be built to reflect and support this fractal distribution. Specifically, a data platform needs to be able to align between the individual and the community of users on the platform. The platform should also be able to resolve technical problems involved in data interchange and alleviate concerns about how data is stored and accessed. From the user’s perspective, it should be as simple as creating a link that behaves like a file they uploaded. For this to work on a global level, instead of creating new protocols for performing data interchange, there should be more reuse of existing open protocols, thus allowing other platforms implementing these same concepts to coexist in this ecosystem.
Scale
Platforms should not require expert knowledge in order to deploy them. The average user should be able to understand how to navigate and use it. Data platforms also should not have fees and/or mandatory accounts for users to use them, as this would encourage gatekeeping.
“And if you can say, ‘Yes, I've got a platform, and it covers three of these four,’ then I will challenge you, why do you not seek to cover them all?” Gutwin added.
The Grand Vision
To build the ideal model, Gutwin encourages that these barriers need to be broken down to share interoperability. There needs to be an “intentional focus against siloing” and an effort to ensure that it can operate or at least run at any scale. Ultimately, the goal is to create a platform that will allow experts and non-experts to access a global ecosystem of biomedical data that meets any user in any use case at any scale and with any kind of data.