How Deleted and Siloed Data Are Slowing Discovery

Contributed Commentary by Adam Marko, Panasas

August 5, 2022 | Over the past decade, advancements in instruments and software have driven life sciences breakthroughs that can drastically improve global health outcomes. Innovative techniques such as cryogenic electron microscopy (cryo-EM) and cross-linking mass spectrometry (CX-MS) are poised to usher proteomics into a new era, while improvements in next-generation sequencing (NGS) technologies continue to accelerate genomics – both research areas that bring us faster and more accurate biomarker identification, clinical diagnostics, and drug discovery processes. More exciting still, new integrative approaches utilizing artificial intelligence and machine learning (AI/ML) may soon bring the dream of mainstream precision medicine closer to reality.

There are incredible promises on the horizon. But to fulfill them, researchers must maximize the value of the huge datasets that these high-throughput technologies generate. Successful organizations will be the ones that pursue practices that align with the FAIR Guiding Principles for scientific data management published in Scientific Data in 2016, which ensure that data is preserved in Findable, Accessible, Interoperable, and Reusable formats in order to promote cooperative research and AI/ML efforts. These principles were proposed by a diverse set of stakeholders – representing academia, industry, funding agencies, and scholarly publishers – who understand that “[g]ood data management is not a goal in itself, but rather is the key conduit leading to knowledge discovery and innovation.”

But as of now, two common practices stand in the way of such good data management at many life sciences institutions. The first is what I have termed “Distribute and Delete,” and the second refers to the abundance of data silos within and between organizations. Both practices not only waste time and money, but more importantly, they hinder research collaboration and undermine AI/ML initiatives by diminishing the value of organizations’ most vital assets – their data. To move beyond these practices, organizations will need to embrace collaborative mindsets and invest in IT infrastructures that make their data easier to use, share, and repurpose.

The Downsides of Distribute and Delete and Data Silos

I coined the term “Distribute and Delete” to describe the common practice of research cores generating data for external users, agreeing to only hold that data for long enough for the user to take it away, and then deleting that data from the core infrastructure – in other words, “your data, your problem.” On the flipside, silo-based approaches to research that block networked scholarship reflect a traditional “my lab, my data” mentality. Both approaches do a disservice to life sciences data and, by extension, scientific progress as a whole.

In the context of Distribute and Delete, there are various reasons for resorting to the practice. To start, many facilities simply do not know what data truly needs to be kept long-term. They may also be limited by their funding and staffing resources. Most often, a short-sighted IT infrastructure bears the blame, such as when the organization’s storage systems lack the performance, scalability, and data insights required to manage the large data volumes coming in from the lab.

The Distribute and Delete model comes with multiple disadvantages, including limited visibility (if any) as to what happens to the data or the following science once it’s been handed off, data duplication and the resulting inefficiencies, and zero chance to implement FAIR principles. This model of data management prevents the analysis of large datasets and reanalysis of individual efforts. It also makes external data sharing more difficult and substantially less likely.

While Distribute and Delete typically stems from technical and financial limitations, breaking down data silos may present more of a challenge simply because they are so embedded in the culture of life sciences. Silos form when research teams or departments isolate the data that their projects generate from the rest of their organization by storing it in a repository that does not connect to other applications in the IT infrastructure.

Just like the Distribute and Delete model of data management, these silos have multiple roots. An organization’s lack of standards for data collection and storage practices may be one. Frequently, new workloads emerge that the existing data storage systems cannot support, so new silos are added, and the fragmentation escalates. Limited communication can also lead to goals that appear unrelated, so teams may fail to recognize how they would benefit by sharing their data. Of course, competitive incentives play a role as well, and some researchers hoard their data simply because they see no benefit to themselves in sharing it.

The disadvantages of these data silos are extensive. When data is stored in scattered silos, the researchers who do want to share their data struggle to do so. Formatting incompatibility often makes analysis difficult, and data security and compliance issues commonly crop up. Work and effort are unnecessarily duplicated, wasting valuable time and money. Silos can also foster a culture of inequality where teams and organizations with the most siloed data hold the most power.

The most serious downside – which both of these practices introduce – involves preventing automated workflow solutions. Deleted or inaccessible data contributes nothing to the development of AI/ML applications that can fast-track research pipelines – lost potential which could mean the difference between a drug discovery cycle taking years instead of months.

These Practices Are Not Sustainable

For life sciences organizations seeking to get the most value out of their data and break new ground, investing in IT infrastructures and strategies that support the storing, processing, sharing, and reusing of massive data volumes is now imperative.

One of the primary ways that organizations can avoid data silos and Distribute and Delete practices is by consolidating their disparate storage systems onto a centralized and scalable storage platform, one that delivers the reliability, capacity, and performance that all their applications demand. That storage platform needs to provide data insights that give researchers a clear overview of the entire storage environment, enabling them to manage their data most effectively and with ease. Beyond that, organizations will need to encourage communication and collaboration between research teams.

Pursuing smarter and more collaborative data management strategies and policies will ultimately benefit everyone involved. Researchers’ work will achieve broader impact and increased visibility, and their data-rich workflows will become more compatible with AI/ML applications that can automate and accelerate previously time- and labor-intensive steps in the research pipeline. Stakeholders will save immensely on the costs of deleted, hidden, or unused data. And the public stands to gain the most from new discoveries that will drive medicine forward.

Adam Marko has 15+ years of experience as both a researcher and as an IT professional analyzing data and meeting the informatics needs of life sciences organizations. Adam is the Director of Life Science Solutions at Panasas, where he is involved in driving all aspects of market development in Life Sciences, including working with field sales, marketing, and engineering. He can be reached at amarko@panasas.com.