Archive First, For The Future
Contributed Commentary by Jeff Hughes
June 13, 2018 | When we think about the industries generating the most data, the life sciences space is perhaps the most exciting because of its potential to improve quality of life and save lives. The data generated by life sciences research has been used to diagnose cancer and other diseases and develop new forms of treatment—and this is only the beginning.
Providing the infrastructure needed to enable these discoveries is one of the most rewarding aspects of building a data management platform. Many of our customers are generating more data than their legacy systems can handle, and their lack of data management infrastructure holds them back from focusing on the science.
The life sciences industry has continually experienced one innovation after another that multiples the amount of data being generated, outpacing enterprise IT. The example we’ve all heard about is the explosion of genomic data, first due to innovations in sequencing equipment and now being used for diagnostic and treatment. Strategies which had previously dealt with much smaller-scale problems and terabytes of data are now tasked with managing petabytes of data and billions of files...and this data growth is still picking up the pace.
It’s crucial for life sciences IT to invest in the right infrastructure now, to manage their growing data and prepare for the future. In particular, archive stands out as perhaps the most important thing life sciences organizations can do to manage their data.
Archive Adds Value To Data
Many organizations think of archive as an afterthought or not at all. I’d argue that this is, in fact, a lost opportunity, especially for the life sciences industry.
Archiving data pays dividends down the road in terms of organizational efficiency and particularly productivity. Having a central indexed repository for research data enables researchers to easily know what’s where, making it possible to maximize utility of the data.
Unfortunately, many life sciences organizations don’t have the infrastructure in place to archive effectively. Without a data archive strategy, research organizations end up storing data across different research groups and locations, resulting in confusion and organizational inefficiencies when the data is moved, modified, and used in analysis and computing.
Not having an archive strategy also means that data from previous studies is lost, along with its potential value. A 2013 study found that 80% of scientific data is lost within two decades and the odds of sourcing datasets decline by 17% each year.
Organizations that don’t invest in their archive strategy end up losing old data that could be useful in future studies. Because storing old data in a cost-effective archive tier is cheap compared to the cost of reproducing that data, archived data is a treasure trove of cheap research. Organizations that don’t archive potentially throw away an abundance of low-cost, valuable research!
Life sciences organizations need to invest in a strong, reliable, and scale-out archive tier built to handle billions of files. This archive tier should enable easy search and discovery for end users, and be delivered as a service to reduce management overhead for overburdened life sciences IT.
Archive First
Not only is archiving one of most important thing life sciences organizations can do to manage their data, but it’s actually the first thing they should do.
In an recent interview, life sciences technologist Chris Dwan advocated for an “archive first” approach to data management. He argues that this approach has several key benefits for life sciences organizations.
First, it preserves every valuable piece of data along with its metadata at the time of creation. This ensures that organizations can always recover the original copy and its precious metadata even if data is modified or processed in a way that loses the metadata. Researchers will always be able to refer back to the original data and answer any questions that come up, like which instrument the data came from and when it was created.
Second, if the organization is going to back up the data, anyway (which they really should), archiving first kills two birds with one stone in a cost-effective manner. Even though it seems expensive because the data is stored twice, it’s actually a simple way to protect data against loss and can be done cost-effectively with an economic archive tier.
Finally, the simplicity and reliability of this approach is a huge bonus for life sciences organizations, which tend to have leaner IT teams that need simple and elegant solutions for managing growing data.
Archive the data first, and then don’t worry about it!
Archive For The Future
There’s a propensity to see archive as a sort of backwards-facing solution, because the term is so tied to historical archives and protecting old data. However, I encourage life sciences organizations to view archive as a future-facing investment that will reap huge rewards over the long run.
The life sciences community is increasingly realizing the need for better data management as it moves towards open science. Preserving and protecting data for the future will pave the way toward open science, where researchers are able to share the data they generate to accelerate the pace at which breakthrough discoveries happen.
Many funders now require researchers to submit data management plans with their proposals, and having an organizational archive strategy will only become more valuable as effective data management becomes a bigger focus in the life sciences space.
Ultimately, having a modern archive tier will open up exciting opportunities for life sciences organizations, including advanced analytics, machine learning/artificial intelligence, and data integration.
Modern archive solutions that support the usage of data in analytics, processing, and computing, such as in ML/AI workflows, allow organizations to use their vast stores of data to generate valuable research.
In addition, being able to consolidate data from multiple sources will prove infinitely valuable as data integration opens opportunities for deeper research. As Dwan explained, we’re just beginning to see all the benefits of data integration that we’ve seen for years in the digital marketing world, in the world of healthcare. When researchers are able to combine data from multiple sources in a way that’s never been done before, we’ll experience incredibly transformative discoveries and technologies.
Life sciences organizations with a strong archive tier that can act as a central repository for their research data will have an edge in pursuing these opportunities.
There’s no question that archive is the most important piece of IT infrastructure life sciences organizations can invest in to prepare for the future. Archiving protects and consolidates data while making it accessible for analysis and computing, and ultimately enables organizations to pursue the most exciting research opportunities of our time.
Jeff Hughes is cofounder and CTO at Igneous Systems. Prior to his role at Igneous, Jeff was Director of Engineering for the Isilon Storage Division of EMC. Jeff previously held numerous engineering management positions at Isilon Systems, before its acquisition by EMC. Jeff received his Bachelor of Science degree in Computer Science from the University of Washington. He can be reached at jeff@igneous.io.