Match Cutting-Edge Research With Next Generation Data Solutions

Contributed Commentary by Josh Gluck

July 2, 2018 | Ask any researcher, and they will confess that there are never enough resources to optimally fuel the innovation engine. This is particularly true in genomics research and precision medicine, disciplines that have progressed rapidly in the last decade and show tremendous potential to transform the way we understand, treat, and—in the future—cure some of the world’s most complex diseases.

Researchers know, too, that their data infrastructures—including storage—are struggling, and in many cases, failing, to keep pace with burgeoning requirements driven by exponential growth in data and the demand from next generation algorithms and pipelines. Further compounding the problem, most organizations are not well positioned to move rapidly toward embracing emerging technologies, such as machine learning and artificial intelligence (AI), that can open even greater frontiers in the journey to precision medicine.

With data emerging as the undisputed driver for 21st Century medicine, it is becoming clear that we cannot expect to design our next generation data infrastructure on the technologies of the last century—technologies that were never intended to handle today’s extreme data volume and workloads.

So Much Data…and Potential

By 2020, at its current rate of data acceleration, genomic sequencing and analysis will produce 1 exabyte of stored data per year. By 2025, the data requirements will increase to 1 zettabyte—that’s one trillion billion bytes—per sequence, per year.

Through the combined efforts of research from various universities and private industry partners, as well as other healthcare data experts, as many as 500,000 human genome sequences were known to be available as of 2017. That number is expected to double every 12 months, with single institutions setting goals for obtaining up to 2 million unique genomic sequences. When one considers that it requires five terabytes of raw data storage to sequence a single genome, it is easy to understand the urgent pressure for platforms that can support exabyte scalability, data reduction, and a total cost of ownership that will enable institutions to realize the value of the data they generate, process and retain.

Time to Accelerate

While the number of big-data platforms and storage systems has increased significantly over the past few years, putting the potential for increased computational power into the hands of research facilities, many of these environments have yet to deliver an agile, performant, cost-effective and manageable solution.

One of the contributing factors has been the reliance on legacy technologies especially when it comes to storage. Legacy storage technologies—based on mechanical spinning disks pioneered in the 1950s—were not designed for these workloads and are a growing bottleneck for researchers. And, they’re not equipped to support the new frontier of AI, deep learning (DL), graphics processing units (GPUs). The ability to store and process very large datasets at high speed are fundamental for AI. DL (a form of machine learning that loosely mimics the way that information is processed in the nervous system) and GPUs (which are used to rapidly render images, animations, and videos) are massively parallel, but legacy storage technologies were designed in an era with an entirely different set of expectations around speed, capacity, and density requirements.

Compute and networking operations have continually exploited the performance rewards delivered by the bi-annual doubling of transistors that can fit on a chip. Now it's time for data centers to take advantage of the same potential in their storage systems and begin building new foundations with data platforms that are reimagined from the ground-up for the modern era of intelligent analytics.

Several key characteristics define data centric architecture and storage requirements for the next-gen genomics era:

Silicon-optimized to support gigabytes/second of bandwidth per application, as opposed to disk-optimized storage. The performance of solid state technology exceeds that of hard disk drive-based storage many times over.
A highly-parallel application architecture that can support thousands to tens of thousands of composite applications sharing petabytes of data versus tens to hundreds of monolithic applications consuming terabytes of data siloed to each application.
Elastic scale to petabytes that allow organizations to pay as they grow with perpetual forward compatibility.
Full automation to minimize management resources required to maintain the platform.
The ability to support and span multiple cloud environments from core data centers to edge data centers, as well as across multi-cloud infrastructure-as-a-service (IaaS) and software-as-a-service (SaaS) providers.
An open development platform versus a closed ecosystem built on complex one-off storage software solutions.
A subscription-consumption model that supports constant innovation and eliminates the churn and endless race to expand storage to meet growing needs and refresh every three to five years.

Without question, genomic and precision medicine is changing lives today. From here, the desire of the research and medical community is to safely and effectively accelerate progress and the positive impact that these approaches have on patient outcomes. The ability to effectively and rapidly gather, manage, analyze, and gain insight from massive data stores is fundamental to this quest. It’s time to advance the journey with a data centric infrastructure designed for the genomic era.

Josh Gluck is the Vice President of Global Healthcare Technology Strategy at Pure Storage where he is responsible for Pure's healthcare solutions technology strategy, market development and thought leadership in healthcare. He can be reached at jgluck@purestorage.com.