The Next Digital Arms Race In Life Sciences
Contributed Commentary By David Hiatt
August 23, 2017 | Life sciences research is in the middle of a digital arms race. As more research is conducted, more data is produced and more storage is required. It’s no secret that although research is leading to important discoveries, the pace of data creation is far outstripping the ability to store and analyze it. In fact, according to Gartner, only 5% of data created has been analyzed. Counter-intuitively, this threatens to slow the pace of important discoveries because more grant money must be allocated to sharing and preserving research data.
Next generation sequencing (NGS) is a key contributor to this phenomenon. Human whole genome data sets are typically hundreds of gigabytes in size. And current figures indicate that sequence data is doubling every seven to nine months, yet sequencing is still in its infancy. In 2014, an estimated 228 thousand genomes were sequenced; by 2017 that figure is expected to jump to 1.6 million genomes.
But what will happen when widespread adoption of genomic sequencing in consumer healthcare clinical settings takes place? Or when longitudinal studies for patient care become routine? And when pathological and radiological imaging data is added to our electronic healthcare records?
And genomics is only part of the problem. The field of Connectomics maps neural connections and pathways of the brain, and relies on nanometer resolution electron microscopy to visualize these connections. The largest data sets are now in the 100-terabyte range. Data sets are expected soon to reach petabyte scales largely driven by faster, higher-resolution electron microscopes. Detectors and imaging facilities coming on the market in the next three to five years are expected to produce data in excess of one terabyte per second! Researcher Dorit Hainen of the Sanford Burnham Prebys Medical Discovery Institute says that her current Titan electron microscope produces high resolution images at a rate of 40 frames per second; her new microscope will push that rate to 400 frames per second.
New Massive Scale Projects Generate More Data
New large-scale scientific projects, such as the Blue Brain Project, Human Connectome Project, 100K Genome Project, BGI’s 1M human, plant, and animal, The Human Microbiome Project, Brain Initiative and the Cancer Moonshot are the primary drivers of data growth.
These projects will generate hundreds of petabytes of data in total. On top of that, downstream analysis will generate even more data. The burden of discovery in life sciences is shifting from scientific methodologies to analytical frameworks and bioinformatics.
In figure 1, the sequencing cost drops by approximately five times per year; however, the cost of computing drops by merely two times. The cost of genomic sequencing used to be the most significant cost factor. However, with the advent of NGS, sequencing cost has been dropping approximately five times per year. In contrast, the cost of analysis is only dropping two times per year, making it the largest cost factor and biggest bottleneck to discovery. It’s also very time consuming.
Enemies of Scientific Discovery
Scientific research depends heavily on both compute and storage infrastructures. Most institutes use high performance computing (HPC) platforms to improve time-to-results. But the analytical pipelines and tools vary greatly, depending on the analysis being performed. This puts a significant burden on compute resources and the supporting storage system, which are often designed for general use, rather than optimized specifically for genomic analysis.
It’s important to note that over the last 60 years computing power has increased a trillion-fold, while storage performance has increased only modestly. This misalignment between compute and storage performance has a serious impact on data analysis, especially as data set sizes continue to grow.
When data must be retrieved from storage, scientific discovery is delayed. Storage systems are slower than the solid-state memory contained inside of the compute node by orders of a magnitude.
Table 1 shows typical response times for various classes of storage. A common way to avoid I/O delays is to add enough memory inside the compute node, such that all required application data is either contained in a caching layer. However, this method does not work in all cases.
Putting Storage I/O into Perspective
Anytime an application requires data that is not readily available in cache or memory, an input/output (I/O) request to the storage array must be issued. Depending on the location of the requested data, this can be a painfully slow process.
For comparison in human terms, Table 1 illustrates the equivalent distances and times of retrieving data from various forms of storage. Consider for a moment the impact on collaboration between researchers on different continents!
Three important considerations significantly impact life sciences research: 1) data accessibility 2) system scalability 3) and I/O latency.
- Data Accessibility: Accessibility corresponds to whether your data can be accessed whenever and wherever it is needed. Modern data centers generally have multiple storage systems, each designed for a particular use case or workload. This results in islands of storage, making the data hard to access from a different system, impacting collaboration.
- System Scalability: Oftentimes storage systems are tied to a specific vendor or hardware design, and in those cases, it will that only scale to a certain size or performance level. Storage systems generally scale in terms capacity or performance, and the most versatile systems can scale on both planes simultaneously and independently.
- Latency: Latency corresponds to how long the application waits before it receives a response from the system. Latency takes many forms, the most common are related to storage I/O, network, and the software stack. Modern computers and storage systems use solid-state flash memory to reduce I/O latency.
- Figure 4 shows how advancements in nonvolatile memory have progressively reduced these hardware latencies. Note that this corresponds to a direct attached SSD so network latency has been removed.
- Software latency, however, remains unchanged, becoming proportionally larger with each successive hardware generation. The upshot is that applications have not been optimized to take advantage of modern SSD based storage. The software stack has become the new bottleneck to storage performance.
Network latency exists whenever storage is separate from the compute cluster.
Alleviating the Bottlenecks
So how can you alleviate forms of latency?
Storage latency can be reduced by moving data to flash-based storage. Network latency can be can be minimized by using a specialized InfiniBand (IB) network to connect to storage or by using server direct attached storage (DAS). An IB network adds unnecessary cost and complexity, and DAS results in server bound silos of data.
There has been interest in moving compute resources to the data, rather than the other way around. This is not possible with traditional storage architectures. This led to the concept of converged infrastructure (CI), a single device that contains storage, network, and compute resources that are virtualized to make them more malleable.
Converged systems overcome many of the limitations of traditional storage systems because the core function of storing data has been designed to scale out in capacity and performance.
What Does This Mean for Life Sciences?
By the end of 2017, research using next-generation sequencing of patient genomes will produce the equivalent to a stack of Blu-ray disks approximately 30 miles high.
To position themselves for the greatest potential for new discoveries, life sciences organizations should carefully consider strategies to reduce bottlenecks. A converged infrastructure and storage technology that maximizes data accessibility and has the capability to scale to billions of files and hundreds of petabytes, while minimizing latency are keys to future success.
David Hiatt is the Director of Strategic Market Development at WekaIO. Throughout his career, Hiatt has specialized in enterprise IT, business and healthcare. Previously, Hiatt was the healthcare and life sciences market development leader at HGST until 2016. Hiatt received an MBA from the University of Chicago. He can be reached at dave@weka.io