Redefining Storage to Enable Digital Twins
Contributed Commentary by Adam Marko, Hammerspace Life Sciences Field CTO
January 30, 2025 | A digital twin is a digital model of a physical, real-world object, such as a jet engine or a living organism, often updated at high frequency from multiple data sources. This concept of digital twins was developed in the late 1990s to early 2000s for manufacturing purposes, and since then, it has been applied to other disciplines. In the case of life sciences research, the developments of big data generators such as next-generation sequencing, the Internet of Things, and faster data transmission through upgraded mobile and local networks have increased the feasibility of digital twins.
In life sciences, digital twins enable in silico (fully digital) modeling of patients or other organisms based on a wide variety of data inputs. With these digital models, researchers can simulate hypothetical scenarios and monitor biological processes. These complex modeling techniques can even be extended to entire populations for use cases such as monitoring public health or environmental impacts on groups of organisms.
While the concept of digital twins is similar in life sciences to manufacturing, human bodies are significantly more complex than industrial machines and systems. In addition to this complexity, there is a wide spectrum of applications in life sciences, ranging from single-cell modeling drug response to entire populations for public health efforts. This further amplifies the challenges surrounding data generation, management, and analysis.
Making effective use of the data required to advance digital twin research creates immense infrastructure challenges. The massive amount of data generated to support digital twin research will only worsen storage and processing bottlenecks. Integrating old datasets with new simulations presents a complex challenge. The need to access data across globally distributed storage environments (cloud and on-premises) in real time for analysis is required to enable multidisciplinary research efforts. Critical to these studies is managing raw data, tracking metadata, and ensuring data accessibility for geographically dispersed teams.
AI is an important component of this field, and improving digital twin simulations requires powerful GPU, storage, and networking infrastructure for data processing and management. Legacy data integration with new AI models will allow for more precise simulations, at the cost of maintaining access to large legacy data sets. This means large data sets from previous studies can not be located in different locations without unified access. If this is the case, model retraining and access to regionally specific GPUs, whether on-premises or in the cloud, will be difficult, if not impossible.
Modern CPUs, GPUs, high-performance storage, and networking are essential for analyzing vast amounts of biological data. Large-scale data generation poses unique challenges, such as analysis that includes high-resolution 3D image data. When this is extended to multiple timesets or large patient populations, the problem is further amplified. Distributed patient data presents a challenge since data is generated from multiple sources and instruments, creating storage and integration bottlenecks. Seamless data movement between cloud and on-premises storage is likely required to get results rapidly.
A potentially high-impact application of digital twins is patient response in drug development. Patients could be monitored for real-time drug response, which can affect dosing schedules. Models can be constructed that will predict patient metabolism to modify dosing amounts, and this information can be integrated with genomic data, helping to develop predictive panels for correct drug dosing amounts based on genetic markers. This brings the healthcare and life science research fields closer to the goal of precision medicine.
What can organizations do to help enable their storage for digital twins research? Existing legacy storage systems that have siloed data will not meet researcher needs; as a result, the current data storage model has to be redefined. Organizations need to be able to integrate hybrid cloud environments to facilitate real-time data movement and analysis from disparate sources. Solutions must allow for seamless data flow between various storage environments for distributed teams, particularly in geographically dispersed research. This distributed data is a reality of digital twins research. Without access to a single storage namespace, research will never progress because data sharing and collaboration will be hindered.
Furthermore, scalable, single-namespace storage is only a part of a full-featured data platform. To find, use, and manage distributed data effectively, researchers must have the ability to apply and define metadata for files and objects. This allows researchers to reuse, share and locate their data. With custom metadata, researchers can achieve greater workflow integration with existing pipelines.
Research IT professionals must continue to innovate to meet the rapidly advancing needs of digital twins data infrastructure. The legacy model of large data sets on different storage platforms and in different locations will not work for digital twins efforts. Careful storage infrastructure planning must be in place so that scientists can find, reuse, and analyze data inputs for digital twins research in a single namespace model.
Adam Marko is an experienced professional in the life sciences sector, currently serving as the Life Sciences Field CTO at Hammerspace. Previously, Adam held the position of Director of Life Science Solutions at Panasas and was the Scientific Solutions Lead at Igneous. Adam's consulting background includes roles as Senior Scientific Consultant and Scientific Consultant at The BioTeam, Inc. He can be reached at adam.marko@hammerspace.com.