Trends from the Trenches Part 1: Infrastructure Upheaval
By Allison Proffitt
July 18, 2023 | After years of threatening to pass the baton, Chris Dagdigian was absent from this year’s Trends from the Trenches session at the 2023 Bio-IT World Conference & Expo. But while he was elsewhere for personal reasons in May, his BioTeam colleagues carried on, giving an overview of the trends and themes they see in their consulting business with as much candor and insight—if not brute force and speed—as the session has become known for.
Ari Berman, BioTeam CEO, kicked off the conversation with a reversion of several recent infrastructure trends. While once considered more settled, infrastructure has been a critical issue in 2023, and on-premises computing has been increasing in complexity, he observed. He blamed the pandemic and the backlog of data analysis that researchers focused on when they were isolated from their more traditional, data-generating work.
“Everyone went home into isolation with their labs shut down, and researchers were like, ‘Hey, I can pass my time by starting to analyze all this data!’”
The resulting traffic impacted both cloud environments and on-premises high performance computing environments. Coupled with supply chain issues that drove up prices and slowed delivery of additional hardware, “Local IT could barely maintain the now super-stressed infrastructure,” he said. Meanwhile cloud marketers launched a campaign that kicked off what he called the second great cloud migration.
Compute and Interconnect On-Prem
One of the biggest changes in the post-pandemic shakeup, Berman said, is the number of companies investing significantly in hardware. He reported a big push by life sciences organizations for as many GPUs (graphics processing units) as possible. He noted that the number of BioTeam customers seriously considering investing in NVIDIA SuperPODs—at a cost of $7M to $40M—is “shocking.”
He warned, however, that not all compute jobs are fit for GPUs. “This is part of the AI tunnel vision problem,” he said. Only 15%-25% of life sciences codes are accelerated for GPUs, he said. Most institutional clusters need to be general purpose, he advised, and Intel is no longer the only shop in town. Both AMD and ARM are powering nice CPU ecosystems recently.
While interconnect has stayed fairly straightforward, Berman pointed out that NVIDIA’s purchase of Mellanox and Intel’s spin-out of Omni-Path (with Cornelis Networks) are worth watching.
Composable architecture is also particularly intriguing, Berman said. “These are really interesting because you can rethink how you’re building your HPC. You don’t have to make every node super general purpose.”
Forever a Problem: Storage and Networks
Storage is “forever a problem for every reason,” Berman said. Cost is always a limiting factor, and most organizations choose to optimize volume and cost over performance. Both cloud and local storage at the petabyte or higher level is very expensive. Berman flagged Hammerspace as a group making a valuable play in the data management space.
IBM’s Spectrum Scale (formerly GPFS) is still the most common life sciences on-premises high-performance clustered file system software, he said. Lustre from DDN is improving; it still doesn’t handle small files well, though that’s on DDN’s roadmap to improve, he reported. Next generation storage architectures from VAST, Weka, Pure and others are worth watching, he added. People are generally happy with them.
There’s been a lot of buzz about moving the compute to the data, but Berman dismisses the notion at its extreme. “I don’t care who says, ‘Let’s bring the compute to the data.’ You’re still going to have to move [data] at least once, probably a whole lot more. Let’s remember that,” he warned.
Cutting down on data movement is not a bad goal, he added, but lab equipment is only able to store the last few experiments, data sharing is now a standard requirement, and both backups and analytics will likely happen elsewhere. With some labs generating more than a petabyte of data a year, networks are non-trivial requirements.
But enterprise networks are just not built for science or large-scale data transfer, Berman said. Security is designed to mitigate risk, not enable science; networks are optimized for web and email traffic, not large, sustained data transfer; and expertise is split and siloed. FedEx is still the most-used high-speed data, he said.
For a long time, the best alternative was a Science DMZ or perimeter network, he said. The idea was developed by the Department of Energy to create a science-first network between national labs: a fast, low-friction solution with security policies built into equipment that are not traditional firewalls meant to optimize many small data flows. The Science DMZ sits outside of the main, enterprise network, so it can be flexible and effective, though the use cases are fairly narrow. “It’s a really good option as long as IT gets what it’s for and doesn’t decide it’s super insecure and lock it down even more—which we’ve seen,” Berman quipped.
Zero-trust and microsegmentation are the newest “giant buzzwords” in networking and security. While Science DMZs can be viewed as a band-aid solution to balancing security and data transfer needs, Berman says the zero-trust approach is a “viable path toward rearchitecting enterprise networks.” The approach changes the entire network to make it mission-based, not risk-based. The approach relies on setting up networks that trust nothing that isn’t standard—laptops, sequencers, microscopes, Internet of Things sensors, and more—but still allow a path for data to move. The zero-trust approach identifies data sources and types—almost “application-aware routing”—and creates fast paths through the network when needed. This is hard, Berman concedes. People are beginning to try it, but he finds it promising.
There are also new networking technologies on the horizon, he adds: 1 Tb networks are in early release, 600-800 Gb Optical Transport Networks are out. He encouraged anyone doing science to look into high-speed networks.
Second Great Cloud Migration
Cloud marketers have taken advantage of all of these challenges to kick off what Berman called the second great cloud migration, convincing resource-strapped and suddenly-distributed organizations that the cloud is the solution to their compute issues. Since the pandemic, many organizations have launched aggressive cloud migration programs, planning “cloud-first” or “all-cloud” transitions away from local architecture—some of which were working quite well. While Berman agrees that cloud is a good solution to some problems, he railed against an absolute—all or nothing—approach
The first cloud migration was 2008-2014, Berman said, and admitted that BioTeam helped many clients migrate to AWS and close their datacenters. The draw then—and now—is cheap, easy to manage, endless compute power with less staff needed. But the reality then—and now—is that cloud can’t replace all local infrastructure, it requires specialized IT skillsets, it doesn’t meet every scientist need, and—in some cases—costs are surprisingly high (10-50x more than a local datacenter).
Cloud has matured considerably since that first migration, Berman notes. There are more cloud providers and competition has spurred huge innovation. There are cloud advantages that cannot be reproduced locally. Containerization and portability of workloads enable sharing data more easily. Virtual orchestration and serverless technologies let advanced users architect very sophisticated environments. Deep learning applications and specialized hardware are huge draws, and at today’s data volumes, local storage is challenging. Real high-performance computing can be done in the cloud, and it can be done securely, though Berman notes “it’s yours to mess up”.
Cloud-based HPC is more relevant than ever, added Adam Kraut, BioTeam’s director of infrastructure and cloud architecture. Generating and simulating the volumes of data needed to train machine learning models requires high performance computing, he noted, and that can be very valuable in the cloud.
But using cloud as your storage solution can be complex, hard to price, and easy to overdo. Plus, Berman adds, once your data are in the cloud, it costs money to get them out again. And can scientists use cloud out of the box? “Absolutely not—100% no.” Berman is emphatic. Even sophisticated users need to architect their cloud environments before they can get going. It’s getting better and there are new services that are fairly easy—he mentions Amazon Omics and DNAnexus—but, “the reality is, it’s just a lot of services that you need to string together.”
Taking a Hybrid Approach
To balance the strengths and challenges, BioTeam now advocates for a hybrid computing model that preserves data ownership for which the source of truth is kept locally, but capabilities of the cloud can be exploited.
But Berman’s argument against an all-cloud business strategy is built on more than simply a strengths and weaknesses chart. There is a fundamental mismatch between cloud business models and long-term scientific research goals, he argues. Cloud providers are private, for-profit companies that are not utilities. Their offerings can change at any time—and frequently do. Scientific studies, on the other hand, are five to 10 years at their shortest, with some comprising hundreds of years of data. How long would it take to move a study’s worth of data if the company no longer offered the services you need? Cloud companies right now are not even disclosing the time period you would have to get your data back if they were to shut down. “This should concern you!” Berman warned.
One caveat: when we say “hybrid cloud” people hear “cloud bursting”—or very quickly sending overflow jobs to the cloud when local resources are at capacity, Kraut clarified. But both Berman and Kraut noted that bursting is “not a solved problem”. Networks aren’t designed for quick transfer of large datasets to the cloud for a burst of work. Storage I/O never quite works, and there are software dependencies that are not easy to spin up.
Please think of cloud as a capability enabler, not a capacity solution, Kraut pleaded.