Federations and Favor: Hydra Genomics Federation Tests Model
By Allison Proffitt
July 11, 2023 | Federated storage systems were a hot topic at the Bio-IT World Conference & Expo in May. Susmit Shannigrahi, Assistant Professor at Tennessee Tech University, reported on Hydra, a federated model for genomics use cases funded by the National Science Foundation.
Similar to the federated model presented in the plenary program by the University of Pennsylvania and Intel, which focused on a medical images use case, Shannigrahi said Hydra as a model is really content agnostic. He focused on genomics as a pilot use case, though, because of the needs of the genomics community, he said.
The current data publication model does not scale, he pointed out. Genomics data are produced by all sorts of institutions, but “There’s no federation of those storages. If a scientist wants to publish data, either it goes to the institutional repositories or one of the cloud storages or one of the repositories hosted by NCBI,” Shannigrahi said. “What we wanted to do was create a simple mechanism where you can take existing storage and build a federation out of it.”
A storage federation for genomics should be able to support very large datasets with simple publication mechanisms and provide persistent storage at the file level, Shannigrahi said. Hydra is designed to replicate data both for backup and geographic access. Data security is provided via access control.
There are two main ways to go about building a data federation, Shannigrahi explained. In one case, metadata are published and replicated within the federation—perhaps data that are already shared in multiple spaces. “So if you have an existing system that you want to publish through Hydra, all you have to do is take the names or endpoints, and then publish the metadata through Hydra. Anyone can go and query it, and all you return—instead of returning the actual biometric log—you return the URIs for your file,” he said.
But Hydra takes a different approach. “The way we are doing it, we are publishing the whole file into the system. We’re making copies and then replicating and making them available through the system.” Hydra links various data repositories that remain under institutional control. In this pilot, data were contributed by Clemson University; University of California, Los Angeles; Florida International University, while Tennessee Tech is the primary organization driving the pilot.
The Hydra research prototype was tested on the National Science Foundation’s FABRIC testbed, Shannigrahi said, making use of the high-speed links between notes. So far, he reported, large scale experiments with 10 to 1,000 simulated users; workflows have been deployed to various cloud and computing platforms.
Design Priorities
Transparency is a particular goal. “The institutions should be able to tell which data is where, how they’re being replicated, how they’re being managed, etc,” Shannigrahi said. However he also stressed that for users, the process of accessing data should be seamless. Within Hydra, there are naming conventions that are agreed upon. Users can request a dataset by name without having to know where it is. If their access rights are sufficient, users then receive a pointer to those data.
Because Hydra uses a global namespace, Shannigrahi said, almost any technology can be used to request a file. “You can request it over HTTP; you can request it over FTP. You can use newer technologies like Named Data Networking, or you can expose a Globus endpoint if you want to. Everything should be supported.”
Each node maintains a global view of the data with metadata but does not hold all of the data. Data are replicated to other nodes based on each node’s FAVOR rating. “Here is the magic part,” Shannigrahi said. FAVOR is a tunable parameter that balances conflicting considerations for the nodes: cost of storage, bandwidth, and space. Hydra calculates a FAVOR rating and replicates data to the three geographically nearest nodes with the highest FAVOR ratings. Currently, Shannigrahi said, FAVOR is a global parameter—all files are considered equal. But he said they are working on adjusting FAVOR ratings to be more granular, so that, perhaps, one node may be able to prefer some types of data to another.
Nodes send out “heartbeat messages” to the network; node failure is clear as soon as a node goes silent. Because each node has a global view of the system, when a node fails, the missing replicates can be created elsewhere. If a node were to crash, then be relaunched nearly instantly, Shannigrahi said, the system may end up with four replicates of a dataset (the original three nodes plus a rescue copy created when a node crashed). In that case, he said, the extra replicate could be manually deleted.
Lessons Learned
Since the Hydra project launched in 2021, Shannigrahi said, the project has proven several points of principle. First that distributed federation for scientific datasets can improve the findability and accessibility of data. As a loosely coupled, self-organizing, and scalable open-source federated system, Hydra can manage file metadata, even when data are large.