What We’re Discovering About Scientific Computing During A Pandemic
By Allison Proffitt
May 18, 2020 | When an all-star panel convened for the second half of Bio-IT World’s Trends from the Trenches event earlier this week—following the traditional lightning round summary of industry trends delivered by Chris Dagdigian—the elephant “in the room”, of course, was that we were not in the same room.
Instead, Susan K. Gregurick, Associate Director, Data Science (ADDS) and Director, Office of Data Science Strategy (ODSS) at the National Institutes of Health; Eli Dart, Network Engineer, Science Engagement, Energy Sciences Network (ESnet), Lawrence Berkeley National Laboratory; Matthew Trunnell, Data Commoner-at-Large, former Vice President and Chief Data Officer Fred Hutchinson Cancer Research Center; Vivien Bonazzi, Chief Biomedical Data Scientist at Deloitte; and Fernanda Foertter, Senior Scientific Consultant for AI+HPC, BioTeam, joined Dagdigian and panel moderator Ari Berman, CEO of BioTeam, from their own living rooms and offices. But that didn’t keep the socially distanced panel from digging into infrastructure, standards, privacy issues and more, all while staying COVID-19 aware.
[Editor's Note: If you missed the live session, you can still view the recording. It will be available on demand here: https://www.bio-itworldexpo.com/trends-in-the-trenches-webinar]
With regard to looking at scientific computing during a pandemic, Matthew Trunnell set the scope early on. “We have this tendency toward tech solutionism,” he said. “There are a lot of really hard problems to be dealt with right now. We tend to gravitate toward technical solutions first, and I am seeing this at every level. I sit on [the COVID-19 Technology Task Force], which has just focused on leveraging efforts like Google and Apple’s contact tracing as sort of a silver bullet.”
That focus, though, oversimplifies the conversation, he warned. According to Trunnell, the real challenge right now is “simply the biology part…I think we tend to underestimate that sometimes.” Our view is fragmented, and our information is incomplete, he observed. There’s a great deal of variation in how COVID-19 manifests and the disease trajectory in different demographics.
“It really is a time when the only way we’re going to be able to do this quickly is to figure out how to share information more quickly,” Trunnell said.
Big Problems Moving Big Data
But data sharing—during a pandemic or otherwise—is a complex problem comprising the logistics of sharing the data itself, all the cultural and privacy concerns, and standards and permissions associated with moving data.
Eli Dart took on the logistics questions first. He says the goal is for data to be where it’s productive. Sometimes that means moving the compute to the data, but in some cases moving data can be a huge challenge for researchers. “Now they’re wrangling: how do I get my 20-terabyte dataset over to where it needs to be?,” he said. “And that is just not a good use of scientists’ time.”
The vision is keeping “all the tech in the tech bucket” and letting humans orchestrate that in a way that is productive and scalable. “I don’t want to sort of just be Mr. Science DMZ over here,” Dart said—referring to the network architecture that he has pioneered—“But there’s a set of architectural models that allow us to do that in a way that’s consistently performing.” (Read more about Science DMZs and Dart’s take on the future of scientific computing.)
Dart believes that we are already part way to a fully networked vision, but the pressures of the pandemic may help to push us all the way. “If your workflow was already 100% network-based, and all you’re doing is orchestrating data movement, data placement, data analysis between large-scale infrastructure systems, you can be home and socially distanced—right?—and continue your work. If your work requires you physically carrying around USB hard drives, suddenly you’re stopped.”
Dagdigian advocated for cloud environments, though he acknowledged that nuances like egress fees still bug him. “The really nice thing about the cloud is the agility at which we can throw together multi-party collaborations,” he said. “It makes me sad, as an IT person, when I see a scientist spending hours or days of her time on simply moving data or tying up her corporate laptop doing a big data copy!” he said.
Dagdigian praised Globus, a service out of the University of Chicago that provides secure, reliable research data management tools. “With Globus Connect, I can deliver to an [Amazon] S3 bucket. I can deliver to an on-prem storage system. I could drop it onto a data transfer node at one of my collaborators.” (read more about Globus).
In fact, Dagdigian is hoping that the pandemic may prove to be the last straw for data transfer by hard drive. “If this situation we’re in now causes the death of physical slopping around of USB drives, that’ll be a great outcome for science,” he said. “I think maybe this might be the kick in the pants that we need.”
Coronavirus As Catalyst
The next challenge may be gathering the types of data most likely to fuel breakthroughs, and getting them into usable formats.
“We can no longer be islands of knowledge. I think we have to be a world of knowledge, and I think we need unified approaches to do that,” Bonazzi said. She warned that just giving lip service to data sharing would be too simplistic. “We don’t want to just get to the end point. We have to pay attention to the data and its structure. That touches data governance methods,” data cleanliness, and interoperability, she said.
Groups have been working on standards for data sharing—Bonazzi highlighted the Research Data Alliance (RDA) and the Global Alliance for Genomics and Health (GA4GH) as examples. Susan Gregurick pointed out the Radiological Society of North America’s laudable work on implementing rapid and flexible imaging standards and associated ways of collecting data.
At NIH, Gregurick said, the agency is working on data sharing, data platforms, and data strategies which will be shared through RFPs, funding announcements, and webinars. And it’s not just NIH, Gregurick emphasized. “This is happening at NSF; it’s happening at DOE; it’s happening at VA. We’re all working together. It’s an unprecedented amount of collaboration and it’s really been invigorating and exhausting!”
Gregurick likened our current situation to a moment of catalytic evolution—a chance to leap forward. She envisions enlisting not just NIH and other government entities, but healthcare providers and academic researchers nationwide to use the same data standards, map them to a common data model, and share them. “We would really be doing the community a great favor,” she said.
Rights, Permissions, Security: Oh My!
Thus the last hurdle may be privacy. Berman jumps straight to the point: Have we overblown privacy a little bit? Do we spend too much time worrying about what might happen to our data?
It depends on which data, Dart replied, pragmatically. There’s a big difference, he said, between somebody’s pre-publication chemistry paper being leaked and a nefarious actor identifying me in a way that has personal consequences for my relationship with the rest of society. But the truth is, we aren’t very good at differentiating between the two. “Humans are not good at reasoning about low-probability, high-cost outlooks,” Dart said.
Trunnell agreed. “We’re not actually thinking about it very constructively. In general, we don’t know how to think about these problems very well, and so we tend to take an extreme position.”
Yet time is of the essence. “I think what we need to do is fly the plane and build it at the same time,” Bonazzi said. We will make mistakes, she admitted, but, “I think science is about that, right? It’s about realizing we take and test something, get it wrong, and improve it.”
We don’t have to start from scratch, Dart interjected. Though the stakes are much lower in the physical sciences—there’s not as much of a risk of negative human impact—we can still apply some of the learnings. “Now what I would hope is that we can take the technologies out of the rest of the physical sciences and apply a policy layer that allows us to make decisions about who gets to do what with what data, and still get all the scalability that we can get in the physical side.”
Fernanda Foertter argues that the coronavirus pandemic is exactly the right kind of puzzle for AI: a complicated problem with many variables being tested. And while using human health data raises well-known issues of data privacy and sharing data over national borders, those challenges can be addressed with a federated learning architecture, Foertter suggested.
Federated learning was first introduced by Google in 2017 and used to train autocorrect models for texting. For biomedical data—like COVID-19 patient data—data owners would train a model locally using only their own data. Each model is then shared with an aggregation server, which creates a consensus model of all the accumulated knowledge from the data owners, even though the raw data never leave their institutions.
“In theory, you could ameliorate or not have to deal with these issues of border and information security and do the [model] training on site,” Foertter explained. It’s a proposal from computer science that could scale quite nicely to the challenge at hand.
“We are itching and dying to get this framework—essentially of data sharing and data utilization—ready so that the promise of AI can really come true,” she said.
“There is an opportunity here for technology. I think if we keep waiting for the best-case scenario—assuming everything is easily shareable and we could, in some way, maintain privacy—we’re still going to have issues with where the data is located physically,” Foertter said. “We just have to, not necessarily ignore, but find ways that we can work around the system that exists today.”
And there’s really no better time to try.
“The idea that we can’t do anything unsafe, we need to get away from that,” Dart said. “This idea that we can’t take any risks, that’s not going to work anymore.”