Ruth Marinshaw on the State of Data in the Academic Sphere
By Irene Yeh
June 13, 2024 | Data is a double-edged sword in the industry. It is a necessary tool for technological developments, especially artificial intelligence (AI), but the excess generation of data has become a thorn on the sides of researchers and companies. And for universities, they have their own sets of problems to deal with. Ruth Marinshaw, CTO of Research Computing at Stanford University, discussed the state of data in the academic sphere and covered the challenges that are making things difficult for academic institutions in May’s Trends from the Trenches podcast episode.
Data is Not Cheap
One of the biggest challenges is the cost of data. “From a facility perspective [and] from a financial perspective, these systems and capabilities are not inexpensive,” said Marinshaw. The high expenses it takes to develop, curate, and document data leads to deeper issues, mainly how money determines accessibility to resources and opportunities.
Compared to Silicon Valley giants and other hyperscalers, academic institutions struggle to compete and drive science forward simply because they have less financial backing. The disparity is “scary and disturbing” and needs to be addressed to stop gatekeeping.
“Campus computing will still be important, but no single campus can deliver the AI resources at the scale that we really need to continue to drive science forward,” added Marinshaw. So, what can be done to help?
Marinshaw cites programs, such as the National AI Research Resource (NAIRR), that provide partnerships between government agencies and private companies to encourage smaller developers and researchers to have a hand in the field. She also highlighted the importance of keeping an eye on what is needed, which has allowed Stanford faculty to establish programs like NAIRR. These programs and the people behind them are critical to ensuring that everyone—whether they are corporate-affiliated or university-affiliated—has access to the resources needed for their research and technological advancements.
FAIR is “No Small Task”
Though the goals of FAIR aim to optimize the reuse of data and to create an efficient, accessible system, it is a colossal challenge.
“I don’t want to say that the vision of FAIR is a fairytale,” said Marinshaw. “But I think it’s not as easy as it sounds.” She compared FAIR to the early days of cloud, where several people had the misconception that they could move their compute to the cloud, “think magical thoughts,” and have the cloud easily solve their problems.
Marinshaw warned that FAIR is not going to be easy to accomplish. Even with a more assertive approach, she said, “We’re a long way away.”
Stanford’s Take
Ever since their compute research department was created 12 years ago, Stanford has made great strides in the data development and management fields. Their department provides centralized computing (e.g. traditional condo shared computing ownership model), system administration, consultations for students, post-docs, and faculty, storage services, onboarding services, and three data centers for research equipment. They hold talks and panels about leveraging AI technologies in medicine and healthcare, established the Human Centered AI Institute, and have AI scientists that focus on the ethical uses of AI.
“We built trust by not going out and advertising broadly until we could demonstrate success,” said Marinshaw. She also elaborated that, while success should be the aim, it is okay when things don’t work out. “You have to build slowly and demonstrate success but also understand you're not always going to be successful, and that's okay,” she added.
As data continues to develop, innovation and progress will follow. But that can only happen if the current problems and obstacles are solved first. Universities and other academic institutions provide crucial resources and teams that can bring AI and other technological advancements further. If they are not given the same support and funds as corporations, then there is a risk of gatekeeping and potentially a monopolization of research and future technology.