Genome Citation Service Tracks Genomes Across Research

By Allison Proffitt

March 28, 2023 | Kjiersten Fagnan is an applied mathematician enamored of biology’s messy datasets. Now, as Chief Informatics Officer at the Department of Energy Joint Genome Institute (JGI), located at Lawrence Berkeley National Laboratory, she finds the computational and data challenges that come up in bioinformatics to be very different from what you find in applied math or partial differential equations.

“Everything’s just a little bit fuzzier and more difficult and more statistical. I’ve had to embrace the fact that all the answers actually come from a distribution. You don’t have a single answer. You have to think about probabilities. It’s a lot of fun. There’s always something new I’m learning about biology,” she told Stan Gloss on the latest episode of Bio-IT World’s Trends from the Trenches podcast.

Fagnan is a return guest to Bio-IT World. Several years ago she and Gloss spoke about FAIR data and the JGI Archive and Metadata Organizer, and now she gives a sneak peek of her talk in May at the 2023 Bio-IT World Conference & Expo on JGI’s Genome Citation Service.

JGI’s data isn’t only messy, it’s big. The Institute has more than 14 petabytes of historic data on tape requiring about 10 petabytes of spinning disc to support the Institute systems and enable restoration from tape, Fagnan explains. When the Department of Energy began pushing for digital object identifiers, JGI’s total was more than a billion data objects.

Fagnan spent some time thinking about what global unique identifiers could accomplish. Goals included knowing how and where data were used, tracking reuse, and identifying the value of particular data or calculating a return on investment for some datasets. Fagnan began talking to scientists about how they really used data. Would they add 1,000 DOIs to the bibliography of a paper if they gathered data from 1,000 datasets? Would that accomplish their goals?

“What we found was that a DOI isn’t necessarily the silver bullet… you might think it is,” Fagnan said.

Instead, she said, JGI began collaborating with a company called NamesforLife that was already on contract with the Department of Energy. For ten years, NamesforLife had been scraping the literature to gather data and build associations to manage the problem of evolving nomenclature in microbial research. The company created a global unique identifier to help researchers link work on the same microbe even as naming conventions change over time. It was a targeted and appropriate use of DOIs, Fagnan thought.

If we have these globally unique identifiers that have been associated with JGI datasets for more than a decade that already include National Center for Biotechnology Information accession numbers, and identifiers that have been issued by resources like IMG [JGI’s Integrated Microbial Genomes & Microbiomes (IMG/M) system] or GOLD [JGI’s Genomes OnLine Database], Fagnan wondered, “how much are those identifiers being used already to cite datasets? Would new DOIs be duplicative?”

“This is a big, messy problem, and it takes this premise: ok, we have globally unique identifiers. Does that make it easier to find all of the associated data?”

We’re looking at the problem wrong, Fagnan insists. The goal isn’t a perfect identification system or a perfect ontology. Seeking that perfect solution is a distraction. Instead, a solution must acknowledge reality and create a system that can handle the inherent messiness of the data.

The Genome Citation Service is, she hopes, a structure for that sort of system. The Genome Citation Service takes global unique identifiers, metadata, and dataset identifiers and returns publications that have a high likelihood of having used that data.

This could only have been built with NamesforLife, Fagnan emphasizes, and so last year, when the NamesforLife owner retired, JGI and Berkeley Labs acquired the intellectual property, services, and databases of the company. Shortly, when the final intellectual property agreements are all in place, these services will be available through JGI. Further user interviews are planned to fine-tune the interfaces researchers most want to be able to access these services and resources, she added.