Finding The Usable In FAIR Data At #BioIT18

By Benjamin Ross

June 6, 2018 | Last year at the Bio-IT World Conference & Expo, three teams competed in a Hackathon to see if they could make their datasets more Findable, Accessible, Interoperable, and Reusable (FAIR). This year, the organizers of the FAIR Hackathon wanted to see if they could take those principles to the next level.

“The approach we’ve been taking in these short hackathons is really a self-assessment of a resource against the 15 FAIR principles,” Erik Schultes, FAIR Data Scientific Projects Lead at the Dutch Techcentre for Life Sciences, told the Hackathon attendees in Boston last month. “Thinking back a year ago, the best we could do was basically hand over these principles, have people read them, think about their data resource, and then come up with a number between 1 and 5 as a qualitative assessment.”

While this was a great exercise in understanding the FAIR principles, Schultes said its qualitative, reflective nature made it difficult to reproduce.

To combat this, Schultes and the rest of the Hackathon organizers developed a set of core metrics and a questionnaire to go along with those metrics.

According to Schultes, these metrics help establish a before and after comparison in the Hackathon format. Hackathon participants come into the room, complete the self-assessments so they get a sense of the FAIR-ness of the dataset in the beginning, and then have 24 hours to hack their way to increased FAIR-ness, and then take another snapshot of where their dataset stacks up against the FAIR principles.

Four teams took the opportunity to analyze pre-established datasets from the Memorial Sloan Kettering and Dana Farber Cancer Center, the Broad Institute, Collaborative Drug Discovery, and the Jackson Laboratory.

The Results Are In

When all was said and done, each of the teams greatly improved their overall FAIR scores, the average improvement being 24.5 points out of 100.

“The scoring for this Hackathon was not meant to be competitive,” Schultes said. “This was as much an assessment of the metrics and the ruling as it was the FAIR-ness of the resources themselves.”

Raising the bar is always the goal with these Hackathons, Schultes continued. “If everyone scored 100% on whatever metrics we had, the metrics would only get harder.”

For Schultes, the metrics are useful in prioritizing what is important in datasets, making “FAIR” a more tangible concept.

“True FAIR-ness I think is difficult,” said Schultes, “so it’s nice to know what are some of the easy things we can do now, and what are the harder things I can invest in from there.”

It’s The Little Things

“What we worked on in the Hackathon primarily is the awareness for the cBioPortal about FAIR [principles],” Kees van Bochove, CEO at The Hyve and team member of the Memorial Sloan Kettering and Dana Farber Cancer Center’s group, told the audience in his report. “One of the questions we had among our group was, ‘Why would we bother with FAIR when everyone in cancer genomics is already using cBioPortal?’ This was an interesting discussion, and one that still needs to take place.”

The team of representatives from Memorial Sloan Kettering, Dana Farber Cancer Center, and the Hyve worked with cBioPortal, an open source software platform that enables interactive, exploratory analysis of large-scale cancer genomics data sets.

“There were a lot of small things we could do to make cBioPortal a lot more FAIR,” Bochove said. “For example we’re already working on a new open API, which is a small step to publish that link… the other interesting thing is that Memorial Sloan Kettering and others take a lot of time curating data sets from their sources. So there’re a lot of interoperability aspects there, just not according to the FAIR principles. This Hackathon has allowed us to see even more action items that will make the cBioPortal FAIR-er.”

By the end of the Hackathon, the cBioPortal team was able to improve the platform’s FAIR score from 39% to 59%.

The More The Merrier

The team from the Broad Institute chose to FAIRify their Single Cell Portal, a visualization portal for single cell RNA-seq data.

Eric Weitz from the Broad Institute told the audience that he was surprised by the amount of support outside the Broad Institute his team received for the project.

“We were hoping for at least 2 to 3 external team members for this Hackathon, but we wound up with close to 18 external members,” Weitz said. “That’s a great problem to have, but it also requires brainstorming what each member of the team’s role will be.”

After deliberation, the team decided to divide their tasks with the goal to improve the study of metadata within the Single Cell portal, and also improving the analysis metadata, which Weitz described as conforming to community standards but restricting the audience access and not being easily machine-findable.

Both of these tasks lead to an increase of interoperability and reusability on the side of studying metadata, and an increase in findability and accessibility on the analysis side. Overall the Single Cell Portal’s FAIR score increased from 29% to 52%.

Tackling Assay Management FAIR-ly

The team representing Collaborative Drug Discovery (CDD) approached the Hackathon by asking direct questions.

“We asked [our team], ‘What issues have you encountered with assay management?’” Samantha Jeschonek, Research Consultant at CDD, said. “Anybody who’s been in science won’t be surprised by the results: everything.”

Between having to deal with issues such firewalls and clinical data, Jeschonek laid out the many ways getting science information has been difficult for FAIR-ification. According to Jeschonek, CDD has been developing BioAssay Express, human-guided machine learning for curating assay text. At its core, Jeschonek said the design of BioAssay Express was to be FAIR.

“BioAssay Express gives the user an interface to enter information about their protocol, and using the combination of NLP, machine learning, and data curation to not only pull from input protocol, but to also alter it,” said Jeschonek.

Upon receiving an initial FAIR score of 48%, the team from CDD noticed a lack of documentation, where things had been built into the BioAssay, but had never been put out in a Word document or in a URL.

The team also experienced confusion when stacking their interface against the FAIR principles to begin with.

“One of the biggest problems people had was understanding what the terms of FAIR meant,” Jeschonek said. “We had a discussion and realized that in actuality we had met some of the requirements we previously believed we had missed.”

When the CDD team went back to the drawing board, they increased their FAIR score to 91%. However, the team was not the only one who had difficulty coming to terms with the FAIR principles.

Taking FAIR To Work

“We felt there should be a ‘U’ in the acronym. There should be a focus on the ‘usability’ of the data,” Anne Deslattes Mays, Principal Computational Scientist at the Jackson Laboratory and team leader of the Jackson Lab Hackathon team, said. The team was created at the event on the fly, looking at the problem of the sheer number of ontologies in their data sets.

“Our goal was to leverage the existing data sets to improve their FAIR-ness,” Mays said. “This involved learning what FAIR was, what its principles were.”

The Jackson Lab team started with a use case, identifying mutations that are lost through passages and then associating those lost mutations with protein domains if they were present within their data set.

After receiving the initial FAIR score of 26%, the team decided to create a viable data model using the use case. Using the data models and original data sources, the team then worked to create a Web Ontology Language (OWL) model so that anyone looking to use data could do so by looking at the OWL definition. This helped boost the team’s FAIR score to 54%.

The Jackson Lab team wasn’t able to do as much as they wanted during the Hackathon, though Mays said the experience has left them with opportunities for improvement. For instance, the team is looking to develop and publish a metadata longevity plan.

“This [Hackathon] gave us the ability to set some goals,” Mays said. “Some we achieved during the Hackathon, some we’re looking forward to implementing moving forward.”