Text Mining Life Science Abstracts Vs. Full-Text Articles

December 20, 2017

Contributed Commentary by Mike Iarrobino

December 20, 2017 | Life science researchers use advanced text mining techniques for the rapid review and analysis of large volumes of scientific literature, searching for patterns and connections to drive drug discovery and inform business decisions.

Because article abstracts are easily accessible through databases such as MEDLINE, many researchers rely on abstracts for text mining rather than sourcing full-text articles. Generally, abstracts include key points covered in the article, and some researchers consider abstracts good enough for their purposes. Here is what we’ve heard about the mining of abstracts over full-text:

  • “More text means more room for false positives.”
  • “Abstracts are more easily accessible via biomedical databases.”
  • “We don’t have the time or resources to spend on additional data cleansing and normalization work for unstructured content.”

Each of these challenges is true to a certain extent, but new research from bioinformaticians at the University of Copenhagen and the University of Denmark confirm that essential information remains hidden when only abstracts are mined.

The study, available on bioRxiv (DOI: 10.1101/162099), analyzed a corpus of more than 15 million full-text scientific articles published between 1823 and 2016, their matching abstracts, and the full set of 16.5 million MEDLINE abstracts. The full-text articles, mainly in PDF format, included articles published by Elsevier, Springer, and those in the Open-Access subset of PMC.

The team compared their findings from the corpus of full-text articles to the corresponding results from the matching set of abstracts, and from the full set of MEDLINE abstracts.

Here are some of the report’s main takeaways.

Full-Text Advantages

To explore the capability of text mining full-text articles, the team extracted protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system.

The results showed that mining the full-text article corpus outperformed the same analysis using abstracts only in every single case.

“Through rigorous benchmarking and comparison of a variety of biologically relevant associations, we have demonstrated that a substantial amount of relevant information is only found in the full body of text,” the report authors wrote.

This suggestion isn’t the first of its kind. Back in 2010, a study published in the Journal of Biomedical Informatics (DOI: 10.1016/j.jbi.2009.11.001) found that only 8% of the scientific claims made in full-text articles were found in their abstracts.

We’re also aware of knowledge management teams who have conducted similar benchmark comparisons; in one such case, a life sciences knowledge management team discovered that, although more entity relationships could be extracted from the full MEDLINE database vs. a comparatively smaller set of full-text articles, significantly more novel relationships were extracted from the full-text corpus than from the abstracts corpus. This further illustrates the value of full-text literature to R&D organizations looking for an edge.

Mining full-text scientific articles provides higher volume and wider information diversity than mining abstracts alone, and includes secondary findings. Logically, full-text articles include more named entities and connections between those entities.

BITW_TextMining

In the case of these 15 million scientific articles, the biggest performance gain in mining full-text articles was the associations found between diseases and genes.

Mineable Formats

In an interview with Science, study co-author Lars Juhl Jensen said converting full-text PDF articles into XML formatting is one of the reasons why full-text mining isn’t typically done at scale.

“We probably spent more computational resources teasing the text out of PDFs and beating it into shape than we spent on the actual text mining,” Jensen said.

Processing steps to enable mining of the full-text PDFs included removal of the acknowledgements, references, or bibliography, and splitting of the text into sentences and paragraphs. The study suggests if all articles were available in a structured XML format, they would have “no doubt produced a higher quality corpus.”

The study results confirm what we’ve heard anecdotally from researchers for a number of years: Insights extracted from full-text scientific literature are of a higher quantity and greater quality than those found in abstracts. Although challenges remain to processing the unstructured data in these articles, organizations that can unlock this value will have a distinct advantage over peers whose computational strategy continues to rely on more structured citation and abstract information, and on more manual processes.

 

Mike Iarrobino is CCC's product manager for content and rights workflow solutions RightFind XML for Mining and RightFind Music. He has previously managed marketing technology and content discovery products at FreshAddress and HCPro. He can be reached at miarrobino@copyright.com.