From Black Hole To Data Big Bang: Open Patent Chemistry Pips 20 Million Structures

Contributed Commentary by Christopher Southan

April 6, 2017 | Contributed Commentary For all the recent press coverage concerning Theranos, including references to how close they had been gripping their cards to their chest, there was hardly a mention, let alone a review, of their numerous published patents (e.g. WO2016161083; “Methods, devices, systems, and kits for automated blood collection by fingerstick”). This is in part because the news pundits tend not to grapple with such technicalities, but it also reflects the traditional status of patents as “Cinderella” information sources, overlooked because they are difficult.

While companies have always been aware of the crucial importance of patent disclosures in the context of intellectual property (IP), appreciation of their value as a data source is slowly increasing in the academic biomedical sector, particularly for medicinal chemistry as applied to drug discovery. In fact, a report published last week in Science showed that 10% of NIH grants generate a patent directly and 30% generate articles that are subsequently cited by patents. As a proxy for increasing availability in the public domain, we can now find patent-extracted chemistry from multiple sources, including most recently 7.5 million from the World Intellectual Property Organization (WIPO). PubChem now contains 20.9 million structures from patents (not necessarily “patented”), as extracted by automated chemical named entity extraction (CNER), thus representing 22% of PubChem’s 94 million total. Commercial resources such as SciFinder have manually extracted several-fold more compounds from the patent corpus.

Notwithstanding, PubChem content extensively covers example structures, especially the post-2002 USPTO XML documents, for which CNER works better than for older applications. This means the relatively recent patent “Big Bang” (starting with first IBM’s first 2.5 million submissions to PubChem in 2012) has had important consequences. The first is derived from the fact that the open chemistry in PubChem and primary databases can now be linked to patent full-text freely available from the major patent office portals and other sources.

Another consequence is what could be termed the democratization of patent chemistry searching. This is exemplified both by SureChEMBL from the EBI (17.9 million structures in situ) and its subsumation, along with other large CNER sources (e.g. IBM’s 10.7 million to date) into PubChem. Note also the former publishes extracted chemistry and searchable metadata (e.g. applicants, inventors and patent classifications) as full-text indexed within a week of publication. This presents a paradox: those who could hitherto not afford commercial sources can now mine both old and new patent chemistry. However, those with access to licenced databases must also search the open sector in parallel because of the inevitable differential extraction coverage.

This democratization effect of access to data via public initiatives also holds for the appearance of ChEMBL that, since 2009, has extracted 960,000 assayed compounds from 65,000 papers and is a submitter to PubChem. For patents there are caveats associated with exactly how bioactivity is defined and how much of it can be mapped to targets. However, data from a curated commercial source, Excelera, currently includes 3.3 million bioactivity-mapped (with most assigned to protein targets) structures from patents. This indicates there are thus not only more bioactive compounds now openly available from the patent corpus than papers, but also these can surface years earlier (and some may remain patent-unique).

Despite these advantages of exploiting patent office portals and open chemistry extraction sources, isolating specific entity relationships that are valuable in a research setting is still not for the faint-hearted. This is mainly because the data are swamped by hundreds of pages of turgid “patentese” of little interest to experimentalists. For example, those working on JAK1 kinase, in the commercial or academic sectors, could be interested the Nissan Chemical Industries granted publication of US9216999. This includes an SAR data set connecting no less than 445 examples and over 1800 IC50 values between four kinases. However, wading through the 558-page original PDF to manually map between the data tables, targets and structures would be tough going. Fortunately open sources have done the work for us in this case, with SureChEMBL having extracted 3122 structures (2053 of which are in PubChem) and BindingDB having curated the mappings between the targets and activity values for the 445 examples (also submitted to PubChem linked to US9216999).

Editor’s note: If you’re interested in an overview of this area, along with some tips and tricks for exploitation, consider Workshop 6 at the 2017 Bio-IT World Conference and Expo: “Digging bioactive chemistry out of patents using open resources,” Tuesday, May 23, 2017 8:00 - 11:30 am. This will introduce open patent chemistry, cover selected tools, address target identification, SAR extraction from patents and papers in parallel as well as including hands on exercises. The presenting faculty will be Chris Southan from IUPHAR/BPS Guide to Pharmacology, Daniel Lowe from NextMove Software, and Paul Thiessen from PubChem.