Scientific Data Doctoring: Could Blockchain Technology Help Stamp It Out?
Contributed Commentary by Richard Shute
February 7, 2018 | The drive for scientists in all disciplines to “publish or perish” seems to be as strong as ever. Whilst the vast majority of scientists have a genuine drive to advance knowledge and do well in their endeavours, this fear of “perishing” continues to tempt some “bad actor” scientists into falsifying their experimental data in hopes that it will give them a better chance of getting their paper(s) into the highest impact factor journals and so further their career. A few high-profile examples of data doctoring (perhaps most infamously the case of the non-existent link between the MMR vaccine and autism) have led to constant questioning of scientific research data, results, and conclusions, often at the highest levels of authority. This, combined with a seemingly growing rate of paper retractions, has led to diminishing trust in science and scientists.
Falsification of scientific data is one of three primary forms of scientific research misconduct defined by the US National Science Foundation (NSF); the other two are plagiarism and fabrication. There have been a number of papers, editorials, and comment articles highlighting this sore. Last June, an editorial in Nature focused on image altering highlighted three main “stakeholders” who have a responsibility to ensure data falsification does not happen: the originating scientist, the senior researcher or principal investigator, and the journal publisher. Many journals and publishing groups have adopted clear guidelines on data integrity with the Journal of Cell Biology leading the way over 15 years ago. However, it is mechanisms to reduce falsifications by originating scientists that would seem to provide the best opportunity to eliminate this scientific scourge. So, is there a technology that could reduce data falsification in a publicly visible, traceable, perpetual and immutable fashion, and thus help to rebuild trust in science? I believe there is: blockchain technology.
Why Blockchain?
Deloitte succinctly defines blockchain as: “a digital, distributed transaction ledger, with identical copies maintained on multiple computer systems controlled by different entities. Anyone participating in a blockchain can review the entries in it; users can update the blockchain only by consensus of a majority of participants. Once entered into a blockchain, information can never be erased; ideally, a blockchain contains an accurate and verifiable record of every transaction ever made.” (For more background, this CB Insights research brief simply describes blockchain technology along with some of the better-known associated cryptocurrencies and protocols: Bitcoin and Ethereum.)
Critically, there are four absolutely fundamental aspects of “the blockchain”.
- Identity – proof of who created what data entry or transaction;
- Timestamping – proof of when an entry was created;
- Content – proof of what the entry contained on the date of creation; and
- Immutability – proof that the Content of the entry has not been altered since it was created.
Also key is a technique that underpins blockchain tech, and which sits at the heart of the proposal I shall make: cryptographic hashing. A hash is a kind of "signature" for a stream of data that represents the contents; the slightest change to the original file will never lead to the same hash. Hashing is a “one-way” process—current computers are not powerful enough to go back to the original file from the hash. There are many freely available online sites that will generate the cryptographic hash of your choice from any file, be that an SHA256 hash, an MD5 hash, or another. Once a file has been hashed and the hash stored on the blockchain, there is then immutable, timestamped, publicly-accessible proof of the content of the file from which the hash originated. This is such a fundamental “use-case” of blockchain technology that services like ProofOfExistence.com, Tierion, Storj, MetroGnomo and others have been in existence for some years, offering file hashing, hash storage, and timestamping using the blockchain.
With these four main properties—Identity, Timestamping, Content and Immutability—along with the public, distributed, always-visible nature of the blockchain, and simple, reproducible-by-anyone, cryptographic file hashing, I believe it might be possible to significantly reduce data falsification at the originating scientist level, by making it much easier to identify files that have been doctored.
How? A Proposal
A typical scientific experiment consists (broadly) of the following process or workflow:
Falsification, if it happens, tends to occur in the final three stages. I shall focus on “Run” because this is where I believe blockchain tech could play a significant role, and be most effectively implemented.
In “Run” a scientist performs an experiment based on the design he developed in the previous stage. He or she will set up the experiment on a piece of equipment, capture observations, and perform intermediate and final analyses. ELNs from companies such as Dassault, SciNote, Benchling and many others capture the method, components, observations, results, and conclusions from an experiment. Instrument software controls the equipment and turns any output, instrument-only-readable, “raw” data into human-readable, “refined” data. Typical data-outputting instruments in Run can include: HPLCs, mass spectrometers, cellular imagers, plate readers, etc. Falsification of such instruments’ raw data and the resultant instrument-created refined files is not easy, and it is at this point where I believe blockchain technology could play an important role in the lab of the future.
Let us assume we have a cellular imaging instrument with software linked to a public blockchain. What if every time the instrument generated an image-file from the measurements it had taken, that file was immediately, cryptographically hashed and the hash stored both in the ELN record for that experiment and on a public blockchain along with some indication of which instrument the file came from? This could be achieved through a simple, add-on service provided by the instrument manufacturer via already existing APIs into blockchain services like Tierion. The cryptographic hash proves the Content; deposition on the blockchain proves the Timestamp; information about the instrument proves Identity; and the passage of time plus rehashing the file then comparing the new hash to the old proves Immutability. Later, when the originating scientist comes to publish their work, they are encouraged or mandated to deposit their data files on a site such as Figshare or Zenodo. If the original cryptographic hashes for these files had been deposited as soon as they had come off the instrument on a publicly-accessible blockchain, then by re-hashing the public file and comparing it to the previous hash on the blockchain, this would prove (or not) that the same file had been generated at some earlier time by a defined instrument. Identical hash equals identical file. Falsifying such an audit trail would be difficult, but could be made even more challenging if ELNs were also blockchain-enabled. ELN entries could be cryptographically hashed on closure, and the hashes then stored on a public blockchain. In addition to guarding against scientific misconduct the record could help with intellectual property defence and patenting.
In the future, a publisher receiving a journal submission; or an independent scientist reading an article and wanting to convince themselves of the veracity of the data; or maybe even a federal regulator wanting to confirm that a data file in a submission had been generated by the right institution at the reported time, could check the files they had against the blockchain. Taking the files they have been sent or that they have downloaded from, say, Figshare, anyone could hash them independently and then look for those hashes on the appropriate blockchain. Once found—assuming they are there, if they are not that tells them something straight away!—that would immediately reveal each file’s timestamp, which itself would be informative. Furthermore, if all instruments had unique blockchain identities, they could confirm that the file was linked to a specific instrument, or instrument type. It might also be possible to link instruments to institutions (or groups). Connecting a file to an instrument and, ideally, to an institution would give further evidence that the file had been generated by the group publishing the data. I recognise that linking instruments to institutions in a publicly visible way might be a step too far—institutions will likely baulk at this level of revelation of what assets they have—but it is theoretically possible. Even without the connection to the institution, it would be a powerful weapon in the fight to counteract data falsification if one could prove that an undoctored file had come off a real instrument.
In summary, I am suggesting that a “supply chain of information” starting as soon as a data file comes off a blockchain-identified instrument, facilitated by cryptographic hashing, plus easy access to a public blockchain storing the hashes, and supported by publication file storage services, could make scientific data falsification significantly more difficult and more easily identified if it were to occur.
There is one final caveat though. It is a truism that no matter how many barriers you put in front of a determined “bad actor”, or how high those barriers are, if they are dedicated and smart enough, they can get around or over them. But if the “cost” is too high, they will be inhibited. What I have proposed above is not a foolproof way to stop scientific data falsification, but it would make it much harder to succeed. With such a system in place, maybe then, the bad actor scientist will reform and just do real, genuine, reproducible experiments which justify publication; then science and society as a whole will benefit.
Richard Shute is an experienced medicinal chemist and informatics IS/IT manager. He worked for over 25 years in Big Pharma at ICI, Zeneca and AstraZeneca; half that time in chemistry, the other half in informatics. Richard has worked for Curlew Research as a consultant since 2015 and has been an advocate of blockchain technology since then. He can be reached at richard.shute@curlewresearch.com.