Rafael Irizarry Announced As 2017 Benjamin Franklin Award Winner

June 2, 2017

By Benjamin Ross

June 2, 2017 | Rafael Irizarry, of Harvard University and the Dana Farber Cancer Institute, was recently chosen by Bioinformatics.org as the 2017 laureate of the 2017 Benjamin Franklin Award in the Life Sciences.

Irizarry’s experience in the field of gene expression and applied statistics—and most recently in his contributions to both the  Gene Expression Barcode 3.0 and the preprocessing algorithm frozen RMA (fRMA)—were examples of what the judges at Bioinformatics.org considered worthy of the prestigious award.

“What I’ve been doing for the last 17 years has been developing statistical methodology, statistical ideas,” Irizarry said during his lecture following his acceptance of the award at the 2017 Bio-IT World Conference & Expo. Irizarry and his team are, “developing software that is open source, partly because we feel like it’s the right thing to do, and also because it helps us to have other people improve it. They expunge, they ask for changes. And also the fact that it’s free and we don’t ask for any kind of incentive makes it more appealing to people.”

The Benjamin Franklin Award for Open Access in the Life Sciences is a humanitarian bioethics award presented annually by Bioinformatics.org to an individual who has, in his or her practice, promoted free and open access to the materials and methods used in the life sciences.

“As a biostatistician, he has made fundamental advances in the science of analyzing large, noisy, and biased genomics datasets,” Irizarry’s nominators wrote. “His contributions are particularly crucial in an era where archives are filling with tens of thousands of large open datasets; to re-use and combine these in any effective way requires a careful approach that considers technical confounders, batch effects, and other issues.”

During his lecture, Irizarry spoke about his collaborative efforts in Open Source Software and Educational Resources. For example, his paper in NCBI, titled “Exploration, normalization, and summaries of high density oligonucleotide array probe level data” (DOI: 10.1093/biostatistics/4.2.249), discusses an open source project reporting the “exploratory analyses of high‐density oligonucleotide array data from the Affymetrix GeneChip system with the objective of improving upon currently used measures of gene expression,” according to the abstract of the paper.

Irizarry was quick to point out that the open source format of his statistical ideas ultimately leads to those ideas becoming part of workflows, an area where he believes we ought to think critically and be technical about the right way to handle workflows.

BITW BF

According to Irizarry, the typical workflow is a very structured process, from raw data transitioning into the pre-processing stage, then moving into the work of analysis, which finally leads to knowledge. Like the earliest microscopic images, Irizarry argues that raw data from biomedical research is out of focus, and it’s often hard to determine what exactly is being presented in a given data set. Irizarry referred to this excess of data as “noise.”

A lot of what Irizarry does comes from looking at microarrays that don’t quite add up, which then requires both he and his coworkers to go back to the full raw data to see what went wrong. Sometimes they are able to find the flaw, and sometimes they don’t.

According to Irizarry, once you know what the model is you can develop solutions that do not have barriers and do not have a sudden and drastic bend in the arrangement of raw data. Irizarry and his team developed a statistical idea that presents an alternative way of background collecting. The statistical idea, leaning on the stochastic approach, is known as frozen robust multiarray analysis (fRMA).

Unlike RMA, which cannot be used in clinical settings where samples need to be processed individually and are not comparable with one another, Irizarry and company’s fRMA allows researchers to analyze microarrays in small batches or individually and also combine the data for analysis.

Typically, according to Irizarry, if one wanted to distinguish a particular protein or DNA fragment from the raw data of a micro array, one would have to start from the top of a chart mapping out the data, and attempt to count the number of proteins they were looking for.

“You’d make about 500 mistakes before you got to the first real [desired protein],” Irizarry said. “With the new approach you find it right away. So there’s a relatively big difference between the standard workflow and our new approach.”

After Irizarry and his collaborators wrote code for their fRMA system, they realized that there was a lot of commonality between their project and the projects of other statisticians. The group decided to launch their code in an open source format, allowing researchers to try out Irizarry’s new method, and using it to find answers to their statistical problems.

Teaching Old Dogs New Tricks

Irizarry’s paper detailing his fRMA system was published in Oxford Academic in 2010 (DOI:10.1093/biostatistics/kxp059). A lot has changed since then, and technology has improved dramatically. With the new phase of technology comes the pupils wanting to learn how to access and understand the concepts associated with the technology.

“Another way we discover an idea is through teaching,” Irizarry said. “I’ve been teaching how to analyze data for a long time, and one of the things that I noticed very early on was that my course in statistics was taken mostly by students in biology.” Irizarry went on to say that he noticed a lot of postdocs in biology were becoming data analysts not out of choice but by necessity.

“One of the things [my colleagues and I] realized was that if this was the case [at Harvard], what about in other universities where the statisticians haven’t caught on to the new data type,” said Irizarry.

This revelation prompted Irizarry to look into the online course movement, and he began launching his online courses in statistics in 2014, beginning with one course on genomics and expanding it to seven courses in the data analysis and life sciences fields.

Irizarry concluded by saying that he was amazed by the number of students who are individuals in the statistics and data analytics space who want to learn more about the evolving systems. “A lot of the people that are actually taking the course are teachers or course instructors,” he said. “They want to learn how to teach this stuff. So it has a very nice facility to express.”