Reporters’ Notebook: Clustal, FAIR Data, Rare Diseases, Data Science Community At #BioIT18
June 8, 2018 | When over 3,500 members of the biotech and life science industries spend three days together, there’s more good stuff than a writer can possibly fit into post-event coverage. Our Reporters’ Notebook comprises some of the bits and pieces that we collected over the three days in Boston. –Allison Proffitt, Editor of Bio-IT World; Benjamin Ross, Editorial Assistant; Joe Stanganelli, Writer
Clustal Honors
During his keynote presentation, 2018 Benjamin Franklin Award-winner Desmond Higgins, professor of biochemistry at the University College Dublin Conway Institute of Biomolecular & Biomedical Research, reflected on his experience with open science and its importance in molecular biology. Going back to the 1980s, Higgins said, if you wanted a particular genetic sequence for research, you had to look through a book listing the various sequences available, and write a letter to EMBL and Genbank and request them, writing out by hand the particular code you were interested in. “It was a broken system,” Higgins said. “One problem was how the sequences got in the book to begin with.” Someone from EMBL or Genbank would go through the library at the institute and copy all of the incoming articles that featured sequencing data. Those articles would then be processed into a nucleotide sequence entry. This made the list of sequence not as reliable as they should have been, and the process of requesting those sequences limited the open access of the data itself.
“The reason I’m here,” Higgins said, “is because I wrote [a software program] that enables open access.” That computer program, known as Clustal, was created by Higgins in 1988 and derives phylogenetic trees, or “guide trees,” from nucleotides for multiple sequence alignment. In 2011, Clustal Omega was released, which Higgins said allows researchers to align 100,000 sequences in 6 hours.
Budgeting for FAIR Data
Barend Mons from the Dutch Techcentre for Life Sciences spoke to the attendees about the European Open Science Cloud (EOSC), the cloud for research data in Europe, and its application to FAIR data. Mons, who is the Chair of the group that regulates the EOSC, said he considers the EOSC the European contribution to FAIR data. In his time working to improve the FAIR-ness of data, Mons said he has come to the conclusion that the lack of FAIR data is not so much a technological issue as it is a social issue. We often refuse to share our data with other disciplines because we’re afraid of how they will handle it, Mons said. This leads to a lack of investment in some kind of research-infrastructure, or we create data without a data stewardship plan. Mons said that without this data gets lost in supplementary papers, and simply flows and pollutes and fails to become useful. Spending 5% of a research team’s budget on data stewardship will greatly improve the odds of data becoming FAIR. These issues are at the core of the EOSC’s Global-Open FAIR (GO-FAIR) proposal, which proposes the inclusive, open, and practical implementation of FAIR principles on a global scale.
Correlation and Causation In Genetic Communication
“Heredity is so central to our own identity, but science shows it is so marvelously complex and still kind of magical,” said Carl Zimmer during his plenary session on Thursday morning. Zimmer, a New York Times bestselling science writer, stirred up quite a response with Golden Tickets hidden throughout the conference for early access to his new book, She Has Her Mother’s Laugh. Zimmer explored his own genome and along the way looked at our understanding of heredity. Height is a great example of how hard it is to get from knowing something is heritable to understanding how, he said. Intelligence raises the bar even further. Zimmer was challenged by the Bio-IT World audience to distinguish between “correlation” and “causation”. It’s a good reminder to speak carefully, he agreed.
Rare Diseases and Natural Blondes
“There are more people with rare diseases in the world than there are natural blondes,” Catherine Brownstein, Scientific Director for Gene Discovery at the Manton Center for Orphan Disease Research at Boston Children’s Hospital and Harvard Medical School, told attendees. “This is an important benchmark for calculating how common rare diseases actually are.” Diseases that affect less than 200,000 people nationwide, otherwise known as orphan diseases, are in desperate need of finding a global platform, Brownstein argues. Recent efforts such as the BabySeq Project, a study by Boston Children’s and Harvard to provide genomic sequencing for childhood risk and newborn illnesses, allow Brownstein and her team of researchers the opportunity to find innovative approaches to genomic medicine. Other efforts to expand the scope of genomic medicine and diagnostics have come in the way of contests such as Foldit, an online puzzle video game designed to fold the structures of selected proteins as perfectly as possible, but Brownstein says the reality is that most times we can’t hold a contest. The risk in these broad, overarching projects is that they run the risk of generalizing the rare disease to the point where they are unrecognizable. “We have to expand our scope globally,” Brownstein said. “We have to invest in infrastructure, and we have to keep the suspect list [for disease diagnosis] in mind while still branching out for new partnerships.”
Genomic Map-Territory Relation
A map is not the territory it represents, but, if correct, it has a similar structure to the territory, which accounts for its usefulness. — Alfred Korzybski, Science and Sanity (1933)
"Have you ever seen a map of the world from 1000 A.D.?" asked conference presenter Scott Jeschonek after referencing this Korzybski quote. "There's [just] Asia and Europe together, because that's how they understood the world to be… [Now you can] zoom in and see street names and current traffic conditions and all this current data." Jeschonek, Principal Program Manager at Microsoft Azure (formerly Director of Cloud Services at Avere Systems before Microsoft acquired Avere earlier this year), argued that, similarly, the history of medical understanding demonstrates how knowledge is an incomplete—and inaccurate—representation of reality.
Pointing to more recent follies, Jeschonek related that in 1900, most medicines in the US contained alcohol and turpentine—while the since-discredited pseudoscience of phrenology formed the basis for how the brain worked. "It was entirely wrong," said Jeschonek. "And now here we are, 2018, datasets and the petabyte, -omes everywhere—genomes and exomes [and] 4D nucleome[s]." And so Jeschonek arrived at his point: change happens, regardless of what we "know" to be "true" today. "Even the stuff you're doing today is temporal," said Jeschonek. "This dataset that you have today… Just think what it's going to be in 10 years."
[Slack] Channeling the Data Science Community
The challenges of finding and training data-science talent was one of the most prolific topics at this year's conference—as, similarly, budding bioinformaticians were keen on how to advance in their data-science careers. (See Bio-IT World's coverage here.) Tanya Cashorali, Founder and CEO of Boston-based company TCB Analytics, came prepared to multiple sessions to talk up a mutually beneficial avenue for the data-science community.
It started on the day-two data-science panel, where Cashorali served as a panelist. During a debate on whether it was better to "buy" talent—by looking for ready-made candidate profiles with a spate of formal credentials—or "build" talent—by hiring candidates who demonstrate what Reynders would refer to as "learning agility"—Cashorali brought up a Slack channel of "200+ data scientists and tech nerds" that she runs.
"I asked them: 'What is the most important quality you're looking for when you're hiring a data scientist?'" said Cashorali. "The top two answers were 'problem-solving [ability]' and 'curiosity'. I think people are looking for self-starting, curious people who can get their hands on data."
Later, when Cashorali served as a panelist during the BioTeam town-hall session that closed the conference (see Bio-IT World's coverage here), an Bostonian audience member expressed an interest in learning more from resident experts while lamenting the lack of training dollars. Cashorali took the opportunity to more fully pitch her Slack channel as a learning solution.
"[The] Slack community [is] for people in tech and data—or people that want to get into tech or learn data science—because, as I think some of you know, the engineering community is not always the most welcoming and diverse," said Cashorali. "You won't get flamed for asking a question like you can on Stack Overflow."
The Slack channel, Cashorali explained, is invite-only to those who have been "vetted"—but invited those interested to connect with her (as she has previously done on Twitter).