A Bird's Eye View Of Data Consultancies
Editor's Note: Phrasing has been corrected in the last paragraph.
By Benjamin Ross
May 16, 2019 | BOSTON—During a panel discussion at the Bio-IT World Conference & Expo, Tanya Cashorali, CEO and Founder of TCB Analytics, Aaron Gardner, Director of Technology at BioTeam, and Eleanor Howe, Founder and CEO of Diamond Age Data Science—three data science consultants—came together to discuss the intricacies of the industry they so often work with.
Chris Dwan, moderator of the panel and a consultant himself, began by asking a question "near and dear" to him: What tool do you recommend for tagging metadata, and how do you get people to actually use that tool?
Shame is really the best motivator, Howe says. "I've had the best luck getting metadata tagged properly by just taking what's there and putting it up on a slide and showing it to people who own that data. They take one look and say, 'Oh my gosh, I can't believe my data looks like that.'"
Gardner says it's too early to say whether one specific tool can assist with tagging. "The field's still pretty open," he said. "It really depends on the application... I don't know about guilt and shame, but I do agree there has to be organizational value derived from the tagging exercise. Otherwise, the organization will not have the wherewithal to see [a project] through."
There's also a myth that data are always readily available to analyze, says Cashorali. "Typically, a lot of our clients have no idea what's in their data warehouse and what's possible. A lot of times we get asked to come in and assess lab and clinical data, and the company will ask, 'What can we do to add more value to this?'"
In those situations, Cashorali spends roughly 80% of her time simply cleaning up the data before analysis can begin. "I can think of a client right now where their facility names are not clean, so they'll have multiple names with different spellings that mean the same facility. That happens all the time. Data's almost always dirty and needs some amount of prep before we can do the fun stuff."
Gardner agreed that most of his time is spent cleaning data, adding that the crowd-sourcing approach to data management is an antiquated one. "[That method]'s broken. It's not an achievable way to manage data. We actually need better methods in analytics to actually achieve what we want on a platform level before we can even begin asking scientific questions."
Gardner advocates for inserting tags right at the creation of data, allowing baseline knowledge at the point of the analysis workflow. "It really helps not having to rely on humans later to remember where things came from and have to establish provenance."
Silo Trouble
So when are data clean enough to start analyzing?
"Depends on what you're doing," Cashorali said. "If you want to do an unsupervised training or classification model and all the data are tabular and there doesn't seem to be much missing or any crazy outliers when you load it, maybe that's considered clean.
"However, if I'm doing something where I need to understand more about the business domain of that data, it's not so much about cleanliness; it's about sufficiency. Do I have a picture? Do I have all the data I need? You can't find causality if you're missing information."
Cashorali says silos are to blame for this misplacement of data. "I can't tell you how many companies don't have data dictionaries, don't understand all the datasets [they've given me]. Someone may say, 'We've given you all the data we have,' but they're all working in silos and it turns out we're missing data from another department."
Mom and Pop Data
With the diversity within the sciences, Dwan wondered, is it often hard to stand out as a specialized data consultant group amidst a sea of broad, national firms, such as Accenture and Deloitte.
"I don't think we think of it that way," said Gardner. "There're so many problems out there that need to be solved, and it seems the need outweighs the available resources. So we often find ourselves working alongside those types of organizations. One of the difficulties in data science is that it cuts through so many disciplines, and it's very rare to find people that have been subjected to all those disciplines and can mold them all together. That's one thing we're able to do is kind of hold it all together."
"Deloitte doesn't have anyone like us," Howe said bluntly. "Fight me."