New Tool Helps Query Databases, Flag Anomalies in Data
By Allison Proffitt
October 8, 2024 | In a data science paper published in June, a team of researchers at MIT presents GenSQL, a system to ease answering data science questions. They published the work in Proceedings of the ACM on Programming Languages (DOI: 10.1145/3656409). GenSQL is available open source in Clojure on GitHub.
The goal of GenSQL, said first author Mathieu Huot, researcher scientist at MIT, was to make it easier for a broader scientific audience to query databases. SQL has done that for various programming languages, he said. “Even though Python is very popular now, back in the ‘70s and ‘80s, SQL kind of taught the business world how to use computers. But SQL is not great for data science questions,” Huot argues. “So the question was, ‘Can we leverage that ease of use and this large user base, and allow them to do more in a familiar environment?’”
GenSQL is a declarative extension to SQL, the authors write. It, “seamlessly enables queries that integrate access to the tabular data with operations against the probabilistic model.”
“It should be easier for users to just learn a few new patterns,” Huot said. “But then behind it is this compiler converting those queries to actual more technical questions on project models.”
GenSQL is particularly good at predicting new data, detecting anomalies, imputing missing values, cleaning noisy entries, and generating synthetic observations, the authors note. And GenSQL is not designed as an AI black box.
Rather than relying on neural nets, GenSQL is built on Bayesian inference workflows, and it works with models written in a variety of probabilistic programming languages.
“Statisticians use much simpler models, like linear models, because they are fully interpretable,” Huot said. “We know exactly what the coefficients are. We know what it classifies. Then you can inspect what the model is doing. We also have a backend that synthesizes these models,” he added, “but again, these models look more like handwritten ones compared to neural nets, which only have weights.”
That’s not to suggest Bayesian models are simple, Huot said—“They can be quite complex!”—but they are still inspectable and can be edited and updated. “It’s not trained to be overconfident and to absolutely try to bullshit. We felt like that was super important to science.” People don’t trust neural nets, he said. They want to have more insight into how the model arrives at conclusions with their data.
Life Sciences Use Cases
The authors presented two case studies in real world data: one in medicine with clinical trial data and one in synthetic biology with wet lab data. In the clinical trial example, the team used GenSQL to check a dataset for probable mislabeling of the data. The team used data from the BEAT19 Covid-19 crowd-sourcing study and calculated conditional probabilities of BMI values for individuals based on other available data.
In the synthetic biology example GenSQL was used to generate accurate synthetic data, capturing the complex relationships between different host genes and experimental conditions and enabling researchers to model whether specific experimental conditions or genome modifications might have cascading downstream effects. They report that GenSQL’s predictions were more accurate than both linear models and conditional generative adversarial networks.
“Look how much better data science could be if it was easier to use!” Huot said. GenSQL represents one way to do that. He is excited to see how members of the community start using and further developing the tool. “It’s not perfect yet, but we believe it’s quite an improvement over other options.”