Tute Genomics Shares Genetic Variants Database on Google Genomics

By Allison Proffitt

March 12, 2015 | Google Genomics wants to play seriously in the genomics space, and the company is busy stocking its sandbox with all the best toys.

Google Genomics and Tute Genomics have announced that a Tute database of 8.5 billion annotations of genetic variants is publiclyavailable through Google Genomics. The database will be hosted on Google Genomics, and can be queried at regular Google Genomics query rates. Other datasets hosted at Google Genomics currently include data from 1000 Genomes Project; 1000 Genomes Project, Phase 3; Illumina Platinum Genomes; several reference genomes; and the MSSNG Database for Autism Researchers.

In distinguishing itself as more than just data storage, Google Genomics is applying Google’s search tools and computing prowess to processing and exploring genomic data as well, Jonathan Bingham, product manager, Google Genomics explained.

Users can store data and share it with whomever they wish within Google Cloud. In addition, Google Genomics supports processing genomics data, including variant calling, tertiary analysis, and cohort comparison

But it’s data exploration that really brings Google’s talents to bear on genomic data.

“Because [Google Genomics] is built on Google Cloud platform and Google’s infrastructure, one of the things that we wanted to solve for was storability,” Bingham explained. “As we move toward a world of millions of sequenced individuals, we want to make sure we don’t drown in data as a community. We want to make it easy for researchers to do analysis across all of that with the speed and power and flexibility that they’re interested in.”

Google BigQuery is a commercial analytics engine available for Google Cloud Platform that Bingham said has proven especially useful for genomic data exploration.

“It turns out that if you feed into [BigQuery] genetic variant calls from a cohort of patients, you can do queries against that, and in a matter of seconds, you can ask questions about allelic frequency, genome-wide association, linkage to phenotypic traits or drug treatments in a way that’s just kind of mind-blowingly fast.”

BigQuery was designed for use with unstructured data, but the Google Genomics team has tweaked the engine to work specifically with genomic data.

Users can load tab delimited or CSV files—“pretty much as big as you want”—into BigQuery and immediately start querying the data. Without fine-tuning the input data, “We can look for things like transition/transversion ratios, allelic frequencies, and synonymous vs non- synonymous mutations,” Bingham said.

Tute’s database, Bingham said, is “carefully curated,” which enables BigQuery to dig even deeper.

“They recognized that Google Genomics and BigQuery together make it possible to do some really interesting things with genetic variants and prior knowledge, Bingham said. “If you’ve done a sequencing study, or you have new human genomes, you can then do joins with these [Tute] annotations. You can ask the question, ‘Given the patients that I’ve sequenced, what do we know about their variants? What are the ones that are most connected to disease. What do we know about drug response from them?’ So by joining against this Tute annotation set, you can ask those questions with interactive speed.”

Bingham gives an example for cost and speed: 88 GB of human genetic variants joined against the Tute dataset can be done in 30 seconds for less than $1.

Tute’s Annotation Database

And the Tute dataset is extensive. The 8.5 billion genetic variants include gene and transcript model annotations such as amino acid and protein substitutions and the functional consequence of exonic variants. The database includes conservation and evolutionary scores from SIFT, PolyPhen2, PhyloP, GERP++, MutationTaster, MutationAssessor, FATHMM, as well as MetaLR and MetaSVM, two ensemble scores recently developed by Dr. Kai Wang, Tute’s President and creator of ANNOVAR, and his collaborator Xiaoming Liu at the University of Texas. The database also contains Tute scores and Tute predictions, the company’s own scoring system to predict whether a SNP or indel is likely to be associated with Mendelian phenotypes.

The data are also enriched with public data. Population frequencies from public projects such as the 1000 Genomes Project and the NHLBI ESP-6500 exomes are included. The database contains clinical annotations from the NCBI’s ClinVar database and GWAS catalog, as well as regional annotations such as miRNA targeting site predictions.

Tute’s dataset is meant to be complimentary to other variant databases, said David Mittelman, Tute’s CSO. For example: “ClinVar is a really good data source, but ClinVar’s source of data is classifications that are made by people. Clinical geneticists will review some content and upload their opinions to ClinVar when they discovery variants. Generally those opinions are based on underlying annotations, and ClinVar is trying to capture people’s perceptions of what is clinically relevant. We’re trying to aggregate and organize the underlying data that would help people come to that decision.”

A relationship between Google and Tute started last year after Tute’s CEO, Reid Robison, met Google’s David Glazer at a conference. It was a good fit, said Mittelman.

“We’re pretty excited about Google Genomics, and basically re-purposing the Google Cloud to allow you to do storage, but also operations on genomic data.

“At Tute we’re working on the whole annotation layer. What’s behind your variants? How we can intersect that against everything that’s known? It screams ‘search engine’! It’s a search engine problem.”

Add search expertise to Google’s work with the Global Alliance on Genomics and Health on genomics standards and Mittelman sees much promise in Google’s approach.

“If you have a big name brand, and you’ve got great engineers, and you’re working on open standards, that’s a recipe for success. It makes a lot more sense than just dumping our data somewhere, building our own experience from scratch, or working with someone who is also starting from scratch.”

The genomics community is still deciding where to work and collaborate, Mittelman observes, but he’s impressed with the features of the community Google is building.

“Folks won’t want to just dump their data somewhere. They’re going to want to interact with it right there on the cloud… I think the jury’s out on where people are going to eventually congregate, but they’re going to congregate somewhere and they’re going to want to operate on their data. I think these kinds of initiatives are a good way to test the market and see if people will engage. And if they do, I think it’ll drive more innovation and development on those platforms.”

Bingham is open to more platforms. He stressed that Google Genomics is happy to host another annotation engine.

For their part, the Tute team is, “committed to building a number of new tools and functionalities on top of Google Cloud,” Mittelman says. “This is just the beginning for us. We’re very eager to get integrated with them in the coming months.”