Broad's GATK Dominates as Genomics Go-To
By Joe Stanganelli
August 10, 2016 | It's difficult to operate in, or even casually observe, the life-sciences or medical research space without noticing that the Broad Institute's Genome Analysis Toolkit (GATK) seems to be everywhere—and becoming exponentially more prolific.
“We started developing the GATK [at the Broad Institute] in early 2009. It started taking off just as the 1000 Genomes Project was maturing,” said Eric Banks, a computational biologist and early developer of GATK who is now Senior Director of the Broad Institute's Data Sciences and Data Engineering Group, in an interview with Bio-IT World. “Other groups [and] consortia wanted to model their genomic processing efforts after the cutting-edge bioinformatics of 1000G.”
Notably, the demand has continued to increase, and garner yet more attention for the non-profit organization. Recent and current examples of increased market penetration for GATK deployment and use, to name a few, include:
- The Broad Institute's GATK forum presently boasts close to 36,000 registered users throughout the world, the majority in North America and Europe.
- The Cancer Genomics Cloud, the NCI-funded cancer cloud that was built by Seven Bridges Genomics and fully released in February of this year, uses GATK—just like the on-premise version of the Seven Bridges platform. This despite the fact that the two organizations “compete” with each other, each having won a lucrative government contract for the National Cancer Institute's Cloud Pilot; the counterpart from the Broad Institute is called the FireCloud.
- In April, the Broad Institute partnered with Intel to make GATK available to Intel's research partners and users on the tech company's Collaborative Cancer Cloud. The Broad Institute further announced that it was partnering with Amazon Web Services, Google Cloud, IBM, Cloudera, and Microsoft to offer cloud-based access to GATK via those companies' cloud platforms.
- In May, Edico Genome announced that its for-lease DRAGEN processor for genomic analysis and variant calling will now come with an "Accelerated" version of GATK pre-installed. Ditto for IBM Power Systems S822LC for HPC—the product of a partnership between Edico Genome and IBM.
- Later this year, according to the Broad Institute, the organization will partner will Illumina in offering GATK-as-a-Service—granting some users of Illumina's cloud platform access to the latest version of GATK.
GATK4: What to Expect from the Upcoming Update
The Broad Institute is showing no signs of slowing down with GATK; the organization reports that it will be unveiling GATK4 with support to the general public later this year. Technically, however, if you want GATK4 right now, you can get it—with an asterisk.
“GATK4 is already in alpha and is technically available, but not supported, to anyone who wants to use it,” Geraldine Van der Auwera, who runs the Broad Institute's GATK blog and leads user support as Group Leader of Data Science and Data Engineering, told Bio-IT World in an interview. “Its performance improvements [in] speed and memory [already] make it a big step forward from GATK3 for production needs.… Now we’re getting ready to launch GATK4 later this year.”
Van der Auwera went on to clarify that GATK4's Copy Number Variation (CNV) calling features—one of several entirely new methods in GATK4—are significantly further along than GATK4's other features, having already progressed beyond alpha and to the beta stage.
And so far, reports Van der Auwera, the feedback there has been quite positive. “External users have told us that they use [CNV] with good results,” she said.
Still, the rest of GATK4 has quite some ways to go before it will catch up to CNV and get itself out of the alpha release, Van der Auwera pointed out. But she remained unshaken, even excited, in her conviction that the full release would happen before the end of calendar year 2016. Specifically, she noted, there are a number of tools that not only still need to be migrated from GATK3 to GATK4, but that also need to undergo “thorough and systematic performance testing to make sure that the tools are better and just as accurate.”
Some of these tools will take longer to make available for the Apache Spark distributed computing framework, which GATK4 relies upon to make GATK (in Van der Auwera's words in a recent Broad Institute blog post), “cloud friendly and more scalable.” Those tools, Van de Auwera conceded, will not be fully and properly migrated over from GATK3 until next calendar year, throughout the course of the year.
But she was quick to point out that even these lagging tools will still see improvement in time for the full launch later this year. “[These] tools will [not] be made available in Spark versions right away,” said Van de Auwera, “but instead will just be better versions of their GATK3 counterparts.”
The Broad Institute is relying on numerous other improvements in GATK4 to enhance data-processing speed and other areas of performance. In addition to the expected quality-control improvements to the actual code of any meaningful software update (“in terms of readability, testing coverage, and simplified release engineering,” specified Van de Auwera), she reports that her organization is particularly excited about the general simplifications and overhauling that GATK's engine itself is undergoing so as to “remove unnecessary complexity” and “improve [researchers'] speed of innovation [while still] allow[ing] porting of existing tools with a manageable amount of effort for users.”
In this way, the Broad Institute certainly seems to understand its demographic for GATK; at the very least, where significant genomic efforts are happening, the Broad Institute is working to ensure that GATK is there. In addition to its enormous GATK push into the cloud via a plethora of platforms this year, the Broad Institute is further working on developing and leveraging “a common code base foundation for multiple genomics efforts,” reported Van de Auwera, where GATK is concerned.
Achieving the Gold Standard by Supporting Standards and the Silver-Lined Cloud
In a related email statement, Andrew Hollinger, Associate Director of the Broad Institute’s Genomics Platform, told Bio-IT World that, with GATK, the Broad Institute seeks to achieve a mission of digital transformation by “improv[ing] human health[,] transform[ing] medicine through the application of genomic tools and… sharing… these tools and data."
“We [aim to] build and offer gold-standard tools and services,” wrote Hollinger, “and to make them accessible to as large a community as we can.”
To these ends, Van der Auwera was especially eager to point out that GATK4 will offer full support for CRAM (a framework designed by the European Bioinformatics Institute, including its own file format and toolkit), Picard command-line tools, APIs from the Global Alliance for Genomics and Health, and other tools and standards in the genomics realm. Indeed, the Broad Institute appears to keenly grasp that GATK is not just a toolkit, but also a platform. Accordingly, it must be developed and marketed as such, replete with all of the necessary sharing-economy courting and support of developers and users—and, naturally, those stakeholders' preferred standards—that go with effective platform development.
“Our goal is to continue to expand the user base because it’s an excellent set of tools and pipelines, which helps research worldwide,” said Banks. “The tools are supported by people who are able to help right away, and who are very familiar with this type of research, because it’s what they are using every day to conduct their own research.”
Banks reports that this notion of going where the researchers already are feeds into the Broad Institute's strategy for enhancing GATK cloud accessibility. “We hear a lot about how hard it is for local users, for example at a small university, to set up the physical infrastructure that’s needed to undertake a large genomic study. The sizes of the databases are enormous; the computing power needed is expensive,” said Banks. “It’s great that several organizations are working with us to make GATK available in the cloud so these users can run their research without having to build and maintain a huge local data center, for example.”
Banks also offered a caveat: that accessibility issues notwithstanding, the Broad Institute is also pressing forward with its aggressive cloud strategy out of “pure necessity.”
“The expected future scale and computing power needed for massive research datasets is quickly outpacing anyone’s individual physical data infrastructure,” added Banks, “including what we deploy locally at the Broad.”
Banks allows, of course, that the cloud isn't for everybody.
“We also know that a lot of users want to build and maintain their own local data centers,” he continued, “so we will continue to offer GATK as a direct download just as we always have.”
As for the Broad Institute itself, however, it appears that the smart money is headed toward the cloud where the future of GATK is concerned. Says Van de Auwera: “Broad’s production cloud pipeline is actually already using GATK4!”