GA4GH Announces Workflow Execution Service API To Standardized Workflows Across Compute Environments
By Allison Proffitt
October 11, 2018 | Among the deliverables announced at the Global Alliance for Genomics and Health plenary meeting in Basel, Switzerland, was the Workflow Execution Service API developed by the Cloud Work Stream.
WES facilitates running a single workflow on multiple compute environments using either Common Workflow Language (CWL) or Workflow Definition Language (WDL). A compute environment can stand up a WES endpoint, and researchers can write CWL or WDL workflows, encapsulate algorithms in Docker, and send that workflow as a run request to the WES service. The WES service runs the workflow and collects information about where the outputs go, the status of the workflow, any standard error, etc, explains Brian O’Connor, Consulting Director for the Computational Genomics Platform at the University of California at Santa Cruz, and co-lead of the GA4GH Cloud Work Stream along with David Glazer of Verily.
WES was born, O’Connor said, out of scientists’ evolving needs over the past four to five years as cloud computing has gained traction, and big, distributed genomics computational projects became possible. For example: the PanCancer Analysis of Whole Genomes, or PCAWG, an international collaboration to identify common patterns of mutation in more than 2,800 cancer whole genomes from the International Cancer Genome Consortium.
O’Connor worked on PCAWG while he was at the Ontario Institute for Cancer Research and experienced the challenges of large-scale, distributed compute projects firsthand. “Because data were right then in many different places around the world—I think we had seven or eight data repositories all around the world for that project—we needed to make our workflows extremely portable,” O’Connor explains.
The compute was also distributed. Fourteen different compute environments were used for PCAWG. Researchers used Azure, Google Cloud, AWS, as well as HPC environments, like the Barcelona Super Computing Center.
“We needed ways to move workflows around, to have that process be really reliable,” O’Connor says. “We needed ways to reference data and pull and push data into these data repositories in a way that was cloud agnostic.”
The PCAWG team adapted technologies to make it work, but the experience helped O’Connor clarify how GA4GH could contribute.
“Whether you're talking about private clouds, or whether you're talking about AWS and Google and being able to move your algorithms between them, it's far easier to send the algorithms around to where the data reside, and have those data live in a place where compute is available, than to be in a situation where you're trying to move petabytes of data around in order to do your analysis,” O’Connor said.
GA4GH’s WES provides a v1.0 solution. For the past six months, the Cloud Work Stream has been working with Driver Projects to build and refine what is needed. Australian Genomics, the European Genome-phenome Archive (EGA), European Variation Archive (EVA), and European Nucleotide Archive (ENA), Genomics England, Human Cell Atlas, and TopMed all worked with the Cloud Work Stream.
It’s a diverse group, O’Connor acknowledges. “They all have slightly different stories in terms of use cases and how they would leverage WES, but there is this general benefit in having a standardized way that researchers can bring algorithms into their system.”
For instance, the Human Cell Atlas Project is currently running their WDL-based workflows on top of the Cromwell workflow service at the Broad Institute, and has given the Work Stream feedback on how to interact with WDL workflows, O’Connor explains. “The Human Cell Atlas Project wants their workflows in addition to their data to be shared with the community; they’ve built a small testing platform that allows them to send their workflows out for not only the Broad to test them, but also to send them to other locations. For example, they built a WES bridge to DNANexus.”
General workflow portability is the goal, O’Connor said. “Even if your use cases aren’t quite as distributed as PCAWG—which was super massively distributed—you’re still benefitting from having a standardized way to send your algorithms around to where the data reside. Really helping push forward workflow portability as a pleasant side effect of that process.”
There’s more on the GA4GH Roadmap for the Cloud Work Stream. While WES helps send an algorithm to the data, standard ways to reference the data in a cloud-agnostic way are still needed. The Work Stream plans to define standards for reading and writing data from different storage backends in different cloud environments, both as inputs and outputs of workflows, and making that process as frictionless as possible. O’Connor says the Work Stream hopes to release a specification for data access next year.
WES, he said, is a big milestone for this year’s meeting. “For us this is really, really significant. Because this is the first time I can now point to WES and say, ‘It’s not just an API, but other people are using it. This is actually a standard endorsed by GA4GH,’ and that means a lot.”
And a powerful standard at that. “If we had had this when we did PCAWG it would have made the process much, much easier,” O’Connor says. “It would have given us a standardized way to interact with these remote data and compute environments without having to rewrite and retool our workflows or our submission process.” O’Connor predicts that the PCAWG workflows runs could have been completed in weeks using WES.