Genetic Sequencing Will Enable Us To Win The Global Battle Against COVID-19
Contributed Commentary by Evan Floden, CEO of Seqera Labs
November 5, 2021 | Using genetic sequencing to identify pathogens and respond to outbreaks is hardly new. Public health authorities routinely use genetic surveillance to monitor seasonal influenza and food-borne bacteria that can lead to illness. Genetic surveillance has also played a key role in helping contain recent viral outbreaks, including Ebola, Yellow Fever, and Zika.
The emergence of the SARS-CoV-2 in late 2019 was a game-changer, however. While SARS-CoV-2 was identified and sequenced in record time, it proved impossible to contain. It quickly became apparent that existing surveillance systems were not up to the task of tracking a fast-moving virus that spread asymptomatically. The speed and scale of Coronavirus spread provided a wake-up call highlighting our vulnerability to a global pandemic. But bioinformatics plays a crucial role.
COVID-19 is arguably the first global pandemic to emerge in our modern bioinformatics era. To meet the threat, governments, health authorities, and scientists quickly mobilized, ratcheting up research and development efforts in a variety of areas, including:
- Sequencing the initial SARS-CoV-2 genome
- Developing vaccines and therapeutics
- Evaluating the efficacy of therapeutics and vaccines in the face of emerging variants
- Conducting genomic surveillance
All these activities rely to some degree on portable, repeatable, scalable bioinformatics pipelines. Genetic surveillance is particularly important. Without surveillance to understand viral transmission patterns and evolution, health authorities would be in the dark. But existing surveillance falls short.
While most developed countries have genetic surveillance capabilities, it became clear early on that existing approaches couldn't scale. Until the pandemic, surveillance typically relied on relatively small numbers of regional or national labs conducting sequencing.
Tracking transmission patterns and monitoring mutations within populations (viral phylodynamics) required a high percentage of positive test samples to be sequenced and analyzed. With some jurisdictions reporting tens of thousands of positive tests per day, facilities were quickly overwhelmed. Monitoring outbreaks and tracking a rapidly mutating virus demanded entirely new levels of speed, scale, and proactiveness, so it was all hands on deck to scale viral surveillance.
Given the scale of the challenge, governments needed to mobilize all available capacity, including regional labs, public health agencies, large hospitals, private labs, and universities. This decentralized approach came with its own technical and logistical challenges, however. Labs often had different levels of resources and capabilities, including:
- Different sequencing platforms
- Diverse software tools and pipelines to extract, analyze and classify samples
- IT infrastructure ranging from no capacity to on-prem clusters to private or public clouds
- Varying levels of bioinformatics and IT expertise making it challenging for some labs to implement and manage analysis pipelines themselves
Data harmonization was another challenge. Labs needed to gather and present genomic analysis and metadata in a standard way. Central health authorities needed automated methods to collect, cleanse, and aggregate data for downstream reporting, analysis, and sharing with international bodies.
Fortunately, techniques pioneered in modern software development, such as collaborative source code management systems (SCMs), container technologies, and CI/CD pipelines, helped to address these challenges. By sharing open source pipelines in public repositories and encapsulating applications in containers, pipelines developed to sequence the virus and analyze variants could be distributed quickly and efficiently. This meant that participating labs with minimal in-house IT expertise could easily obtain and run curated COVID pipelines designed by expert scientists and bioinformaticians. Pipelines could essentially be treated as a "black box" and run by people with minimal knowledge of the pipeline mechanics and the IT infrastructure.
Workflow orchestrators that can simplify writing and deploying compute and data-intensive pipelines at scale on any infrastructure have played a key role in the battle against COVID-19. These orchestration platforms provide important capabilities required by pharmaceutical companies, research labs and public health authorities, including:
- Pipeline portability – compute environments are abstracted from code so that flows can run unmodified across different IT environments.
- Container support – support for all major container standards and registries simplifies pipeline portability, deployment and execution.
- SCM integrations – view, pull, and manage pipelines in shared repositories such as GitHub, GitLab, and BitBucket.
- An expressive domain-specific language (DSL) – designers can write workflows in their language of choice. Features such as workflow introspection can adjust workflow behavior at runtime, making pipelines more adaptable, reliable, and predictable.
- Curated, shared pipelines – advanced platforms makes it easy for organizations to publish and distribute shared pipelines to jumpstart analysis efforts. The nf-core project is a community effort to collect analysis pipelines for a variety of datasets and analysis types.
The Ncov2019-artic-nf pipeline is a good example of how these orchestration platforms are enabling COVID surveillance efforts. With development led by Matt Bull of the Pathogen Genomics Unit, Public Health Wales, a sequencing pipeline was used by COVID-19 Genomics UK Consortium (COG UK) at local sequencing sites. The pipeline automates the analysis of SARS-CoV-2 sequences from both Illumina and Nanopore platforms generating consensus genomes, variants, quality control data, and various metadata. The pipeline then produces output files in a standard format to be easily shared, collected, and aggregated.
A separate pipeline called Elan (developed by Sam Nicholls of the University of Birmingham) runs centrally, processing data from various COG-UK labs. Elan is the daily heartbeat of the COG-UK bioinformatics pipeline and is responsible for sanity checking uploaded data, performing quality control and updating Majora (the COG-UK database) with processed file metadata. Elan is also responsible for publishing the daily COG-UK dataset for downstream analysis. One of the most important downstream analyses performed is tree building, which is done by a different pipeline called Phylopipe. The Phylopipe pipeline is responsible for building phylogenetic trees from the COG-UK data set. Phylopipe was written by Rachel Colquhoun at the University of Edinburgh.
A recent tweet shared by Elan's author illustrates their success with over 1,000,000 SARS-CoV-2 genomes processed since March 2020. Back in March of 2021, it was observed that if the workload processed by Elan were run sequentially, processing the COVID Genomics UK dataset would have required 123 years—a timeframe that is obviously incompatible with surveilling a pandemic in real-time!
Evan Floden is the CEO and co-founder of Seqera and a founding member of the Nextflow project. Evan has experience bringing healthcare solutions from ideation to market through roles in the biotechnology and medical device industry. He holds a PhD in biomedicine for work on large-scale sequence alignment. His broader interests encompass everything at the intersection of life-sciences and cloud computing. He can be reached at Evan@seqera.io.