Building A Telomere-to-Telomere Human Pangenome Reference
By Allison Proffitt
March 2, 2021 | Our current version of the human reference genome is built primarily from one person. “This does not adequately represent genetic diversity in the human population,” Karen Miga told the virtual audience at this week’s AGBT General Meeting. “This is true of most of our genomic repositories that we’ve built and resources and tools we’ve built,” she continued. “We’re seeing a failing in recognizing the full and broad representation of sequence diversity in the human population.”
Miga is an Assistant Research Scientist at the University of California, Santa Cruz Genomics Institute and Director of the Data Production Center for the Human Pangenome Reference Consortium (HPRC). The goal of the HPRC, she explained, is to develop a better representation of sequence diversity in the human population, starting with 350 diverse humans. With that diverse input, the HPRC plans to develop a complete and comprehensive map of genome variation using haplotype-phased assemblies: two references from each participant, one from the maternal and one from the paternal chromosome.
Enriched diversity in our upstream reference genomes will have longstanding downstream impacts on our mapping, alignment, and analyses, Miga argued. Additionally, a new reference data structure will foster a new ecosystem of tools.
But this presents a staggering technical challenge: creating hundreds (if not thousands) of complete reference-grade genomes. In fact, Miga added, the goal is even loftier. Through the Telomere-to-Telomere consortium, researchers hope to achieve complete genomes or at least complete assembled chromosomes.
“This has never been done before!” Miga said. “Sure back in 2003 we reached 99% of the euchromatic regions, however the highly-repetitive heterochromatic regions were intentionally not included.” These regions are essential to understanding human biology, she emphasized, including mitosis and meiosis, protein formation, genome spatial organization, epigenetic profiles, genome instability, and gene families.
Add to the mix additional goals of an embedded ethics framework to oversee ethical and policy questions and the need for global genomics partnerships to ensure population representation. In fact, Miga acknowledged, even the idea of a “complete” human pangenome reference isn’t perfectly clear. “The reason ‘complete’ is in quotes is because we’re still trying to figure out what that means at the population genetics level.”
The effort will certainly require a team approach with a host of sequencing technologies and centers to handle production of the genomes.
One of the drivers for the project, Miga said, is PacBio’s HiFi long read data. “When you take this high-quality circular consensus—or CCS—read, you end up with an extremely high-quality consensus read of 99.9% read accuracy,” she said, reporting that her team is reaching 35-40x coverage.
The Consortium is also using Oxford Nanopore’s Ultra Long data. “This has been tremendously useful in gaining data that can go up to 100kb plus,” Miga said. She highlighted the consortium’s partnership with Circulomics, which has helped develop their ultra-long dataset sequencing.
The HPRC just celebrated its year one data release, Miga said, which includes sequencing data and QC metrics from the first 30 samples and relied on technologies from PacBio, Oxford Nanopore, Dovetail Genomics, Bionano, Illumina, and Strand-Seq. The data are shared in an open data and cloud-based data management approach and can be found in an AWS S3 bucket and AnVil with workflows available on Dockstore and GitHub.
Meanwhile the Telomere-to-Telomere Consortium (T2T)—which Miga says has now really merged with HPRC’s work—launched at AGBT in 2019 with the telomere-to-telomere assembly of the X chromosome. Since then, the consortium has been busy using PacBio’s HiFi reads to construct string graphs from long perfect overlaps and is using ONT for the “hard” tangles, “to try to resolve some of the longer repeats,” Miga explained.
In September 2020, T2T released its first genome with 23 chromosomes. “We have no unplaced contigs anymore!” Miga said. “It’s been a tremendous joy to see the first high-resolution maps of every human pericentric and centromeric region in the human genome.”
The effort is revealing interesting genomic rearrangements, new repeat predictions, new satellite arrays and transposable elements, new tandem repeats, and new genes, she said.
While this is tremendous progress, Miga emphasizes that we are, “not yet to the finish line” of a truly complete genome. There are significant technological barriers between the haploid genomes that have been the focus thus far and a diploid telomere-to-telomere genome, plus further barriers to scaling T2T diploid genomes.
But the efforts to overcome these barriers are well worth it.
“Having this more complete and comprehensive human pangenome reference will absolutely revolutionize our understanding of genetics and epigenetics,” Miga said. “I think this is not only clearly in our understanding of how this could influence healthcare by the ability to study the complete genome for important clinical variants that might have been missed before, but also in our understanding of genome diversity, understanding how we vary as a species at the genetic level, and also understanding new and exciting cell biology that might have been missed just because of our incomplete maps in the past.”