GIAB Establishes New Benchmark for Several Medically Relevant, Difficult-to-Sequence Genes
By Kyle Proffitt
February 14, 2022 |The Genome in a Bottle (GIAB) Consortium has just reported a new, expanded benchmark for 273 medically relevant genes that are challenging to clinically assess with current sequencing methods. The work is published in Nature Biotechnology. Prior benchmarks released by the group excluded nearly 400 medically relevant genes due to repetitiveness or polymorphic complexity, but the new effort ultimately reports over 17,000 single nucleotide variations, 3,600 insertions and deletions, and 200 structural variations for human genome references GRCh37 and GRCh38 across HG002.
Despite the advances of sequencing technologies and the refinement of reference human genome sequences, several medically relevant genes present pathogenic variants that remain difficult to identify using short and even long-read sequencing. As the report states, “The clinical tests for these genes often require locus-specific targeted designs and/or employ multiple technologies and are only applied when suspicion of a specific disorder is high.”
To focus on particular medically relevant genes, the consortium first pulled lists of medically relevant genes from OMIM, HGMD, and ClinVar and combined these with a list from the COSMIC gene census, identifying 4,697 autosomal medically relevant genes. Of these, 395 genes showed less than 90% inclusion in GIAB’s prior HG002 v4.2.1 small variant benchmark and were chosen for further resolution.
To accomplish this feat, GIAB relied upon PacBio HiFi reads to create a trio-based (HG002 with both parental HG003 and HG004 genomes) haplotype-resolved assembly using hifiasm v0.11. New small variant and structural variant benchmarks were established for 273 of these challenging, medically relevant genes (CMRGs) using the criteria that the entire gene, both 20-kb flanking regions, and any overlapping segmental duplications needed to have “exactly one fully aligned contig from each haplotype with no breaks on GRCh37 and GRCh38”. In at least one case, SMN1, manual curation with ultralong Oxford Nanopore Technologies (ONT) and 10x Genomics reads allowed benchmarking where the PacBio reads did not provide complete coverage. The team also used Bionano Genomics optical mapping-based SV calling to confirm 50 identified structural variants of at least 500 bp.
The new benchmarks, “identified variant-calling errors due to false duplications in GRCh37 or GRCh38 in several medically relevant genes.” The authors give the example where “PacBio HiFi and Illumina short-read coverage is low and missing one or both haplotypes for CBS, CRYAA and KCNE1 on GRCh38, because reads incorrectly align to distant incorrect copies of these genes (CBSL, CRYAA2 and KCNE1B, respectively).” This finding is supported by work from the Telomere-to-Telomere Consortium and the Genome Reference Consortium.
Working with the Genome Reference Consortium, GIAB researchers established a new masking file to alter these falsely duplicated regions to N’s for the GRCh38 reference, allowing unambiguous variant calling in the correct genes. This masking “substantially improves recall and precision of variant calls in these genes for Illumina, PacBio HiFi and ONT mapping-based methods… without increasing errors in other regions.”
The team also evaluated the CMRG variant benchmarks by comparing variant callsets from short- and long-read technologies, including Illumina, PacBio, and ONT, using a variety of mapping and assembly-based variant calling methods. This analysis showed that the new benchmark reliably identified false positives and false negatives across these callsets, but it also revealed some errors with the haplotype-based assembly and enabled further benchmark refinement.
Despite the revelations and improved benchmark, the authors are quick to remind that “another 122 autosomal genes covered <90% by v4.2.1 are still excluded from the CMRG benchmark.” These genes remain difficult to benchmark for various reasons elaborated in the paper, and the authors indicate that new methods will be needed for further resolution.