Illumina Launches DRAGEN V3.7, Talks ML, Truth Challenges, All Of Us
By Allison Proffitt
October 27, 2020 | Illumina is launching version 3.7 of its DRAGEN bio-IT platform this week, Rami Mehio, Vice President, Bioinformatics and Instrument Software at Illumina, told Bio-IT World last week. The newest release will include DRAGEN for Single Cell RNA, unique molecular identifier (UMI) support including UMI-aware small variant calling, CYP2D6 genotyping, and accuracy gains in both germline variant calling and tumor-only analysis.
These updates are the latest in the DRAGEN team’s effort to provide functionalities Illumina customers need, said Mehio. Previously VP of Engineering at Edico Genome, Mehio joined Illumina in the 2018 acquisition. DRAGEN has been rolling out updates every three months, he said, seeking to deliver the maximum information from sequencing.
These new features are, at least in part, being driven by requirements from the National Institutes of Health’s All of Us Research Program. The program has selected DRAGEN as its population genomics informatics software. (Illumina is also providing Infinium Global Diversity Arrays, at no charge, to process up to a million samples.)
The three All of Us genome analysis centers—the University of Washington, the Broad Institute and Baylor College of Medicine—have standardized on the DRAGEN platform, Mehio said. The Broad Institute and Baylor College of Medicine have been independent DRAGEN customers. The Broad and DRAGEN have had a joint partnership for methods development; DRAGEN and Baylor were in early talks on work together. The standardization was driven by what the centers wanted to accomplish.
“It’s the centers wanting to produce more variants out of the pipeline,” Mehio explained. “Initially they got the FDA approval for the small variants. Today, we discussed moving the centers to detecting CNVs, SVs, CYP2D6 for pharmacogenomics. Essentially, they would like to explore getting more variants for the All of Us program,” he said. “Now separately the Broad is working with us on liquid biopsy pipeline and tumor/normal pipelines, exomes” and more.
DRAGEN and ML
DRAGEN has more improvements in the works as well, Mehio shared. The team is also focusing on improving its compute capabilities, Mehio said. “The volume of sequencing is growing. DRAGEN is very fast, but we’re confident that actually the volume of sequencing will keep growing and we’ll keep up with that.”
In the near future, Mehio added, annotation will be incorporated into DRAGEN, and the platform will soon support long read technologies. In July, Illumina acquired Enancio, a data compression company based in Cesson-Sévigné, France. Enancio’s lossless compression technology will be “very shortly” integrated into DRAGEN and Illumina’s cloud storage platform services, Mehio said.
Mehio is also focused on incorporating machine learning in DRAGEN, “both from the computation perspective and also from discovery perspective.” DRAGEN’s acceleration is driven by FPGAs, which Mehio says are particularly well-suited for inference in machine learning. “They’re not great for training, but luckily in the use cases of secondary analysis, it’s all about inference. You train offline in the cloud or something, and you implement an inference engine. We have actually already invested a lot of building computation capability in DRAGEN for machine learning. Now we’re in the process of basically applying specific algorithms that will yield need accuracy improvements using machine learning.”
These accuracy improvements driven by ML are not included in the v3.7 release, Mehio said, but the technology and the team are “complete equipped” to take it on. “You’ll hear from us in the future on that.”
Illumina will be applying machine learning very selectively, Mehio said—“less of a hammer approach.” They will be using ML to learn patterns, Mehio said. For instance, “false positives appear to happen because of the following conditions. That gives us some insights on how to solve the problem in a really rigorous, Bayesian, mathematical way.” But machine learning can be problematic, Mehio noted; variant calling with machine learning can be very biased. “We will work very diligently to make sure that’s not the case.”
For discovery, Mehio said the DRAGEN team is applying machine learning to cohorts: “really in the space of mining variants, not just the secondary analysis.” The DRAGEN team is working closely with the BlueBee team, a Dutch bioinformatics company that Illumina acquired in June, to develop capabilities for mining variants for biomarker discovery and drug discovery, he said.
Advancements In A Bottle
Part of the way DRAGEN refines its capabilities is through competitions like ones hosted by PrecisionFDA and The Genome in a Bottle (GIAB) consortium. Most recently, the group announced the results of the Truth Challenge V2: Calling Variants from Short and Long Reads in Difficult-to-Map Regions such as segmental duplications, and the Major Histocompatibility Complex (MHC).
The PrecisionFDA Truth Challenge ran from May 1 to June 15 and attracted 64 submissions from 20 teams using data from the Illumina NovaSeq, PacBio HiFi reads from the Sequel II System, and long reads from the Oxford Nanopore PromethION sequencing technologies. The DRAGEN team won two of the available best performance awards for Illumina data: difficult-to-map regions and all benchmark regions. Seven Bridges won the third performance award for accuracy on the MHC region.
“We have a history of participating in the PrecisonFDA challenges,” Mehio said. “This one, I think, in my opinion, is the most interesting one that they’ve organized. What they did is, they basically evaluated bioinformatics tools for their accuracy, taking into account a much more complete view of the genome. They did not exclude difficult regions. They expanded their evaluation using new truth sets.”
When the full performance data were shared with the participants, Mehio said, “We figured out that we were 28% better on the whole genome than the next best, and 38% better on the difficult regions. Really our margin of victory was significant.”
“The reason we did well is we basically built a graph genome or mapper and improved our variant calling,” he said. The team didn’t spend time building a graph for the MHC region, though Mehio said they are working on it now.
The Truth Challenge V2 gave competitors three truth sets. “But how can you check whether it’s applicable to the whole population?” Mehio asked. Simple: “You can actually measure your mismatch rate in mapping. Essentially you don’t really need to know the truth about variants. All you need to do is measure how well your reads from the samples mapped to your graph reference. If they map better than the linear reference, then that means you have actually fit a better reference that will be better variants.”
Diversity is important, Mehio said, data from multiple ethnic groups should be included—augmenting the graph with population haplotypes if needed so that mapping metrics “move in the right direction” across all the ethnic groups, he added.