AI System For Diagnosing Rare Diseases and Solving Medical Cold Cases

By Deborah Borfitz

May 14, 2024 | Using an artificial intelligence (AI) model that was trained on millions of variants from identified genetic disorders and incorporates the complex decision-making process of human molecular scientists, investigators at Baylor College of Medicine are on track to automate the diagnosis of an enormous number of undiagnosed conditions. Their aim is to prioritize the genes and variants for Mendelian disorders based on the clinical features and sequencing profiles of patients, according to Pengfei Liu, Ph.D., associate professor of molecular and human genetics at Baylor College of Medicine and associate clinical director at Baylor Genetics.

It typically takes hours in a clinical lab to look at the sequencing data of just one patient, and hundreds if not thousands of cases are being processed monthly, he says. And the data needs to be reanalyzed “again and again” because hundreds of novel disease genes are being reported every year.

Diagnosing the unsolved pool of cases that accumulate over time has therefore been extraordinarily difficult to achieve in a timely and accurate manner. The Baylor team’s AI-MARRVEL (AIM) system is not the first AI tool meant to improve the diagnostic yield, but it has succeeded in doubling the number of solved cases relative to existing methods across three different real-world cohorts, as reported recently in NEJM AI (DOI: 10.1056/AIoa2300009).

AIM is built on the infrastructure of MARRVEL (model organism aggregated resources for rare variant exploration), previously described by investigators (AJHG, DOI: 10.1016/j.ajhg.2017.04.010). The training dataset contains more than 3.5 million variants from thousands of diagnosed cases.

Getting AIM to produce a ranking of the most likely gene candidates causing a rare disease requires only inputting a patient’s exome sequence data and symptoms, explains Zhandong Liu, Ph.D. (unrelated to Pengfei), associate professor of pediatrics at Baylor College of Medicine and an investigator at the Jan and Dan Duncan Neurological Research Institute at Texas Children’s Hospital. A key benefit of the algorithm is that researchers can adjust the range and severity of phenotypes to see the impact on prediction outcomes.

For academic use, the AI system is free and available online for anyone to try, adds Zhandong. It is “just one milestone” in an anticipated string of ever more powerful AI tools to be based on clinical big data from Baylor Genetics, a joint venture of Baylor College of Medicine (top department of human genetics based on funding by the National Institutes of Health) and H.U. Group Holdings, Inc., (leading clinical lab in Japan).

Among the flagship solutions of the academic-commercial hybrid, founded in 2015, are rapid whole genome sequencing and whole exome sequencing for patients with rare diseases. Baylor Genetics is headquartered in Houston’s Texas Medical Center and serves customers in 50 states and 16 countries.

Knowledge Build

Mendelian diseases, by definition, are caused by one or a few genetic variants in a single gene. They may be seen on a scale of one in a million individuals, says Pengfei, but collectively represent thousands of diseases and tens of millions of individuals.

Efforts to provide a clinical diagnosis for these patients typically begins by sequencing the entire DNA sequence of their genome, Pengfei continues. Baylor set up its clinical sequencing program in 2011 and the initial offering was clinical exome sequencing tests for children with unexplained genetic disorders such as intellectual disability or autism.

The Herculean task has always been interpreting the relative importance of the millions of detected variants to help find the one that would accomplish a diagnosis as well as facilitate ongoing reanalysis, says Pengfei. This is where AIM came in.

The Baylor team was aided in developing the data analysis tool by scientists with the Undiagnosed Disease Network (UDN), an ongoing research project funded by the National Institutes of Health. Baylor is one of 14 clinical sites in the U.S. where UDN participants are evaluated.

The UDN patient dataset was one of the three utilized in the latest study evaluating AIM, the other two coming from Baylor Genetics (Clinical Diagnostic Lab) and the Deciphering Developmental Disorders (DDD) study underway in the U.K. For the evaluation exercise, application scenarios included dominant, recessive, trio diagnosis, large scale reanalysis, and novel disease gene discovery.

AIM was trained on existing knowledge using real-world clinical data, says Pengfei, but there are of course genes that aren’t yet known to cause disease. However, the knowledge base in the algorithm equips a computer to discover the novel genes by looking at all the other phenotypic features.

The researchers tested AIM’s clinical exome reanalysis on a dataset of UDN and DDD cases and found that it was able to correctly identify 57% of diagnosable cases out of a collection of 871 cases. They also designed an algorithm used in conjunction with AIM to identify two new disease genes, one (MYCBP2) recently discovered in a cohort of eight patients with a neurodevelopmental disorder and another (TMEM161B) appearing to play a role in the developing central nervous system.

These data are being actively investigated by researchers and confidence is high that the disease genes will soon be entering the public domain, Pengfei says.

Starting Point

The MARRVEL engine launched in 2018, the result of a collaboration between investigators at the Baylor College of Medicine and the UDN, says Zhandong. Previously, processing an undiagnosed case was a labor-intensive undertaking that required sifting through an assortment of databases to synthesize information in hopes of understanding the nature of a mutation, the impacted organ, and the culprit gene and its molecular function.

All that information is now available via a single visit to MARRVEL and, importantly, the incoming data is systematically updated, he says. Using AI to enable prediction of the most likely problem-causing mutation behind patients’ symptoms was the next logical step in minimizing their diagnostic odyssey.

Development of AI-MARRVEL began by interviewing clinical geneticists at Baylor to better understand their thinking process and thus how a computer might be taught to mimic that human decision-making logic to render a prediction that is “as accurate as possible,” says Zhandong. The tool relies on a commonly used random forest machine learning method designed to combine different parameters to reach a single result.

AI-MARRVEL was benchmarked against five existing algorithms frequently cited in the literature, Zhandong notes, most extensively Exomiser—a Java application that finds potential disease-causing variants from whole-exome or whole-genome sequencing data—and Genomiser, an extension of Exomiser that associates regulatory variants to specific Mendelian diseases.

Case for Reanalysis

Proportionally, unresolved genetic disease cases in the clinic after an initial round of clinical exome sequencing averages about 60%, says Pengfei. One major contributor to the cold case problem is that “the mutation we are trying to find is actually in the data but... we can’t analyze it well enough.”

The implication is that reanalysis of the data can be done in subsequent years for the undiagnosed cases using the most updated system where more variants were annotated as pathogenic, Pengfei says. Periodic reanalysis, which is one of the use cases proposed for AIM, can result in new molecular diagnoses over time, as he and his colleagues pointed out in a 2019 article in the New England Journal of Medicine (DOI: 10.1056/NEJMc1812033).

Unfortunately, finding patients to share the news years after their genome was sequenced is not always easy, says Zhandong. “This is a separate challenge that we have to face after we use these sophisticated programs to ensure we find the diagnosis... in the most efficient manner.”

AIM could help with this downstream issue of delivering results back to patients, adds Pengfei. With just a few clicks, patients themselves could opt to upload their own genome sequencing files through the AI-MARRVEL website for analysis since the Baylor team has identified the handful of parameters yielding the highest performance in ranking diagnostic variants.

Zhandong says he has over the years been emailed questions from parents who were bravely searching for answers to their child’s mysterious illness on the original MARRVEL website. “I believe this new tool will give the power back to patients and their families... without having to completely rely on genome centers.”

Indeed, AIM is expanding access to people without the domain expertise of large diagnostic labs, including patients as well as researchers, physicians, and even some hospital-based geneticists, says Zhandong. Any stakeholder can use the online system using the “tuning parameters,” or control knobs, of their choosing.

Forward View

In a research capacity, current users of AIM include Michael Wangler, M.D., one of Pengfei’s colleagues at the Baylor College of Medicine, who is endeavoring to make genomic medicine more accessible and useful for underserved minority communities in Texas. Baylor Genetics is the sequencing partner for the project.

Caleb Bupp, M.D., division chief of medical genetics and genomics at Corewell Health (formerly Spectrum Health Helen DeVos Children's Hospital), has also been using AIM as part of his efforts to initiate rapid whole genome sequencing, improve the diagnostic yield of testing, and help facilitate increased access to testing in Michigan through Project Baby Deer.

Baylor Genetics is currently looking at the potential of moving AIM into clinical use, says Pengfei. Its Clinical Diagnostics Lab is by far the largest of the three training datasets used in the development of AIM, accounting for over 1,000 patients in the training and the testing set combined.

AIM could become a component of laboratory developed tests for rare disease diagnostics being used in clinical labs elsewhere once clinical validation work is completed in those settings, he adds. In this scenario, “we would analyze the [sequencing] data... and decide which is the more reportable variant in the latter part of the workflow.”

The “semi-automated” algorithm presented in the New England Journal of Medicine paper emphasizing the importance of reanalysis of clinical exome sequencing data was recognized by the National Human Genome Research Institute as one of the 10 most important findings in 2019 for genomic medicine, says Pengfei. In a subsequent interview about the approach, he said the next step is to fully automate the analysis process—an “exciting direction we are [quickly] moving toward.”