The State of Mutation Curation
April 6, 2011
By Kevin Davies
April 5, 2011 | It is a testament to David Cooper’s drive and perseverance that his Human Gene Mutation Database (HGMD; www.hgmd.org) is the most comprehensive source of human mutation data currently available. “I hope it is a tremendous resource, because we’re not aware of any direct competitors,” says Cooper, a human molecular geneticist at Cardiff University’s Institute of Medical Genetics in Wales.
But that rather depends on what one’s definition of ‘direct’ is. Two new gene mutation resources, the Human Variome Project (HVP) organized by Richard Cotton in Australia, and the MutaDATABASE, conceived by Patrick Willems in Belgium, could eventually provide open access to a trove of gene mutation data and complement, if not compete with, HGMD.
“Currently there is no open-access resource allowing researchers to easily compare their patient data with a comprehensive, up-to-date, readily accessible and mapped list of known disease mutations,” says Daniel MacArthur, a geneticist and author of an influential blog, Genetic Future. “That is, to put it mildly, a huge problem for researchers embarking on projects that will involve sequencing the entire exomes of hundreds or thousands of patients.”
HGMD aims to be a comprehensive inherited gene mutation database, from functional and disease-associated SNPs to gross gene deletions. It does not however include mitochondrial DNA mutations (which are handled by MITOMAP, the brainchild of Doug Wallace in Philadelphia) or somatic mutations (which are handled by COSMIC at the Sanger Institute). To date, HGMD contains more than 110,000 different disease-causing or disease-associated mutations in more than 4,000 different human genes.
In January, Cooper signed a new five-year deal with German biological databases and software company BIOBASE, making HGMD perhaps the first major genetics database to be self-sustaining. But Cooper’s reliance on private funding, which began because he was originally denied funding from the Wellcome Trust ten years ago, means that access to any HGMD data less than two and a half years old requires a subscription.
“I’m frequently criticized for our funding model, but we cannot obtain money from the public sector, and if there were to be none from the private sector, then there would be no HGMD,” says Cooper. “I’ve always wanted in my heart of hearts to stay in the public domain. But since someone has to foot the bill, we’d like industry and commerce to contribute the bulk of the upkeep costs so as to keep the costs of academic subscriptions at a minimal level. This is BIOBASE’s view as well.”
Prior to re-signing the agreement with BIOBASE, Cooper almost struck a deal for public funding from the National Center for Biotechnology Information (NCBI) at NIH. Cooper was hoping for a long-term contract but the NIH was unable to guarantee long-term funding. “We would only have been guaranteed a contract for three years, at which time we would be invited to re-apply for further funding (with absolutely no guarantees as to our likely success),” says Cooper. “As you can imagine, the HGMD group members were less than totally enthusiastic at this prospect.” Cooper had no interest in a repeat of his early experiences with the Wellcome Trust or potentially losing responsibility for the HGMD data once they were in the public domain. Cooper politely declined the NIH offer.
NCBI’s James Ostell would not comment on the negotiations, other than to note that discussions reached the highest levels of NIH. He offered a prepared statement to Bio-IT World: “NIH appreciates the value of having fully public data resources for the medical genetics community. We did discuss with HGMD the possibility of public funding in return for free public access, but unfortunately we not able to conclude an agreement. The major obstacle was an expressed desire by HGMD for funding guarantees into the future. NIH itself does not have guaranteed funding levels, and cannot make such guarantees to any group it funds.”
Lingering questions over HGMD’s funding model and the confidence in the data annotation—evidenced in part by recent experiences of individuals such as Illumina CEO Jay Flatley, who have screened their genome sequences against HGMD—provide additional opportunities for new databases. Two such genome databases gaining traction are the Human Variome Project (HVP), which is seeking to develop thousands of locus-specific mutation databases (see, “Gene Data’s Aussie Rules”), and the MutaDATABASE (see, “The MutaDATABASE”).
Cooper is diplomatic in assessing the changing landscape: “Whether this [HVP] idea is viable or not remains to be seen. It is also unclear to me what it is that the HVP thinks it can add to what is already being provided by existing databases such as HGMD, OMIM, dbSNP, mitoMAP, COSMIC etc.”
Cooper providing a platform for HGMD subscribers to access HGMD data from within MutaDATABASE. “With HVP and mutaDATABASE apparently vying with each other to set up two very similar (and confederated) sets of mutation databases in parallel, the scientific community is likely to become at best confused, at worst turned off completely! In my opinion, both entities urgently need to attempt to sit down and work out ways in which they can be seen to be reading from the same hymn-sheet.”
In the Beginning
Cooper’s interest in cataloguing gene mutations dates back to the mid-1980s, when he performed meta-analyses with various researchers including medical geneticist Hagop Youssoufian (now the chief medical officer at ImClone). Cooper met him in 1986 while visiting the late Victor McKusick, the legendary creator of another key genetics resource, OMIM (Online Mendelian Inheritance in Man) at Johns Hopkins in Baltimore. “OMIM is a fantastic knowledge base for inherited diseases and traits and as a reference for genotype-phenotype relationships,” says Cooper. “But it’s a narrative database—a few noteworthy variants [per gene] that tell a nice story. It was never their intention to collect all mutations.”
Cooper and Youssoufian were both interested in genome methylation (Cooper had done his PhD with noted epigeneticist Adrian Bird). In 1988, the pair published a highly cited paper that, although based on a “ridiculously small” sample size, documented that about 30% single base-pair substitutions fell within CpG dinucleotides. Together with Michael Krawczak, Cooper started performing meta-analyses on the frequency and spatial distribution of DNA mutations in the genome.
“I had been taught that mutation was essentially a random process—cosmic rays, bad luck etc.,” he says. “I didn’t actually believe it, especially when I started looking at real mutation data. We soon found deletion and indel hotspots in addition to the CpG hotspot. It was our fascination with the non-randomness of human gene mutation and what that could potentially tell us about underlying mutational mechanisms that propelled us into producing datasets that later became a conglomerate— HGMD.”
But Cooper’s initial efforts to obtain funding from the Medical Research Council and the Wellcome Trust (twice) were denied, even though there was an initiative at the time to facilitate the establishment of British-based genetics databases. “The suggestion of one referee was, ‘Try working with Aberystwyth!’” Cooper recalls. “We were quite capable of doing this on our own without any help from Aber, thanks very much.”
Cooper concluded that the public sector was, if anything, a more capricious funder than anything he was likely to find in the private domain. In 2000, he visited Craig Venter’s company, Celera Genomics, and returned home with “a large sum” that funded a 5-year partnership. Celera incorporated HGMD into its Celera Discovery System. But after Craig Venter left, Celera shifted focus and ceased to be a gene discovery operation.
“We scouted around for a new partner and teamed up with BIOBASE,” recalls Cooper. It was a much more interactive working relationship than with Celera. BIOBASE now pays the salaries of Cooper’s five HGMD curators, who are responsible for keeping up the data and developing proprietary software for data interrogation. In terms of its utility to industry, Cooper points to a deal with Knome, which is incorporating HGMD data into its personal genomics system.
The free public version of HGMD has some 35,000 registered users who are only required to provide their name and institutional e-mail address (to prevent folks from industry logging in on their home addresses). Of course, anyone can subscribe to the Professional version of the database, which provides the most up-to-date data and advanced search functionalities, and many academic groups already do. “For the genome projects, the cost is a drop in the ocean. Several thousand dollars a year doesn’t seem excessive,” says Cooper. “Alternatively, they can enter into an academic collaboration with us. For example, we’ve worked with Richard Gibbs on the chimpanzee and rat genomes and been included as co-authors on the papers as a result.”
HGMD Professional allows users to run some fairly sophisticated searches, for example, to obtain all known intron or exon splice enhancer mutations or all gain-of-phosphorylation mutations (both predicted and empirically demonstrated). Although the subscription-only HGMD Professional provides access to the most recent two and a half years’ worth of data, about two thirds of the HGMD data entries are still available free to registered users, although they cannot be downloaded en masse or repackaged on users’ own websites. The Professional version adds additional options, including searches by chromosome or by disease. That version has also been merged with BIOBASE’s own TRANSFAC database, enabling any promoter mutation in a known TATA box, for example, to be returned.
Curation is what Cooper calls “a semi-manual process. It’s not something you can do exclusively by computer; you can’t automate it fully for a host of reasons.” OMIM, for example, contains examples of somatic mutations and neutral polymorphisms. “That’s the risk of automation.” HGMD scans about 150 journals manually, plus around 5,000 other journals indexed in PubMed. “The challenge for us is to include functional polymorphisms without including noise,” says Cooper. “Put it this way: I’ve never claimed that we’re 100% comprehensive, but we’re as near as damn it!”
Is It or Isn’t It?
The HGMD curators constantly grapple with the question: when is a mutation not a mutation? Cooper notes that there are a thousand different examples in the human genome of nonsense SNPs that are present in the genes of apparently healthy individuals. Some may have occurred in non-essential genes, whereas others could have been rescued by copy number variation.
As new data emerge, the HGMD curators are able to reassess early conclusions as to the pathological authenticity of previously catalogued mutations if, for example, they are subsequently found in the genomes of healthy individuals. “Sometimes we remove variants completely from HGMD if we conclude that the evidence for pathological involvement is not as convincing as first believed,” says Cooper. “Other times, we simply add secondary references that alert the user to differences of opinion concerning the pathological authenticity of a given variant.”
Establishing the pathogenicity of any given variant raises what Cooper calls the “extremely important—but scarcely addressed—questions of incomplete penetrance and variable expressivity.” Mutations may not always have the identical effect in different individuals. “What else, after all, is complex disease? In my opinion, it would be a serious error to exclude all mutations from HGMD simply because they were sometimes found in the genomes of healthy individuals.”
Cooper enjoys an excellent working relationship with BIOBASE, which appears to provide HGMD with a measure of long-term funding and obviates the need for Cooper to have continually to write reports and applications. At the end of each contract period, the mutation data remain the property of Cardiff University, which keeps Cooper’s options open. “Only the scientific community loses,” he says, “because access to the most up-to-date mutation data and search programs are only available to HGMD Professional subscribers.”
Cooper still keeps an open door for publicly funded support but can’t help but be a little distrustful. He admits to being intrigued still as to why the Wellcome Trust declined to offer funding ten years ago, an oversight he calls “a missed opportunity to keep us in the public domain. Bizarrely, we are now contributing mutation and polymorphism data to the 1000 Genomes Project, through a collaboration with the Wellcome Trust Sanger Institute. All’s well that ends well, I suppose.” •
April 5, 2011 | It is a testament to David Cooper’s drive and perseverance that his Human Gene Mutation Database (HGMD; www.hgmd.org) is the most comprehensive source of human mutation data currently available. “I hope it is a tremendous resource, because we’re not aware of any direct competitors,” says Cooper, a human molecular geneticist at Cardiff University’s Institute of Medical Genetics in Wales.
But that rather depends on what one’s definition of ‘direct’ is. Two new gene mutation resources, the Human Variome Project (HVP) organized by Richard Cotton in Australia, and the MutaDATABASE, conceived by Patrick Willems in Belgium, could eventually provide open access to a trove of gene mutation data and complement, if not compete with, HGMD.
“Currently there is no open-access resource allowing researchers to easily compare their patient data with a comprehensive, up-to-date, readily accessible and mapped list of known disease mutations,” says Daniel MacArthur, a geneticist and author of an influential blog, Genetic Future. “That is, to put it mildly, a huge problem for researchers embarking on projects that will involve sequencing the entire exomes of hundreds or thousands of patients.”
HGMD aims to be a comprehensive inherited gene mutation database, from functional and disease-associated SNPs to gross gene deletions. It does not however include mitochondrial DNA mutations (which are handled by MITOMAP, the brainchild of Doug Wallace in Philadelphia) or somatic mutations (which are handled by COSMIC at the Sanger Institute). To date, HGMD contains more than 110,000 different disease-causing or disease-associated mutations in more than 4,000 different human genes.
In January, Cooper signed a new five-year deal with German biological databases and software company BIOBASE, making HGMD perhaps the first major genetics database to be self-sustaining. But Cooper’s reliance on private funding, which began because he was originally denied funding from the Wellcome Trust ten years ago, means that access to any HGMD data less than two and a half years old requires a subscription.
“I’m frequently criticized for our funding model, but we cannot obtain money from the public sector, and if there were to be none from the private sector, then there would be no HGMD,” says Cooper. “I’ve always wanted in my heart of hearts to stay in the public domain. But since someone has to foot the bill, we’d like industry and commerce to contribute the bulk of the upkeep costs so as to keep the costs of academic subscriptions at a minimal level. This is BIOBASE’s view as well.”
Prior to re-signing the agreement with BIOBASE, Cooper almost struck a deal for public funding from the National Center for Biotechnology Information (NCBI) at NIH. Cooper was hoping for a long-term contract but the NIH was unable to guarantee long-term funding. “We would only have been guaranteed a contract for three years, at which time we would be invited to re-apply for further funding (with absolutely no guarantees as to our likely success),” says Cooper. “As you can imagine, the HGMD group members were less than totally enthusiastic at this prospect.” Cooper had no interest in a repeat of his early experiences with the Wellcome Trust or potentially losing responsibility for the HGMD data once they were in the public domain. Cooper politely declined the NIH offer.
NCBI’s James Ostell would not comment on the negotiations, other than to note that discussions reached the highest levels of NIH. He offered a prepared statement to Bio-IT World: “NIH appreciates the value of having fully public data resources for the medical genetics community. We did discuss with HGMD the possibility of public funding in return for free public access, but unfortunately we not able to conclude an agreement. The major obstacle was an expressed desire by HGMD for funding guarantees into the future. NIH itself does not have guaranteed funding levels, and cannot make such guarantees to any group it funds.”
Lingering questions over HGMD’s funding model and the confidence in the data annotation—evidenced in part by recent experiences of individuals such as Illumina CEO Jay Flatley, who have screened their genome sequences against HGMD—provide additional opportunities for new databases. Two such genome databases gaining traction are the Human Variome Project (HVP), which is seeking to develop thousands of locus-specific mutation databases (see, “Gene Data’s Aussie Rules”), and the MutaDATABASE (see, “The MutaDATABASE”).
Cooper is diplomatic in assessing the changing landscape: “Whether this [HVP] idea is viable or not remains to be seen. It is also unclear to me what it is that the HVP thinks it can add to what is already being provided by existing databases such as HGMD, OMIM, dbSNP, mitoMAP, COSMIC etc.”
Cooper providing a platform for HGMD subscribers to access HGMD data from within MutaDATABASE. “With HVP and mutaDATABASE apparently vying with each other to set up two very similar (and confederated) sets of mutation databases in parallel, the scientific community is likely to become at best confused, at worst turned off completely! In my opinion, both entities urgently need to attempt to sit down and work out ways in which they can be seen to be reading from the same hymn-sheet.”
In the Beginning
Cooper’s interest in cataloguing gene mutations dates back to the mid-1980s, when he performed meta-analyses with various researchers including medical geneticist Hagop Youssoufian (now the chief medical officer at ImClone). Cooper met him in 1986 while visiting the late Victor McKusick, the legendary creator of another key genetics resource, OMIM (Online Mendelian Inheritance in Man) at Johns Hopkins in Baltimore. “OMIM is a fantastic knowledge base for inherited diseases and traits and as a reference for genotype-phenotype relationships,” says Cooper. “But it’s a narrative database—a few noteworthy variants [per gene] that tell a nice story. It was never their intention to collect all mutations.”
Cooper and Youssoufian were both interested in genome methylation (Cooper had done his PhD with noted epigeneticist Adrian Bird). In 1988, the pair published a highly cited paper that, although based on a “ridiculously small” sample size, documented that about 30% single base-pair substitutions fell within CpG dinucleotides. Together with Michael Krawczak, Cooper started performing meta-analyses on the frequency and spatial distribution of DNA mutations in the genome.
“I had been taught that mutation was essentially a random process—cosmic rays, bad luck etc.,” he says. “I didn’t actually believe it, especially when I started looking at real mutation data. We soon found deletion and indel hotspots in addition to the CpG hotspot. It was our fascination with the non-randomness of human gene mutation and what that could potentially tell us about underlying mutational mechanisms that propelled us into producing datasets that later became a conglomerate— HGMD.”
But Cooper’s initial efforts to obtain funding from the Medical Research Council and the Wellcome Trust (twice) were denied, even though there was an initiative at the time to facilitate the establishment of British-based genetics databases. “The suggestion of one referee was, ‘Try working with Aberystwyth!’” Cooper recalls. “We were quite capable of doing this on our own without any help from Aber, thanks very much.”
Cooper concluded that the public sector was, if anything, a more capricious funder than anything he was likely to find in the private domain. In 2000, he visited Craig Venter’s company, Celera Genomics, and returned home with “a large sum” that funded a 5-year partnership. Celera incorporated HGMD into its Celera Discovery System. But after Craig Venter left, Celera shifted focus and ceased to be a gene discovery operation.
“We scouted around for a new partner and teamed up with BIOBASE,” recalls Cooper. It was a much more interactive working relationship than with Celera. BIOBASE now pays the salaries of Cooper’s five HGMD curators, who are responsible for keeping up the data and developing proprietary software for data interrogation. In terms of its utility to industry, Cooper points to a deal with Knome, which is incorporating HGMD data into its personal genomics system.
The free public version of HGMD has some 35,000 registered users who are only required to provide their name and institutional e-mail address (to prevent folks from industry logging in on their home addresses). Of course, anyone can subscribe to the Professional version of the database, which provides the most up-to-date data and advanced search functionalities, and many academic groups already do. “For the genome projects, the cost is a drop in the ocean. Several thousand dollars a year doesn’t seem excessive,” says Cooper. “Alternatively, they can enter into an academic collaboration with us. For example, we’ve worked with Richard Gibbs on the chimpanzee and rat genomes and been included as co-authors on the papers as a result.”
HGMD Professional allows users to run some fairly sophisticated searches, for example, to obtain all known intron or exon splice enhancer mutations or all gain-of-phosphorylation mutations (both predicted and empirically demonstrated). Although the subscription-only HGMD Professional provides access to the most recent two and a half years’ worth of data, about two thirds of the HGMD data entries are still available free to registered users, although they cannot be downloaded en masse or repackaged on users’ own websites. The Professional version adds additional options, including searches by chromosome or by disease. That version has also been merged with BIOBASE’s own TRANSFAC database, enabling any promoter mutation in a known TATA box, for example, to be returned.
Curation is what Cooper calls “a semi-manual process. It’s not something you can do exclusively by computer; you can’t automate it fully for a host of reasons.” OMIM, for example, contains examples of somatic mutations and neutral polymorphisms. “That’s the risk of automation.” HGMD scans about 150 journals manually, plus around 5,000 other journals indexed in PubMed. “The challenge for us is to include functional polymorphisms without including noise,” says Cooper. “Put it this way: I’ve never claimed that we’re 100% comprehensive, but we’re as near as damn it!”
Is It or Isn’t It?
The HGMD curators constantly grapple with the question: when is a mutation not a mutation? Cooper notes that there are a thousand different examples in the human genome of nonsense SNPs that are present in the genes of apparently healthy individuals. Some may have occurred in non-essential genes, whereas others could have been rescued by copy number variation.
As new data emerge, the HGMD curators are able to reassess early conclusions as to the pathological authenticity of previously catalogued mutations if, for example, they are subsequently found in the genomes of healthy individuals. “Sometimes we remove variants completely from HGMD if we conclude that the evidence for pathological involvement is not as convincing as first believed,” says Cooper. “Other times, we simply add secondary references that alert the user to differences of opinion concerning the pathological authenticity of a given variant.”
Establishing the pathogenicity of any given variant raises what Cooper calls the “extremely important—but scarcely addressed—questions of incomplete penetrance and variable expressivity.” Mutations may not always have the identical effect in different individuals. “What else, after all, is complex disease? In my opinion, it would be a serious error to exclude all mutations from HGMD simply because they were sometimes found in the genomes of healthy individuals.”
Cooper enjoys an excellent working relationship with BIOBASE, which appears to provide HGMD with a measure of long-term funding and obviates the need for Cooper to have continually to write reports and applications. At the end of each contract period, the mutation data remain the property of Cardiff University, which keeps Cooper’s options open. “Only the scientific community loses,” he says, “because access to the most up-to-date mutation data and search programs are only available to HGMD Professional subscribers.”
Cooper still keeps an open door for publicly funded support but can’t help but be a little distrustful. He admits to being intrigued still as to why the Wellcome Trust declined to offer funding ten years ago, an oversight he calls “a missed opportunity to keep us in the public domain. Bizarrely, we are now contributing mutation and polymorphism data to the 1000 Genomes Project, through a collaboration with the Wellcome Trust Sanger Institute. All’s well that ends well, I suppose.” •