Deanna Church on the Reference Genome Past, Present and Future

April 22, 2013 | Of the many thousands of researchers who have contributed sequence or analysis to the hallowed Reference genome—the hybrid genome assembly composed of some 50 anonymous DNA donors that made up the Human Genome Project—few have spent as much time gazing upon the Reference as Deanna Church, from her office at the National Center for Biotechnology Information (NCBI), where she is currently Coordinator of Variation Resources.

After training a gene mapper at UC Irvine, Church’s plans on becoming a mouse developmental biologist suffered a setback after she developed “a pretty severe mouse allergy.” Figuring that computer mice was a healthier career path than live mice, she moved to NCBI 14 years ago, to help manage data for the Mouse Genome Project. She later took the helm of the NCBI group that was part of the Genome Reference Consortium (GRC), taking over management of the Reference genome in the aftermath of the Human Genome Project (HGP). Leadership of the GRC has moved to Valerie Schneider, while as Church continues working on the reference assembly and improve genome variation resources.

With this being the 10th anniversary of the HGP, Bio-IT World editor Kevin Davies skyped with Church to hear her personal assessment of the current state of the Reference, the degree of progress in the past decade and priorities for the future.

Bio-IT World: Deanna, let’s start by asking what is the GRC?

Church: The GRC is the Genome Reference Consortium, which was formed a couple years after the end of the Human Genome Project. Ewan Birney (EBI) and I were at a copy-number meeting at the Sanger Institute—all these groups were doing large-scale variation discovery but were finding problems with the reference assembly, and nobody knew who to talk to about getting them fixed. There was really no mechanism for doing this because all of the centers that had been funded to do sequencing were no longer funded to do this.

And so the GRC was formed. It consists of Washington University at St. Louis, the Sanger Institute, which had a lot of knowledge of generating the assembly in the first place, as well as EBI (European Bioinformatics Institute) and NCBI to try and provide more bioinformatics support.

Because the HGP was a very distributed project—different labs were responsible for different chromosomes—the data was distributed around the world, a lot of it sitting in postdocs’ and grad students’ notebooks. So we’ve taken all this data and put it into a central repository. Anybody can go to the website and see what the current state of the genome is, the clone tiling path, the regions we’re curating to review to see if they need improvement. Now instead of the genome just being 24 sticks, we now have regions of the genome where we can represent more than one [sequence] path because we know that there’s no way to make a coherent consensus sequence.

The MHC is the prototypical poster child for this. We have an allele of MHC that is incorporated into the chromosome, but then we have 7 other tiling paths available that are aligned back to the chromosome so you can understand that chromosome context. Really, what we’re providing you with is allelic diversity.

Could you compare and contrast the state of the Reference back in 2003 to the state of the Reference today?

That is a really hard question because what we’re learning is we really don’t understand a lot of human diversity. One of the things we’re thinking about with the reference assembly is how do we create this sort of human pan-genome, because while we want to represent chromosome assemblies, we also want to represent diversity in these regions where we know that there might be population-specific stratification, and these regions tend to be very hard to deal with. While we’re making some progress in that area, we have a long way to go.

With respect to the main chromosome assemblies, one of the things that we have learned since 2003 is how complex genomes are. We know, for example, that a certain number of the gaps that are in the current reference are actually caused by us having mixed haplotypes in the reference assembly. So even if you think about a DNA library, if that donor is heterozygous for a structural variant, you can mix those haplotypes and that often leads to gaps.

One of the prototypical examples of this is the MAPT region [on 17q21]. There’s a very large inversion polymorphism segregating throughout different populations. It turns out the [major Reference] donor RP11 is an H1/H2 heterozygote—he has one H1 allele, one H2 allele, and in earlier versions of the reference assembly we mixed those two alleles. But Evan Eichler and Mike Zody did all of the hard work of basically haplotype sorting the RP11 BAC clone, so we could now have an H1 haplotype that’s integrated into the chromosome and we also have the representation of the H2. [See http://www.ncbi.nlm.nih.gov/pubmed/19165922]

Can you provide a quantitative or qualitative comparison of today’s assembly versus the Reference we had 10 years ago?

It’s really hard to do on a quantitative metric, because every time we close a gap we sometimes open another one because we find there was a bad join put together or something like that. In 2003, we had a very high-quality reference assembly, but each of the chromosomes was produced by a different lab—it was much more of a research project. One of the things the GRC has done is productizing the reference assembly... We now have a single set of software that’s producing all of the alignments. We store these alignments in a database so that they’re readily available and curators can go in and curate those alignments… We’ve taken the genome project from being a research project into really being a product that is very consistent and we hope reliable. One of the things we’re going to see with the upcoming release is a lot of the incorporation of data that we’ve gotten from the past 10 years of people analyzing the reference assembly.

When is the new Reference being released and how significant is the leap over the current version?

The release will be called GRCh38 and our plan is to have a data freeze in early August. We would anticipate that that assembly would be deposited to GenBank in early September. At that point it would be available for all of the genome browsers to pick up and start their annotation runs.

One of the things we’ve been doing since GRCh37 is a patch process. We do quarterly updates when we have regions where we know we’ve made an improvement, for instance, we’ve fixed a gap. You can think of these as special alternate loci, or you can get a preview of what GRCh38 is going to look like. It’s not the whole picture because we don’t release a patch for every fix, for many reasons. But there are over 111 of these regions right now. Some of them fix very small things, maybe a single base pair, whereas some of them are complete retiles of the genome.

If you look at the alignments of those patches to the chromosomes, it looks like we’re adding at least 4 to 5 megabases of sequence that’s not in the current reference assembly. We expect that number to go up as we continue to work on adding sequence into the reference assembly. These sequences do include genes. If you take the sum of the fixed patches plus the novel patches and alternate loci, there are about 180-190 genes unique to these sequences. They’re not on the chromosome assembly at all. Now when we do the update, all of the fixed patches will now get incorporated into the primary, so some of those genes will now find a home. But some of those genes will continue to just live on the alternate loci because they’re all in places where we really can’t generate a good consensus…

One of the things everyone liked to talk about in 2003 was ‘the Golden Path,’ because we really thought the bulk of the variation was going to be SNPs… [We said] ‘Oh well, we can make one Golden Path and then just annotate the variation.’ And now we know that there are regions that are sufficiently diverse that we can’t make a Golden Path—if we want to make sure that the reference assembly, even though it will continue to be a composite for the foreseeable future, at any given locus, you want that allele to represent something that’s in at least some person on the planet. And if you make it consensus in these highly diverse regions, you just end up with a ‘franken-allele,’ as opposed to something that actually exists in the population.

The original reference genome consisted of RP11 and I understand about 50 different individuals in the current reference assembly. For the latest updates, are you retaining the same donors?

No, we have brought in some new donors. Probably the most significant one is a BAC [bacterial artificial chromosome] library that was made by Pieter de Jong, called CHORI-17. This is a library generated from a hydatidiform mole resource. (A hydatidiform mole is generated when you get a sperm that fertilizes an enucleated egg and the paternal genome duplicates. This source DNA is a single haplotype, so you can make a library that comes from one haplotype.)

For many of these regions where we know that there are large-scale duplications and structural variations and that the chances of getting a single individual with the same haplotype at both of those locations is hard, we can retile a lot of these complicated regions in this single haplotype resource. Many of the very complex regions are being replaced by the single haplotype resource.

Are some gaps in the Reference still proving more or less impossible to sequence because of their highly repetitive DNA?

Well, some of the [unsequenced] regions clearly are very repetitive and it’s not like the repetitive sequence and the structural variations are completely independent. There’s a very large overlap between regions of segmental duplication and regions where you see recurrent structural variation. So we know that a lot of the regions that we’re working on require a lot of mapping, longer reads, and a lot of manual intervention.

One of the more interesting stories that’s going to be incorporated into GRCh38 concerns a family of genes on chromosome 1 called the SRGAPs. There were a couple of papers published in Cell in May 2012, one from Eichler’s lab [DOIs: 10.1016/j.cell.2012.03.033 and 10.1016/j.cell.2012.03.034]. For this gene family, the grandmother gene sits at 1q32, it’s not represented well. It’s only a partial gene, right next to a gap in GRCh37. This gene has duplicated twice, so one duplication went to 1q21, the other went to 1p21. They’re also human specific, so you only see the 1q32 copy in other primates.

What’s interesting about this is that the [proteins encoded by the] SRGAP genes inhibit neuronal outgrowth, but the granddaughter gene seems to act as a dominate negative. So when the granddaughter gene is expressed, it allows for neurons to outgrow longer. This duplication happened right around or shortly after the human-chimp divergence. So this might be a very interesting gene to look at with respect to neurocognitive development in humans...

All these people screening for mutations in autism or other disorders have likely been missing these genes because they haven’t been part of the reference assembly. They have been released as patches… but I think these genes are largely being ignored in a lot of the screening. So that’s one of the interesting stories that will be coming up.

So to be clear, there’s still information of medical utility to be gleaned by this process of patching and wrapping things up?

Absolutely. We know a lot of the segmental duplications are lineage specific, and that does make them interesting from the viewpoint of neurocognitive development… [Take] the chromosome 1q21 region, which is horribly misassembled right now. It has several neurocognitive phenotypes that map to that region. That is one of the regions for which, thanks largely to the hard work of Evan Eichler’s lab and Tina Graves at Wash U, we’ve retiled that region, so we think we have one complete representation of that region. The problem is that region is very polymorphic, prone to rearrangement, so we need many, many more representations to really understand it… The biology of this region is really fascinating and really hard to get at with current sequencing technology.

Is there any portion of the genome that is not amenable to current sequencing technologies, where say a long-read method might sequence some region that is really stymieing other approaches?

Let’s take centromeres and telomeres out of the mix, because they’re still very problematic to deal with… There’s a difference between being able to sequence something and being able to analyze something. There seems to be some evidence that [newer technologies] can sequence even the centromeres and the telomeres, but we can’t really analyze that data, because especially with the Illumina data, you really need a reference assembly to get good analysis out of it. And even with PacBio or some of these longer-read technologies, they’re still not really long enough to deconvolute those repeats…

We’re working on getting some PacBio data soon. If you look at the strict accessibility mask that was developed by 1000 Genomes, I believe there was maybe 70 to 75% of the genome that was easily accessible to Illumina… If you loosen the criteria, you can get more of the genome and I believe Richard Durbin says about 85 to 90% of the genome was accessible. What you consider accessible depends on the algorithms you’re using. More of the genome gets accessible as read lengths get longer, but it’s still a moving target.

Is there an endpoint in sight? Will there be a point where you say ‘That’s it, we’ve got the definitive sequence’?

I’m actually optimistic about the fact that, at some point, we’ll really be able to sequence through these structures. I wouldn’t want to make a bet on when that’s going to happen, but I think it will!

I think at some point we will get to [an endpoint], but I don’t think it’s GRCh38. Even though we’ve worked really hard to correct a lot of the problems we know about, most of the regions where there are actual errors that need to be corrected are these regions that really require a lot of manual effort. Even if we’re sequencing using new technologies, we’re still having to tile clones so that we can limit the complexity of the sequence mixtures.

I can see at least one additional update to the reference assembly where we still might be making corrections to the chromosome assembly, but you can certainly imagine a point where the chromosome assembly is pretty good, then you could continue to add diversity by adding alternate alleles, so you wouldn’t have to disrupt the chromosome coordinates per se, you could just add sequence diversity from other population as you needed it.

The majority of the Reference comes from an anonymous donor known as RP11. We hear investigators including Stephan Schuster, Pieter de Jong and David Reich refer to the likely African-American ancestry of RP11. Is this a widely known conclusion?

I’m not sure that it’s common knowledge. Most people who are aware of how the genome is put together are pretty well aware of the fact that RP11 is of African-American ancestry. Stephan gave a similar talk at AGBT this year and also presented some of the data supporting the notion that the RP11 donor was African-American. And since David Reich’s paper came out that has been discussed, but there are a lot of people who use the genome now who don’t even think about the fact that it’s clone-based or how the donors went into it. So I’m not sure it’s even something that a lot of people think about…

I think it’s great that we have a donor that’s admixed to a certain degree. Of course, this explains why some of these regions have been really hard to put together… because of this admixture, RP11 is heterozygous at a lot of loci, which has complicated the assembly. But now we know that, it makes it a bit easier for us to deal with.

Stephan Schuster is resequencing RP11 using both 454 and Illumina platforms. Is this in conjunction with GRC to provide deeper data on RP11?

Valerie Schneider and I have been working with Stephan as well as Wash U… We’re looking at two things. The goal of the project is to have a complete, independent RP11 assembly that will live beside the reference, for a lot of reasons… It’s pretty clear from the preliminary assembly that there are places where they add sequence that we don’t have representation for. We’re going to use that to fortify the current reference. We’ll have both assemblies available and we’re planning on annotating and making them available for public use.

What about the future of the Reference? As longer read technologies become available and de novo mapping becomes more accurate, will the use of the reference genome slowly decrease?

I firmly believe that there will always be utility in having a reference assembly, for many reasons, even though how people use that might change. I also believe that if we are truly going to be successful in having genomics affect clinical medicine and we want to understand variation within individuals, we have to have de novo assembly.

We know we’re missing too much variation the way we’re doing it now. One of the things my group has put a lot of effort on is doing assembly-to-assembly alignment for doing comparison, as opposed to just doing read-to-assembly alignment. I think being able to transform coordinates between different assemblies is going to end up being important.

But I could see a couple of different scenarios. You have a reference assembly that is sort of this pan-genome and you might do some survey sequencing to understand what your chromosome context of your sequence is and possibly what population context your sequence belongs in. Then there’s another whole set of assemblies… to really understand your sequence or your individual in the context of those population references. So I see more references, in fact, in a more population-specific way, but I think this is also probably going to push us in terms of developing more graph-based software to represent this group of assemblies in more compact, easy-to-use ways… I think in order to be successful in understanding variation and translating that to the clinic, we actually need both of them.

Clearly the reference is a hybrid of multiple donors, all of whom by definition are carrying mutations in a whole bunch of genes. Is there any notion to produce a “healthy” Reference that somehow expunges all of the known deleterious variants?

Right. That’s a really difficult problem, because you know, at the end of the day, everybody dies of something! I think it’s difficult to understand how you would define a healthy genome, even to a certain degree, because our understanding really of allelic combinations and phenotype is still pretty limited.

That being said, one of the things we have been trying to do is work with a lot of the clinical testing labs to make sure that we have the best possible allele at a locus in the reference assembly. This is easy in some cases and it’s harder in other cases. So the CYP* gene family for instance, there are variants we call phenotypes that are at high frequency within the population and so it’s partly just working with the community to calculate what the best reference is for that location. We are trying to do that, but will we ever get a reference assembly that’s free of any deleterious mutations? Probably not.

The original donors were anonymous and we have no medical history on them. Any regrets that we can’t trace the sequences of these individuals and tie certain variations to disease risk?

Well, hindsight’s 20/20!

I don’t know how many genome parties there’ve been going back to 2000. Any idea when the final champagne celebration will occur?

Yeah, I’m not the one organizing the parties, so I don’t really know what the answer to that question is! But that said, it’s worthwhile celebrating to a certain extent and acknowledging the progress that we’ve made. There have been a lot of tweets and stories talking about the celebration of the end of the Human Genome Project, even though it wasn’t the completion of the human reference assembly… It was just moving the reference assembly into a new phase.

Can anybody potentially contribute meaningful sequence data to the GRC?

Absolutely. Wash U and Sanger are actually both actively trying to sequence and solve problems in the current reference. We are certainly happy to work with other groups. For example, Schuster and Roche approached us about the RP11 project and we were quite happy to work with them. We understand we cannot solve all of the problems of the genome and that’s one of the reasons we put this website together to make it both easier to let the community know the regions that we’re reviewing, but also to make it easier for the community to approach us. If they have something to contribute, then we’re happy to work with other groups to try and incorporate that information into the assembly.