The Hunt for a New Human Reference Genome

By Aaron Krol

June 30, 2014 | The human reference genome is a linchpin of modern genetics, but it’s also a bit of an historical oddity. Currently known as GRCh38, or “build 38” for short, it is a direct descendant of the original Human Genome Project, and has touched almost every genomic study since. The reference genome acts as a template that makes it much cheaper and easier to assemble new human genomes: when a sequencing project breaks a subject’s DNA into millions of short fragments, those reads can be placed in their correct locations by matching them to build 38.

There’s no special reason that a certain genome has to be used as the reference, but build 38 has a lot going for it. It remains the most accurate and complete human genome ever assembled, and is regularly updated by the Genome Reference Consortium (GRC), made up of teams from the National Center for Biotechnology Information (NCBI), the Genome Institute at Washington University in St. Louis, the U.K.’s Wellcome Trust Sanger Institute, and the European Bioinformatics Institute — all key participants in the Human Genome Project. The GRC released its latest major overhaul of the reference genome last December, replacing build 37, and adds minor updates four times a year. (For more on the biggest changes that accompanied build 38, see “Getting to Know the New Reference Genome Assembly.”)

But build 38 also carries some baggage. Most importantly, a mix of several donor sources was used in the Human Genome Project, and more have been incorporated into the reference since. And while the reference genome is haploid — it features just one copy of each chromosome — almost all its sources are diploid, with two copies of each chromosome that may be dramatically different from each other in areas of heavy structural variation.

This confusion of sources can cause problems. The reference can end up with two versions of a structural variant mashed together, to create a new genotype that never occurs in nature. There may even be artificial gaps in the sequence, where two structural variants from different sources fail to meet in the middle. The GRC is constantly searching for and cleaning up these events, but they’re not always easy to find. Even harder to capture are haplotypes: sets of variants that tend to travel together, even if they occur tens of thousands of bases apart. With its motley heritage, the reference genome in many regions has no natural haplotype, just a patchwork of its various clone sources.

“The way the reference is currently created is that clone A and clone B and clone C can all be completely different haplotypes,” says Tina Graves-Lindsay, leader of the Reference Genomes Group at Washington University, and a contributor to the GRC. “So you have no linkage across any large region, for knowing your assembly aligns the same way to each haplotype.”

This can affect the main purpose of the reference genome, of placing DNA fragments in the right order. If two haplotypes differ by a large structural variation, like an inversion or duplication, where the same sequence appears in different places or in reverse order, reads can be aligned differently depending on the haplotype of the reference. Without the context of the whole haplotype, it becomes much harder to resolve this kind of error.

To get a better sense of variation across large regions, it would be better to have a naturally haploid reference, one that can show the real sequence of at least one set of actual human chromosomes. Graves-Lindsay is now part of a team at Washington University that is working on just that: a whole alternate reference genome, as accurate as build 38, but sequenced from just one sample with half the usual number of chromosomes. Its curators call it the platinum genome, and it relies on a very unusual donor source.

One Set of Chromosomes

In 2002, Evan Eichler, a geneticist then at Case Western Reserve University, wrote to the National Human Genome Research Institute to request the creation of a new BAC library — a sort of bacterial storage system for long DNA fragments, used to hold onto interesting genetic material for repeated sequencing. The Human Genome Project had not yet been completed, but Eichler was already finding gaps and errors that could be fixed by sequencing a haploid human genome. To get one, he recommended a BAC library covering the entire genome of a hydatidiform mole.

Hydatidiform moles are the result of a type of abnormal pregnancy, where an egg that by some accident has no nuclear DNA is impregnated by an ordinary sperm. The sperm then doubles its own DNA, resulting in two identical copies of each chromosome in every cell as the mole starts to divide. Hydatidiform moles are rare, but several have been isolated and turned into cell lines, and one of those, called CHM1, has become an industry standard.

Eichler’s proposed BAC library was eventually created from CHM1 — the library is called CHORI-17 — and Eichler, now at the University of Washington in Seattle, has been working with it for around ten years, in collaboration with the Genome Institute at Washington University. At first, says Graves-Lindsay, who has been regularly involved in the partnership, the goal was just to go back over the most confusing parts of the reference genome and repair them.

“We really started with the BAC sequencing, initially to fix regions,” she says. “There are definitely regions that cannot be sorted out without a single haplotype. And we actually found that there were a lot of gaps in the reference that are due to two different haplotypes on either side.”

Thanks to long work on CHORI-17, updates to builds 37 and 38 corrected several unresolved genes. These included SRGAP2, a very complex gene that is duplicated in three different places across the length of chromosome 1, and the immunoglobulin heavy locus, where several similar DNA segments are shuffled and reshuffled together to express a highly variable set of antibodies.

The success patching up specific structural variants, however, soon underlined the need to show how these variants behave together. “The more we worked on it,” says Graves-Lindsay, “the more we realized that having the complete sequence would be good also.” In 2011, the Washington University/University of Washington team sequenced all of CHM1 on Illumina sequencers, creating their first assembly of a haploid human genome, which was made freely available in the NCBI’s GenBank database.

This assembly was a useful starting point, but it had some limitations. Like all whole genomes created with Illumina’s instruments — which are fast and highly accurate, but split their samples into small fragments just one or two hundred bases long — the new CHM1 assembly had to be guided by build 38. This meant it was vulnerable to the same confusions around large structural variants that a haploid reference genome was meant to overcome. Adding information from CHORI-17 could fix some of these problems, but not all, and not quickly.

The Illumina assembly was also far from complete, covering just over 92% of build 37 upon release. While that has since improved, today the assembly is still divided in over 40,000 segments, or contigs, with gaps in between that cannot be resolved. The hardest work was still to come in bridging the distance from this first CHM1 assembly to the “platinum genome.”

No Reference Required

Elsewhere, however, a different assembly of CHM1 was in the works. The sequencing company Pacific Biosciences, based in Menlo Park, California, had struggled to carve out a market since the release of its first sequencer in 2010. The company’s technology was neither as cheap nor as fast as market leader Illumina, but it did have one noteworthy advantage. With the release of new chemistry in October 2013, PacBio was delivering half its reads in fragments of 8,000 bases or more, over an order of magnitude longer than any of its competitors.

Long reads make it exponentially easier to put together whole genomes de novo, without using a reference genome, in part because there are fewer total fragments and less confusion about the order they belong in. Over the course of 2013, PacBio released a series of de novo genomes, starting with a few bacteria and building up to yeasts and fruit flies, to sell potential customers on its long-reading machines. But Jonas Korlach, the company’s CSO, wanted to tackle a human sample, and he naturally reached out to someone who worked with human genomes on a regular basis.

“I had asked Evan Eichler in the summer,” Korlach told Bio-IT World, “if we want to show that the long reads from PacBio can be really useful for getting an improved de novo assembly, what sample should we use? And Evan immediately said we should use the CHM1 sample.” (Eichler was also until recently a member of PacBio’s advisory board.)

By February 2014, PacBio was ready to release its own assembly of CHM1. It joined just a handful of de novo human assemblies ever performed, and thanks to the long read lengths, it had some properties that previous efforts couldn’t match. “Our assembly, straight out of the pipe, came to an N50 of 4.4 Mb,” says Korlach, meaning half the contigs are at least 4.4 million DNA bases long. By comparison, the Washington University assembly of CHM1 has a contig N50 of just 144,000 bases.

Longer contigs mean fewer contigs, and fewer gaps in between them. Overall, says Korlach, “the assembly was about forty times more contiguous than any of the previous approaches, except of course the very first Human Genome Project.” Intriguingly, the PacBio assembly is also longer overall than any previous human genome, by about 400 million bases. “We’re looking at that carefully now,” Korlach adds, “but we already see indications that it’s because you recover and resolve highly repetitive regions” — areas like the telomeres and centromeres, which aren’t fully represented in build 38 because they’re too repetitive to sequence with current technologies.

A great deal of validation still needs to be done on PacBio’s version of CHM1, which is now being carried out both within the company and at outside institutions that have downloaded the freely available data. But as the most complete human genome since the reference itself, this assembly looks like a much more secure model for the platinum genome, a project Korlach enthusiastically supports.

“The ultimate goal would be to get a human genome that goes from one telomere, through the centromere, to the other telomere — a chromosome represented by continuous sequence,” he says. “That would be a great advance for science, to really have a sense of completion, and to know all the bases in at least one human genome.”

PacBio has previously contributed in a small way to the GRC’s efforts; its assembly of the MUC5AC gene, a highly repetitive gene that may be involved in chronic obstructive lung disease, is the canonical sequence in build 38. Now, the company’s first whole human genome is playing a central part in the effort to add a second high-quality reference to human geneticists’ arsenals.

Toward a Platinum Genome

It will take a mix of many data sources, each with their advantages and disadvantages, to piece a useful platinum genome together. “We’ve got the Illumina sequence, the PacBio sequence, we’ve got lots of clones sequenced,” says Graves-Lindsay. “So we plan to use all of those resources to check the accuracy of our final assembly.” The team is also referring to a third assembly of CHM1 on a different technology, an optical system from a company called BioNano, which is useful for ordering structurally similar regions.

The Genome Institute at Washington University was an early adopter of PacBio instruments, and the team is now using its PacBio sequencer on BAC clones from the CHORI-17 library. They’re still focusing mainly on the thorniest regions, so that improved sequence can be added to build 38 as quickly as possible. “As soon as we fix a region, it will be a part of the reference as a patch,” says Graves-Linday. “So that’s the piecemeal goal, to get the sequence out there as best we can.”

The major challenge is getting consistently high accuracy. Most high-throughput sequencing technologies are now in the region of 99.9% accurate for each DNA base call, but over the length of a whole genome, that still leaves a lot of room for error. The original Human Genome Project used the much more painstaking and expensive Sanger sequencing method, which is scrupulously accurate; by comparing different data sources against each other, Graves-Lindsay and her colleagues hope to achieve the same quality at a fraction of the cost.

In the medium term, a first full draft of the platinum genome is still in the making. The Genome Institute plans to deposit that resource in GenBank, just as it already has with its Illumina CHM1 assembly, so that researchers anywhere in the world can access it. At first, its most promising use will likely be in haplotype studies, helping to clarify which variants tend to be inherited together.

“I think the biggest utility for the single-allelic [haploid] representation is likely to be the context, that allele A or variant A always goes with this variant that’s down the road a little bit,” says Graves-Lindsay. “Especially if you want allelic context in a large region, if you’ve got a single allele, you’ll be able to figure out how things work together.”

In the longer term, she imagines the platinum genome could be curated to the same degree as build 38. One danger of any reference genome is that structural differences between the reference’s haplotype and a given sample will be too great to bridge, leading to regions that can’t be assembled. (This is especially relevant because build 38 is based almost entirely on DNA from U.S. donors. Korlach remembers speaking to Japanese customers at the Advances in Genome Biology & Technology conference: “they said the human reference genome is great, but it doesn’t really apply to the kinds of genomes that they’re interested in.”)

To get around this, the GRC has been diligently adding “alternate scaffolds” to builds 37 and 38, where highly variable regions can be represented in a number of different ways. Graves-Lindsay and her colleagues want to do the same for the platinum genome — possibly even with whole haplotypes, so that alternate sequences stretch great distances across chromosomes.

“Our intention is to continue to add additional sequences,” she says. “There will probably be a complete sequence, and then hopefully you’ll be able to layer these either on the reference, or on the single haplotype, to the point where you have all the different alleles layered on.” That would be the most powerful resource for human genetics, able to correctly assemble whole genomes from any human sample, as well as illuminate the way variants stay linked with one another across long stretches of the chromosomes.

Build 38 and its predecessors have been incredible tools for genetics, making it possible to sequence human genomes en masse, and collecting the highest quality sequence for nearly all regions of the genome in one place. But the reference remains bound to the unique circumstances of the Human Genome Project, the race to build a human genome as quickly as possible from whatever sources worked. Although it remains the best-curated genome available, it probably looks quite different from a genome built for reference-guided assembly from the ground up.

As the researchers at the Genome Institute and in Eichler’s lab continue to pore over CHM1’s DNA, this strange cell line may one day offer a new foundation for the daily work of human genetics.