Persephone, the Real-Time Genome Browser
By Aaron Krol
April 25, 2014 | Syngenta, one of the world’s largest crop engineering companies, was not in the market for a new genome browser. In 2008, Syngenta, like many companies, was using the open source program GBrowse, which is substantially similar to most commercial platforms. GBrowse has an intuitive interface for displaying genes and chromosomes, and all the basic tools, like BLAST, that users need to run in genome browsers. It didn’t seem like a particular pain point for research.
But then a group of Syngenta scientists did a site visit to Ceres, a smaller bioengineering company that specializes in creating crop strains for use as biofuels. There, they were introduced to an in-house browser called Persephone. “They were looking at something completely different,” says Eric Ganko, a computational biologist at Syngenta. “But they happened to see this software, and they were quite interested in how fast you were able to look at a whole chromosome view.”
Genome browsers have to render massive amounts of data in an interactive visual format, which makes heavy demands on memory. It can take ten or fifteen seconds to load a new file, or to zoom from a specific gene out to a whole chromosome. “If the research stream is 15 clicks, and every click is 10 to 15 seconds, it may not sound like a lot, but over the course of looking up your information it can be very time-consuming,” says Tim Swaller, Ceres’ Vice President of Genomic Technologies.
In fact, in subtle ways, it can begin to shape the kind of research that gets done. Scientists entering a genome browser tend to navigate straight to a single gene or region of interest, and not move far afield, or switch to another data source, without a specific reason.
Carl Meinhof, Ceres’ Manager of Research Informatics, likens the situation to the early days of Google Earth. “[The map] was all there,” he says, “but when you panned around, or zoomed in, you always had to wait for things to load. It was great that something like this existed in the first place, but in the end, it was really painful to use.”
Persephone, however, works essentially in real time. Users can open a 300 million-nucleotide chromosome and see it appear onscreen in less than a second, and immediately scale down to the level of introns and exons, or pan between genic regions. “You can scroll the mouse wheel, and things just react immediately, just like you would expect from a picture you zoom in and out of,” says Meinhof. “In other browsers, you deal with delays in reloading data when you go between regions.”
Related data – from gene expression studies, or large sets of SNPs – can then be quickly layered on top of the gene or chromosome of interest. “We’ve brought up 60 million SNPs in 30 seconds,” says Swaller.
The Seeds of a New Approach
In a way, it’s not surprising that a platform like Persephone should come out of a company in the field of plant genetics. While the issues Persephone addresses frustrate many geneticists, they are particularly acute for those working with crops.
“The funding for plant research is not as high as for humans, so a lot of the time we deal with less-than-full datasets,” says Swaller. “We end up working with a lot of scaffold maps. For example, with wheat there are scaffold maps with hundreds of thousands of scaffolds.” These genetic maps are much less complete than physical maps of the chromosomes, and so get split into many more pieces. Where a human geneticist can access the whole human reference genome in just 23 chromosomes, plant geneticists often deal with exponentially more files, all of which have to be loaded individually in browsers.
Researchers in the plant field also work more often with homologous sequences from related species, meaning yet more genetic maps to load. At Ceres, one of the key crops is sorghum, an organism so incompletely sequenced that the company uses the corn genome as a reference. This reliance on cross-species data means that Persephone has to move rapidly between separate maps, rather than focusing on just one at a time.
For this reason, Persephone has the ability to display more than one chromosome or genetic map on the same screen, which is not a standard feature of genome browsers. If you want to match sequences from two species, says Swaller, “you can bring up syntenic chromosomes, with all the orthologous matches, in less than a second.” Visualizing these separate sequences side by side offers a more intuitive picture of genetic homology. Persephone can search for homologous or orthologous regions between two different datasets, and draw lines between the two maps to show where matches occur.
Human and mouse chromosomes viewed side by side in Persephone. The lines indicate areas of synteny between chromosomes. Image credit: Ceres
One quirk of Persephone’s origins in plant genetics is that it displays maps vertically, a perspective that Swaller says helps make visual sense of cross-chromosome comparisons. When users want to switch to a more targeted view, they can simply highlight a region or gene, and see that information in the more familiar horizontal display that most genome browsers use. In the horizontal mode, users can scale down to the level of single nucleotides, to better explore SNPs and other variants.
Viewing multiple maps in the same window can give users a more comprehensive view of polygenic traits. “Most traits, in general, are quantitative in nature – multiple genes that are additive in effect,” says Swaller. “What our scientists want is to be able to view two, three, four different genes at once. And if those are across different chromosomes, we still want to add expression data and see all the SNP data for those.”
This combination of speed and keeping multiple datasets onscreen doesn’t just accelerate computational tasks. By making it more inviting to skim through the whole genome, it encourages users to do more exploratory, hypothesis-free research.
“We want to allow people to browse through the data, and explore the data, without a concept of what the results will be,” says Swaller. “If you want to see methylation patterns across the genome, you need to bring up multiple chromosomes at one time, see methylation patterns, maybe see gene annotations, without any kind of preconceived notion of where you want to end up.” This can lead to unexpected insights and connections that would not be made if researchers looked at only the genes they’re currently working on.
“At some point, speed is not just a quantitative improvement – it’s a qualitative jump,” says Meinhof. “It makes things possible that are not possible at the slower speed.”
From In-House Tool to Software-as-a-Service
Persephone’s speed is not a product of sophisticated hardware or parallel computing. At the Molecular Medicine Tri-Conference in San Francisco this February, Meinhof demonstrated the software on an ordinary laptop. “It’s fundamentally software design, and the data storage and transport” that accounts for the speed, he says.
Ceres’ lead software designer comes from the gaming industry, a perspective that Swaller says made him “very interested in applications that gave our researchers immediate gratification.” The designers borrowed techniques from game development for displaying elements onscreen, and temporarily reducing them during navigation, which can control loading times.
“The other aspect is how data is transported between the back end and the front end, and the compression being used,” adds Meinhof. “We can load very large datasets with a relatively small memory footprint. We can load a few million SNPs, and still be under a gigabyte in memory usage. It can be done on a mediocre machine.”
Still, Persephone was originally built with in-house use at a medium-size company in mind. Ceres had no inclination to become a software provider, and only began licensing its genome browser when Syngenta happened to see the platform and expressed interest in adopting it.
In 2008, when Syngenta first saw Persephone in use, next-generation sequencing was also beginning to pile unprecedented amounts of data into companies’ servers. “There are genomes out there that have a million sequences,” says Ganko. “That’s not something Persephone was necessarily designed for originally, but it is something they’ve adjusted for over time.”
Ganko manages the reference genomes at Syngenta, and has made a gradual transition to using Persephone as the central repository for his data. Over the years that Syngenta has licensed the genome browser, Ceres has worked hard to adapt it to the needs of large clients in the era of high-throughput sequencing. “When next-generation sequencing came about, we really revamped the software to handle these very, very large datasets from all these dispersed databases,” says Swaller.
“It’s definitely come along in terms of speed under heavy loads,” adds Ganko. “We really have enjoyed working with Ceres’ developers. It’s not always easy to find people who are good and responsive at development, but also can understand the biology.”
A screenshot from Persephone showing both the vertical macro view of a chromosome, and the horizontal micro view of a target gene. Image credit: Ceres
Syngenta has now fully transitioned to Persephone as both the primary genome browser, and a database for genomic data and annotations. Ganko is encouraged by the results. “What it can open up is that ad hoc kind of discovery,” he says. “Most people are going to Persephone with a specific target in mind, a gene or a marker they’re interested in exploring… [but] when you are able to look at what’s around you much more easily, you might find out we actually have several other markers nearby that we could also try. My hope is always that, by making different types of data available, you might allow for chance discovery.”
Many of the features in Persephone appeal to Syngenta specifically as a company in plant engineering, including the ability to work more easily with genetic maps that fall short of whole physical chromosomes. “The genetic maps is a really big and needed feature,” says Ganko. “Some crops still don’t have a genome. Things like wheat, and sugarcane, are just so big that we don’t have good physical sequence, and probably won’t for some time.” The public databases Persephone draws on, too, were originally concentrated heavily in the plant space.
Yet Ceres has recognized a broader need for a real-time genome browser, and the company is now beginning to demo the platform for users in the human genome space.
Persephone is already available for large customers with the resources to license the platform and install it internally. However, Ceres is also working on a software-as-a-service model, where users can run Persephone through Amazon Web Services, and store their data in the cloud. This is the architecture that prospective customers are using to demo the software, and Ceres hopes to fully deploy it as a commercial solution by the end of the year.
“We realize there’s a big community of individual users – at companies, universities, institutes – that want an application with this performance and speed,” says Swaller. After years of development with its major client, Ceres now sees a large potential market for its first software platform.