Children’s Hospital Of Philadelphia, Edico Set World Record For Secondary Analysis Speed
By Allison Proffitt
October 23, 2017 | Using Edico Genome’s DRAGEN pipeline on 1,000 Amazon EC2 F1 instances, the Children’s Hospital of Philadelphia (CHOP) and Edico Genome set a new scientific world standard last week in rapidly processing whole human genomes. CHOP used the DRAGEN pipeline to process 1,000 whole pediatric genomes in two hours and twenty-five minutes. The feat set the set the Guinness World Records title for Fastest time to analyze 1,000 human genomes. The title was presented onsite by an official Guinness World Records adjudicator, and will be granted upon publication of the results in a peer-reviewed journal.
The demonstration used a pediatric cohort of 1,000 whole genomes from the Center for Applied Genomics (CAG), a specialized Center of Emphasis at CHOP. The de-identified samples were chosen to reflect the composition of the entire biobank and represent the most common complex disorders and rare single-gene diseases. FASTQ files were moved from Amazon S3 buckets into the EC2 F1.2xlarge instances, which use Xilinx Virtex UltraScale+ field programmable gate arrays (FPGAs). The DRAGEN pipeline consisted of mapping, aligning, sorting, duplicate marking, and haplotype variant calling and ended when a variant call format (VCF) file was delivered back to a secure Amazon S3 bucket.
“Today’s speed test is a culmination of two years of collaboration between CAG and Edico Genome, including beta-testing their product in our center,” said Hakon Hakonarson, M.D., Ph.D., director of CAG at CHOP in a press release. “We utilize DRAGEN as part of our genomic workflow to achieve our mission of translating basic research findings to medical innovations. The speed of this technology in processing vast amounts of raw data in a matter of minutes will allow us to deliver actionable results in hours—an important capability as we go forward in realizing the benefits of precision medicine for children and families.”
“Now, really for the first time, you can take care of this piece,” Hakonarson, told Bio-IT World after the trial. “Generating the sequencing is getting more and more trivial. It’s really the informatics process, translating the sequence into usable files so you can use to do phenotype-genotype analysis, [that is challenging]. And that is the process we accomplished.”
Edico ported the DRAGEN pipeline to Amazon Web Service’s EC2 F1/FPGA instances last year. Previously, DRAGEN was only available installed on site. This August, Pieter van Rooyen, Edico’s president and CEO told Bio-IT World that the DRAGEN pipeline had processed 12 petabytes of data so far in the year in on site deployments. He said then that he expected usage in the cloud to be quite a bit more. “There will definitely be more usage in the cloud going forward; there’s definitely a trend for genomic data to be processed in the cloud,” he said this summer. “Quite honestly I think the right solution is a hybrid solution—both on site and in the cloud.”
But Hakonarson argues for the flexibility of accessing DRAGEN in the cloud. “You have the Broad, and Baylor, and WashU, and Seattle, and these sorts of large scale sequencing centers that work with and process large amounts of data. They are obviously well-equipped to handle that. But any sort of medium-sized, company-sized institute that is dealing with sequencing today—and there are hundreds of them in that size range—they all struggle with this. (see Shawn Levy’s initial impressions of DRAGEN installed at HudsonAlpha.)
While the Guinness World Records title is for speed, the process was cheap as well. “It’s actually not terribly expensive,” Hakonarson said. “Because it’s only two hours, the cost is actually relatively low.” He declined to give an exact number, but said, “we are talking dollars per sample.” Later he added: “Cost, of course, is more relevant in most instances… but if you have a sick baby in the NICU and you need to process that data rapidly for diagnostics purposes, you can do that.”
Edico’s pipeline has been used in clinical settings to set another Guinness World Record: the 26-hour diagnostic genome conducted by Stephen Kingsmore and his colleagues at the Center for Pediatric Genomic Medicine at Children’s Mercy in Kansas City and published in Genome Medicine in 2015.
The data used in the demonstration at ASHG yesterday were not clinical data. The samples came from the CAG biobank and are all di-identified research samples. More than 60% of the samples were from African Americans, making it one of the largest cohorts for this demographic that has been sequenced to date. Results from the rapid analysis will be utilized by the CAG with the hope of uncovering genetic links to common childhood diseases, including asthma, autism, diabetes, epilepsy, obesity, schizophrenia, pediatric cancer, and a range of rare diseases.
The CAG has a biobank of about 100,000 samples and has access to 350,000 collaborator samples. Essentially all of the samples have been genotyped, Hakonarson said. Several thousand have been exome sequenced and a couple of thousand samples have undergone whole genome sequencing.
The limiting step, Hakonarson said, is the cost of the whole genome sequencing—“about $1500 per sample”, he says. “If we could afford to do that for 100,000 samples we would do that immediately!” Instead, CAG is committed to conducting whole genome sequencing on about 10,000 samples by the end of the year.