Reliable Standards: A Necessity For Genomic Data
Contributed Commentary By Aaron M. Wenger
February 12, 2018 | As the field of genomics grows, it is imperative that we develop reliable, high-quality data standards with which to gauge accuracy and completeness. I am heartened to see major progress toward the goal of establishing genomic data standards coming from a number of dedicated groups. Scientists from around the world are contributing to initiatives such as the Genome in a Bottle Consortium, the Genome Reference Consortium, and the Global Alliance for Genomics and Health. My colleagues at PacBio, like researchers at many other genomics technology vendors, are committed to supporting this important work.
The more we learn about the human genome, the more needs we identify for data standards. For example, early efforts focused on ensuring that single nucleotide variant (SNV) calls could be tested for accuracy; today we know that structural variants, which are responsible for the vast majority of base pair differences between any two people, are just as critical to call with precision. We are reaching an inflection point beyond which DNA sequencing performed in the world will increase dramatically. Now is the time to ensure that we have the right safeguards to confirm that the sequence data being generated is accurate and reliable for use in both research and clinical settings.
Standard Bearers
The Genome in a Bottle (GIAB) Consortium, launched by the National Institute of Standards and Technology with participants from a host of agencies and institutions, has done remarkable work to produce reliable reference materials that allow scientists to measure the confidence level of their own sequencing and variant-calling pipelines. Recent GIAB efforts have applied multiple technologies to sequence family trios to exceptional resolution. Samples have been chosen to represent different ethnic communities, such as Han Chinese and Ashkenazi Jews. These resources will be particularly helpful as whole genome sequencing moves into the clinic for more mainstream use; the truth sets established by GIAB sequencing initiatives will make it straightforward for labs to assess the quality of their own results and fine-tune operations to produce the most reliable data for patients.
The GIAB Consortium initially produced benchmarks for smaller variants, such as SNVs, and has more recently begun incorporating larger structural variants as technologies like long-read sequencing have allowed for effective detection. Structural variants were the prime focus of a GIAB workshop at Stanford last month. There, participants critically evaluated the consortium’s fifth iteration of a draft benchmark set and debated how to finalize an official benchmark planned for release this year. The consortium also discussed plans to push for additional reference samples, more variant types, and greater representation of difficult regions of the genome.
Another important driver for data quality standards is the Genome Reference Consortium (GRC), charged with generating and curating the reference assemblies used by the scientific community. Recently, the GRC has been hard at work improving representation of ethnic diversity in the human reference genome; this will be a critical component for establishing useful and relevant standards for genomic data that can be implemented around the world. To achieve this, GRC scientists are producing high-quality assemblies for genomes from various ethnic groups. The GRC has completed initial assemblies of Yoruban, Puerto Rican, Han Chinese, and Colombian individuals. Those assemblies are deeply valuable on their own, and genomic regions that differ from the current human reference are being added to that community resource through alternate loci scaffolds.
Perhaps the largest group working to set standards for genomic data, the Global Alliance for Genomics and Health (GA4GH) now has 500 member organizations from around the world and is partnering with sequencing mega-projects such as the All of Us research program. GA4GH working groups are tackling standards for data sharing, security, storage, and more. Among the most important roles of GA4GH is maintaining and extending the widely-used SAM and VCF file formats.
Success Through Collaboration
While these efforts have already been beneficial, a risk we run is that multiple data standards could emerge from different groups. It will be essential for these large consortia to collaborate with each other to ensure that appropriate data standards are available for various needs (such as supporting both SNVs and structural variants) and that the specific use for each is clearly defined.
I believe that the efforts described above will help the genomics community cross an important threshold, from the realm of pioneering tinkerers to a robust, reliable, and highly accurate science that can be readily applied both in research and in the clinic — no longer as a solution reserved only for the most complex cases. If you are not already involved in one of these initiatives, I strongly recommend finding a way to participate.
Aaron Wenger is a principal scientist at PacBio interested in the application of genome sequencing to improve human health. Aaron received a PhD from Stanford University, where he developed techniques to annotate regulatory elements in mammalian genomes. He now studies applications of PacBio sequencing to identify structural variation in human genomes. He can be reached at awenger@pacb.com.