We are thrilled to announce that the pilot GIAB/NIST Reference Material 8398, Human DNA for Whole-Genome Variant Assessment (Daughter of Utah/European Ancestry) is released! This is genomic DNA from a large batch of the same cell line as NA12878. It is available for sale from the NIST Standard Reference Material website at http://tinyurl.com/giabpilot. Thank you to all of you who have helped with sequencing the genome, analyzing the genome, and giving feedback on our high-confidence calls and regions.
This is the first ever Reference Material of its kind, with orders of magnitude more characterized properties (all those variants and reference calls!) than any before it. This truly was a Consortium effort, and we are excited to continue working with you to characterize this genome and our next genomes more completely.
Excerpted from the Report of Investigation (http://tinyurl.com/giabpilotreport):
This Reference Material (RM) is intended to provide a whole human genome sample and accompanying reference values to assess performance of variant calling from genome sequencing. This RM contains human genomic deoxyribonucleic acid (DNA) extracted from a large growth of the human lymphoblastoid cell line GM12878. A unit of RM 8398 consists of a single vial containing approximately 10 μg of genomic DNA, with the peak of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer (10 mM TRIS, pH 8.0, 1 mM EDTA, pH 8.0). This material is intended for assessing performance of human genome sequencing, including whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. Specifically, the material can be used to obtain estimates of true positives, false positives, true negatives, and false negatives for variant calls. This genomic DNA is to be analyzed as any other processed, extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA extraction, but it does assess sequencing library preparation, sequencing machines, and the bioinformatics steps of mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as functional or clinical interpretation.
Reference values are provided for single nucleotide polymorphisms (SNPs), small indels (insertions and deletions), and homozygous reference genotypes for approximately 77 % of the genome (http://go.nature.com/utlGz6). This report contains variants with respect to the GRCh37 reference assembly. Reference values are non-certified values that are the best estimate of the true value; however, the values do not meet the NIST criteria for certification for which all biases be sufficiently understood. The reference values are given as a variant call file (vcf) that contains the high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called high-confidence. The files referenced in this Report of Investigation are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information (NCBI). The Genome in a Bottle ftp site for the initial high-confidence vcf and high-confidence regions is:
As sequencing technologies and analysis methods improve, these high-confidence calls and regions will updated with refined versions of the files in a different directory at the Genome in a Bottle ftp site:
It is important to recognize that there is currently no standardization of definitions for true positive, false positive, true negative, and false negative. For example, genotyping errors can be counted as true positives, false positives, or false negatives, and no-calls can be treated as uncertain or homozygous reference regions (see “Instructions for Storage and Use”).
For more information: From NIST Tech Beat: February 25, 2014