Reference Materials and Data
The Genome in a Bottle Consortium has selected several genomes to produce and characterize as reference materials. The National Institute of Standards and Technology (NIST) is developing NIST Reference Materials from these genomes, which are DNA extracted from a large homogenized growth of B lymphoblastoid cell lines from the Coriell Institute for Medical Research. Note that there may be small differences between the NIST DNA and the Coriell DNA since they come from different growths of cells, though we do not expect these differences to be large for most applications.
The NIST Reference Materials available and planned are listed below, along with links to their data.
A description of data generated by GIAB for all the genomes below is published here, and characterization of small variants is published here. Ongoing work to characterize more difficult variants and regions is announced in the GIAB Analysis Team google group.
Genome Data and Resources
Pilot Genome (NA12878):
- NIST RM 8398 (HG001): Available at http://tinyurl.com/giabpilot
- NIST ID: HG001
- Link to NA12878 DNA and GM12878 cell line from Coriell: https://catalog.coriell.org/0/Sections/Search/Sample_Detail.aspx?Ref=NA12878&Product=DNA
- High-confidence variant calls: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/
- We have also uploaded fastq and bam files from ~300x total coverage of 150x150bp HiSeq2500 sequencing of NA12878/HG001 to: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x
- Mt. Sinai School of Medicine has also uploaded ~44x PacBio data for NA12878, including raw reads, error-corrected reads, and a merged SV vcf to: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NA12878_PacBio_MtSinai
- We published a paper in Nature Biotechnology describing the methods used to integrate datasets and form v2.18 of our high-confidence calls: nature.com/nbt/journal/v32/n3/full/nbt.2835.html
- Preliminary benchmark SV calls from svclassify (biorxiv.org/content/early/2015/05/16/019372) are available at: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/technical/svclassify_Manuscript/Supplementary_Information/Person
- Complex variants for testing variant comparison tools are available at: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/technical/complexVariants_NA12878/
Ashkenazim Father-Mother-Son trio from Personal Genome Project:
- Candidate NIST RMs 8391 (son only) and 8392 (entire trio):
- NIST IDs: HG002/HG003/HG004 (Son/Father/Mother)
- PGP IDs: huAA53E0/hu6E4515/hu8E87A9 (Son/Father/Mother; For some phenotype information, see participant profiles at http://www.personalgenomes.org/harvard/data)
- Coriell IDs: GM24385/GM24149/GM24143 (Son/Father/Mother; cell lines and DNA available at https://catalog.coriell.org/)
- High-confidence variant calls: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio
- There are a variety of raw datasets and bam files under each individual, currently including 10X Genomics, BioNano, Complete Genomics regular and LFR, 300x Illumina paired-end, Illumina 6kb mate-pair, 1000x Ion exome, custom moleculo libraries, ~0.05x Oxford Nanopore, and 70x/30x/30x PacBio: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/
- There are also a variety of analyses of these data under: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/
- Very preliminary benchmark SVs (see the README for usage information) are under: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_UnionSVs_12122017/
- All of these data are made public for analysis without restrictions, though when writing a scientific paper the Scientific Data paper and integration paper should be cited.
- The multiple IDs corresponding to each individual are:
HG002- NA24385 – huAA53E0 (son)
HG003- NA24149 – hu6E4515 (father)
HG004- NA24143 – hu8E87A9 (mother)
Asian (Han Chinese) Father-Mother-Son trio from Personal Genome Project:
- Candidate NIST RM 8393 (son only):
- NIST ID: HG005 (son)
- PGP IDs: hu91BD69/huCA017E/hu38168C (Son/Father/Mother; For some phenotype information, see participant profiles at http://www.personalgenomes.org/harvard/data)
- Coriell IDs: GM24631/GM24694/GM24695 (Son/Father/Mother; cell lines and DNA available at https://catalog.coriell.org/)
- There are a variety of raw datasets and bam files under each individual, currently including BioNano, Complete Genomics regular and LFR, 300x/100x/100x Illumina paired-end, Illumina 6kb mate-pair, 1000x Ion exome, and custom moleculo libraries: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/ChineseTrio/
- There are also a variety of analyses of these data under: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/ChineseTrio/analysis/
All of this data is made public for analysis without restrictions, though when writing a scientific paper nature.com/nbt/journal/v32/n3/full/nbt.2835.html, the newest integration paper and/or http://www.nature.com/articles/sdata201625 should be cited as appropriate.