This repo contains the Platinum Genomes small variant truthset for samples NA12878 (also known as hg001) and NA12877. Platinum Genomes truthset variants were validated using haplotype inheritance information through a well studied 17-member pedigree (CEPH 1463).
Truthsets are made up of a VCF of validated variant records and a BED file of confident regions. These files aren't huge (00s of MB) but are too large to play nicely with git and github, here's a few ways to download:
Truthset files are stored in an AWS S3 bucket called platinum-genomes
, one way to download is via the AWS CLI:
aws s3 cp s3://platinum-genomes/2017-1.0 pg2017 --recursive
To download without AWS credentials, add the --no-sign-request
flag. You can also explore the bucket and download individual files with this S3 bucket display.
Alternatively, use wget
or similar with the file URIs in this repo, e.g.:
wget -xi files/2017-1.0.files
You can then use the relevant md5 checksum in each release to validate data integrity.
Finally, truthset files can also be downloaded via FTP, e.g.:
wget ftp://platgene_ro:''@ussd-ftp.illumina.com/2017-1.0/hg38/small_variants/NA12878/NA12878.vcf.gz
To compare a VCF against these truthsets, we recommend using hap.py which
performs a sophisticated haplotype comparison rather than a simpler tool such as bcftools isec
.
Applications wrapping hap.py and containing these truthsets are available on the following platforms:
- BaseSpace Sequence Hub (Hap.py Benchmarking and VCAT)
- PrecisionFDA (GA4GH Benchmarking)
See the attached wiki for technical information.
Sequencing data for NA12878, NA12877 and samples NA12889-NA12892 (grandparents) are available through the ENA:
BaseSpace users can access this data via a shared BaseSpace project:
Sequencing data for the remaining pedigree members is not consented for public release and so is made available through the dbGaP database:
Please open an issue for comments, issues and other feedback.
For further information or to cite Platinum Genomes resources, see:
- Eberle, MA et al. (2017) A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Research, 27:157-164. doi:10.1101/gr.210500.116