Data file descriptions

This document contains information about all data files associated with this project. Each file will have the following association information:

File type will be one of:
- Reference file: Obtained from an external source/database. When known, the obtained data and a link to the external source is included.
- Modified reference file: Obtained from an external source/database but modified for OpenPBTA use.
- PBTA data file: Pediatric Brain Tumor Atlas data that are processed upstream of the OpenPBTA project, e.g., the output of a somatic single nucleotide variant method. Links to the relevant D3B Center or Kids First workflow (and version where applicable) are included in Origin.
- Analysis file: Any file created by a script in analyses/*.
Origin
- For PBTA data files, a link the relevant D3B Center or Kids First workflow (and version where applicable).
- When applicable, a link to the specific module, workflow, or resource that produced (or modified, for Modified reference file types) the data.
File description
- A brief one sentence description of what the file contains (e.g., bed files contain coordinates for features XYZ).

current release (release-v22-20220505)

File name	File Type	Origin	File Description
`fusion_summary_embryonal_foi.tsv`	Analysis file	`fusion-summary`	Summary file for presence of embryonal tumor fusions of interest
`fusion_summary_ependymoma_foi.tsv`	Analysis file	`fusion-summary`	Summary file for presence of ependymal tumor fusions of interest
`fusion_summary_ewings_foi.tsv`	Analysis file	`fusion-summary`	Summary file for presence of Ewing's sarcoma fusions of interest
`gencode.v27.primary_assembly.annotation.gtf.gz`	Reference file	GENCODE v27	hg38 gene annotation on primary assembly (reference chromosomes and scaffolds)
`GRCh38.primary_assembly.genome.fa.gz`	Reference Genome file	GENCODE v27	hg38 primary assembly genome sequence FASTA file
`independent-specimens.wgs.primary-plus.tsv`	Analysis file	`independent-samples`	Independent specimens list for WGS sample, primary + non-primary when no primary sample is available
`independent-specimens.wgs.primary.tsv`	Analysis file	`independent-samples`	Independent specimens list for WGS samples, primary only
`independent-specimens.wgswxs.primary-plus.tsv`	Analysis file	`independent-samples`	Independent specimens list for WGS and WXS samples, primary + non-primary when no primary sample is available
`independent-specimens.wgswxs.primary.tsv`	Analysis file	`independent-samples`	Independent specimens list for WGS and WXS samples, primary only
`intersect_cds_lancet.bed`	Analysis file	`snv-callers`	Intersection of `gencode.v27.primary_assembly.annotation.gtf.gz` CDS with WXS 100bp padded BED regions and Lancet's WXS regions
`intersect_cds_lancet_strelka_mutect_WGS.bed`	Analysis file	`snv-callers`	Intersection of `gencode.v27.primary_assembly.annotation.gtf.gz` CDS with Lancet, Strelka2, Mutect2 regions
`intersect_strelka_mutect_WGS.bed`	Analysis file	`snv-callers`	Intersection of `gencode.v27.primary_assembly.annotation.gtf.gz` CDS with Strelka2 and Mutect2 regions called
`pbta-cnv-cnvkit-gistic.zip`	PBTA data file	Workflow	Somatic CNV - GISTIC 2.0 output using `pbta-cnv-cnvkit.seg` file input (WGS samples only)
`pbta-cnv-consensus-gistic.zip`	Analysis file	Workflow	Somatic CNV - GISTIC 2.0 output using `pbta-cnv-consensus.seg` file input (WGS samples only)
`pbta-cnv-cnvkit.seg.gz`	PBTA data file	Copy number variant calling Workflow	Somatic Copy Number Variant - CNVkit SEG file (WGS samples only)
`pbta-cnv-consensus.seg.gz`	Analysis file	`copy_number_consensus_call`	Somatic Copy Number Variant - CNVkit SEG file (WGS samples only)
`pbta-cnv-controlfreec.tsv.gz`	PBTA data file	Copy number variant calling Workflow	Somatic Copy Number Variant - TSV file that is a merge of ControlFreeC `*_CNVs` files (WGS samples only)
`consensus_seg_annotated_cn_autosomes.tsv.gz`	Analysis file	`focal-cn-file-preparation`	TSV file containing genes with copy number changes per biospecimen; autosomes only
`consensus_seg_annotated_cn_x_and_y.tsv.gz`	Analysis file	`focal-cn-file-preparation`	TSV file containing genes with copy number changes per biospecimen; sex chromosomes only
`consensus_seg_with_status.tsv.tsv`	Analysis file	`focal-cn-file-preparation`	TSV file containing chromosome locations with copy number changes and ploidy per biospecimen
`pbta-fusion-arriba.tsv.gz`	PBTA data file	Gene fusion detection Workflow	Fusion - Arriba TSV, annotated with FusionAnnotator
`pbta-fusion-putative-oncogenic.tsv`	Analysis file	`fusion_filtering`	Filtered and prioritized fusions
`pbta-fusion-recurrently-fused-genes-byhistology.tsv`	Analysis file	`fusion_filtering`	Recurrently-fused genes tabulated by broad histology
`pbta-fusion-recurrently-fused-genes-bysample.tsv`	Analysis file	`fusion_filtering`	Binary matrix that denotes the presence or absence of a recurrently fused gene in an individual RNA-seq specimen
`pbta-fusion-starfusion.tsv.gz`	PBTA data file	Gene fusion detection Workflow	Fusion - STARFusion TSV
`pbta-gene-counts-rsem-expected_count.polya.rds`	PBTA data file	Gene expression abundance estimation Workflow	Gene expression - RSEM expected counts for poly-A samples (gene-level)
`pbta-gene-counts-rsem-expected_count.stranded.rds`	PBTA data file	Gene expression abundance estimation Workflow	Gene expression - RSEM expected counts for stranded samples (gene-level)
`pbta-gene-expression-kallisto.polya.rds`	PBTA data file	Gene expression abundance estimation Workflow	Gene expression - kallisto TPM for poly-A samples (transcript-level)
`pbta-gene-expression-kallisto.stranded.rds`	PBTA data file	Gene expression abundance estimation Workflow	Gene expression - kallisto TPM for stranded samples (transcript-level)
`pbta-gene-expression-rsem-fpkm-collapsed.polya.rds`	Analysis file	`collapse-rnaseq`	Gene expression - RSEM FPKM for poly-A samples collapsed to gene symbol (gene-level)
`pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds`	Analysis file	`collapse-rnaseq`	Gene expression - RSEM FPKM for stranded samples collapsed to gene symbol (gene-level)
`pbta-gene-expression-rsem-fpkm.polya.rds`	PBTA data file	Gene expression abundance estimation Workflow	Gene expression - RSEM FPKM for poly-A samples (gene-level)
`pbta-gene-expression-rsem-fpkm.stranded.rds`	PBTA data file	Gene expression abundance estimation Workflow	Gene expression - RSEM FPKM for stranded samples (gene-level)
`pbta-gene-expression-rsem-tpm.polya.rds`	PBTA data file	Gene expression abundance estimation Workflow	Gene expression - RSEM TPM for poly-A samples (gene-level)
`pbta-gene-expression-rsem-tpm.stranded.rds`	PBTA data file	Gene expression abundance estimation Workflow	Gene expression - RSEM TPM for stranded samples (gene-level)
`pbta-histologies.tsv`	Analysis file	`molecular-subtype-integrate`	Harmonized clinical metadata file plus biospecimen molecular subtypes
`pbta-isoform-counts-rsem-expected_count.polya.rds`	PBTA data file	Gene expression abundance estimation Workflow	Gene expression - RSEM expected counts for poly-A samples (transcript-level)
`pbta-isoform-counts-rsem-expected_count.stranded.rds`	PBTA data file	Gene expression abundance estimation Workflow	Gene expression - RSEM expected counts for stranded samples (transcript-level)
`pbta-isoform-expression-rsem-tpm.polya.rds`	PBTA data file	Gene expression abundance estimation Workflow	Gene expression - RSEM TPM for poly-A samples (transcript-level)
`pbta-isoform-expression-rsem-tpm.stranded.rds`	PBTA data file	Gene expression abundance estimation Workflow	Gene expression - RSEM TPM for stranded samples (transcript-level)
`pbta-mend-qc-manifest.tsv`	PBTA data file	MendQC Workflow	File to map MendQC output to biospecimen IDs
`pbta-mend-qc-results.tar.gz`	PBTA data file	MendQC Workflow	MendQC output files
`pbta-snv-consensus-mutation.maf.tsv.gz`	Analysis file	`snv-callers`	Consensus calls for SNVs and small indels; columns in the included file are derived from the Strelka2.
`pbta-snv-scavenged-hotspots.maf.tsv.gz`	Analysis file	`hotspots-detection`	MAF of SNVs overlapping MSKCC hotspots database
`pbta-snv-consensus-mutation-tmb-all.tsv`	Analysis file	`snv-callers`	Tumor mutation burden statistics calculated from Strelka2 and Mutect2 SNV consensus, and the intersection of Strelka2 and Mutect2 BED windows sizes.
`pbta-snv-consensus-mutation-tmb-coding.tsv`	Analysis file	`snv-callers`	Coding only tumor mutation burden statistics calculated from the number of coding sequence Strelka2, Mutect2, and Lancet consensus SNVs and size of the intersection of all three callers' BED windows and the Gencode v27 coding sequences.
`pbta-snv-lancet.vep.maf.gz`	PBTA data file	Somatic mutation calling Workflow	Somatic SNV - Lancet annotated MAF file
`pbta-snv-mutect2.vep.maf.gz`	PBTA data file	Somatic mutation calling Workflow	Somatic SNV - Mutect2 annotated MAF file
`pbta-snv-strelka2.vep.maf.gz`	PBTA data file	Somatic mutation calling Workflow	Somatic SNV - Strelka2 annotated MAF file
`pbta-snv-vardict.vep.maf.gz`	PBTA data file	Somatic mutation calling Workflow	Somatic SNV - VarDict annotated MAF file
`tcga-snv-consensus-snv.maf.tsv.gz`	Analysis file	`snv-callers`	TCGA Consensus calls for SNVs and small indels made from Mutect2, Strelka2, and Lancet.
`tcga-snv-mutation-tmb-all.tsv`	Analysis file	`snv-callers`	Tumor Mutation burden calculations using all mutations identified by both of Mutect2 and Strelka2 throughout the genome.
`tcga-snv-mutation-tmb-coding.tsv`	Analysis file	`snv-callers`	Tumor Mutation burden calculations using coding only mutations identified by both Mutect2 and Strelka2 only within coding sequence regions of the genome.
`pbta-star-log-final.tar.gz`	PBTA data file	Gene expression abundance estimation Workflow	STAR log final output files
`pbta-star-log-manifest.tsv`	PBTA data file	Gene expression abundance estimation Workflow	File to map STAR output to biospecimen IDs
`pbta-sv-manta.tsv.gz`	PBTA data file	Structural variant calling Workflow	Somatic Structural Variant - Manta output, annotated with AnnotSV (WGS samples only)
`pbta-tcga-manifest.tsv`	PBTA data file	Retrieved from GDC website API endpoint	Manifest of TCGA tumor/normal BAMs used for SNV calling, Tumor_Sample_Barcodes, and histologies
`pbta-tcga-snv-lancet.vep.maf.gz`	PBTA/TCGA data file	Somatic mutation calling Workflow	Somatic SNV - Lancet annotated MAF file
`pbta-tcga-snv-mutect2.vep.maf.gz`	PBTA data file	Somatic mutation calling Workflow	Somatic SNV - Mutect2 annotated MAF file
`pbta-tcga-snv-strelka2.vep.maf.gz`	PBTA data file	Somatic mutation calling Workflow	Somatic SNV - Strelka2 annotated MAF file
`StrexomeLite_hg38_liftover_100bp_padded.bed`	Modified Reference File	Somatic mutation calling Workflow	hg38 targeted panel regions used for all variant callers, each region padded by 100 bp
`StrexomeLite_Targets_CrossMap_hg38_filtered_chr_prefixed.bed`	Reference File	Somatic mutation calling; Link to file	hg38 lifted over targeted DNA panel bait capture regions provided by the kit manufacturer
`WGS.hg38.lancet.300bp_padded.bed`	Modified Reference File	Somatic mutation calling Workflow	WGS.hg38.lancet.unpadded.bed file with each region padded by 300 bp
`WGS.hg38.lancet.unpadded.bed`	Modified Reference File	Somatic mutation calling Workflow	hg38 WGS regions created using UTR, exome, and start/stop codon features of the GENCODE 31 reference, augmented with PASS variant calls from Strelka2 and Mutect2
`WGS.hg38.mutect2.vardict.unpadded.bed`	Modified Reference File	Somatic mutation calling Workflow	hg38 BROAD Institute interval calling list (restricted to Chr1-22,X,Y,M and non-N regions) used for Mutect2 and VarDict variant callers
`WGS.hg38.strelka2.unpadded.bed`	Modified Reference File	Somatic mutation calling Workflow	hg38 BROAD Institute interval calling list (restricted to Chr1-22,X,Y,M) used for Strelka2 variant caller
`WGS.hg38.vardict.100bp_padded.bed`	Modified Reference File	Somatic mutation calling Workflow	`WGS.hg38.mutect2.vardict.unpadded.bed` with each region padded by 100 bp used for VarDict variant caller
`WXS.hg38.100bp_padded.bed`	Modified Reference File	Somatic mutation calling Workflow	hg38 WXS regions provided by the kit manufacturer used for Strelka2, Mutect2, and VarDict variant callers with each region padded by 100 bp
`WXS.hg38.lancet.400bp_padded.bed`	Modified Reference File	Somatic mutation calling Workflow	hg38 WXS regions provided by the kit manufacturer used for Lancet variant callers with each region padded by 400 bp
`intersected_whole_exome_agilent_designed_120_AND_tcga_6k_genes.Gh38.bed`	Modified Reference File	`tcga-capture-kit-investigation`	Generated using bedtools intersect from `tcga_6k_genes.targetIntervals.Gh38.bed` and `whole_exome_agilent_designed_120.targetIntervals.Gh38.bed`
`intersected_whole_exome_agilent_plus_tcga_6k_AND_tcga_6k_genes.Gh38.bed`	Modified Reference File	`tcga-capture-kit-investigation`	Generated using bedtools intersect from `tcga_6k_genes.targetIntervals.Gh38.bed` and `whole_exome_agilent_plus_tcga_6k.targetIntervals.Gh38.bed`
`tcga_6k_genes.targetIntervals.Gh38.bed`	Modified Reference File	`tcga-capture-kit-investigation`	hg38 version of `tcga_6k_genes.targetIntervals.bed` generated using CrossMap and bedtools sort and merge
`tcga_6k_genes.targetIntervals.bed`	Reference File	Downloaded via `tcga-capture-kit-investigation`	hg19 WXS target capture regions downloaded from GDC website API endpoint
`whole_exome_agilent_1.1_refseq_plus_3_boosters.targetIntervals.Gh38.bed`	Modified Reference File	`tcga-capture-kit-investigation`	hg38 version of `whole_exome_agilent_1.1_refseq_plus_3_boosters.targetIntervals.bed` generated using CrossMap and bedtools sort and merge
`whole_exome_agilent_1.1_refseq_plus_3_boosters.targetIntervals.bed`	Reference File	Downloaded via `tcga-capture-kit-investigation`	hg19 WXS target capture regions downloaded from GDC website API endpoint
`whole_exome_agilent_designed_120.targetIntervals.Gh38.bed`	Modified Reference File	`tcga-capture-kit-investigation`	hg38 version of `whole_exome_agilent_designed_120.targetIntervals.bed` generated using CrossMap and bedtools sort and merge
`whole_exome_agilent_designed_120.targetIntervals.bed`	Reference File	Downloaded via `tcga-capture-kit-investigation`	hg19 WXS target capture regions downloaded from GDC website API endpoint
`whole_exome_agilent_plus_tcga_6k.targetIntervals.Gh38.bed`	Modified Reference File	`tcga-capture-kit-investigation`	hg38 version of `whole_exome_agilent_plus_tcga_6k.targetIntervals.bed` generated using CrossMap and bedtools sort and merge
`whole_exome_agilent_plus_tcga_6k.targetIntervals.bed`	Reference File	Downloaded via `tcga-capture-kit-investigation`	hg19 WXS target capture regions downloaded from GDC website API endpoint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data-files-description.md

data-files-description.md

Data file descriptions

current release (release-v22-20220505)

Files

data-files-description.md

Latest commit

History

data-files-description.md

File metadata and controls

Data file descriptions

current release (release-v22-20220505)