-
Notifications
You must be signed in to change notification settings - Fork 41
Test data
Data for the TOBIAS test commands found in this wiki can be obtained using TOBIAS DownloadData
:
$ TOBIAS DownloadData --bucket data-tobias-2020
$ mv data-tobias-2020/ test_data/
This downloads the test-data (~700 MB) from the loosolab S3-storage server and moves the data to the test_data/
directory.
The source of the test data is the paper "Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position", Buenrostro et al. 2013, Nature Methods link. This paper applied ATAC-seq to the GM12878 lymphoblastoid cell line (derived from B cells) and to CD4+ positive T cells at three time points. The raw data from the study (study accession PRJNA207663) in the format of .fastqs were downloaded from the following urls:
sample_title | experiment_accession | fastq files |
---|---|---|
GM12878_ATACseq_50k_Rep1 | SRX298000 | read1,read2 |
GM12878_ATACseq_50k_Rep2 | SRX298001 | read1,read2 |
GM12878_ATACseq_50k_Rep3 | SRX298002 | read1,read2 |
GM12878_ATACseq_50k_Rep4 | SRX298003 | read1,read2 |
CD4+_ATACseq_Day1_Rep1 | SRX298007 | read1,read2 |
CD4+_ATACseq_Day1_Rep2 | SRX298008 | read1,read2 |
CD4+_ATACseq_Day2_Rep1 | SRX298009 | read1,read2 |
CD4+_ATACseq_Day2_Rep2 | SRX298010 | read2,read2 |
CD4+_ATACseq_Day3_Rep1 | SRX298011 | read1,read2 |
CD4+_ATACseq_Day3_Rep2 | SRX298012 | read1,read2 |
All samples were mapped using STAR. Single replicates were merged using samtools merge
to condition .bam-files to yield Bcell.bam
, Tcell_day1.bam
, Tcell_day2.bam
and Tcell_day3.bam
. To keep file sizes minimal, a random subset of reads were chosen for each replicate using samtools view -s <fraction>
. For the sake of the examples, the Tcell samples were further merged to one .bam-file Tcell.bam
.
Peak-calling was performed per replicate using MACS2 with parameters --nomodel --shift -100 --extsize 200 --broad
. The file merged_peaks.bed
represents peaks merged across the Bcell
and Tcell
conditions.
The .gtf-file used for annotation was downloaded from Ensembl (link). Chromosome prefix "chr" was added and the file was further subset to chr4
.
Annotation of peaks in merged_peaks.bed
was performed using UROPA as shown here:
$ uropa --bed merged_peaks.bed --gtf transcripts_chr4.gtf --show_attributes gene_id gene_name --feature_anchor start --distance 20000 10000 --feature gene
The test files are obtained with:
$ cut -f 1-6,16-17 merged_peaks_finalhits.txt | head -n 1 > merged_peaks_annotated_header.txt
$ cut -f 1-6,16-17 merged_peaks_finalhits.txt | tail -n +2 > merged_peaks_annotated.bed
The file motifs.jaspar
contains 83 motifs from the JASPAR 2020 vertebrate database (download here. The motifs found in test_data/individual_motifs/
were obtained using TOBIAS FormatMotifs --task split
.
The file blacklist.bed
is a subset of the Boyle-lab blacklist (available here) containing only chr4 regions.
Additional files are obtained using the test commands throughout this wiki.