-
Notifications
You must be signed in to change notification settings - Fork 0
Preprocessing
samtools merge -f merged.bam rep1.bam rep2.bam
samtools index merged.bam
We will also index the control BAM file
samtools index control.bam
In addition to creating the bigwig files, at this step we will filter the bam files
to keep only the chromosomes that we want to use in the model. In the example shown below,
we do this using samtools view
and the hg38.chrom.sizes
reference file.
Experiment
# get coverage of 5’ positions of the plus strand
samtools view -b merged.bam $(cut -f 1 hg38.chrom.sizes) | \
bedtools genomecov -5 -bg -strand + -ibam stdin | \
sort -k1,1 -k2,2n > plus.bedGraph
# get coverage of 5’ positions of the minus strand
samtools view -b merged.bam $(cut -f 1 hg38.chrom.sizes) | \
bedtools genomecov -5 -bg -strand - -ibam stdin | \
sort -k1,1 -k2,2n > minus.bedGraph
# Convert bedGraph files to bigWig files
bedGraphToBigWig plus.bedGraph hg38.chrom.sizes plus.bw
bedGraphToBigWig minus.bedGraph hg38.chrom.sizes minus.bw
Control
# get coverage of 5’ positions of the control plus strand
samtools view -b control.bam $(cut -f 1 hg38.chrom.sizes) | \
bedtools genomecov -5 -bg -strand + -ibam stdin | \
sort -k1,1 -k2,2n > control_plus.bedGraph
# get coverage of 5' positions of the control minus strand
samtools view -b control.bam $(cut -f 1 hg38.chrom.sizes) | \
bedtools genomecov -5 -bg -strand - -ibam stdin | \
sort -k1,1 -k2,2n > control_minus.bedGraph
# Convert bedGraph files to bigWig files
bedGraphToBigWig control_plus.bedGraph hg38.chrom.sizes control_plus.bw
bedGraphToBigWig control_minus.bedGraph hg38.chrom.sizes control_minus.bw
For the purposes of this tutorial we will use the optimal IDR thresholded peaks that are already available in the ENCODE data portal. We will use the the narrowPeak files that are in BED6+4 format. Explanation of what each of the 10 fields means can be found here. Currently, only this format is supported but in the future support for more formats will be added.
See image below that shows the file listed in the ENCODE data portal
Download the file:
wget https://www.encodeproject.org/files/ENCFF396BZQ/@@download/ENCFF396BZQ.bed.gz -O peaks.bed.gz
gunzip peaks.bed.gz
At this point, we suggest creating a directory structure to store the data, models, predictions, metrics, importance scores, discovered motifs, plots & visualizations etc. that will make it easier for you to organize and maintain your work. Let's start by creating a parent directory for the experiment and moving the bigwig files and peaks file from section 1.1 & 1.2 to a data directory
mkdir ENCSR000EGM
mkdir ENCSR000EGM/data
mv *.bw ENCSR000EGM/data
mv peaks.bed ENCSR000EGM/data
Once this is done, your directory hierarchy should resemble this
Note that the relative paths in subsequent pipeline steps assume that the current working directory is the one
immediately above the project directory. For example, if you are in the ENCSR000EGM
folder, navigate up one level with cd ..
.
Make a reference
directory at the same level as the ENCSR000EGM
experiment directory. In the reference
directory we will place 4 files: the genome reference hg38.genome.fa
, the indexed reference hg38.genome.fai
, the chromosome sizes file hg38.chrom.sizes
and one text file that contains a list of chromosomes we care about chroms.txt
. blacklist.bed
contains all the blacklist regions with artifact peaks.
mkdir ENCSR000EGM/reference
mv hg38.genome.fa* ENCSR000EGM/reference
mv hg38.chrom.sizes ENCSR000EGM/refence
mv chroms.txt ENCSR000EGM/reference
mv blacklist.bed ENCSR000EGM/reference
The directory structure should look like this (TODO: update this image):