Skip to content

Preprocessing

Vivekanandan Ramalingam edited this page Mar 19, 2024 · 12 revisions

Preprocessing steps to generate bigwig counts tracks

Merge the two replicates and create and index

samtools merge -f merged.bam rep1.bam rep2.bam
samtools index merged.bam

We will also index the control BAM file

samtools index control.bam

Create bigwig files using bedtools via intermediate bedGraph files

In addition to creating the bigwig files, at this step we will filter the bam files to keep only the chromosomes that we want to use in the model. In the example shown below, we do this using samtools view and the hg38.chrom.sizes reference file.

Experiment

# get coverage of 5’ positions of the plus strand
samtools view -b merged.bam $(cut -f 1 hg38.chrom.sizes) | \
	bedtools genomecov -5 -bg -strand + -ibam stdin | \
	sort -k1,1 -k2,2n > plus.bedGraph

# get coverage of 5’ positions of the minus strand
samtools view -b merged.bam $(cut -f 1 hg38.chrom.sizes) | \
        bedtools genomecov -5 -bg -strand - -ibam stdin | \
        sort -k1,1 -k2,2n > minus.bedGraph

# Convert bedGraph files to bigWig files
bedGraphToBigWig plus.bedGraph hg38.chrom.sizes plus.bw
bedGraphToBigWig minus.bedGraph hg38.chrom.sizes minus.bw

Control

# get coverage of 5’ positions of the control plus strand
samtools view -b control.bam $(cut -f 1 hg38.chrom.sizes) | \
        bedtools genomecov -5 -bg -strand + -ibam stdin | \
        sort -k1,1 -k2,2n > control_plus.bedGraph

# get coverage of 5' positions of the control minus strand
samtools view -b control.bam $(cut -f 1 hg38.chrom.sizes) | \
        bedtools genomecov -5 -bg -strand - -ibam stdin | \
        sort -k1,1 -k2,2n > control_minus.bedGraph

# Convert bedGraph files to bigWig files
bedGraphToBigWig control_plus.bedGraph hg38.chrom.sizes control_plus.bw
bedGraphToBigWig control_minus.bedGraph hg38.chrom.sizes control_minus.bw

Identify peaks

For the purposes of this tutorial we will use the optimal IDR thresholded peaks that are already available in the ENCODE data portal. We will use the the narrowPeak files that are in BED6+4 format. Explanation of what each of the 10 fields means can be found here. Currently, only this format is supported but in the future support for more formats will be added.

See image below that shows the file listed in the ENCODE data portal

Download the file:

wget https://www.encodeproject.org/files/ENCFF396BZQ/@@download/ENCFF396BZQ.bed.gz -O peaks.bed.gz
gunzip peaks.bed.gz

Organize you data

At this point, we suggest creating a directory structure to store the data, models, predictions, metrics, importance scores, discovered motifs, plots & visualizations etc. that will make it easier for you to organize and maintain your work. Let's start by creating a parent directory for the experiment and moving the bigwig files and peaks file from section 1.1 & 1.2 to a data directory

mkdir ENCSR000EGM
mkdir ENCSR000EGM/data
mv *.bw ENCSR000EGM/data
mv peaks.bed ENCSR000EGM/data

Once this is done, your directory hierarchy should resemble this

Note that the relative paths in subsequent pipeline steps assume that the current working directory is the one immediately above the project directory. For example, if you are in the ENCSR000EGM folder, navigate up one level with cd ...

Make a reference directory at the same level as the ENCSR000EGM experiment directory. In the reference directory we will place 4 files: the genome reference hg38.genome.fa, the indexed reference hg38.genome.fai, the chromosome sizes file hg38.chrom.sizes and one text file that contains a list of chromosomes we care about chroms.txt. blacklist.bed contains all the blacklist regions with artifact peaks.

mkdir ENCSR000EGM/reference 
mv hg38.genome.fa* ENCSR000EGM/reference
mv hg38.chrom.sizes ENCSR000EGM/refence
mv chroms.txt ENCSR000EGM/reference
mv blacklist.bed ENCSR000EGM/reference

The directory structure should look like this (TODO: update this image):