This document covers the mapping, filtering, peak calling, and plotting of whole-animal ATAC-seq and Cut&Tag data to identify cis-regulatory elements in the strain AEP H.vulgaris genome.
[TOC]
To analyze our whole-animal ATAC-seq data for the AEP strain of H. vulgaris, we begun by aligning each of the three biological replicates to the AEP assembly. The mapping process we used was as follows: we first used Trimmomatic (v0.36) to remove low quality base calls and contaminating adapter sequences. We also generated FASTQC (v0.11.4) reports before and after the Trimmomatic step to ensure the filtering was successful.
We next mapped the filtered reads to both the AEP assembly as well as the Hydra mitochondrial genome (downloaded from here and here) using bowtie 2 (v2.2.6). The ATAC-seq protocol targets all DNA, including mitochondrial DNA. This means that often a sizable portion of ATAC-seq reads are mitochondrial reads that provide no insight into chromatin accessibility in the nuclear genome. We therefore removed all reads from our AEP assembly-mapped alignment file that aligned to the mitochondrial genome.
We then performed some filtering of the aligned reads to remove unmapped or ambiguously mapped reads using samtools (v1.12) and Picard Tools (v2.17.8). Finally, we removed PCR duplicate sequences to prevent PCR biases introduced when amplifying the ATAC-seq libraries from interfering with the quantitative analysis of read counts.
The final output from this step was three bam files (AEP1_final.bam
, AEP2_final.bam
, and AEP3_final.bam
), one for each biological replicate, containing all non-mitochondrial dedupliated uniquely mapped ATAC-seq reads.
Prior to running the mapping pipeline we first had to prepare bowtie references for both the mitochondrial and AEP assembly fasta files:
bowtie2-build 105_mitochondrial_genome.fa hydra_mito
bowtie2-build HAVEP1_genome.fa hydra_genome
We then used the following script for the alignment pipeline:
(01_atacPeaks/ATAC_Mapping_Pipeline.sh)
#! /bin/bash
# This code takes raw ATAC-seq reads in a fastq.gz format and generates a BAM file
# This BAM file contains only unambiguously mapped, unique, non-mitochondrial reads
# These BAM files are then used for downstream peak calling
########################
# The only argument needed for this script is the unique sample ID
prefix="$1"
# create an initial fastqc report of both forward and reverse reads
# we'll compare this to the filtered reads to ensure that trimming
# was effective
echo "initial fastqc report"
fastqc -o . -t 2 resources/"$prefix"_ATAC_R*.fastq.gz
mv "$prefix"_ATAC_R1_fastqc.html "$prefix"_ATAC_R1_unfiltered_fastqc.html
mv "$prefix"_ATAC_R2_fastqc.html "$prefix"_ATAC_R2_unfiltered_fastqc.html
mv "$prefix"_ATAC_R1_fastqc.zip "$prefix"_ATAC_R1_unfiltered_fastqc.zip
mv "$prefix"_ATAC_R2_fastqc.zip "$prefix"_ATAC_R2_unfiltered_fastqc.zip
# use trimmomatic to remove adapter sequences and
# stretches of low quality base calls.
# It also sets the minimum read size to 32
echo "filtering reads"
java -jar resources/trimmomatic-0.36.jar PE -threads 16 -phred33 \
resources/"$prefix"_ATAC_R1.fastq.gz resources/"$prefix"_ATAC_R2.fastq.gz \
"$prefix"_ATAC_R1_trim_P.fastq.gz "$prefix"_ATAC_R1_trim_UP.fastq.gz \
"$prefix"_ATAC_R2_trim_P.fastq.gz "$prefix"_ATAC_R2_trim_UP.fastq.gz \
ILLUMINACLIP:resources/NexteraPE-PE.fa:2:30:10:2:true TRAILING:3 \
SLIDINGWINDOW:4:15 MINLEN:32
rm "$prefix"*trim_UP.fastq.gz
# create a fastqc report of filtered reads
# this ensures filtering worked well
# you should see: high phred scores across the entire
# read length, a read length distribution that reflects
# expectations for ATAC-seq data, and no overrepresented sequences
# looking at the GC distribution can also give you a
# sense of bacterial contamination (curvibacter GC% is around 60)
echo "post-filtering fastqc report"
fastqc -o . -t 2 "$prefix"_ATAC_R*_trim_P.fastq.gz
# map the filtered reads to the H. mag 105 genome assembly (dovetail 1.0)
# here I'm opting for quite sensitive, but relatively slow mapping
# This will require that you generate a bowtie2 reference in the
# resources folder using the H. mag 105 reference genome
# That can be found here:
# research.nhgri.nih.gov/hydra/download/?dl=asl
echo "mapping to genome reference"
bowtie2 -X 1000 --very-sensitive-local --mm --threads 16 -S "$prefix"_genome \
-x resources/hydra_genome \
-1 "$prefix"_ATAC_R1_trim_P.fastq.gz -2 "$prefix"_ATAC_R2_trim_P.fastq.gz
cat "$prefix".log "$prefix"_genome.metfile
rm "$prefix"_genome.metfile
# because the H. mag genome has mitochondrial sequence contamination,
# we need to map reads separately to the mitochondrial reference
# this let's us identify which reads are mitochondrial, so they
# can be removed downstream
# we only need the read IDs, so we just pull the mapped read
# IDs and disregard all other output
echo "mapping to mitochondrial reference"
bowtie2 -X 1000 --very-sensitive-local --mm --threads 12 -x resources/hydra_mito \
-1 "$prefix"_ATAC_R1_trim_P.fastq.gz -2 "$prefix"_ATAC_R2_trim_P.fastq.gz | \
samtools view -S -F 4 - | cut -f 1 | sort | uniq > "$prefix"_mito_IDs.txt
cat "$prefix".log "$prefix"_mito.metfile
rm "$prefix"_mito.metfile
# We then filter out mitochondrial reads from the genome mapped reads
# using the read IDs we generated above
echo "removing mitochondrial reads"
java -Xmx16g -jar resources/picard.jar FilterSamReads \
I="$prefix"_genome \
O="$prefix"_genome_MF.sam \
READ_LIST_FILE="$prefix"_mito_IDs.txt \
FILTER=excludeReadList
# Here we filter reads to only include mapped read pairs with a
# mapping quality of 3 or greater
# We then sort them by name, which is required for the fixmate command
# we're performing next
echo "filtering ambiguous mappers and incoherent read pairs"
samtools view -Su -F 524 -q 3 "$prefix"_genome_MF.sam | samtools sort -n -T "$prefix" -o "$prefix".bam
# fixmate fills in information about read pairs, such as the distance
# between R1 and R2
samtools fixmate -r "$prefix".bam "$prefix".fix.tmp.bam
# We'll again filter, this time removing improperly paired reads
# We then sort by coordinate
samtools view -u -F 524 -f 2 "$prefix".fix.tmp.bam | samtools sort -T "$prefix".fix -o "$prefix".fix.bam
rm "$prefix".fix.tmp.bam "$prefix".bam "$prefix"_genome \
"$prefix"_mito_IDs.txt "$prefix"_genome_MF.sam
# We'll then use picard to mark perfectly duplicated read pairs
# so that they can be removed later
echo "marking PCR duplicates"
java -Xmx50g -jar resources/picard.jar MarkDuplicates \
I="$prefix".fix.bam \
O="$prefix"_markedDup.bam \
M="$prefix"_markedDup_metrics.txt \
ASSUME_SORT_ORDER=coordinate
# the code below creates a report giving you a sense of how many PCR duplicates
# were found in the data
# these metrics were established by ENCODE and give a sense of how
# much bottlenecking occured during the library prep
# this code is taken from the ENCODE ATAC pipeline (v1)
# it can be found here:
# encodeproject.org/pipelines/ENCPL792NWO/
# the table is formated as such:
# TotalReadPairs [tab] DistinctReadPairs [tab] OneReadPair [tab]
# TwoReadPairs [tab] NRF=Distinct/Total [tab] PBC1=OnePair/Distinct [tab]
# PBC2=OnePair/TwoPair
# for more info on these metrics, see:
# encodeproject.org/atac-seq/#standards
echo "generating PBC report"
bedtools bamtobed -i "$prefix"_markedDup.bam \
| awk 'BEGIN{OFS="\t"}{print $1,$2,$3,$6}' \
| sort \
| uniq -c \
| awk 'BEGIN{mt=0;m0=0;m1=0;m2=0} ($1==1){m1=m1+1} ($1==2){m2=m2+1} {m0=m0+1} {mt=mt+$1} END{printf "%d\t%d\t%d\t%d\t%f\t%f\t%f\n",mt,m0,m1,m2,m0/mt,m1/m0,m1/m2}' > "$prefix"_PBC.txt
# We then create the final bam for downstream analysis by removing PCR duplicates
echo "removing PCR duplicates"
samtools view -F 1804 -b -o "$prefix"_final.bam "$prefix"_markedDup.bam
rm "$prefix"_markedDup.bam "$prefix".fix.bam
echo "done"
We executed the above pipeline on a computing cluster using this script:
(01_atacPeaks/slurmMap.sh)
#!/bin/bash -l
#SBATCH --job-name=ATAC_Pipeline
#SBATCH -p med
#SBATCH -c 16
#SBATCH -t 7-0
#SBATCH --array=0-2
#SBATCH --mem=0
#SBATCH --error=ATAC_Pipeline_%a.err
#SBATCH --output=ATAC_Pipeline_%a.out
module load bowtie2
module load samtools
module load fastqc/0.11.4
module load bedtools
module load java/1.8
array=(AEP1 AEP2 AEP3)
./resources/ATAC_Mapping_Pipeline.sh ${array[$SLURM_ARRAY_TASK_ID]}
To identify candidate CRE coordinates from our aligned ATAC-seq data. We used Macs2 (v2.2.7.1) to call peaks for each of our ATAC-seq biological replicates. We then used the IDR python package (v2.0.3) to make pairwise comparisons between the peak calling results of different biological replicates to calculate three sets of irreproducible discovery rate (IDR) scores, which reflect the likelihood that a given peak was reproducible across two biological replicates. We use these IDR scores in a subsequent step to generate our consensus peak set.
As part of our peak calling pipeline, we also generated a merged version of the data that combined all three biological replicates. We called peaks on this merged dataset as well, which served as the starting point for generating our consensus peak set (see below). Finally, we used deepTools (v3.5.0) generated bigwig files to facilitate viewing ATAC-seq read density in a genome browser.
(01_atacPeaks/ATAC_Peak_Pipeline.sh)
#! /bin/bash -l
shopt -s nullglob
# The only argument needed for this script is the unique sample ID
prefix="$1"
conda activate deepEnv
#index bam files
echo "indexing bam files"
for arg in "$prefix"[^_]*_final.bam;
do
echo "$arg"
samtools index "$arg"
done
echo "done"
#namesort bam files
echo "shifting reads"
for arg in "$prefix"[^_]*_final.bam;
do
rep="${arg/_final.bam/}"
echo "$rep"
alignmentSieve -b "$arg" -o "$rep"_final_shift.bam -p 16 --ATACshift
echo "resorting shifted reads"
samtools sort -T "$rep".sort -o "$rep"_final_shift.sort.bam "$rep"_final_shift.bam
mv "$rep"_final_shift.sort.bam "$rep"_final_shift.bam
samtools index "$rep"_final_shift.bam
done
echo "done"
#make self pseudoreplicates
echo "making self-pseudoreplicates"
for arg in "$prefix"[^_]*_final_shift.bam;
do
rep="${arg/_final_shift.bam/}"
echo "$rep"
echo "generating first psuedoreplicate"
samtools view -s 1234.5 -b -@ 16 -o "$rep"_PR1_final_shift.bam "$arg"
echo "generating second psuedoreplicate"
samtools view "$rep"_PR1_final_shift.bam | cut -f 1 > "$rep"_PR1_qname.txt
java -Xmx32g -jar resources/picard.jar FilterSamReads \
I="$arg" \
O="$rep"_PR2_final_shift.bam \
READ_LIST_FILE="$rep"_PR1_qname.txt \
VALIDATION_STRINGENCY=SILENT \
FILTER=excludeReadList \
QUIET=true
rm "$rep"_PR1_qname.txt
done
#pool replicates
echo "pooling replicates"
samtools merge "$prefix"_MG_final_shift.bam "$prefix"[1-9]_final_shift.bam
samtools index "$prefix"_MG_final_shift.bam
#split pooled rep into psuedoreps (same number as total reps)
echo "generating pseudoreplicates from pooled counts"
numReps=("$prefix"[^_]*_final.bam)
numReps=${#numReps[@]}
echo "splitting into $numReps files"
count=1
cp "$prefix"_MG_final_shift.bam "$prefix"_MG_final_shift.sub.bam
while [ $numReps -gt 1 ]
do
subSampleValue="$(Rscript resources/generateSubsampleValue.R $numReps | cut -f 2)"
echo "$numReps"
echo "$count"
echo "$subSampleValue"
samtools view -s "$subSampleValue" -b -@ 16 \
-o "$prefix"_MG_PR"$count"_final_shift.bam "$prefix"_MG_final_shift.sub.bam
samtools view "$prefix"_MG_PR"$count"_final_shift.bam | cut -f 1 > "$prefix"_PR_qname.txt
java -Xmx32g -jar resources/picard.jar FilterSamReads \
I= "$prefix"_MG_final_shift.sub.bam \
O= "$prefix"_MG_final_shift.sub.tmp.bam \
READ_LIST_FILE="$prefix"_PR_qname.txt \
VALIDATION_STRINGENCY=SILENT \
FILTER=excludeReadList \
QUIET=true
rm "$prefix"_MG_final_shift.sub.bam
mv "$prefix"_MG_final_shift.sub.tmp.bam "$prefix"_MG_final_shift.sub.bam
count=$(( $count + 1 ))
numReps=$(( $numReps - 1 ))
done
mv "$prefix"_MG_final_shift.sub.bam "$prefix"_MG_PR"$count"_final_shift.bam
rm "$prefix"_PR_qname.txt
echo "Calling peaks"
for arg in "$prefix"*_final_shift.bam;
do
rep="${arg/_final_shift.bam/}"
echo "$rep"
macs2 callpeak \
-t "$arg" -f BAMPE -n "$rep" -g 9e8 -p 0.1 \
--nomodel --keep-dup all
sort -k 8gr,8gr "$rep"_peaks.narrowPeak \
| awk 'BEGIN{OFS="\t"}{$4="Peak_"NR ; print $0}' > "$rep".narrowPeak
rm -f "$rep"_peaks.narrowPeak "$rep"_peaks.xls
done
rm *summits*
echo "done"
#perform idr for all reps
echo "Performing IDR on biological replicates"
for i in "$prefix"[1-9].narrowPeak
do
for j in "$prefix"[1-9].narrowPeak
do
if [[ "$i" < "$j" ]]; then
echo "$i"
echo "$j"
inputFile1="$i"
prefix1="${inputFile1/.narrowPeak/}"
inputFile2="$j"
prefix2="${inputFile2/.narrowPeak/}"
idr --samples "$i" "$j" \
--peak-list "$prefix"_MG.narrowPeak --input-file-type narrowPeak \
--output-file "$prefix1"_"$prefix2".idr --rank p.value --soft-idr-threshold 0.1
IDR_THRESH_TRANSFORMED=$(awk -v p=0.1 'BEGIN{print -log(p)/log(10)}')
awk 'BEGIN{OFS="\t"} $12>='"${IDR_THRESH_TRANSFORMED}"' \
{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' "$prefix1"_"$prefix2".idr | \
sort | \
uniq | \
sort -k7n,7n > "$prefix1"_"$prefix2".IDR.narrowPeak
fi
done
done
echo "done"
#perform idr for all self-pseudoreps
echo "Perfoming IDR on self-pseudoreplicates"
for arg in "$prefix"[1-9]_final_shift.bam
do
rep="${arg/_final_shift.bam/}"
echo "$rep"
inputFile1="$rep"_PR1.narrowPeak
prefix1="${inputFile1/.narrowPeak/}"
inputFile2="$rep"_PR2.narrowPeak
prefix2="${inputFile2/.narrowPeak/}"
idr --samples "$inputFile1" "$inputFile2" \
--peak-list "$rep".narrowPeak --input-file-type narrowPeak \
--output-file "$prefix1"_"$prefix2".idr --rank p.value --soft-idr-threshold 0.1
IDR_THRESH_TRANSFORMED=$(awk -v p=0.1 'BEGIN{print -log(p)/log(10)}')
awk 'BEGIN{OFS="\t"} $12>='"${IDR_THRESH_TRANSFORMED}"' \
{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' "$prefix1"_"$prefix2".idr | \
sort | \
uniq | \
sort -k7n,7n > "$prefix1"_"$prefix2".IDR.narrowPeak
done
echo "done"
#perform idr for all pooled pseudoreps
echo "Perfoming IDR on pooled pseudoreplicates"
for i in "$prefix"_MG_PR*.narrowPeak
do
for j in "$prefix"_MG_PR*.narrowPeak
do
if [[ "$i" < "$j" ]]; then
echo "$i"
echo "$j"
inputFile1="$i"
prefix1="${inputFile1/.narrowPeak/}"
inputFile2="$j"
prefix2="${inputFile2/.narrowPeak/}"
idr --samples "$i" "$j" \
--peak-list "$prefix"_MG.narrowPeak --input-file-type narrowPeak \
--output-file "$prefix1"_"$prefix2".idr --rank p.value --soft-idr-threshold 0.1
IDR_THRESH_TRANSFORMED=$(awk -v p=0.1 'BEGIN{print -log(p)/log(10)}')
awk 'BEGIN{OFS="\t"} $12>='"${IDR_THRESH_TRANSFORMED}"' \
{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' "$prefix1"_"$prefix2".idr | \
sort | \
uniq | \
sort -k7n,7n > "$prefix1"_"$prefix2".IDR.narrowPeak
fi
done
done
echo "done"
conda activate deepEnv
#generate bigwig files
echo "Generating bigwig tracks"
for arg in "$prefix"[1-9]_final_shift.bam
do
echo "$arg"
bamCoverage -b "$arg" -o "${arg/.bam/.bw}" \
-of "bigwig" -bs 10 -p 16 --normalizeUsing "CPM"
done
echo "$prefix"_MG_final_shift.bam
bamCoverage -b "$prefix"_MG_final_shift.bam -o "$prefix"_MG_final_shift.bw \
-of "bigwig" -bs 10 -p 16 --normalizeUsing "CPM"
echo "done"
echo "cleaning up"
rm regions_"$prefix"*ATAC.bed Values_"$prefix"*.txt "$prefix"*matrix.gz "$prefix"*.idr
echo "done"
The peak pipeline script using the following accessory R script to generate seed values for subsetting bam files:
(01_atacPeaks/generateSubsampleValue.R)
options(digits=10)
args <- commandArgs(trailingOnly=TRUE)
subSampleValue <-sample(1:9999,1)
subSampleValue <- subSampleValue + (1/as.numeric(args[1]))
cat(subSampleValue)
We executed the peak calling pipeline on a computing cluster using the following script:
(01_atacPeaks/slurmPeak.sh)
#! /bin/bash -l
#SBATCH --job-name=ATAC_Peaks
#SBATCH -p med
#SBATCH -c 16
#SBATCH -t 7-0
#SBATCH --mem=0
#SBATCH --error=ATAC_Peaks.err
#SBATCH --output=ATAC_Peaks.out
conda activate deepEnv
resources/ATAC_Peak_Pipeline.sh AEP
When calculating the IDR score, we removed any peaks that had a score > 0.1. To generate our consensus set of reproducible peaks, we used BedTools (v2.29.2) to identify peaks from the merged ATAC-seq data set (AEP_MG.narrowPeak
) that overlapped with peaks from each of our pairwise IDR comparisons.
bedtools intersect -e -f 0.5 -F 0.5 -c -a AEP_MG.narrowPeak -b AEP1_AEP2.IDR.narrowPeak > 12Int.bed
bedtools intersect -e -f 0.5 -F 0.5 -c -a AEP_MG.narrowPeak -b AEP1_AEP3.IDR.narrowPeak > 13Int.bed
bedtools intersect -e -f 0.5 -F 0.5 -c -a AEP_MG.narrowPeak -b AEP2_AEP3.IDR.narrowPeak > 23Int.bed
We then used the following R script to subset the AEP_MG.narrowPeak
peak list to include only those peaks that intersected with a peak from at least two of the IDR peak sets. The output from this script (consensusAEP.bed
) was our consensus ATAC-seq peak set that we used for subsequent analyses.
(01_atacPeaks/getCon.R)
int12 <- read.delim("/Volumes/Data/Cut&Stuff/ATAC/12Int.bed",header = F, sep = "\t")
int13 <- read.delim("/Volumes/Data/Cut&Stuff/ATAC/13Int.bed",header = F, sep = "\t")
int23 <- read.delim("/Volumes/Data/Cut&Stuff/ATAC/23Int.bed",header = F, sep = "\t")
consensus <- int12[(int12[,11] + int13[,11] + int23[,11]) > 1,1:6]
write.table(consensus, file = "/Volumes/Data/Cut&Stuff/ATAC/consensusAEP.bed", sep = '\t', quote = F, col.names = F, row.names = F)
Our CUT&Tag mapping approach was quite similar to the one we used for our ATAC-seq data. We filtered reads using Trimmomatic and verified the output using FASTQC. We then mapped the filtered reads to the AEP assembly. Finally, we removed duplicate reads from the resulting alignment file.
(02_cutAndTagPeaks/mapPipe.sh)
#!/bin/bash
source ~/.bash_profile
prefix="$1"
echo "$prefix"
[ ! -d "fastqc_raw" ] && mkdir fastqc_raw
echo "initial fastqc report"
fastqc -o fastqc_raw -t 2 resources/reads/raw/"$prefix"_R*.fastq.gz
echo "filtering reads"
java -jar resources/trimmomatic-0.36.jar PE -threads 24 -phred33 \
resources/reads/raw/"$prefix"_R1.fastq.gz \
resources/reads/raw/"$prefix"_R2.fastq.gz \
resources/reads/trim/"$prefix"_R1_trim_P.fastq.gz \
resources/reads/trim/"$prefix"_R1_trim_UP.fastq.gz \
resources/reads/trim/"$prefix"_R2_trim_P.fastq.gz \
resources/reads/trim/"$prefix"_R2_trim_UP.fastq.gz \
ILLUMINACLIP:resources/NexteraPE-PE.fa:2:30:10:2:keepBothReads \
TRAILING:3 \
LEADING:3 \
SLIDINGWINDOW:4:15 \
MINLEN:32
rm resources/reads/trim/"$prefix"*trim_UP.fastq.gz
[ ! -d "fastqc_filt" ] && mkdir fastqc_filt
echo "post-filtering fastqc report"
fastqc -o fastqc_filt -t 2 resources/reads/trim/"$prefix"_R*_trim_P.fastq.gz
echo "mapping to genome reference"
bowtie2 -X 700 \
-I 10 \
--very-sensitive-local \
--threads 22 \
--no-mixed \
--no-discordant \
-x resources/hydra_genome \
-1 resources/reads/trim/"$prefix"_R1_trim_P.fastq.gz \
-2 resources/reads/trim/"$prefix"_R2_trim_P.fastq.gz | samtools view -b -o "$prefix".gen.bam -
samtools view -Su -F 524 -q 2 "$prefix".gen.bam |
samtools sort -n -T "$prefix" -o "$prefix"_allMapped.tmp.bam
echo "removing duplicates"
samtools fixmate -@ 24 -r -m "$prefix"_allMapped.tmp.bam "$prefix".fix.tmp.bam
samtools sort -T "$prefix" -@ 24 -o "$prefix".fix.sort.tmp.bam "$prefix".fix.tmp.bam
samtools markdup -@ 24 -T "$prefix" -r -s "$prefix".fix.sort.tmp.bam "$prefix".final.bam > "$prefix".dupStats.txt
samtools index "$prefix".final.bam
rm "$prefix"*.tmp.bam
echo "done"
We executed this mapping pipeline on a computing cluster using the following script:
(02_cutAndTagPeaks/slurmMapPipe.sh)
#!/bin/bash
#SBATCH --job-name=CnT_Map
#SBATCH -c 24
#SBATCH -p med
#SBATCH -t 60-0
#SBATCH --array=0-11
#SBATCH --mem=0
#SBATCH --error=CnT_Map_%a.err
#SBATCH --output=CnT_Map_%a.out
module load fastqc/0.11.4
module load java/1.8
array=( H273_1 H273_2 H273_3 \
H41_1 H41_2 H41_3 \
H43_1 H43_2 H43_3 \
IGG_1 IGG_2 IGG_3 )
./resources/mapPipe.sh ${array[$SLURM_ARRAY_TASK_ID]}
Our peak calling approach with CUT&Tag differed from our ATAC-seq approach because our CUT&Tag data included negative controls that could be used as a baseline. Instead of Macs2, we opted to use SEACR (v1.3) to call peaks on our CUT&Tag data, as it was specifically designed to work well with CUT&Tag data.
SEACR requires read depth data in a bedgraph format. Because we needed to prep the negative control IgG samples before we could process any other samples, we used the following script to generate the necessary negative control files prior to running the main peak calling script.
(02_cutAndTagPeaks/iggProcessing.sh)
#! /bin/bash -l
shopt -s nullglob
echo "pooling replicates"
samtools merge -f IGG_MG.final.bam IGG_[1-9].final.bam
samtools index IGG_MG.final.bam
for arg in 1 2 3 "MG"
do
samtools sort -T IGG.sort -n -o IGG_"$arg".final.ns.bam IGG_"$arg".final.bam
bedtools bamtobed -bedpe -i IGG_"$arg".final.ns.bam > IGG.bed
cut -f 1,2,6 IGG.bed | \
sort -k1,1 -k2,2n -k3,3n > IGG.fragments.bed
bedtools genomecov -bg -i IGG.fragments.bed -g resources/aep.genome > IGG_"$arg".bg
rm IGG.bed IGG.fragments.bed IGG_"$arg".final.ns.bam
done
For the peak calling, we used SEACR to call peaks on each bam file generated in the previous section, as well as on merged replicate bam files generated. Each SEACR command used the sample-matched negative control IgG data (e.g., IgG replicate 1 is derived from the same experiment as H3K4me1 replicate 1).
We opted to run SEACR in permissive mode, and then use the same IDR approach we used for our ATAC-seq data to ensure that the resulting consensus peak sets contained reproducible peaks.
(02_cutAndTagPeaks/peakPipe.sh)
#! /bin/bash -l
shopt -s nullglob
# The only argument needed for this script is the unique sample ID
prefix="$1"
#pool replicates
echo "pooling replicates"
samtools merge -f "$prefix"_MG.final.bam "$prefix"_[1-9].final.bam
samtools index "$prefix"_MG.final.bam
echo "generating bigwigs"
conda activate deepEnv
for arg in "$prefix"_[0-9].final.bam
do
rep="${arg/.final.bam/}"
echo "$rep"
bamCoverage -b "$arg" \
-o "$rep".bw \
-of "bigwig" \
-bs 10 \
-p 24 \
--normalizeUsing CPM \
--exactScaling
done
bamCoverage -b "$prefix"_MG.final.bam \
-o "$prefix"_MG.bw \
-of "bigwig" \
-bs 10 \
-p 24 \
--normalizeUsing CPM \
--exactScaling \
conda deactivate
echo "Calling peaks"
for arg in 1 2 3 "MG"
do
rep="$prefix"_"$arg"
echo "$rep"
samtools sort -T $rep.sort -n -o $rep.ns.bam "$rep".final.bam
bedtools bamtobed -bedpe -i $rep.ns.bam > $rep.bed
cut -f 1,2,6 $rep.bed | \
sort -k1,1 -k2,2n -k3,3n > $rep.fragments.bed
bedtools genomecov -bg -i $rep.fragments.bed -g resources/aep.genome > $rep.bg
rm $rep.bed $rep.fragments.bed $rep.ns.bam
bash resources/SEACR/SEACR_1.3.sh $rep.bg IGG_"$arg".bg norm relaxed $rep
bash resources/SEACR/SEACR_1.3.sh $rep.bg IGG_"$arg".bg norm stringent $rep
Rscript resources/refBed.R "$rep".relaxed.bed
Rscript resources/refBed.R "$rep".stringent.bed
rm "$rep".relaxed.bed "$rep".stringent.bed
mv "$rep".relaxed.rfmt.bed "$rep".relaxed.bed
mv "$rep".stringent.rfmt.bed "$rep".stringent.bed
done
echo "done"
conda activate deepEnv
#perform idr for all reps
echo "Performing IDR on biological replicates"
for i in "$prefix"_[0-9].relaxed.bed
do
for j in "$prefix"_[0-9].relaxed.bed
do
if [[ "$i" < "$j" ]]; then
echo "$i"
echo "$j"
inputFile1="$i"
prefix1="${inputFile1/.relaxed.bed/}"
inputFile2="$j"
prefix2="${inputFile2/.relaxed.bed/}"
idr --samples "$i" "$j" \
--peak-list "$prefix"_MG.relaxed.bed --input-file-type bed \
--output-file "$prefix1"_"$prefix2".idr \
-i 0.1 --rank 5
bedtools intersect -e -f 0.25 -F 0.25 -c \
-a "$prefix"_MG.relaxed.bed \
-b "$prefix1"_"$prefix2".idr > "$prefix1"_"$prefix2".int.bed
fi
done
done
echo "done"
conda deactivate
#generating consensus peaklists
Rscript resources/getCon.R "$prefix" "$prefix"*_[0-9].int.bed
echo "# of peaks in biologically reproducible peak set:"
wc -l consensus"$prefix".bed
echo "done"
Below is an auxiliary R script used within the peak calling pipeline. It reformats SEACR's output into a standard bed format.
(02_cutAndTagPeaks/refBed.R)
args = commandArgs(trailingOnly=TRUE)
inBed <- read.delim(args[1],sep = "\t",header = F)
inBed[,6] <- "."
inBed[,5] <- inBed[,4]
inBed[,4] <- 1:nrow(inBed)
newCols <- data.frame(V1 = rep(0,nrow(inBed)), V2 = rep(0,nrow(inBed)), V3 = rep(0,nrow(inBed)))
outBed <- cbind(inBed,newCols)
write.table(outBed, file = gsub(".bed",".rfmt.bed",args[1]), quote = F, row.names = F, col.names = F, sep = "\t")
Below is an auxiliary R script used within the peak calling pipeline. It generates a consensus peak set by determining which peaks in the merged replicate peak set passed the IDR threshold in at least two pairwise comparisons of biological replicates.
(02_cutAndTagPeaks/getCon.R)
args <- commandArgs(trailingOnly=TRUE)
prefix <- args[1]
intRes <- lapply(args[-1],read.delim,header=F,sep='\t')
intRes.sub <- lapply(intRes,function(x) x[,ncol(x)])
intRes.sub <- do.call(cbind,intRes.sub)
intRes.sub <- apply(intRes.sub,1,sum)
consensus <- intRes[[1]][intRes.sub > 1,1:6]
# consensus$V4 <- gsub('.*_MG_','',consensus$V4)
write.table(consensus, file = paste0("consensus",prefix,".bed"), sep = '\t', quote = F, col.names = F, row.names = F)
Note that the CUT&Tag peak pipeline also uses the generateSubsampleValue.R
script first mentioned in the ATAC-seq peak calling pipeline.
The peak pipeline was run on a computing cluster using this script:
(02_cutAndTagPeaks/slurmPeakPipe.sh)
#!/bin/bash
#SBATCH --job-name=CnT_Peak
#SBATCH -c 24
#SBATCH -p med
#SBATCH -t 60-0
#SBATCH --mem=0
#SBATCH --error=CnT_Peak.err
#SBATCH --output=CnT_Peak.out
echo "processing IGG samples"
./resources/iggProcessing.sh
echo "running peak pipeline"
./resources/peakPipe.sh H41
./resources/peakPipe.sh H43
./resources/peakPipe.sh H273
To link peaks to their potential target genes, we used UROPA (v4.0.2) to identify the nearest gene model to each peak in our H3K4me1, H3K4me3, and ATAC-seq peak sets (i.e., peaks associated with active regulatory elements), with the maximum allowed distance being 100 Kb.
UROPA uses a config file in JSON format to set relevant parameters. This was the parameter file for annotating our ATAC-seq peaks:
(03_characterizeCREs/peakAnnotATAC.json)
{
"queries":[
{"distance": [100000, 100000], "feature": ["transcript"], "feature_anchor": ["start"], "name": "query_1", "relative_location": ["Downstream", "Upstream", "OverlapStart", "FeatureInsidePeak", "PeakInsideFeature", "OverlapEnd"], "strand": "ignore"}
],
"show_attributes": ["gene_id"],
"priority": "False",
"gtf": "HVAEP1.GeneModels.gtf",
"bed": "../ATAC/consensusAEP.bed",
"prefix": "ATAC",
"outdir": "uropaOut",
"threads": "1",
"output_by_query": "False"
}
The parameter file for the other two peak sets were largely identical to the one above. For the H3K4me1 peak set annotation, we made the following changes (text is output from diff
command comparing ATAC-seq and H3K4me1 config files)
8,9c8,9
< "bed": "../ATAC/consensusAEP.bed",
< "prefix": "ATAC",
---
> "bed": "consensusH41.bed",
> "prefix": "H41",
For the H3K4me3 peak set annotation, we made the following changes (text is output from diff
command comparing ATAC-seq and H3K4me3 config files)
8,9c8,9
< "bed": "../ATAC/consensusAEP.bed",
< "prefix": "ATAC",
---
> "bed": "consensusH43.bed",
> "prefix": "H43",
We then ran UROPA using the following commands:
uropa -i peakAnnotH43.json #H3K4me3 config
uropa -i peakAnnotH41.json #H3K4me1 config
uropa -i peakAnnotATAC.json #ATAC-seq config
In order to evaluate the reproducibility of our ATAC-seq and CUT&Tag data, and to determine if the different types of data had the expected distribution across the genome relative to each other (e.g., open chromatin should be positively correlated with activating histone modifications), we calculated global correlation scores amongst all of our ATAC-seq and CUT&Tag replicates.
We used the DeepTools multiBigwigSummary
function to calculate the correlation scores:
(03_characterizeCREs/multiBWComp.sh)
#!/bin/bash
multiBigwigSummary bins -b H41_1.bw H41_2.bw H41_3.bw \
H43_1.bw H43_2.bw H43_3.bw \
H273_1.bw H273_2.bw H273_3.bw \
IGG_1.bw IGG_2.bw IGG_3.bw \
../ATAC/AEP1_final_shift.bw \
../ATAC/AEP2_final_shift.bw \
../ATAC/AEP3_final_shift.bw \
-o corplot.npz \
-l H41_1 H41_2 H41_3 H43_1 H43_2 H43_3 \
H273_1 H273_2 H273_3 IGG_1 IGG_2 IGG_3 \
ATAC_1 ATAC_2 ATAC_3 \
-p 6
We then plotted the results using the following script:
(03_characterizeCREs/plotCorrelation.sh)
#!/bin/bash
plotCorrelation -in corplot.npz \
-c spearman \
-p heatmap \
-o corPlot.pdf \
-min -0.05 \
--plotNumbers \
--colorMap magma
The hierarchical clustering grouped together biological replicates, a good indication of reproducibility. We also saw positive correlation between datasets associated with activating marks (ATAC-seq, H3K4me1, and H3K4me3), whereas there was little to no correlation between activating marks and the repressive mark H3K27me3 or the negative controls.
Because chromatin accessibility and certain histone modifications have clear expected distribution patterns around transcription start sites, we visualized the read distribution of our CUT&Tag and ATAC-seq data around the AEP gene models to further validate the data.
H3K4me3 should be strongly enriched at transcription start sites and H3K4me1 should be be found around expressed genes, but should not be strongly associated with transcription start sites. ATAC-seq is expected to be associated both with transcription start sites and more distal regulatory elements. H3K27me3 should be depleted near actively transcribed genes, but it can be found in silenced genes.
We supplemented our analysis with RNA-seq data, allowing us to verify which genes were actively transcribed. For this we used the file NS_RNA.bw
, which is a bigwig file containing read depth information for genome-mapped RNA-seq data from whole adult Hydra polyps. Specifically, the alignment data was generated by merging the bam files NS1_RNA.genome.bam
, NS2_RNA.genome.bam
, and NS3_RNA.genome.bam
from the alignment benchmarking analysis described in 03_aepGenomeAnnotation.md
.
We also included data on sequence conservation using the file aepCon.bw
, which was generated by the analysis described in 07_genomeConservation.md
.
We combined all of these data and calculated their average profiles across the AEP assembly gene models using the DeepTools computeMatrix
function:
(03_characterizeCREs/compGeneMatrix.sh)
#!/bin/bash
computeMatrix scale-regions \
-R HVAEP1.GeneModels.gtf \
-S aepCon.bw NS_RNA.bw ../ATAC/AEP_MG_final_shift.bw IGG_MG.bw H43_MG.bw H41_MG.bw H273_MG.bw \
-o geneMatrix.txt \
--outFileNameMatrix geneMatrix.names.txt \
--outFileSortedRegions geneMatrix.regions.txt \
-m 500 \
-b 5000 \
-a 5000 \
--missingDataAsZero \
-p 6 \
--metagene
We then plotted the results using this script:
(03_characterizeCREs/plotGeneHeatmap.sh)
#!/bin/bash
plotHeatmap -m geneMatrix.txt \
-o geneBodyHeatmap.pdf \
--colorList "white,darkblue" \
-max 1.5 1 0.6 0.6 0.6 0.6 0.6 \
--yMax 1.5 4.5 1.75 1.75 1.75 1.75 1.75
Generating Representative Plots for ATAC-seq, CUT&Tag, and Sequence Conservation Centered on Individual Genes
To show individual examples of the correspondence between our various methods of identifying regulatory elements, we plotted each of the genome-wide data tracks we generated for ATAC-seq, CUT&Tag, and sequence conservation centered on specific genes in the genome using the R script below.
However, prior to running the plotting script, we had to fix an issue where CDS phases were improperly formatted in the AEP gene model gff3 file that prevented the gene models from being properly imported. We did this using the following command:
agat_sp_fix_cds_phases.pl -g HVAEP1.GeneModels.gff3 -fa HVAEP1_genome.fa -o HVAEP1.GeneModels.pfix.gff3
(03_characterizeCREs/plotTrackData.R)
library(Gviz)
library(rstudioapi)
library(GenomicRanges)
library(GenomicFeatures)
library(rtracklayer)
library(grDevices)
setwd(dirname(getActiveDocumentContext()$path))
genomeInfo <- read.table("../../genome/aep.final.genome.fa.fai", stringsAsFactors = F)[,1:2]
chrominfo <- data.frame(chrom=genomeInfo$V1, length=genomeInfo$V2, is_circular=rep(FALSE,length(genomeInfo$V1)))
txDb.mi <- makeTxDbFromGFF(file="HVAEP1.GeneModels.pfix.gff3", format="gff", dataSource="Gene Models", organism ="Hydra vulgaris", chrominfo=chrominfo)
options(ucscChromosomeNames=F)
pal <- colorRampPalette(c('#3FB8AF','#7FC7AF','#DAD8A7','#FF9E9D','#FF3D7F'),bias=1)
palCols <- pal(8)
AtacPlot <- function(chr, height, left, right, buffer=0,fpath = "plot.pdf", minCon=0) {
chrID <- chr
#This specifies the maximum value on the y-axis
z <- height
#chromosome map
gtrack <- GenomeAxisTrack(name = chrID,add53=T, add35 = T, fontsize = 13, fontcolor.title = "black",
fontsize.title = 13, showTitle = F, rotation.title = 0, grid = T,
cex = 0.6, labelPos = "below")
#gene models
grtrack <- GeneRegionTrack(txDb.mi, chromosome=chrID, name="Genes", transcriptAnnotation = "gene",
col = "black", fontsize.group = 13, fontcolor.group = "black", fill = "black",
fontsize=25, rotation.title = 0, background.title = "white", col.line = "black",
just.group = "below", collapseTranscripts = "longest")
reptrack <- AnnotationTrack(range = '../../genome/repeats/bothMaskFull.out.sorted.gff', name = 'repeats',
chromosome = chrID,shape = "box",col = "black",background.title = "white",
fill = "black",cex=0)
magCon <- DataTrack(range = "105.rolling100.cactus.bw",
type = "polygon", col = "black", lwd.mountain = 0, fill.mountain = c(palCols[5],palCols[5]),
chromosome = chrID, name = "Control", fontsize = 13, showAxis = T, ylim=c(minCon,1),
background.title = "white", fontcolor.title = "black", col.axis = "black",
span = 0.5, lwd = 0, showTitle = F, cex.axis = 0.6)
oligCon <- DataTrack(range = "olig.rolling100.cactus.bw",
type = "polygon", col = "black", lwd.mountain = 0, fill.mountain = c(palCols[6],palCols[6]),
chromosome = chrID, name = "Control", fontsize = 13, showAxis = T, ylim=c(minCon,1),
background.title = "white", fontcolor.title = "black", col.axis = "black",
span = 0.5, lwd = 0, showTitle = F, cex.axis = 0.6)
viridCon <- DataTrack(range = "virid.rolling100.cactus.bw",
type = "polygon", col = "black", lwd.mountain = 0, fill.mountain = c(palCols[7],palCols[7]),
chromosome = chrID, name = "Control", fontsize = 13, showAxis = T, ylim=c(minCon,1),
background.title = "white", fontcolor.title = "black", col.axis = "black",
span = 0.5, lwd = 0, showTitle = F, cex.axis = 0.6)
clytiaCon <- DataTrack(range = "clytia.rolling100.cactus.bw",
type = "polygon", col = "black", lwd.mountain = 0, fill.mountain = c(palCols[8],palCols[8]),
chromosome = chrID, name = "Control", fontsize = 13, showAxis = T, ylim=c(minCon,1),
background.title = "white", fontcolor.title = "black", col.axis = "black",
span = 0.5, lwd = 0, showTitle = F, cex.axis = 0.6)
ATAC <- DataTrack(range = "../ATAC/AEP_MG_final_shift.bw",
type = "polygon", col = "black", lwd.mountain = 0, fill.mountain = c(palCols[1],palCols[1]),
chromosome = chrID, name = "Control", fontsize = 13, showAxis = T, ylim=c(0,z[1]),
background.title = "white", fontcolor.title = "black", col.axis = "black",
span = 0.5, lwd = 0, showTitle = F, cex.axis = 0.6)
H41 <- DataTrack(range = "H41_MG.bw",
type = "polygon", col = "black", lwd.mountain = 0, fill.mountain = c(palCols[2],palCols[2]),
chromosome = chrID, name = "Control", fontsize = 13, showAxis = T, ylim=c(0,z[2]),
background.title = "white", fontcolor.title = "black", col.axis = "black",
span = 0.5, lwd = 0, showTitle = F, cex.axis = 0.6)
H43 <- DataTrack(range = "H43_MG.bw",
type = "polygon", col = "black", lwd.mountain = 0, fill.mountain = c(palCols[3],palCols[3]),
chromosome = chrID, name = "Control", fontsize = 13, showAxis = T, ylim=c(0,z[3]),
background.title = "white", fontcolor.title = "black", col.axis = "black",
span = 0.5, lwd = 0, showTitle = F, cex.axis = 0.6)
H273 <- DataTrack(range = "H273_MG.bw",
type = "polygon", col = "black", lwd.mountain = 0, fill.mountain = c(palCols[4],palCols[4]),
chromosome = chrID, name = "Control", fontsize = 13, showAxis = T, ylim=c(0,z[4]),
background.title = "white", fontcolor.title = "black", col.axis = "black",
span = 0.5, lwd = 0, showTitle = F, cex.axis = 0.6)
pdf(file = fpath, width=7, height=7)
plotTracks(list(ATAC,H41,H43,H273,
grtrack,gtrack,reptrack,magCon,
oligCon,viridCon,clytiaCon),
from=(left - buffer), to=(right + buffer), title.width = 0.7, margin = 0, sizes = c(rep(5,4),3,3,1,rep(5,4)))
dev.off()
}
AtacPlot("chr-3",c(3,7,5,2),40525583,40580186,0, "brachyurySample.pdf")
AtacPlot("chr-4",c(5,16,3,4),17489011,17570761,0, "bmp5-8cSample.pdf")
#conservation plots
AtacPlot("chr-6",c(3,7,5,2),9544624,9546772,0, "wnt3ConSample.pdf",minCon = 0.5)
AtacPlot("chr-6",c(5,16,3,4),24222683,24225675,0, "sp5ConSample.pdf",minCon=0.5)
The following two plots were subsetted to just focus on the viridissima and oligactis conservation tracks for comparing our systematic genome alignment to the results of manual alignments done by Vogg et al. (2019).
When we tried to use the above script to generate example plots that showed the individual biological replicates we found that Gviz didn't do a good job of handling the individual CUT&Tag replicates. It's unclear what caused the issue, but the resulting plots looked block-y and didn't accurately represent the data.
We therefore used pyGenomeTracks (v3.6) to plot the data instead. Below is the config file used to generate the plot of individual replicates for the CUT&Tag and ATAC-seq data centered around the brachyury1 gene:
(03_characterizeCREs/repTracks.ini)
[bigwig file test]
file = ../ATAC/AEP1_final_shift.bw
# height of the track in cm (optional value)
height = 2
title = ATAC-seq Rep 1
min_value = 0
max_value = 4
color = #8FD744FF
[bigwig file test]
file = ../ATAC/AEP2_final_shift.bw
# height of the track in cm (optional value)
height = 2
title = ATAC-seq Rep 2
min_value = 0
max_value = 4
color = #8FD744FF
[bigwig file test]
file = ../ATAC/AEP2_final_shift.bw
# height of the track in cm (optional value)
height = 2
title = ATAC-seq Rep 3
min_value = 0
max_value = 4
color = #8FD744FF
[spacer]
[bigwig file test]
file = h41_1.bw
# height of the track in cm (optional value)
height = 2
title = H3K4me1 CUT&Tag Rep 1
min_value = 0
max_value = 10
color = #35B779FF
[bigwig file test]
file = h41_2.bw
# height of the track in cm (optional value)
height = 2
title = H3K4me1 CUT&Tag Rep 2
min_value = 0
max_value = 10
color = #35B779FF
[bigwig file test]
file = h41_3.bw
# height of the track in cm (optional value)
height = 2
title = H3K4me1 CUT&Tag Rep 3
min_value = 0
max_value = 10
color = #35B779FF
[spacer]
[bigwig file test]
file = h43_1.bw
# height of the track in cm (optional value)
height = 2
title = H3K4me3 CUT&Tag Rep 1
min_value = 0
max_value = 10
color = #21908CFF
[bigwig file test]
file = h43_2.bw
# height of the track in cm (optional value)
height = 2
title = H3K4me3 CUT&Tag Rep 2
min_value = 0
max_value = 10
color = #21908CFF
[bigwig file test]
file = h43_3.bw
# height of the track in cm (optional value)
height = 2
title = H3K4me3 CUT&Tag Rep 3
min_value = 0
max_value = 10
color = #21908CFF
[spacer]
[bigwig file test]
file = h273_1.bw
# height of the track in cm (optional value)
height = 2
title = H3K27me3 CUT&Tag Rep 1
min_value = 0
max_value = 2
color = #31688EFF
[bigwig file test]
file = h273_2.bw
# height of the track in cm (optional value)
height = 2
title = H3K27me3 CUT&Tag Rep 2
min_value = 0
max_value = 2
color = #31688EFF
[bigwig file test]
file = h273_3.bw
# height of the track in cm (optional value)
height = 2
title = H3K27me3 CUT&Tag Rep 3
min_value = 0
max_value = 2
color = #31688EFF
[spacer]
[bigwig file test]
file = IGG_1.bw
# height of the track in cm (optional value)
height = 2
title = IgG CUT&Tag Rep 1
min_value = 0
max_value = 7
color = #443A83FF
[bigwig file test]
file = IGG_2.bw
# height of the track in cm (optional value)
height = 2
title = IgG CUT&Tag Rep 2
min_value = 0
max_value = 7
color = #443A83FF
[bigwig file test]
file = IGG_3.bw
# height of the track in cm (optional value)
height = 2
title = IgG CUT&Tag Rep 3
min_value = 0
max_value = 7
color = #443A83FF
[spacer]
[test gtf]
file = HVAEP1.GeneModels.rmod.gtf
height = 1
title = genes
fontsize = 12
file_type = gtf
color=#440154FF
[spacer]
[x-axis]
where = bottom
The plot was generated using the script below:
(03_characterizeCREs/makeRepPlot.sh)
#!/bin/bash
pyGenomeTracks --tracks repTracks.ini --region chr-3:40498685-40606817 --outFileName repPlot.pdf --dpi 300
In order to determine if out CUT&Tag data for H3K4me1 and H3K4me3 were indeed enriched at CREs, we performed transcription factor footprinting using our ATAC-seq data to look for an enrichment of predicted transcription factor binding sites in H3K4me1 and H3K4me3 peaks.
We used the TOBIAS pipeline to perform the footprinting analysis. We first used the ATACorrect
function to pre-process the mapped ATAC-seq reads (AEP_MG_final.bam
generated by the 01_atacPeaks/ATAC_Mapping_Pipeline.sh script). Input to this function also included the consensus ATAC-seq peak file (consensusATAC.bed, renamed from the file consensusAEP.bed generated by the 01_atacPeaks/getCon.R
script).
Following pre-processing, we then generated TF footprint scores using the FootprintScores
function.
(03_characterizeCREs/runToby.sh)
#! /bin/bash -l
#SBATCH --job-name=toby
#SBATCH -c 24
#SBATCH -t 60-0
#SBATCH --mem=36G
#SBATCH --error=tobias.err
#SBATCH --output=tobias.out
#SBATCH -p med
conda activate tobiasEnv
TOBIAS ATACorrect --bam AEP_MG_final.bam \
--genome HVAEP1_genome.fa \
--peaks consensusATAC.bed \
--outdir corrected \
--cores 24
TOBIAS FootprintScores --signal corrected/AEP_MG_final_corrected.bw \
--regions consensusATAC.bed \
--output AEP_footprints.bw \
--cores 24
To visualize the distribution of the predicted TF footprints (stored in the AEP_footprints.bw
file) around H3K4me1, H3K4me3, and H3K27me3 peaks, we used the DeepTools computeMatrix
and plotHeatmap
functions. 5 kb of flanking sequence on either side of predicted peaks was included in the final plots
(03_characterizeCREs/hisMarksFootprint.sh)
#!/bin/bash
computeMatrix scale-regions \
-R consensusH41.bed consensusH43.bed consensusH273.bed \
-S AEP_footprints.bw \
-o histoneFootprints.txt \
--outFileNameMatrix histoneFootprints.names.txt \
--outFileSortedRegions histoneFootprints.regions.txt \
-m 1000 \
-b 5000 \
-a 5000 \
--missingDataAsZero \
-p 6
(03_characterizeCREs/plotHisFootHeatmap.sh)
#!/bin/bash
plotHeatmap -m histoneFootprints.txt \
-o histoneFootprints.txt.pdf \
--colorList 'white,darkblue'
We observed a clear enrichment of predicted TF binding sites in the H3K4me1 and H3K4me3 peaks, while the H3K27me3 peaks showed no enrichment. This finding is consistent with H3K4me1 and H3K4me3 peaks occurring within active CREs and H3K27me3 peaks occurring within regions of repressed heterochromatin.
08_creIdentification/
├── 01_atacPeaks
│ ├── 12Int.bed
Bed genome coordinate file containing peaks that passed an IDR
threshold of 0.1 when comparing the AEP ATAC-seq biological
replicates 1 and 2.
│ ├── 13Int.bed
Bed genome coordinate file containing peaks that passed an IDR
threshold of 0.1 when comparing the AEP ATAC-seq biological
replicates 1 and 3.
│ ├── 23Int.bed
Bed genome coordinate file containing peaks that passed an IDR
threshold of 0.1 when comparing the AEP ATAC-seq biological
replicates 2 and 3.
│ ├── ATAC_Mapping_Pipeline.sh
Shell script containing the pipeline used to map the AEP ATAC-seq
data. This pipeline trims and filters the reads using trimmomatic
and maps the reads using Bowtie2. The pipeline also removes all
mitochondrial reads and PCR duplicates from the final bam file.
│ ├── ATAC_Peak_Pipeline.sh
Shell script used to identify biologically reproducible peaks in
the AEP ATAC-seq data. This pipeline calls peaks using MACS2 and
assesses variability using IDR. Peaks are defined as reproducible
if they pass an IDR threshold of 0.1 in at least 2 pairwise
comparisons of biological replicates. This pipeline also generates
bigwig files for easy visualization of the data using a genome
browser.
│ ├── bigwigs
Folder containing the bigwig genome browser tracks of ATAC-seq
read density for the strain AEP H. vulgaris genome.
│ │ ├── AEP1_final_shift.bw
Bigwig genome browser track quantifying read density for biological
replicate 1 of the AEP ATAC-seq data.
│ │ ├── AEP2_final_shift.bw
Bigwig genome browser track quantifying read density for biological
replicate 2 of the AEP ATAC-seq data.
│ │ ├── AEP3_final_shift.bw
Bigwig genome browser track quantifying read density for biological
replicate 3 of the AEP ATAC-seq data.
│ │ └── AEP_MG_final_shift.bw
Bigwig genome browser track quantifying the combined read density
for all three AEP ATAC-seq biological replicates.
│ ├── consensusAEP.bed
Bed genome coordinate file containing the consensus peak set of all
biologically reproducible ATAC-seq peaks in the AEP genome.
│ ├── generateSubsampleValue.R
Utility R script that generates a random seed value for subsetting bam
files using samtools.
│ ├── getCon.R
R script that identifies ATAC-seq peaks that passed an IDR threshold
of 0.1 in at least two pairwise comparisons and generates a final
consensus peak set.
│ ├── slurmMap.sh
Shell script that executes ATAC_Mapping_Pipeline.sh on a slurm-based
computing cluster.
│ └── slurmPeak.sh
Shell script that executes ATAC_Peak_Pipeline.sh on a slurm-based
computing cluster.
├── 02_cutAndTagPeaks
│ ├── bigwigs
Folder containing the bigwig genome browser tracks of CUT&Tag
read density for the strain AEP H. vulgaris genome.
│ │ ├── H273_1.bw
Bigwig genome browser track quantifying read density for biological
replicate 1 of the AEP H3K27me3 CUT&Tag data.
│ │ ├── H273_2.bw
Bigwig genome browser track quantifying read density for biological
replicate 2 of the AEP H3K27me3 CUT&Tag data.
│ │ ├── H273_3.bw
Bigwig genome browser track quantifying read density for biological
replicate 3 of the AEP H3K27me3 CUT&Tag data.
│ │ ├── H273_MG.bw
Bigwig genome browser track quantifying the combined read density
for all three AEP H3K27me3 CUT&Tag biological replicates.
│ │ ├── H41_1.bw
Bigwig genome browser track quantifying read density for biological
replicate 1 of the AEP H3K4me1 CUT&Tag data.
│ │ ├── H41_2.bw
Bigwig genome browser track quantifying read density for biological
replicate 2 of the AEP H3K4me1 CUT&Tag data.
│ │ ├── H41_3.bw
Bigwig genome browser track quantifying read density for biological
replicate 3 of the AEP H3K4me1 CUT&Tag data.
│ │ ├── H41_MG.bw
Bigwig genome browser track quantifying the combined read density
for all three AEP H3K4me1 CUT&Tag biological replicates.
│ │ ├── H43_1.bw
Bigwig genome browser track quantifying read density for biological
replicate 1 of the AEP H3K4me3 CUT&Tag data.
│ │ ├── H43_2.bw
Bigwig genome browser track quantifying read density for biological
replicate 2 of the AEP H3K4me3 CUT&Tag data.
│ │ ├── H43_3.bw
Bigwig genome browser track quantifying read density for biological
replicate 3 of the AEP H3K4me3 CUT&Tag data.
│ │ ├── H43_MG.bw
Bigwig genome browser track quantifying the combined read density
for all three AEP H3K4me3 CUT&Tag biological replicates.
│ │ ├── IGG_1.bw
Bigwig genome browser track quantifying read density for biological
replicate 1 of the AEP IgG CUT&Tag data.
│ │ ├── IGG_2.bw
Bigwig genome browser track quantifying read density for biological
replicate 2 of the AEP IgG CUT&Tag data.
│ │ ├── IGG_3.bw
Bigwig genome browser track quantifying read density for biological
replicate 3 of the AEP IgG CUT&Tag data.
│ │ ├── IGG_MG.bw
Bigwig genome browser track quantifying the combined read density
for all three AEP IgG CUT&Tag biological replicates.
│ │ └── NS_RNA.bw
Bigwig genome browser track quantifying the combined read density
for three whole-animal strain AEP H. vulgaris RNA-seq data aligned
to the AEP genome.
│ ├── consensusH273.bed
Bed genome coordinate file containing the consensus peak set of all
biologically reproducible CUT&Tag H3K27me3 peaks in the AEP genome.
│ ├── consensusH41.bed
Bed genome coordinate file containing the consensus peak set of all
biologically reproducible CUT&Tag H3K4me1 peaks in the AEP genome.
│ ├── consensusH43.bed
Bed genome coordinate file containing the consensus peak set of all
biologically reproducible CUT&Tag H3K4me3 peaks in the AEP genome.
│ ├── getCon.R
R script that identifies ATAC-seq peaks that passed an IDR threshold
of 0.1 in at least two pairwise comparisons and generates a final
consensus peak set.
│ ├── H273_1_H273_2.int.bed
Bed genome coordinate file containing peaks that passed an IDR
threshold of 0.1 when comparing the AEP H3K27me3 CUT&Tag biological
replicates 1 and 2.
│ ├── H273_1_H273_3.int.bed
Bed genome coordinate file containing peaks that passed an IDR
threshold of 0.1 when comparing the AEP H3K27me3 CUT&Tag biological
replicates 1 and 3.
│ ├── H273_2_H273_3.int.bed
Bed genome coordinate file containing peaks that passed an IDR
threshold of 0.1 when comparing the AEP H3K27me3 CUT&Tag biological
replicates 2 and 3.
│ ├── H41_1_H41_2.int.bed
Bed genome coordinate file containing peaks that passed an IDR
threshold of 0.1 when comparing the AEP H3K4me1 CUT&Tag biological
replicates 1 and 2.
│ ├── H41_1_H41_3.int.bed
Bed genome coordinate file containing peaks that passed an IDR
threshold of 0.1 when comparing the AEP H3K4me1 CUT&Tag biological
replicates 1 and 3.
│ ├── H41_2_H41_3.int.bed
Bed genome coordinate file containing peaks that passed an IDR
threshold of 0.1 when comparing the AEP H3K4me1 CUT&Tag biological
replicates 2 and 3.
│ ├── H43_1_H43_2.int.bed
Bed genome coordinate file containing peaks that passed an IDR
threshold of 0.1 when comparing the AEP H3K4me3 CUT&Tag biological
replicates 1 and 2.
│ ├── H43_1_H43_3.int.bed
Bed genome coordinate file containing peaks that passed an IDR
threshold of 0.1 when comparing the AEP H3K4me3 CUT&Tag biological
replicates 1 and 3.
│ ├── H43_2_H43_3.int.bed
Bed genome coordinate file containing peaks that passed an IDR
threshold of 0.1 when comparing the AEP H3K4me3 CUT&Tag biological
replicates 2 and 3.
│ ├── iggProcessing.sh
Shell script that calls peaks for the control CUT&Tag IgG samples.
│ ├── mapPipe.sh
Shell script that trims and filters CUT&Tag reads using trimmomatic
and aligns reads to the AEP genome using bowtie2. It also removes
PCR duplicates following mapping.
│ ├── peakPipe.sh
Shell script that calls peaks for the non-IgG CUT&Tag data and performs
IDR to identify biologically reproducible peaks across samples. Peaks
are defined as reproducible if they pass an IDR threshold of 0.1 in
at least 2 pairwise comparisons of biological replicates. The script
also generates read coverage bigwig files.
│ ├── refBed.R
R script that reformats the output from the peak calling software
(SEACR) into standard bed format.
│ ├── slurmMapPipe.sh
Shell script that executes mapPipe.sh on a slurm-based computing cluster.
│ └── slurmPeakPipe.sh
Shell script that executes iggProcessing.sh and peakPipe.sh on a slurm-
based computing cluster.
└── 03_characterizeCREs
├── AEP_MG_final_corrected.bw
Bigwig file generated by the TOBIAS ATACorrect function as part of the
runToby.sh script that contains the bias-corrected insertion frequencies
across the AEP assembly.
├── AEP_footprints.bw
Bigwig file generated by the TOBIAS FootprintScores function as part of
the runToby.sh script that contains protein binding footprint scores
across the AEP assembly.
├── aep.final.genome.fa.fai
Fasta index file for the AEP genome assembly generated by samtools.
Used when generating plots of ATAC-seq, CUT&Tag, and sequence
conservation data in plotTrackData.R.
├── ATAC_finalhits.txt
Table containing the UROPA peak annotation results for the peaks in
consensusAEP.bed. Identifies the nearest gene to each peak.
├── bothMaskFull.out.sorted.gff
Sorted version of bothMaskFull.out.sorted.gff (described in
02_repeatMasking.md). Used when generating plots of ATAC-seq, CUT&Tag,
and sequence conservation data in plotTrackData.R.
├── compGeneMatrix.sh
Shell script that uses the deeptools calculate matrix to quantify RNA-seq,
CUT&Tag, and ATAC-seq read density as well as sequence conservation in and
around genes in the AEP assembly.
├── corplot.npz
Compressed correlation matrix results written as output by the deeptools
multiBigwigSummary function for all AEP CUT&Tag and ATAC-seq samples.
Generated by multiBWComp.sh.
├── hisMarksFootprint.sh
Shell script that uses the deeptools calculate matrix to quantify the density
of predicted ATAC-seq protein binding footprints around H3K4me1, H3K4me3, and
H3K27me3 CUT&Tag peaks in the AEP assembly.
├── geneMatrix.txt
Results file written as output by the deeptools computeMatrix function.
Quatifies RNA-seq, CUT&Tag, and ATAC-seq read density as well as sequence
conservation in and around genes in the AEP assembly. Generated by
compGeneMatrix.sh.
├── H41_finalhits.txt
Table containing the UROPA peak annotation results for the peaks in
consensusH41.bed. Identifies the nearest gene to each peak.
├── H43_finalhits.txt
Table containing the UROPA peak annotation results for the peaks in
consensusH43.bed. Identifies the nearest gene to each peak.
├── HVAEP1.GeneModels.pfix.gff3
Modified version of HVAEP1.GeneModels.gff3 where CDS phase issues have
been corrected using agat_sp_fix_cds_phases. Used when generating plots
of ATAC-seq, CUT&Tag, and sequence conservation data in plotTrackData.R.
├── makeRepPlot.sh
Shell script that uses pyGenomeTracks to generate a plot depicting mapped
read densities for each of the individual CUT&Tag and ATAC-seq centered
around the brachyury1 gene locus.
├── multiBWComp.sh
Shell script that uses the deeptools multiBigwigSummary function to calculate
global read density correlation scores for all ATAC-seq and CUT&Tag libraries.
├── peakAnnotATAC.json
JSON-formated configuration file for the UROPA peak annotation analysis
of consensusAEP.bed.
├── peakAnnotH41.json
JSON-formated configuration file for the UROPA peak annotation analysis
of consensusH41.bed.
├── peakAnnotH43.json
JSON-formated configuration file for the UROPA peak annotation analysis
of consensusH43.bed.
├── plotCorrelation.sh
Shell script that uses the deeptools plotCorrelation function to generate
a heatmap visualizing global read density correlation scores for all ATAC-seq
and CUT&Tag libraries.
├── plotGeneHeatmap.sh
Shell script that uses the deeptools plotHeatmap function to visualize
RNA-seq, CUT&Tag, and ATAC-seq read density as well as sequence conservation
in and around genes in the AEP assembly.
├── plotHisFootHeatmap.sh
Shell script that uses the deeptools plotHeatmap function to visualize
the density of predicted ATAC-seq protein binding footprints around H3K4me1,
H3K4me3, and H3K27me3 CUT&Tag peaks in the AEP assembly.
├── plotTrackData.R
R script that plots ATAC-seq, CUT&Tag, and sequence conservation data for
individual loci of interest within the AEP genome.
├── repTracks.ini
pyGenomeTracks config file used by makeRepPlot.sh to generate a plot
depicting mapped read densities for each of the individual CUT&Tag and
ATAC-seq centered around the brachyury1 gene locus.
└── runToby.sh
Shell script that runs both the ATACorrect and FootprintScores functions from
the TOBIAS package to predict protein binding footprints in the AEP assembly
using ATAC-seq data.