PeakSegJoint: supervised joint peak detection (arXiv:1506.01286)
There are two major differences between PeakSegJoint and all of the other peak detection algorithms for ChIP-seq data analysis:
- Supervised machine learning: PeakSegJoint is trained by providing labels (example) that indicate regions with and without peaks. So if you see false positives (a peak called where there is clearly only background noise) or false negatives (no peak called where there should be one) you can add labels to correct PeakSegJoint, and it learns and gets more accurate as more labels are added. In contrast, other peak detection methods are unsupervised, meaning that they usually have 10-20 parameters, and no obvious way to train them, yielding arbitrarily inaccurate peaks that can NOT be corrected using labels.
- Joint peak detection in any number of samples or cell types so the model can be easily interpreted to find similarities or differences between samples (PeakSegJoint outputs a binary matrix, samples x peaks). In contrast, it is not easy to find similarities and differences between samples using single-sample peak detection methods (e.g. MACS), and other multi-sample peak detection methods are limited to one (e.g. JAMM) or two (e.g. PePr) cell types (assuming all samples of the same cell type are replicates with the same peak pattern).
Train error data viz, 8 samples of H3K4me3 data.
Test error data viz, 8 samples of H3K4me3 data.
This R package contains C code that implements the fast heuristic JointZoom algorithm described in the PeakSegJoint paper [source]. It also contains an R implementation of a Fast Iterative Shinkage-Thresholding Algorithm (FISTA) for L1-regularized interval regression with the squared hinge loss.
install.packages("devtools")
devtools::install_github("tdhock/PeakSegJoint")
library(PeakSegJoint)
example(PeakSegJointSeveral) # JointZoom approximate segmentation algo.
example(PeakSegJointError) # Number of incorrect labels.
example(IntervalRegressionProblems) # FISTA convex optimization algo.
First, plot the coverage of several samples of the same experiment type in a genome browser (IGV or UCSC). To compute the coverage you can use bedtools genomecov -bg to get a bedGraph file and then use bedGraphToBigWig to convert it to a bigwig file that can be viewed on a genome browser. The coverage data can be either normalized (non-integer values) or raw (integer valued count data), with or without lines for zero coverage. Create a track hub to visualize them on UCSC (example).
Then, visually identify several regions with and without peaks, saving those labels to a file. Example labels file: there are two cell types, bcell and tcell. Note that for the labels file, you may use only peakStart, peakEnd, noPeaks, and peaks labels.
Then, put that label file in the same directory as several sub-directories that have names that match the labels file. Example data set: the tcell subdirectory contains two bigwig files, and the bcell subdirectory contains two bigwig files. To download the bigwig files listed in a track hub, save the track hub to the same directory as the labels file (example), and then run downloadTrackDbBigWigs.R:
Rscript downloadTrackDbBigWigs.R path/to/trackDb.txt
Then, run 00_AllSteps_qsub.R which will run the analysis in parallel on a cluster with qsub:
Rscript 00_AllSteps_qsub.R path/to/labels.txt
Note that you should copy 00_AllSteps_qsub.R and change the PBS headers to reflect your cluster configuration.
If you don’t have access to a cluster with qsub, you can still run the steps in sequence on one computer. For example test-bash-noinput.R uses bash instead of qsub, via
QSUB='echo INTERACTIVE && bash' JOBS=2 Rscript 00_AllSteps_qsub.R path/to/labels.txt