GitHub - jingydz/piPipes: piRNA pipeline collection developed in the Zamore Lab and ZLab in UMass Med School

piPipes =====

A set of pipelines developed in the Zamore Lab and ZLab to analyze piRNA/transposon from different Next Generation Sequencing libraries (small RNA-seq, RNA-seq, Genome-seq, ChIP-seq, CAGE/Degradome-Seq).

In order to achieve a generic interface in terms of the genome assembles it supports, piPipes provides a installation pipeline to download ready-to-use genome annotation packages from Illumina iGenome as well as UCSC Genome Browser.

For small RNA-Seq, RNA-Seq and ChIP-Seq pipelines, piPipes provides two modes: single-sample mode and dual-sample mode, to analyze single library and pair-wise comparison between two samples respectively. For degradome-seq, piPipes provide options to perform Ping-Pong analysis between degradome reads and small RNA reads.

Visit our Wiki Page for more details on how to install the genome, run each pipeline, and interpretate the output.

##INSTALL
piPipes is written in Bash, C/C++, Perl, Python, HTML/Javascript and R. It currently only works under Linux environment.

C/C++

piPipes comes with statically compiled linux x86_64 binaries for its own C++ scripts and the other tools written in C/C++. Ideally, the users don't need to do any compiling. But if the static versions do not work in your system, exemplified by the error message "kernel too old", please compile them from src and move the binaries to the bin, or simply email us or file an issue on Github.

If you need to compile from source code:

Please install BEDtools using the source code in the third_party directory and rename it as bedtools_piPipes in the bin directory of piPipes. It has a little modification that makes our self-defined format more efficient to process.
Please install bowtie from https://github.com/bowhan/bowtie , where we have added native gzip/bzip2 support, which is required to run zipped, Paired-End sample for ChIP-seq pipeline.
Most of piPipes's C++ code utilizes C++11 features and Boost library. It is recommended to install relatively new GCC and Boost for compiling them. If you don't have them, we recommend to use brew to install them automatically.
Some codes require the htslib installed first.

Python/Cython

For MACS2 and HTSeq-count, the users will need to install them and make them available in their $PATH.
We cannot find a good way to ship the ready-to-use Cython code. Without htseq-count, piPipes rna/deg/cage won't be able to make transcripts/transposon counting using genomic coordinates. But it will still perform other functions of the pipeline, including quantification using Cufflinks and eXpress. Without macs2, piPipes chip/chip2 won't work at all.

R

For R packages that are unavailable in the user's system, the installation is performed during the piPipes install process. They will be installed in the same directory as the pipeline in case the user doesn't have write permission in the R installation directory. Please keep the version of R constant.

Genome Annotation

Due to the limitation on the size of the files on github, the genome sequence, most annotation files are to be downloaded from somewhere else and reformatted to accommodate the pipeline. piPipes uses iGenome and provides piPipes install to download iGenome genomes and organize the files to be used by the pipeline (see below).

For the recently released (07/2014) Drosophila melanogaster BDGP release 6, we directly obtain the data from flyBase;

piPipes uses the following public tools:

For alignment, piPipes uses Bowtie, Bowtie2, BWA, STAR and mrFast for different purposes.
For transcripts/transposons quantification, piPipes uses Cufflinks, HTSeq and eXpress under different circumstances.
For transposon mobilization as well as other structural variants discovery, piPipes uses TEMP, BreakDancer, RetroSeq and VariationHunter.
For ChIP-Seq reads allocation, piPipes uses CSEM; for peaks calling, piPipes uses MACS2. For TSS/TES/metagene analysis, piPipes uses bwtool.
Additionally, piPipes uses many tools from the Kent Tools, like faSize, bedGraphToBigWig.
To wrap bash scripts for multi-threading, piPipes utilizes ParaFly from Trinity. piPipes also learns the touch trick for job resuming from Trinity.
To determine the version of FastQ, piPipes uses SolexaQA.pl from SolexaQA. piPipes have modified it in a way that the program exits as soon as the version of FastQ has been determined. The modified code can be found in the bin directory.
piPipes uses BEDtools to assign alignments to different genomic annotations (gene, transposon, piRNA cluster, et al.).

##USAGE The pipeline finds almost everything under its own directory so please do not move the piPipes script. Use ln -s $ABSOLUTE_PATH_TO_piPipes/piPipes $HOME/bin/piPipes to create symbol link in your $HOME/bin; Or add /path/to/piPipes to your $PATH. But please do NOT add the /path/to/piPipes/bin to your $PATH

Call different pipelines using:

# This is a very brief introduction, for more details on the usage and output interpretation, please visit our Wiki or the manual in the package

# ===== Genome installation pipeline =====
 # 1. to install genome and R packages in one step
 # the assembly that piPipes supports can be found in the common/iGenome_UTL.txt file
$PATH_TO_piPipes/piPipes	install -g dm3|mm9|hg19...
 # 2. to only download the genome and R packages (if the machine/node is not appropriate to be used for heavy computing tasks, like building indexes); then run (1) on a powerful mechine/node.
$PATH_TO_piPipes/piPipes	install -g dm3|mm9|hg19 -D
 # 3. to download the iGenome from other explicitly specified location
$PATH_TO_piPipes/piPipes	install -g hg18 -l ftp://igenome:[email protected]/Homo_sapiens/UCSC/hg18/Homo_sapiens_UCSC_hg18.tar.gz

# ===== Small RNA-seq pipeline =====
# to run small RNA pipeline in single sample mode; input fastq can be gzipped
$PATH_TO_piPipes/piPipes	small -i input.trimmed.fq[.gz] -g dm3 -c 24
# to run small RNA pipeline in single sample mode; full options
$PATH_TO_piPipes/piPipes	small -i input.trimmed.fq[.gz] -g dm3 -N miRNA -o output_dir -F virus.fa -P mini_white.fa -O gfp.fa

# to run small RNA pipeline in dual library mode (need single sample mode output for each sample first)
$PATH_TO_piPipes/piPipes	small2 -a directory_A -b directory_B -g dm3 -c 24
# to run small RNA pipeline in dual library mode, normalized to miRNA, for unoxidized library
$PATH_TO_piPipes/piPipes	small2 -a directory_A -b directory_B -g dm3 -c 24 -N miRNA
# to run small RNA pipeline in dual library mode, normalized to siRNA (structural loci and cis-NATs), for oxidation sample of -fruitfly only-
$PATH_TO_piPipes/piPipes	small2 -a directory_A -b directory_B -g dm3 -c 24 -N siRNA

# ===== RNA-seq pipeline =====
# to run RNASeq pipeline in single sample mode, dUTP based method
$PATH_TO_piPipes/piPipes	rnaseq -l left.fq -r right.fq -g mm9 -c 8 -o output_dir
# to run RNASeq pipeline in single sample mode, ligation based method
$PATH_TO_piPipes/piPipes	rnaseq -l left.fq -r right.fq -g mm9 -c 8 -o output_dir -L

# to run RNASeq pipeline in dual library mode (need single sample mode been ran for each sample first)
$PATH_TO_piPipes/piPipes	rnaseq2 -a directory_A -b directory_B -g mm9 -c 8 -o output_dir -A w1 -B piwi
# to run RNASeq pipeline in dual library mode with replicates
$PATH_TO_piPipes/piPipes	rnaseq2 -a directory_A_rep1,directory_A_rep2,directory_A_rep3 -b directory_B_rep1,directory_B_rep2 -g mm9 -c 8 -o output_dir -A w1 -B piwi

# ===== Degradome/RACE/CAGE-seq pipeline =====
# to run Degradome/RACE/CAGE-Seq library
$PATH_TO_piPipes/piPipes	deg -l left.fq -r right.fq -g dm3 -c 12 -o output_dir

# to run Degradome library to check ping-pong signature with a small RNA library (need the small RNA library ran first)
$PATH_TO_piPipes/piPipes	deg -l left.fq -r right.fq -g dm3 -c 12 -o output_dir -s /path/to/small_RNA_library_output

# ===== ChIP-seq pipeline =====
# to run ChIP Seq library in single sample mode, for narrow peak, like transcriptional factor
$PATH_TO_piPipes/piPipes	chip -l left.IP.fq -r right.IP.fq -L left.INPUT.fq -R right.INPUT.fq -g mm9 -c 8 -o output_dir
# to run ChIP Seq library in single sample mode, for broad peak, like H3K9me3
$PATH_TO_piPipes/piPipes	chip -l left.IP.fq -r right.IP.fq -L left.INPUT.fq -R right.INPUT.fq -g mm9 -c 8 -o output_dir -B
# to run ChIP Seq library in single sample mode with Single-End library
$PATH_TO_piPipes/piPipes	chip -i IP.fq  -I input.fq  -g dm3
# to run ChIP Seq library in single sample mode, only use unique mappers reported by Bowtie2 (default)
$PATH_TO_piPipes/piPipes	chip -l left.IP.fq -r right.IP.fq -L left.INPUT.fq -R right.INPUT.fq -g mm9 -c 8 -o output_dir -u
# to run ChIP Seq library in single sample mode, for multi-mappers, let Bowtie2 randomly assign it to ONE of the best loci
$PATH_TO_piPipes/piPipes	chip -l left.IP.fq -r right.IP.fq -L left.INPUT.fq -R right.INPUT.fq -g mm9 -c 8 -o output_dir -m
# to run ChIP Seq library in single sample mode, for multi-mappers, let Bowtie (not Bowtie2) to report all the best alignments; then apply EM-algorithm, using CSEM, to allocate each read to one loci with >0.5 csem posterior
$PATH_TO_piPipes/piPipes	chip -l left.IP.fq -r right.IP.fq.gz -L left.INPUT.fq.bz2 -R right.INPUT.fq -g mm9 -c 8 -o output_dir -e

# to run ChIP Seq library in dual library mode (need single sample mode been ran for each sample first)
$PATH_TO_piPipes/piPipes	chip2 -a directory_A -b directory_B -g mm9 -c 8 -o output_dir
# to run ChIP Seq library in dual sample mode, extend up/down stream 5000 bp for TSS/TES/meta analysis (for bwtool)
$PATH_TO_piPipes/piPipes	chip2 -a directory_A -b directory_B -g mm9 -c 8 -o output_dir -x 5000

# ===== Genomic-seq pipeline =====
# to run Genome Seq library
$PATH_TO_piPipes/piPipes	dna -l left.fq -r right.fq -g dm3 -c 24 -D 100

Find more detailed information on Wiki

###install : to install genome assembly Due to the limitation on the size of file by github, piPipes doesn't ship with the genome sequences and annotation. Alternatively, we provide scrips to download genome assemly files from iGenome project of illumina. Please make sure internet is available during this process. piPipes provides an option to separate downloading from other processes, in case the machine/node with internet access is not appropriate for building index and other works.
Except for the genome, this pipeline will also install unavailable R packages under the pipeline directory. The downloading and installation can be separated using -D option, in case the head node is not supposed to be used for heavy computational work, like building indexes.
Currently, piPipes comes with annotation files for Drosophila melanogaster (dm3 and BDGP6), Mus musculus (mm9), Homo sapiens (hg19), Danio rerio (danRer7), Rattus norvegicus (rn5), and Bos taurus (bosTau7). Arabidopsis thaliana (TARI10) is also included (but not rigorously tested), though no piRNA has been described in plants.