-
Notifications
You must be signed in to change notification settings - Fork 40
installation
This document explains how to obtain piPipes from Github and how to install genome files.
To clone the directory from Github, you will need to have git
installed on your system.
If not, please download git here.
# The genome sequence and annotations will be stored under the piPipes directory
# so allow extra ~8.5 G for dm3 (fly), ~90 G for mm9 (mouse), ~131 G for hg19 (human)
git clone https://github.com/bowhan/piPipes.git
# If you have git, enter the piPipe directory and then type:
git pull
# occasionally, you might get error message like:
git pull
Updating 42bf792..fe137aa
error: Untracked working tree file 'common/dm3/rRNA.fa' would be overwritten by merge. Aborting
# this issue originated from the explicit inclusion of rRNA.fa file in piPipes, which conflicts
# with the same file extracted from iGenome when you install the genome
# to solve "Untracked working tree file":
rm -f common/dm3/rRNA.fa && git pull
# you might also get error like
warning: Cannot merge binary files: common/dm3/structured_loci.bed.gz
(HEAD vs. fe137aab4c81c6b0ff3f66cef68e3b7e396aba15)
Auto-merging common/dm3/structured_loci.bed.gz
CONFLICT (content): Merge conflict in common/dm3/structured_loci.bed.gz
Automatic merge failed; fix conflicts and then commit the result.
# this issue originated from force update of file which was included in gitignore
# to solve "Merge conflict":
git checkout -- common/dm3/structured_loci.bed.gz
git pull
# If you have git, enter the piPipe directory and then type:
git reset --hard origin/master
# to re-install a genome in a clean background, enter the common/ directory and do:
rm -rf bosTau7
git checkout -- bosTau7
piPipes install -g bosTau7
Alternatively, you can obtain piPipes from its release page.
Note that you will not be able to make upgrades without git
.
Make symbol links to piPipes script, so that you can find piPipes without explicitly typing the absolute path:
# Enter the piPipes directory
ln -s $PWD/piPipes $HOME/bin/piPipes
ln -s $PWD/piPipes_debug $HOME/bin/piPipes_debug
# If successfully done:
$ which piPipes
~/bin/piPipes
piPipes has most of the third-party tools pre-compiled and included in the bin
directory.
They will be automatically found when you run piPipes.
To avoid mixing them with your own versions, we do not recommend to add /piPipes/bin
to the $PATH
.
However, there are some tools that we find them hard to ship so the user will need to install them if haven't done so.
# 1. R
# Please follow instructions on http://www.r-project.org/ to install R
#! if successfully installed:
$ which Rscript
~/bin/Rscript
# ! Note R 3.1.0 has a different behavior for read.table ().
# http://stackoverflow.com/questions/22962917/barplot-failure-in-r-3-1-0-read-csv-converting-what-should-be-numerics-to-facto/23225932#23225932
# and has been fixed in 3.1.1...
# Also try to keep only one version of R in your system or $PATH
# Many of the "bugs" reported by our users were caused by multiple versions of R!
# FYI: in the installation pipeline, piPipes will try to install the following packages.
# It would be nice if they are manually installed and confirmed.
## from CRAN
RColorBrewer
ggplot2
ggthemes
gplots
parallel
scales
reshape
gridExtra
gdata
RCircos
## from Bioconductor
cummeRbund
# 2. HTSeq-count
# Please follow instructions on http://www-huber.embl.de/users/anders/HTSeq/doc/install.html
# to install HTSseq-count
# or if you have pip set up
pip install HTSeq
#! if successfully installed:
$ which htseq-count
~/bin/htseq-count
# HTSeq-count is used in RNA-seq pipeline; if you are not planning to use RNA-seq pipeline, you
# might not need it
# 3. MACS2
# Please follow instructions on https://github.com/taoliu/MACS/blob/master/INSTALL.rst
# to install MACS2
# or if you have pip
pip install macs2 # please run "macs2 callpeak -h" to see if the option --outdir is included...; if not, install it from github
#! if successfully installed:
$ which macs2
~/bin/macs2
# MACS2 is used in ChIP-seq pipeline; if you are not planning to use ChIP-seq pipeline, you
# might not need it
# 4. Perl Module Statistics::Descriptive; install it through
cpan Statistics::Descriptive
#! if successfully installed:
$ perl -MStatistics::Descriptive -e "print \"Installed.\\n\";"
Installed.
# Bio::Seq
# please follow the instructions here
# http://www.bioperl.org/wiki/Installing_BioPerl_on_Unix
# Theose two modules are only used in genome-seq pipeline; if you are not planning to use genome-seq
# pipeline, you might not need it
# 5. GNU awk
# GNU awk is heavily used in piPipes. But some versions of awk do not have the GNU extension,
# for example, the definition of variable ARGIND; to test is
$ echo 1 | awk '{print ARGIND}'
# if it prints nothing, it means that your awk doesn't define ARGIND variable, it will cause
# issues when you run piPipes
# the easiest way to install gawk is to use linuxbrew
# https://github.com/Homebrew/linuxbrew
# Please follow their instruction to install linuxbrew and install gawk with the following:
$ brew install gawk
# then you have to make a symbol link in the /bin directory of piPipes to make it use it as "awk"
$ ln -s $HOME/.linuxbrew/bin/gawk /path/to/piPipes/bin/awk
piPipes provides a uniform interface for different organisms/genomes. Due to Github's limit on the size of a single file, genome sequences and annotations are downloaded separately. The user will need to perform an installation to download the files and prepare them for other pipelines to use.
To install a specific genome in one step:
piPipes install -g dm3 # fly genome dm3
piPipes install -g dm6 # fly genome new release, BDGP6
piPipes install -g mm9 # mouse genome mm9
piPipes install -g hg19 # human genome hg19
Many computing clusters only have internet access on the 'head node', which should only be used to submit jobs but not to run jobs. To separate downloading and preparation steps:
# under the "head" node: with internet access but no computing power
piPipes install -g dm3 -D
# finish the work under a computing node
piPipes install -g dm3
# Some steps take advantage of multiple CPUs, so providing more than one CPUs using `-c`
# accelerates the installation process.
piPipes install -g dm3 -c 8
Notes:
-
piPipes uses
wget --continue
so downloading will resume if the installation is disrupted. piPipes also only runs steps that haven't succeeded. -
During the installation, the user will be prompted to define the length of siRNAs and piRNAs for the genome to be installed. Our lab uses 20-22 nt for fly/mouse siRNA, 23–29 nt for fly piRNA and 23–35 for mouse piRNA. This information is stored in
common/dm3/variables
files and users can change the values manually later. -
The installation of R packages is NOT multi-threading safe, so please install each genome separately.
Currently, Drosophila melanogaster and Mus Musculus piRNAs are the most well studied. piPipes is optimized for those two species (assembly version dm3 and mm9 from UCSC). For other organisms, due to either the relatively immature piRNA cluster annotation, some functions in the pipelines may not be performed. Please contact us if you would like to contribute to the annotations of organisms that are poorly supported by piPipes.
All the files for a specific genome are stored under the /path/to/piPipes/common/
.
For example, fly files are stored under /path/to/piPipes/common/dm3
.
Most of them are in gzipped BED format.
piPipes downloads the annotation from iGenome, which misses the chrU and X-TAS. piPipes thus downloads chrU.fa from UCSC, and put X-TAS.fa in the Github repository.
For piRNA cluster annotation, piPipes uses the one from Brennecke, et al., Cell, 2007.
For transposons, piPipes uses two different annotations. transposon sequences are from flyBase and repBase sequences are from repBase. The transposon annotation has been used in the Zamore Lab since Li, et al., Cell, 2009. The repBase annotation separated Long Terminal Repeat (LTR) of a retrotransposon from the middle part. So the LTR derived sequences do not become multi-mappers simply due to the presence of two LTRs in a transposon sequence.
piPipes has incorporated the new assembly of fruitfly genome release 6.
# To install the new release, type:
piPipes install -g dm6
Since it was just released (July 2014), iGenome or UCSC has not incorporated it. We used most of the annotation files from flyBase. Several notes:
1.piRNA cluster
Using the converter tool provided by flyBase, we tried to make the new coordinates of piRNA clusters. However, 46 clusters cannot be successfully found in the new assembly, mostly due to "maps to more than one scaffold".
We now only keep the 96 ones that can be successfully mapped. But we are planning to use new data with higher depth and possibly new algorithsm to annotate new clusters.
For more information, please read file common/dm6/Brennecke.piRNAcluster.bed6.converted.failed
2.Repeat Masker
We ran repeatMasker using the following parameter to identify transposon sites in BDGP6.
Note that by providing -species drosophila
, we were using the transposon sequences from repBase instead of the sequences from flyBase.
# Using flyBase transposon sequences
RepeatMasker \
-pa 24 \
-s \
-low \
-lib dmel-all-transposon-r6.01.fasta \
-gff dmel-all-chromosome-r6.01.fasta \
1> flyBase.stdout \
2> flyBase.stderr
# Using repBase
RepeatMasker \
-pa 24 \
-s \
-low \
-species drosophila \
-gff dmel-all-chromosome-r6.01.fasta \
1> repBase.stdout \
2>repBase.stderr
3.GTF file
The gtf file obtained from flyBase ftp://ftp.flybase.net/releases/FB2014_04/dmel_r6.01/gtf/dmel-all-r6.01.gtf.gz
cannot be correctly processed by gtfToGenePred
from kent tools, due to
the presence of "trans-splicing" of mdg4
.
invalid gffGroup detected on line: 3R FlyBase CDS 21375060 21375912 3.000000 - 0 gene_id "FBgn0002781"; transcript_id "FBtr0084081";
GFF/GTF group FBtr0084081 on 3R+, this line is on 3R-, all group members must be on same seq and strand
# the rest trans-splicing ones include
FBtr0084079
FBtr0084080
FBtr0084081
FBtr0084082
FBtr0084083
FBtr0084084
FBtr0084085
FBtr0307759
FBtr0307760
We thus removed all the mdg4
annotations.
grep -v mdg4
piPipes downloads the annotation from iGenome.
piPipes uses the piRNA cluster annotation from Li, et al., Mol Cell, 2013 and transposon annotation from repBase.
piPipes downloads the annotation from iGenome.
piPipes uses the piRNA cluster annotation from Rosenkranz, et al., BMC Bioinformatics, 2013 and transposon annotation from repBase.
In order for piPipes to perform its full function on other genomes, the following steps should be completed:
1.Annotate piRNA cluster and provide it in BED format. Pleases also provide the sequences in a file named ${GENOME}.piRNAcluster.fa
.
Run proTRAC
or piClust
to produce piRNA cluster annotation.
Rosenkranz D and Zischler H. 2012. proTRAC--a software for probabilistic piRNA cluster detection,
visualization and analysis. BMC Bioinformatics 13: 5.
Jung, I., Park, J. C. & Kim, S. piClust: A density based piRNA clustering algorithm.
Comput Biol Chem (2014).
2.Get gene structure annotations from UCSC table browser or through the mySQL interface.
We have already included those files for many organisms in the common
folder.
If the folder already exist, there is no need to do this step.
We provided an option -C
to install genomes that are not currently supported by iGenome:
-C Custom genome installation. The user will need to create a folder
$PIPELINE_DIRECTORY/common/MY_GENOME and provide the following files:
$PIPELINE_DIRECTORY/common/MY_GENOME/MY_GENOME.fa --> genome sequence in fasta format
$PIPELINE_DIRECTORY/common/MY_GENOME/MY_GENOME.transposon.fa --> transposon sequence in fasta format
$PIPELINE_DIRECTORY/common/MY_GENOME/MY_GENOME.piRNAcluster.bed --> piRNA cluster in bed format
$PIPELINE_DIRECTORY/common/MY_GENOME/MY_GENOME.genes.gtf --> genes annotation in gtf format
$PIPELINE_DIRECTORY/common/MY_GENOME/MY_GENOME.hairpin.fa --> miRNA hairpin sequence in fasta format
$PIPELINE_DIRECTORY/common/MY_GENOME/MY_GENOME.mature.fa --> miRNA sequence in fasta format
*Note that if you obtain hairpin and mature sequences from miRBase, you can extract the sequences
corresponding to your genome using $PIPELINE_DIRECTORY/bin/piPipes_extract_organiam_from_fa.py:
$PIPELINE_DIRECTORY/bin/piPipes_extract_organiam_from_fa.py hairpin.fa dme > \
$PIPELINE_DIRECTORY/common/dm3/dm3.hairpin.fa
$PIPELINE_DIRECTORY/bin/piPipes_extract_organiam_from_fa.py mature.fa dme > \
$PIPELINE_DIRECTORY/common/dm3/dm3.mature.fa
Then run:
piPipes install -g MY_GENOME -C
Better name the genome just using lowercase a-z and underscore. Avoid using all upper-case name such as "GENOME".
Then please create the genome_feature files according to the instruction at the end of this document.
bosTau7
rn5
danRer7
TARI10
hg19
mm9
dm3
3.Edit the genomic_features
file under the genome folder. See the next section.
4.The genome sequences should be provided in a file named as $GENOME.fa
.
piPipes builds bowtie index of the genome sequence for small RNA pipeline, STAR index for RNA-seq and degradome pipeline and Bowtie2 index for Genome-seq pipeline.
5.The rRNA sequence should be provided in a file named as rRNA.fa
.
piPipes builds bowtie index of the rRNA for small RNA, bowtie2 index for normal RNA.
6.The transposon consensus sequences should be provided and named as ${GENOME}.repBase.fa
.
piPipes builds bowtie index of the repBase/transposon/piRNA cluster for small RNA.
|-- piPipes/ # top directory
| |-- piPipes # main bash script to run
| |-- piPipes_debug # main bash script to run, debug mode
| |-- bin/ # binrary executables
| |-- piPipes_smallRNA.sh # smallRNA seq pipeline, single sample mode
| |-- piPipes_smallRNA2.sh # smallRNA seq pipeline, dual sample mode
| |-- piPipes_RNASeq.sh # RNA-seq pipeline, single sample mode
| |-- piPipes_RNASeq2.sh # RNA-seq pipeline, dual sample mode
| |-- piPipes_DegradomeSeq.sh # Degradome-seq pipeline
| |-- piPipes_ChIPSeq.sh # ChIP-seq pipeline, single sample mode
| |-- piPipes_ChIPSeq2.sh # ChIP-seq pipeline, dual sample mode
| |-- piPipes_GenomeSeq.sh # Genomic Seq pipeline
| |-- ... # binaries like bowtie, STAR, cufflinks ...
| |-- src/ # source codes
| |-- bed2_to_bedGraph.cpp # piPipes source codes
| |-- third_party/ # source codes of other tools; use this if the precompiled ones don't work
| |-- ...
| |-- common/ # where annotations and sequences been stored
| |-- mm9/
| |-- dm3/
| |-- dm3.fa # genome sequence
| |-- genomic_features # very important configuration file, see below
| |-- Brennecke.piRNAcluster.bed6.gz # one the the annotation file, in bed format
| |-- BowtieIndex/
| |-- ...
| |-- dm6/
| |-- hg19/
| |-- genome_supported.txt # storing the names of genome that has been installed
| |-- RepBase19.02.fasta.tar.gz # transposon consensus sequences from repBase
| |-- reformat_repBase_for_eXpress.sh # eXpress only takes the first token of Fasta name...
piPipes downloads annotations from iGenome (UCSC version), which usually includes genomic sequence (fasta), rRNA (fasta), transcriptome (gtf) to be used by piPipes.
piPipes includes the repBase(fasta) in the github for dm3 and mm9. For other genomes, please retrieve the repBase.fa
and name it ${GENOME}.repBase.fa
in the common/${GENOME}
directory.
For example, run:
# Enter the directory unarchived from RepBase19.02.fasta.tar.gz
$ cat humrep.ref humsub.ref > ../hg19/hg19.repBase.fa
# for hg19 genome, please then run
bash reformat_repBase_for_eXpress.sh hg19/hg19.repBase.fa > hg19/hg19.repBase.fa.1 && \
mv hg19/hg19.repBase.fa.1 hg19/hg19.repBase.fa
# it replace space in the fasta header to underscore
# this step is essential since eXpress only uses the first token as name
# and some transposon sequences share same name
piPipes includes a bunch of genomic features (bed) in the genomic_features file under the directory of each genome.
Please also include them in the common/${GENOME}
directory and add them in the TARGET array
in common/${GENOME}/genomic_features
.
Follow the following example to set up:
# variables for small RNA pipeline intersecting
MASK=$COMMON_FOLDER/UCSC.rRNA+tRNA+nonCoding.bed6.gz
# tRNA, rRNA, nonCoding RNA (flyBase) from UCSC table browser
piRNA_Cluster=$COMMON_FOLDER/Brennecke.piRNAcluster.bed6.gz
# piRNA cluster defined in Brennecke, et al,. Cell, 2007; no strand information
piRNA_Cluster_42AB=$COMMON_FOLDER/Brennecke.piRNAcluster.42AB.bed6.gz
# 42AB
piRNA_Cluster_20A=$COMMON_FOLDER/Brennecke.piRNAcluster.20A.bed6.gz
# 20A
piRNA_Cluster_flam=$COMMON_FOLDER/Brennecke.piRNAcluster.flam.bed6.gz
# flam
repeatMasker=$COMMON_FOLDER/UCSC.RepeatMask.bed
# repeatMakser obtained from UCSC
repeatMasker_IN_Cluster=$COMMON_FOLDER/UCSC.RepeatMask.inCluster.bed.gz
# repeat masker identified region that fall into piRNA cluster
repeatMasker_OUT_Cluster=$COMMON_FOLDER/UCSC.RepeatMask.outCluster.bed.gz
# repeat masker identified region that fall outside piRNA cluster
Trn=$COMMON_FOLDER/Zamore.transposon.bed.gz
# transposon region used in Li, et al., Cell, 2009. More conserved than repeat masker
Trn_IN_Cluster=$COMMON_FOLDER/Zamore.transposon.inCluster.bed.gz
# transposon region in cluster
Trn_OUT_Cluster=$COMMON_FOLDER/Zamore.transposon.outCluster.bed.gz
# transposon region out cluster
Trn_GROUP0=$COMMON_FOLDER/Zamore.transposon.group0.bed.gz
# transposons that failed to pass threshold in Li, et al., Cell, 2009.
# More conserved than repeat masker
Trn_GROUP1=$COMMON_FOLDER/Zamore.transposon.group1.bed.gz
# group 1 transposon in Li, et al., Cell, 2009, mainly germline
Trn_GROUP2=$COMMON_FOLDER/Zamore.transposon.group2.bed.gz
# group 2 transposon in Li, et al., Cell, 2009
Trn_GROUP3=$COMMON_FOLDER/Zamore.transposon.group3.bed.gz
# group 3 transposon in Li, et al., Cell, 2009, mainly somatic
flyBase_Gene=$COMMON_FOLDER/UCSC.flyBase.Genes.bed12.gz
# flyBase gene
flyBase_Exon=$COMMON_FOLDER/UCSC.flyBase.Exons.bed.gz
# flyBase exons
flyBase_Intron=$COMMON_FOLDER/UCSC.flyBase.Introns.bed.gz
# flyBase introns
flyBase_Intron_xRM=$COMMON_FOLDER/UCSC.flyBase.Introns_xRM.bed.gz
# flyBase introns that subtract repeatMasker
flyBase_5UTR=$COMMON_FOLDER/UCSC.flyBase.5UTR.bed.gz
# flyBase 5' UTR
flyBase_CDS=$COMMON_FOLDER/UCSC.flyBase.CDS.bed.gz
# flyBase CDS
flyBase_3UTR=$COMMON_FOLDER/UCSC.flyBase.3UTR.bed.gz
# flyBase 3' UTR
cisNATs=$COMMON_FOLDER/cisNATs.bed.gz
# cis-NATs
structural_loci=$COMMON_FOLDER/structured_loci.bed.gz
# structural loci
lincRNA=$COMMON_FOLDER/lincRNA.Young.bed6.gz
# linc RNA identified in 'Identification and properties of 1,119 candidate lincRNA loci in the
# Drosophila melanogaster genome. Genome Biol Evol. 2012;4(4):427-42.'
unannotated=$COMMON_FOLDER/unannotated_genome.bed.gz
# unannoated region, basically all the genome segments between annotations defined above
# TARGETS is used in small RNA-seq and degradome-seq pipeline
declare -a TARGETS=( \
"piRNA_Cluster" \
"piRNA_Cluster_42AB" \
"piRNA_Cluster_20A" \
"piRNA_Cluster_flam" \
"repeatMasker" \
"repeatMasker_IN_Cluster" \
"repeatMasker_OUT_Cluster" \
"Trn" \
"Trn_IN_Cluster" \
"Trn_OUT_Cluster" \
"Trn_GROUP1" \
"Trn_GROUP2" \
"Trn_GROUP3" \
"Trn_GROUP0" \
"flyBase_Gene" \
"flyBase_Exon" \
"flyBase_Intron" \
"flyBase_Intron_xRM" \
"flyBase_5UTR" \
"flyBase_CDS" \
"flyBase_3UTR" \
"cisNATs" \
"structural_loci" \
"lincRNA" \
"unannotated" )
# TARGETS_SHORT is used for "cis-Ping-Pong" analysis between degradome/small RNA.
# Since this step uses multi-threading itself, we are not able to run each feature simultaneously
# thus a few less important ones have been removed
declare -a TARGETS_SHORT=( \
"piRNA_Cluster" \
"piRNA_Cluster_42AB" \
"piRNA_Cluster_20A" \
"piRNA_Cluster_flam" \
"repeatMasker" \
"Trn" \
"Trn_GROUP1" \
"Trn_GROUP2" \
"Trn_GROUP3" \
"Trn_GROUP0" \
"flyBase_Gene" \
"flyBase_Exon" \
"flyBase_Intron_xRM" \
"flyBase_5UTR" \
"flyBase_3UTR" \
"lincRNA" )
# The following variables are for the pie chart, which gives reads information for genomic
# features that are mostly exclusive to each other. Different from the genomic feature count
# using TARGETS, reads mappable to genomic features in TARGETS_EXCLUSIVE will be partitioned.
# For example, if a read overlaps with a region annotated as both piRNA_Cluster and Repeats,
# piRNA_Cluster and Repeats will each get half of the reads.
# Please see small RNA-seq pipeline document for more information.
FivePrimeUTR=$flyBase_5UTR
ThreePrimeUTR=$flyBase_3UTR
CDS=$flyBase_CDS
Intron=$flyBase_Intron_xRM
Repeats=$repeatMasker
tRNA_NonCoding=$COMMON_FOLDER/UCSC.rRNA+tRNA+nonCoding.bed6.gz
declare -a TARGETS_EXCLUSIVE=(\
"piRNA_Cluster" \
"CDS" \
"FivePrimeUTR" \
"ThreePrimeUTR" \
"Intron" \
"Repeats" \
"tRNA_NonCoding" \
)
# variables for small RNA direct mapping
declare -a DIRECT_MAPPING=( "transposon" "repBase" "piRNAcluster" )
# gtf files for rnaseq/deg/cage htseq-count
Genes_transposon_Cluster=$COMMON_FOLDER/dm3.genes+transposon+piRNACluster.gtf
Genes_repBase_Cluster=$COMMON_FOLDER/dm3.genes+repBase+piRNACluster.gtf
declare -a HTSEQ_TARGETS=( "Genes_transposon_Cluster" "Genes_repBase_Cluster" )
For example:
# put the bed files under the common/xxx folder
#MASK is used to mask regions
MASK=$COMMON_FOLDER/region_I_want_to_mask.bed
# some regions of interest
piRNACluster=$COMMON_FOLDER/piRNAcluster.bed
myGene=$COMMON_FOLDER/myGene.bed
regionOfInterest=$COMMON_FOLDER/region1.bed
# put them in an array in this awy
declare -a TARGETS=( \
"piRNACluster" \
"myGene" \
"regionOfInterest" \
)
# The following variables are for the pie chart, which gives reads information for genomic
# features that are mostly exclusive to each other. Different from the genomic feature count
# using TARGETS, reads mappable to genomic features in TARGETS_EXCLUSIVE will be partitioned.
# For example, if a read overlaps with a region annotated as both piRNA_Cluster and Repeats,
# piRNA_Cluster and Repeats will each get half of the reads.
# Please see small RNA-seq pipeline document for more information.
declare -a TARGETS_EXCLUSIVE=(\
"piRNACluster" \
"myGene" \
"regionOfInterest" \
)