-
Notifications
You must be signed in to change notification settings - Fork 40
Genome seq
This document explains how to run piPipes Genome-seq pipeline and how to interpret the output.
This pipeline provides analysis on transposon insertion/deletion as well as general structural variation (SV) events using paired-end Genome-seq reads generated by Next Generation Sequencing.
# require internet access
piPipes install -g dm3
# use fastq-dump from SRATools (http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software)
# to download data and convert to fastq; require internet access
fastq-dump -F --split-3 --gzip -A Theurkauf.GSQ.w1xHar.21day.ovary SRR333512
piPipes dna
# or
piPipes dna -h
# -l: left read of Genome-seq
# -r: right read of Genome-seq
# -g: genome to use
# -o: output directory
piPipes dna \
-l Theurkauf.GSQ.w1xHar.21day.ovary_1.fastq.gz \
-r Theurkauf.GSQ.w1xHar.21day.ovary_2.fastq.gz \
-g dm3 \
-o Theurkauf.GSQ.w1xHar.21day.ovary.piPipes_out \
1> Theurkauf.GSQ.w1xHar.21day.ovary.piPipes.stdout \
2> Theurkauf.GSQ.w1xHar.21day.ovary.piPipes.stderr
# -l: left read of Genome-seq
# -r: right read of Genome-seq
# -g: genome to use
# -o: output directory
# -d: VCF filtering depth, passed to "vcfutils.pl varFilter -D" and "retroseq.pl -call -depth"
# -e: Transgene sequence in fasta format. piPipes calls TEMP to identify new transposon
# insertion site
# -M: Use mrFast and VariationHunter. mrFast requires large amount of memory so
# it's off by default.
piPipes dna \
-l Theurkauf.GSQ.w1xHar.21day.ovary_1.fastq.gz \
-r Theurkauf.GSQ.w1xHar.21day.ovary_2.fastq.gz \
-g dm3 \
-d 200 \
-e P.fa \
-M \
-o Theurkauf.GSQ.w1xHar.21day.ovary.piPipes_out \
1> Theurkauf.GSQ.w1xHar.21day.ovary.piPipes.stdout \
2> Theurkauf.GSQ.w1xHar.21day.ovary.piPipes.stderr
The output should contain the following directories:
mrFast_VariationHunter_output/
bigWig/
TEMP_output/
break_dancer_out/
bwa_bcftools_output/
retroSeq_discovering/
pdfs/
mrFast_VariationHunter_output/
contains the output of mrFast
and Variation Hunter
.
By default, mrFast
and Variation Hunder
are not ran because mrFast
uses memory that is proportional to the input Fastq.
In most cases, we were unable to finish running.
bwa_bcftools_output/
contains the genome alignments as well as SNP calling using bcftools
.
# genome alignment by BWA MEM algorithm and sorted by coordinates
Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.log
Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.bam
Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.sorted.bam
Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.sorted.bam.bai
# SNP calling output
Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.var.raw.bcf
Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.var.flt.vcf
# bedGraph, used to produce the bigWig file
Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.sorted.bdg
# genome alignment by BWA ALN algorithm, only used by TEMP
Theurkauf.GSQ.w1xHar.21day.1_sequence.sai
Theurkauf.GSQ.w1xHar.21day.2_sequence.sai
Theurkauf.GSQ.w1xHar.21day.bwa-aln.sorted.bam
Theurkauf.GSQ.w1xHar.21day.bwa-aln.sorted.bam.bai
bigWig/
directory contains files that can be used by UCSC genome browser.
Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.sorted.bigWig
TEMP_output/
contains output by TEMP.
This is the algorithm used in:
Adaptation to P element transposon invasion in Drosophila melanogaster. Cell. 2011 Dec 23;147(7):1551-63.
If feeding this pipeline with -e
option and a Fasta file, TEMP
will be used to locate the insertion of this sequence in the genome.
Our lab constantly uses this function to map transgene insertion site.
# intermediates files
dm3.repBase.fa
/project/umw_phil_zamore/hanb/tmp/GENOME/Theurkauf.GSQ.w1xHar.21day.ovary_1.piPipes_out/bwa_bcftools_output/Theurkauf.GSQ.w1xHar.21day.bwa-aln.sorted.bam
/project/umw_phil_zamore/hanb/tmp/GENOME/Theurkauf.GSQ.w1xHar.21day.ovary_1.piPipes_out/bwa_bcftools_output/Theurkauf.GSQ.w1xHar.21day.bwa-aln.sorted.bam.bai
dm3.repBase.fa.bwt
dm3.repBase.fa.pac
dm3.repBase.fa.ann
dm3.repBase.fa.amb
dm3.repBase.fa.sa
Theurkauf.GSQ.w1xHar.21day.bwa-aln.unpair.uniq.transposons.bed
Theurkauf.GSQ.w1xHar.21day.bwa-aln.unpair.uniq.transposons.filtered.bed
Theurkauf.GSQ.w1xHar.21day.bwa-aln.uniq.transposons.filtered.wGap.class.bed
Theurkauf.GSQ.w1xHar.21day.bwa-aln.insertion.bp.bed
Theurkauf.GSQ.w1xHar.21day.bwa-aln.clipped.reads.aln
Theurkauf.GSQ.w1xHar.21day.bwa-aln.insertion.refined.bp
# files with transposon insertion events
Theurkauf.GSQ.w1xHar.21day.bwa-aln.insertion.refined.bp.summary
$ head Theurkauf.GSQ.w1xHar.21day.bwa-aln.insertion.refined.bp.summary
Chr Start End TransposonName TransposonDirection Class VariantSupport Frequency Junction1 Junction1Support Junction2 Junction2Support 5'_Support 3'_Support
chr2L 27827 28327 DM176 sense singleton 1 0.0222 28077 0 28077 0 1 0
chr2L 44874 45374 BLOOD_I sense singleton 1 0.0294 45124 0 45124 0 1 0
chr2L 290721 291221 POGON1 antisense singleton 3 0.2500 291182 1 291185 1 0 1
chr2L 290721 291221 POGO antisense singleton 3 0.2500 291182 1 291185 1 0 1
chr2L 362397 362897 ROO_LTR sense singleton 1 0.0370 362647 0 362647 0 0 1
chr2L 493282 493782 R1_DM antisense singleton 1 0.0909 493532 0 493532 0 0 1
chr2L 512615 512722 LTRMDG3_DM sense 1p1 8 0.3200 512702 3 512702 0 1 4
chr2L 639122 639211 BEL_LTR antisense 1p1 12 0.4800 639191 2 639195 2 3 5
chr2L 735384 735884 ROO_LTR antisense 2p 7 0.3500 735883 2 735888 3 0 2
retroSeq_discovering/
contains output of retroSeq, which is another algorithm seeking transposon movement
# intermediate file
exd.file
Theurkauf.GSQ.w1xHar.21day.dm3.bwa.sorted.retroSeq
Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.sorted.bam.vcf.PE.vcf.candidates
# this two files contain the identified loci
# see https://github.com/tk2/RetroSeq/wiki/RetroSeq-Tutorial for more information
Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.sorted.bam.vcf.PE.vcf
Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.sorted.bam.vcf.PE.vcf.bed
break_dancer_out/
contains outputs of Break Dancer for discovering general structural variation in the genome.
It is NOT specialized in transposon related insertion/deletion.
# intermediate files
Theurkauf.GSQ.w1xHar.21day.breakdancer
Theurkauf.GSQ.w1xHar.21day.breakdancer.NA.2.fastq
Theurkauf.GSQ.w1xHar.21day.breakdancer.NA.1.fastq
Theurkauf.GSQ.w1xHar.21day.breakdancer.SV.bed
# outputs
# please see https://github.com/kenchen/breakdancer for detailed explanation
Theurkauf.GSQ.w1xHar.21day.breakdancer.out
Theurkauf.GSQ.w1xHar.21day.breakdancer.out.for_Circos
pdfs
contains the output pdf Circos plot representing overall looking of TEMP
retroSeq
and Break Dancer
.
# from innert to outer: break dancer, retroSeq and TEMP
# piRNA cluster loci has been marked
Theurkauf.GSQ.w1xHar.21day.summary.circos.pdf
Please see example figures from the Github Wiki page or our manuscript.
##Flowchart and example figures from our manuscript