GitHub - jstjohn/SimSeq: An illumina paired-end and mate-pair short read simulator. This project attempts to model as many of the quirks that exist in Illumina data as possible. Some of these quirks include the potential for chimeric reads, and non-biotinylated fragment pull down in mate-pair libraries . Additionally the program provides the ability to model both site and base specific error, and scripts are provided to train this error model on real datasets. My hope in creating this program is to generate as realistic data as possible to assist in assessing the accuracy of genome assembly tools.

jstjohn / SimSeq Public

Notifications You must be signed in to change notification settings
Fork 12
Star 66

An illumina paired-end and mate-pair short read simulator. This project attempts to model as many of the quirks that exist in Illumina data as possible. Some of these quirks include the potential for chimeric reads, and non-biotinylated fragment pull down in mate-pair libraries . Additionally the program provides the ability to model both site a…

View license

66 stars 12 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
SimSeqNBProject		SimSeqNBProject
ThirdParty		ThirdParty
cUtils		cUtils
examples		examples
inc		inc
lib		lib
profiles		profiles
scripts		scripts
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README		README

Repository files navigation

SimSeqNBProject currently contains the net beans project with the initial source

getErrorProfile contains the c source for the program that generates the
error model for the sequence simulator. This program utilizes a subset
of the kent source library (http://hgwdev.cse.ucsc.edu/~kent/src/) which
is freely available for both commercial and non-commercial use
(see README in "ThirdParty/kent" for more information)

examples contains some things to run this on. Right now there is
an example mitochondrial sequence, and an example error profile
from a 100 bp illumina GAIIx dataset. You can simulate any read
less than or equal to 100bp in length using that error profile.

usage: java -jar -Xmx2048m SimSeq.jar [required options] [options]
Last Updated: 4.12.2011
-1,--read1_length <arg> Integer length of first read. Default:
100
-2,--read2_length <arg> Integer length of second read.
Default: 100
--adapter1 <arg> The first read illumina adapter
sequence to use when the insert size
is less than the read length.
Currently only works on paired end
simulation.
Defalt:AGATCGGAAGAGCGGTTCAGCAGGAATGCCG
AGACCG
--adapter2 <arg> The second read illumina adapter
sequence to use when the insert size
is less than the read length.
Currently only works on paired end
simulation.
Defalt:AGATCGGAAGAGCGTCGTGTAGGGAAAGAGT
GT
-d,--dip <arg> If diploid data desired, path to
diploid file. (format: chrom pos(0
based) altChar
--debug Write debug info to stderr.
-e,--error <arg> If simulated read error desired, path
to read error file.
--error2 <arg> If you desire a seperate error
distribution to be applied to the
second read, then provide a path to
that error profile with this option
-h,--help Print this usage message.
--inf_id Output location information in the
read ID. Currently this is only
informative for paired-end data and
note mate paired.
-l,--insert_size <arg> mean library insert size for either
mate-paired or paired-end. Default:
200
-m,--mate_pair Perform mate-pair rather than paired
end run.
--mate_frag <arg> If using a mate-pair library, what is
your desired loop fragmentation size?
Default: 500
--mate_frag_stdev <arg> If using a mate-pair library, what is
your desired loop fragmentation size
standard deviation? Default: 50
--mate_pulldown_error_p <arg> If using a mate-pair library, what is
the probability that a read does not
include the biotin marker? Default:
0.3
-n,--read_number <arg> Integer number of reads you would like
to sample. Default: 1000000
-o,--out <arg> Filename for output sam file
(REQUIRED)
-p,--read_prefix <arg> Prefix for simulated reads. Default:
SimSeq_
--phred64 Output phred+64 quality string rather
than phred+33
-r,--reference <arg> Reference genome sequence in
uncompressed fasta format (REQUIRED)
-s,--insert_stdev <arg> mean library insert stdev for either
mate-paired or paired-end. Default: 20
-u,--duplicate_probability <arg> probability of generating a duplicate.
Default: 0.0

===Quick Start Example:===
==Prerequisits==
0.0. Download and install samtools: http://samtools.sourceforge.net/ and add executable to your $PATH
0.1. Download latest Picard jar file: http://sourceforge.net/projects/picard/files/, and extract .jar files somewhere that makes sense like $HOME/jars/, for the rest of this I will assume that all jar files are located in $HOME/jars/
==Generate and check simulated reads==
1. Download SimSeq.jar, error_profile.txt, and AlligatorMito.fa
2. put SImSeq.jar in your jar directory ($HOME/jars/ or something) and put error_profile.txt and AlligatorMito.fa in a folder where you want to work
3. Run `java -jar -Xmx2048m $HOME/jars/SimSeq.jar -1 100 -2 100 \
--error hiseq_mito_default_bwa_mapping_mq10_1.txt \
--error2 hiseq_mito_default_bwa_mapping_mq10_2.txt \
--insert_size 3000 --insert_stdev 300 \
--mate_pair --mate_frag 500 --mate_frag_stdev 50 \
--mate_pulldown_error_p 0.3 --read_number 10000 \
--read_prefix AMito_ --reference AlligatorMito.fa \
--duplicate_probability 0.01 --out out.sam`
4. Convert sam output to bam: `samtools view -bS -T AlligatorMito.fa -t AlligatorMito.size -o out.bam out.sam`
5. Sort the bam file: `samtools sort out.bam out.sorted`
6. Index the sorted bam file: `samtools index out.sorted.bam`
7. Check the alignment in tview: `samtools tview out.sorted.bam AlligatorMito.fa`
==Convert Sam output to Fastq==
run `java -jar -Xmx2048 $HOME/jars/SamToFastq.jar INPUT=out.sorted.bam FASTQ=library.1.fastq SECOND_END_FASTQ=library.2.fastq INCLUDE_NON_PF_READS=true VALIDATION_STRINGENCY=SILENT`

Note: setting the VALIDATION_STRINGENCY=SILENT is probably not ideal, but for now it is necessary because samtools isn't putting the length of the reference sequence into the header, which is expected by Picard. You get an error about the first read having a coordinate outside of the range of the reference sequence which has a length of 0. If you use Picard to somehow fix the header given the reference sequence this should work without telling it not to validate your sequence.

Also Note: The current sam alignment for a chimeric read shows the fragment starting at the left-most coordinate. Since the rest of the sequence occurs before the left most coordinate I use a soft clipping for the remainder of the sequence. This does not effect the output fastq file generated by the above method, but when you are looking at the alignment in samtools tview you will see some sequences as short as 1 nucleotide. This is normal, and doesn't mean that the actual sam records are that short.

When producing chimeric reads, samtools doesn't currently allow you to view the backwards jumps that occur. Doing so would involve a '-xN' operation on a cigar string. I added a custom tag ('YC:Z:[CIGAR]') at the end of each sam string that shows either a normal cigar string if the read can be represented with one, or a custom cigar string that allows for negative jumps needed to represent chimeric reads if the read is chimeric. Here is an example of what a chimeric read's tag would look like `YC:Z:97M-4996N3M` while a non-chimeric read would have a normal cigar string in its tag like: `YC:Z:100M`. To reconstruct a chimeric read, you start at the start coordinate for the read which corresponds to the left most coordinate, then you proceed for the number of matches 'M', then you jump the specified number of positions from your current location, in this case -4996, and finally proceed with the remaining CIGAR string. Right now since I don't add indels to the data, this string will either be a match or one of these chimeric jumps.