ADEPT, a dynamic next generation sequencing data error-detection program with trimming

======= ADEPT is a program that dynamically assesses errors within reads by comparing position-specific and neighboring base quality scores with the distribution for the dataset being analyzed.

PREREQUISITES

The main program is developed in Perl v 5.8.8.
Parallel::ForkManager module from CPAN
(http://search.cpan.org/~dlux/Parallel-ForkManager-0.7.9/lib/Parallel/ForkManager.pm)
String::Approx module from CPAN
(http://search.cpan.org/~jhi/String-Approx-3.27/Approx.pm)
R for ploting
(http://www.r-project.org/)

BASIC USAGE

Trimming by quality 5 and filtering reads with any ambiguous base or low complexity.

$ perl ADEPT.pl -p 'reads1.fastq reads2.fastq' -d out_directory

Full USAGE

 perl ADEPT.pl [options] [-u unpaired.fastq] -p 'reads1.fastq reads2.fastq' -d out_directory
 
 Input File: (can use more than once fastq file)
        -u             Unpaired reads
        
        -p             Paired reads in two files and separate by space in quote
 Trim:
        -qE            5" and 3" ends triming # as quality level (0-40) (default 5) for trimming
        -qC            threhold to call a base to be correct (0-1.0) (default = 0.25, higher quality
                       than 25% the nucleotides at that position within the sampled run )
        -qW            threhold to identifying a nucleotide as an error if it falls below a defined 
                       percentage of the quality scores for that position (0-1.0) (default = 0)
        -qMN           ratio of the of the base quality to the qualities of upstream and downstream positions (0-1.0)
                       By default, all qIN ratios must be at least 0.4 to be considered as a potential erroneous base 
                       (i.e. all adjacent qualities must be at least 2.5 times higher than the quality of the position being investigated).
        -qNS           threhold to identify a nucleotide as an potential error if its neighbors' quality falls below a defined 
                       percentage of the quality scores for that neighbors' position within the sampled run (0-1.0) (default = 0.3)

 Filters:
        -min_L         Trimmed sequence length will have at least minimum length (default:50)
        

        					
 Q_Format:
        -ascii         Encoding type: 33 or 64 or autoCheck (default)
                       Type of ASCII encoding: 33 (standard) or 64 (illumina 1.3+)

        -out_ascii     Output encoding. (default: 33)
 Output:
        -prefix        Output file prefix. (default: QC)

        -stats         Statistical numbers output file (default: prefix.stats.txt)

        -d             Output directory.
 Options:
        -t             # of CPUs to run the script (default:2 )

        -split_size    Split the input file into several sub files by sequence number (default: 1000000) 

        -out_non_trim_reads      <bool> Output not trimmed reads to prefix.discard.fastq (default: 0, not output)

        -debug         keep intermediate files

VERSION HISTORY

======== Version 1.1 Stable function release. Features:

assesses errors within reads by comparing position-specific and neighboring ba se quality scores with the distribution for the dataset being analyzed.
autocheck quality encoding and quality encoding coversion
multi-threads (required Parallel::ForkManager)
input paired end reads aware

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PREREQUISITES

BASIC USAGE

Full USAGE

VERSION HISTORY

CITATION

Files

README.md

Latest commit

History

README.md

File metadata and controls

PREREQUISITES

BASIC USAGE

Full USAGE

VERSION HISTORY

CITATION