ECTyper (an easy typer)

ECTyper is a standalone versatile serotyping module for Escherichia coli. It supports both fasta (assembled) and fastq (raw reads) file formats. The tool provides convenient species identification coupled to quality control module giving a complete, transparent and reference laboratories suitable report on E.coli serotyping.

Dependencies:

python >= 3.5
bcftools >= 1.8
blast == 2.7.1
seqtk >= 1.2
samtools >= 1.8
bowtie2 >= 2.3.4.1
mash >= 2.0

Python packages:

biopython >= 1.70
pandas >= 0.23.1
requests >= 2.0

Installation

Option 1: As a conda package

If you do not have conda environment, get and install miniconda or anaconda:

bash miniconda.sh -b -p $HOME/miniconda
echo ". $HOME/miniconda/etc/profile.d/conda.sh" >> ~/.bashrc
source ~/.bashrc```

Install conda package from bioconda channel conda install -c bioconda ectyper

Option 2: From the source directly

Second option is to install from the source.

Install dependencies. On Ubuntu distro run

apt install samtools bowtie2 mash bcftools ncbi-blast+ seqtk

Install python dependencies via pip:

pip3 install pandas biopython

Clone the repository or checkout a particular release (e.g v1.0.0, etc.):

git clone https://github.com/phac-nml/ecoli_serotyping.git
git checkout v1.0.0 #optionally checkout release version

Install ectyper: python3 setup.py install

Basic Usage

Put the fasta/fastq files for serotyping analyses in one folder (concatenate paired raw reads files if you would like them to be considered a single entity)
ectyper -i [file path] -o [output_dir]
View the results on the console or in cat [output folder]/output.csv

Example Usage

ectyper -i ecoliA.fasta for a single file
ectyper -i ecoliA.fasta -o output_dir for a single file, results stored in output_dir
ectyper -i ecoliA.fasta,ecoliB.fastq,ecoliC.fna for multiple files (comma-delimited)
ectyper -i ecoli_folder for a folder (all files in the folder will be checked by the tool)

Advanced Usage

usage: ectyper [-h] [-V] -i INPUT [-c CORES] [-opid PERCENTIDENTITYOTYPE] [-hpid PERCENTIDENTITYHTYPE] [-opcov PERCENTCOVERAGEOTYPE] [-hpcov PERCENTCOVERAGEHTYPE] [--verify] [-o OUTPUT] [-r REFSEQ] [-s] [--debug] [--dbpath DBPATH]

ectyper v1.0.0 database v1.0 Prediction of Escherichia coli serotype from raw reads or assembled genome sequences. The default settings are recommended.

options:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit
  -i INPUT, --input INPUT
                        Location of E. coli genome file(s). Can be a single file, a comma-separated list of files, or a directory
  -c CORES, --cores CORES
                        The number of cores to run ectyper with
  -opid PERCENTIDENTITYOTYPE, --percentIdentityOtype PERCENTIDENTITYOTYPE
                        Percent identity required for an O antigen allele match [default 90]
  -hpid PERCENTIDENTITYHTYPE, --percentIdentityHtype PERCENTIDENTITYHTYPE
                        Percent identity required for an H antigen allele match [default 95]
  -opcov PERCENTCOVERAGEOTYPE, --percentCoverageOtype PERCENTCOVERAGEOTYPE
                        Minumum percent coverage required for an O antigen allele match [default 90]
  -hpcov PERCENTCOVERAGEHTYPE, --percentCoverageHtype PERCENTCOVERAGEHTYPE
                        Minumum percent coverage required for an H antigen allele match [default 50]
  --verify              Enable E. coli species verification
  -o OUTPUT, --output OUTPUT
                        Directory location of output files
  -r REFSEQ, --refseq REFSEQ
                        Location of pre-computed MASH RefSeq sketch. If provided, genomes identified as non-E. coli will have their species identified using MASH. For best results the pre-sketched RefSeq archive https://gembox.cbcb.umd.edu/mash/refseq.genomes.k21s1000.msh is recommended
  -s, --sequence        Prints the allele sequences if enabled as the final columns of the output
  --debug               Print more detailed log including debug messages
  --dbpath DBPATH       Path to a custom database of O and H antigen alleles in JSON format. Check Data/ectyper_database.json for more information

Fine-tunning parameters

ECTyper requires minimum options to run (-i and -o) but allows for extensive configuration to accomodate wide variaty of typing scenarios

Parameter	Explanation	Usage scenario
`-opid`	Specify minimum `%identity` threshold just for O antigen match	Poor coverage of O antigen genes or for exploratory work (recommended value is 90)
`-opcov`	Minimum `%covereage` threshold for a valid match against reference O antigen alleles	Poor coverage of O antigen genes and a user wants to get O antigen call regardless (recommend value is 95)
`-hpid`	Specify minimum `%identity` threshold just for H antigen match	Poor coverage of O antigen genes or for exploratory work (recommend value is 95)
`-hpcov`	Minimum `%covereage` threshold for a valid match against reference H antigen alleles	Poor coverage of O antigen genes and a user wants to get O antigen call regardless (recommend value is 95)
`--verify`	Verify species of the input and run QC module providing information on the reliability of the result and any typing issues	User not sure if sample is E.coli and wants to obtain if serotype prediction is of sufficient quality for reporting purposes
`-r`	Specify custom MASH sketch of reference genomes that will be used for species inference	User has a new assembled genome that is not available in NCBI RefSeq database. Make sure to add metadata to `assembly_summary_refseq.txt` and provide custom accession number that start with `GCF_` prefix
`--dbpath`	Provide custom appended database of O and H antigen reference alleles in JSON format following structure and field names as default database `ectyper_alleles_db.json`	User wants to add new alleles to the alleles database to improve typing performance

Quality Control (QC) module

To provide an easier interpretation of the results and typing metrics, following QC codes were developed. These codes allow to quickly filter "reportable" and "non-reportable" samples. The QC module is tightly linked to ECTyper allele database, specifically, MinPident and MinPcov fields. For each reference allele minimum %identity and %coverage values were determined as a function of potential "cross-talk" between antigens (i.e. multiple potential antigen calls at a given setting). The QC module covers the following serotyping scenarios. More scenarios might be added in future versions depending on user needs.

QC flag	Explanation
PASS (REPORTABLE)	Both O and H antigen alleles meet min `%identity` or `%coverage` thresholds (ensuring no antigen cross-talk) and single antigen predicted for O and H
FAIL (-:- TYPING)	Sample is E.coli and O and H antigens are not typed. Serotype: -:-
WARNING MIXED O-TYPE	A mixed O antigen call is predicted requiring wet-lab confirmation
WARNING (WRONG SPECIES)	A sample is non-E.coli (e.g. E.albertii, Shigella, etc.) based on RefSeq assemblies
WARNING (-:H TYPING)	A sample is E.coli and O antigen is not predicted (e.g. -:H18)
WARNING (O:- TYPING)	A sample is E.coli and O antigen is not predicted (e.g. O17:-)
WARNING (O NON-REPORT)	O antigen alleles do not meet min %identity or %coverage thresholds
WARNING (H NON-REPORT)	H antigen alleles do not meet min %id or %cov thresholds
WARNING (O and H NON-REPORT)	Both O and H antigen alleles do not meet min %identity or %coverage thresholds

Report format

ECTyper capitalizes on a concise minimum output coupled to easy results interpretation and reporting. ECTyper v1.0 serotyping results are available in a tab-delimited output.tsv file consisting of the 16 columns listed below:

Name: Sample name (usually a unique identifier)
Species: the species column provides valuable species identification information in case of inadvertent sample contamination or mislabelling events
O-type: O antigen
H-type: H antigen
Serotype: Predicted O and H antigen(s)
QC: The Quality Control value summarizing the overall quality of prediction
Evidence: How many alleles in total used to both call O and H antigens
GeneScores: ECTyper O and H antigen gene scores in 0 to 1 range
AllelesKeys: Best matching ECTyper database allele keys used to call the serotype
GeneIdentities(%): %identity values of the query alleles
GeneCoverages(%): %coverage values of the query alleles
GeneContigNames: the contig names where the query alleles were found
GeneRanges: genomic coordinates of the query alleles
GeneLengths: allele lengths of the query alleles
Database: database release version and date
Warnings: any additional warnings linked to the quality control status or any other error message(s).

Selected columns from the ECTyper typical report are shown below.

Name	Species	Serotype	Evidence	QC	GeneScores	AlleleKeys	GeneIdentities(%)	GeneCoverages(%)	GeneContigNames	GeneRanges	GeneLengths	Database	Warnings
15-520	Escherichia coli	O174:H21	Based on 3 allele(s)	PASS (REPORTABLE)	wzx:1; wzy:1; fliC:1;	O104-5-wzx-origin;O104-13-wzy;H7-6-fliC-origin;	100;100;100;	100;100;100;	contig00049;contig00001;contig00019;	22302-23492;178-1290;6507-8264;	1191;1113;1758;	v1.0 (2020-05-07)	-
EC20151709	Escherichia coli	O157:H43	Based on 3 allele(s)	PASS (REPORTABLE)	wzx:1;wzy:0.999;fliC:1	O157-5-wzx-origin;O157-9-wzy-origin;H43-1-fliC-origin;	100;99.916;99.934;	100;100;100;	contig00002;contig00002;contig00003;	62558-63949;64651-65835;59962-61467;	1392;1185;1506;	v1.0 (2020-05-07)	-

Availability

Resource	Description	Type
PyPI	PyPI pacakge that could be installed via `pip` utility	Terminal
Conda	Conda package available from BioConda channel	Terminal
Docker	Images containing completely initialized ECTyper with all dependencies	Terminal
Singluarity	Images containing completely initialized ECTyper with all dependencies	Terminal
GitHub	Install source code as any Python package	Terminal
Galaxy ToolShed	Galaxy wrapper available for installation on a private/public instance	Web-based
Galaxy Europe	Galaxy public server to execute your analysis from anywhere	Web-based
IRIDA plugin	IRIDA instances could easily install additional pipeline	Web-based

Name		Name	Last commit message	Last commit date
Latest commit History 817 Commits
bin		bin
ectyper		ectyper
galaxy		galaxy
helper_scripts		helper_scripts
recipe		recipe
test		test
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ECTyper (an easy typer)

Dependencies:

Python packages:

Installation

Option 1: As a conda package

Option 2: From the source directly

Basic Usage

Example Usage

Advanced Usage

Fine-tunning parameters

Quality Control (QC) module

Report format

Availability

About

Releases 13

Packages

Contributors 8

Languages

License

phac-nml/ecoli_serotyping

Folders and files

Latest commit

History

Repository files navigation

ECTyper (an easy typer)

Dependencies:

Python packages:

Installation

Option 1: As a conda package

Option 2: From the source directly

Basic Usage

Example Usage

Advanced Usage

Fine-tunning parameters

Quality Control (QC) module

Report format

Availability

About

Resources

License

Stars

Watchers

Forks

Releases 13

Packages 0

Contributors 8

Languages

Packages