Skip to content

Latest commit

 

History

History
259 lines (203 loc) · 13.2 KB

README.md

File metadata and controls

259 lines (203 loc) · 13.2 KB

Phylogenetic Tree Inference Nextflow Workflow

This script provides a convenient way to run a Nextflow workflow with various configurations and parameters.

Requirements

  • Nextflow DSL2
  • Python >= 3.10

Quick Start

Single inputs

python ete_build.py --mode local --input data/NUP62.aa.fa --output result/ --aligner mafft --trimmer trimal --tree_builder fasttree  

Multitple inputs, which will fetch all *fa, *faa, *fasta files.

python ete_build.py --mode local --input data/ --output result/ --aligner mafft --trimmer trimal --tree_builder fasttree

Arguments:

  • --mode: Execution mode. Choose between local and slurm. This is a required argument.
  • --input: Path to the input fasta file or directory. This is a required argument.
  • --output: Path to the output directory where results will be saved. This is a required argument.
  • --aligner: Alignment tool to use. Default is mafft.
  • --trimmer: Trimming tool to use. Default is trimal.
  • --tree_builder: Tree building tool to use. Default is fasttree.

SLURM-specific arguments:

If you choose slurm as the execution mode, you need to provide additional SLURM-specific arguments:

  • --slurm-partition: SLURM partition name. This is required if --mode is slurm.
  • --time: Time limit for SLURM jobs. Default is 1h.
  • --memory: Memory allocation for SLURM jobs. Default is 4GB.
  • --cpus: Number of CPUs for SLURM jobs. Default is 4.

Tools Overview

The pipeline utilizes several bioinformatics tools categorized into three main groups: aligners, trimmers, and tree builders. Below are the tools used, and you can check their versions with the conda list command.

1. Aligners

Aligners are used to align sequences before trimming or building trees. The following aligners are supported:

  • MAFFT: Multiple sequence alignment tool for amino acid or nucleotide sequences.
  • MUSCLE: Multiple sequence alignment with high accuracy and throughput.
  • T-Coffee: A sequence alignment package that computes multiple alignments of protein sequences.
  • Clustal Omega (Clustalo): A fast and scalable multiple sequence alignment program.
  • FAMSA: Fast and accurate multiple sequence alignment.

2. Trimmers

Trimmers are used to trim alignments, removing low-quality or ambiguous regions:

  • TrimAl: A tool for automated trimming of multiple sequence alignments.
  • ClipKit: A tool that enables automated selection of conserved sites in alignments.
  • TrimAlgV2: A custom alignment trimming tool in ./bin/.

3.Tree Builders

Tree builders are used to infer phylogenetic trees from sequence alignments:

  • FastTree: A tool to compute approximately maximum likelihood phylogenetic trees from alignments.
  • PhyML: A software package that estimates maximum likelihood phylogenies from nucleotide or amino acid sequences.
  • RAxML: A tool for maximum likelihood-based inference of large phylogenies.
  • IQ-TREE: A fast and effective stochastic algorithm for inferring maximum likelihood trees.
  • MrBayes: A program for Bayesian inference of phylogeny using Markov chain Monte Carlo methods.

Functionality

The script first generates a configuration file for Nextflow based on the provided execution mode and other parameters. It then runs the specified Nextflow workflow script (ete_build_dsl2.nf by default) with the given parameters.

The workflow processes the input fasta files to perform alignment, trimming, and tree building based on the specified tools.

Gene tree mode

Basic usage

In general, given the input fasta files the pipeline will output the alignment, trimmed alignment and phylogeneies in the given output folder. By default, program will generate phylogenetic tree for each given alignment. For example

  1. Build gene tree of NUP62
python ete_build.py --mode local --input data/NUP62.aa.fa --output result/ --aligner mafft --trimmer trimal --tree_builder fasttree  
  1. check result once it finished, we can check in the output folder, here is in result/:
ll -a result/
-rw-r--r--  1 deng huerta 12149 Sep 29 17:21 .nextflow.log
drwxr-xr-x  2 deng huerta     9 Sep 29 17:21 NUP62.aa-mafft-trimal-fasttree
-rw-r--r--  1 deng huerta   262 Sep 29 17:21 used_config.cfg
-rw-r--r--  1 deng huerta   171 Sep 29 17:21 used_config.json
drwxr-xr-x  7 deng huerta     5 Sep 29 17:21 work
  • .nextflow.log file contains detailed logs generated by Nextflow during the execution of the pipeline
  • used_config.cfg file contains the configurations used for the pipeline run, specifying the alignment, trimming, and tree-building tools with their associated parameters. It can be used to replicate the analysis or for transparency about which p arameters were applied during the run.
  • used_config.json This file is a JSON representation of the same configurations present in used_config.cfg. It provides a structured format that is easy to parse programmatically, documenting the tools and settings used in the pipeline run.
  • work/ intermediate files created during the execution of each process in the pipeline. It includes working data for alignment, trimming, and tree-building steps, as well as temporary files used by Nextflow.
  • NUP62.aa-mafft-trimal-fasttree/ output directory contains the output files related to the analysis of the NUP62 gene using the selected tools: MAFFT for alignment, Trimal for trimming, and FastTree for tree building. It holds all intermediate and final results of the pipeline for this gene, including alignment files, trimmed alignments, and the final phylogenetic tree.

Check output files to fetch target alignment NUP62.aa.aln.faa, trimmed alignment NUP62.aa.clean.alg.faa and tree NUP62.aa.output.tree

ls NUP62.aa-mafft-trimal-fasttree/
total 38
-rw-r--r-- 1 deng huerta  2830 Sep 29 17:21 align.err
-rw-r--r-- 1 deng huerta  1677 Sep 29 17:21 build.err
-rw-r--r-- 1 deng huerta 15526 Sep 29 17:21 NUP62.aa.aln.faa
-rw-r--r-- 1 deng huerta    38 Sep 29 17:21 NUP62.aa_aln_timing.log
-rw-r--r-- 1 deng huerta 14626 Sep 29 17:21 NUP62.aa.clean.alg.faa
-rw-r--r-- 1 deng huerta  1147 Sep 29 17:21 NUP62.aa.output.tree
-rw-r--r-- 1 deng huerta    39 Sep 29 17:21 NUP62.aa_treebuild_timing.log
-rw-r--r-- 1 deng huerta     0 Sep 29 17:21 trim.err
-rw-r--r-- 1 deng huerta     0 Sep 29 17:21 trim.out

Compose workflow with different tools

You can easily compose your own workflow by specifying the desired tools for each step. For instance, if you want to use muscle for alignment, clipkit for trimming, and iqtree for tree building, you would run:

python ete_build.py --mode local --input data/NUP62.aa.fa --output result/ --aligner muscle --trimmer clipkit --tree_builder iqtree  

check in result, now result folder will be stored in NUP62.aa-muscle-clipkit-iqtree/

ll result/
drwxr-xr-x  2 deng huerta  16 Sep 29 18:05 NUP62.aa-muscle-clipkit-iqtree
drwxr-xr-x 11 deng huerta   9 Sep 29 18:04 work
-rw-r--r--  1 deng huerta 313 Sep 29 18:03 used_config.cfg
-rw-r--r--  1 deng huerta 226 Sep 29 18:03 used_config.json
drwxr-xr-x  2 deng huerta   9 Sep 29 17:21 NUP62.aa-mafft-trimal-fasttree

Compose workflow with different tools with custom configuration

In addition to specifying aligners, trimmers, and tree builders on the command line, you can also customize your workflow further by providing a configuration file, such as nf_build.cfg. This file allows you to define tool-specific parameters and options, giving you more control over how each tool operates within your workflow.

Customizing Workflow Using nf_build.cfg

The nf_build.cfg file contains settings for each tool used in the workflow. These settings override default values, allowing you to fine-tune the behavior of the tools (e.g., alignment precision, trimming thresholds, or tree-building models).

Example nf_build.cfg Format

Here is an example of how the configuration file (nf_build.cfg) may look:

[mafft_linsi]
_desc = 'Mafft with linsi method'
_app = mafft
op = 1.53  # Gap opening penalty at group-to-group alignment. Default: 1.53
ep = 0.123 # Offset value, which works like gap extension penalty, for group-to-group alignment. Default: 0.123
maxiterate = 1000 # number cycles of iterative refinement are performed. Default: 0
matrix = "" # "", blosum or pam
blosum_coefficient = 62 # Default: 62
pam_coefficient = 80
localpair = True 
retree = 2 # Number of iterative refinement is performed. Default: 2

[trimal_default]
_desc = 'trimal alignment cleaning with default parameters'
_app = trimal
gt = 0.01

[fasttree_full]
_desc = 'Fasttree with full mode'
_app = fasttree
aa_model = JTT # Evolutionary model LG, WAG or JTT for amino acid sequences
nt_model = JC # Evolutionary model GTR or JC for nucleotide sequences
gamma = False # Non-uniformity of evolutionary rates among sites may be modeled by using a discrete Gamma distribution
bootstrap = 1000 # Number of bootstrap replicates, 0 to disable support values
pseudo = True # Pseudo-likelihood support values
spr = 4 # the number of rounds of minimum-evolution SPR moves
mlacc = 2 # the number of rate categories for the ML model of rate heterogeneity
slownni = True # Use slow NNI moves
....

Using the Custom Configuration File

To apply a custom configuration in your workflow, you can pass the path to the configuration file (nf_build.cfg) using the --config argument. The pipeline will then apply the settings specified in this file for the selected tools. Noted if you using custom config, the tool's name should be the same as written in [] from nf_build.cfg

python ete_build.py --mode local --input data/NUP62.aa.fa --output result/ --aligner mafft_linsi --trimmer trimal_default --tree_builder fasttree_full  --config nf_build.cfg

Species Tree mode

Species Tree Mode

The workflow can also be configured to infer a species tree from individual gene alignments using a supermatrix approach or a coalescent-based method, such as ASTRAL. The species tree mode allows you to summarize phylogenetic relationships across multiple genes into a single species tree, providing a comprehensive view of evolutionary relationships.

Enabling Species Tree Mode

To enable species tree mode, you can use the --supermatrix_mode or --coalescent_mode flags depending on the approach you wish to take:

  1. Supermatrix Mode: This method concatenates all individual gene alignments into a supermatrix, which is then used to build the species tree.
  2. Coalescent Mode: This method constructs individual gene trees and then combines them to infer the species tree using a coalescent-based approach (e.g., ASTRAL).

Running Supermatrix Mode

In supermatrix mode, the workflow concatenates multiple gene alignments into a single large alignment file (supermatrix) before performing tree inference. To enable this mode, you need to provide a list of target species in a separate file. Here we use example in data/sp_tree

ls data/sp_tree
-rw-r--r-- 1 deng huerta     8401 Sep 19 16:36 gene1.fa
-rw-r--r-- 1 deng huerta     6147 Sep 19 16:36 gene2.fa
-rw-r--r-- 1 deng huerta     5002 Sep 19 16:36 gene3.fa
-rw-r--r-- 1 deng huerta     7219 Sep 19 16:36 gene4.fa
-rw-r--r-- 1 deng huerta     3671 Sep 19 16:36 gene5.fa
-rw-r--r-- 1 deng huerta       62 Sep 19 16:36 target_species.txt

grep ">" data/sp_tree/gene1.fa|head -5
>1073090_gene1
>105351_gene1
>1073089_gene1
>1034303_gene1
>1034304_gene1

head data/sp_tree/target_species.txt
1073090
105351
1073089
1034303

Command Example:

python ete_build.py --mode local --input data/sp_tree/ --output result/ --aligner mafft --trimmer none --tree_builder fasttree --supermatrix --target-species data/sp_tree/target_species.txt --spname-delimiter _ --spname-field 0

check result of each gene and the final species tree from supermatrix

ll result/
drwxr-xr-x  2 deng huerta   4 Sep 29 19:44 gene1-mafft-none-fasttree
drwxr-xr-x  2 deng huerta   4 Sep 29 19:44 gene2-mafft-none-fasttree
drwxr-xr-x  2 deng huerta   4 Sep 29 19:44 gene3-mafft-none-fasttree
drwxr-xr-x  2 deng huerta   4 Sep 29 19:44 gene4-mafft-none-fasttree
drwxr-xr-x  2 deng huerta   4 Sep 29 19:44 gene5-mafft-none-fasttree
drwxr-xr-x  2 deng huerta   5 Sep 29 19:44 supermatrix-mafft-none-fasttree
-rw-r--r--  1 deng huerta 182 Sep 29 19:44 used_config.cfg
-rw-r--r--  1 deng huerta 123 Sep 29 19:44 used_config.json
drwxr-xr-x 77 deng huerta  75 Sep 29 19:44 work

Running Coalescent Mode

In coalescent mode, individual gene trees are first built, and then a species tree is inferred using the coalescent-based method (e.g., ASTRAL). This mode is particularly useful when different genes may have conflicting phylogenies, as it attempts to infer the species tree by accounting for such conflicts.

python ete_build.py --mode local --input data/sp_tree/ --output result/ --aligner mafft --trimmer none --tree_builder fasttree --spname-delimiter _ --spname-field 0  --coalescent 

show result

ll result/
total 6
drwxr-xr-x  2 deng huerta   3 Sep 29 19:56 astral-mafft-none-fasttree
drwxr-xr-x  2 deng huerta   7 Sep 29 19:56 gene1-mafft-none-fasttree
drwxr-xr-x  2 deng huerta   7 Sep 29 19:56 gene2-mafft-none-fasttree
drwxr-xr-x  2 deng huerta   7 Sep 29 19:56 gene3-mafft-none-fasttree
drwxr-xr-x  2 deng huerta   7 Sep 29 19:56 gene4-mafft-none-fasttree
drwxr-xr-x  2 deng huerta   7 Sep 29 19:56 gene5-mafft-none-fasttree