- Repository containing workflows for
IMP3
downstream analyses - Related project(s): NOMIS
# install miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod u+x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh # follow the instructions
Getting the repository including sub-modules
git clone --recurse-submodules ssh://[email protected]:8022/susheel.busi/nomis_pipeline.git
Create the main snakemake
environment
# create venv
conda env create -f requirements.yml -n "snakemake"
The successful completion requires tools created by others
Notes:
- Dependencies are included as
submodules
where possible - However, installation issues may persist
- If so, check the respective repositories listed
The workflow can be launched using one of the option as follows
./config/sbatch.sh
(or)
CORES=48 snakemake -s workflow/Snakefile --configfile config/config.yaml --use-conda --conda-prefix ${CONDA_PREFIX}/pipeline --cores $CORES -rpn
(or)
Note: For running on esb-compute-01
or litcrit
adjust the CORES
as needed to prevent MANTIS
from spawning too many workers and launch as below
CORES=24 snakemake -s workflow/Snakefile --configfile config/config.yaml --use-conda --conda-prefix ${CONDA_PREFIX}/pipeline --cores $CORES -rpn
All config files are stored in the folder config/
:
imp workflow
: setup the folders required for running IMP3 on each sampleviruses workflow
: run VIBRANT and vCONTACT2 on assemblies, including CheckV on vibrant_outputeukaryotes workflow
: running EUKUlele on assembliesbins workflow
: collects all bins together for taxonomy analysestaxonomy workflow
: run GTDBtk and CheckM on the binsfunctions workflow
: runs METABOLIC, MAGICCAVE and FUNCS analysesmantis workflow
: runs MANTIS on the bins explicitlyeuk_bin workflow
: performs coassembly specifically for eukaryotes (EukRep) and runs binning with CONCOCTcoassembly_binning
: performs coassembly for all samples and subsequent binningmisc workflow
: runs gRodon, antismash. To be implemented PopCOGent,and potentially anvi'o coassembly/binning.
Relevant paremters which have to be changed are listed for each workflow and config file. Parameters defining system-relevant settings are not listed but should be also be changed if required, e.g. number of threads used by certain tools etc.
The workflow is setup in multiple steps. Prior to running change the following
- config:
config/
config.yaml
:- change
steps
- change
Options:
- imp
- viruses
- eukaryotes
- bins
- taxonomy
- functions
- mantis
- euk_bin
- coassembly_binning
- misc
IMPORTANT NOTE: only the imp
step should be run first, followed by launching IMP3
outside of this pipeline. Subsequent other STEPS
can be run
Per-sample IMP3 can be launched as follows:
chmod -R 775 ${SAMPLE} # adding permissions
cd ${SAMPLE}
sbatch ./launchIMP.sh # on IRIS
Download raw data required for the analysis.
- config:
config/
config.yaml
:- change
work_dir
- change
sbatch.sh
- change
SMK_ENV
- if not using
slurm
to submit jobs remove--cluster-config
,--cluster
from thesnakemake
CMD
- change
slurm.yaml
(only relevant if usingslurm
for job submission)
- workflow:
workflow/
Prior to running the imp workflow
make the following adjustments.
- IMP_config.yaml:
workflow/notes/IMP_config.yaml
- change
Metagenomics
- change
- run_threads:
workflow/notes/runIMP.sh
- change:
threads
- change:
- launch_threads:
workflow/notes/runIMP.sh
- change:
-n8
- change:
IMPORTANT Note:
This above workflow should be run first, followed by launching IMP3 outside of this pipeline and then subsequent STEPS
can be run
Main analysis workflow: given SR FASTQ files, run all the steps to generate required output. This includes:
- setting up folders for IMP
- viral and eukaryotic annotations
- functional analyses and
- taxonomic analyses (optional)
The workflow is run per sample and might require a couple of days to run depending on the sample, used configuration and available computational resources. Note that the workflow will create additional output files not necessarily required to re-create the figures shown in the manuscript.
- config:
- per sample
config/<sample>/config.yaml
- change all path parameters (not all databases are required, see above)
config/<sample>/sbatch.yaml
- change
SMK_ENV
- if not using
slurm
to submit jobs remove--cluster-config
,--cluster
from thesnakemake
CMD
- change
config/<sample>/slurm.yaml
(only relevant if usingslurm
for job submission)
- workflow:
workflow/
This workflow creates various summary files, plots and an HTML report for a sample using the output of the main workflow.
Note: How the metaP peptide/protein reports were generated from raw metaP data is described in notes/gdb_metap.md
.
- config:
- sample configs used for the main workflow
- workflow:
workflow_report/
To execute this workflow for all samples:
./config/reports.sh "YourEnvName" "WhereToCreateCondEnvs"
Re-create figures (and tables) used in the manuscript. This workflow should be only run after running the main workflow and report workflow for all samples.
- config:
config/fig.yaml
- change
work_dir
- change paths for all samples in
samples
- change
- workflow:
workflow_figures/
conda activate "YourEnvName"
snakemake -s workflow_figures/Snakefile --cores 1 --configfile config/fig.yaml --use-conda --conda-prefix "WhereToCreateCondEnvs" -rpn # dry-run
Notes for manual/additional analyses done using the generated data.