Merge pull request #753 from d4straub/add-ancombc

Add ANCOMBC
nf-core · Jun 25, 2024 · 19f49e6 · 19f49e6
2 parents a02190d + 3932f8e
commit 19f49e6
Show file tree

Hide file tree

Showing 24 changed files with 481 additions and 57 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,10 +9,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 - [#751](https://github.com/nf-core/ampliseq/pull/751) - Added version R08-RS214 of curated GTDB 16S taxonomy: `sbdi-gtdb=R08-RS214-1` or `sbdi-gtdb` as parameter to `--dada_ref_taxonomy`
 - [#752](https://github.com/nf-core/ampliseq/pull/752) - Added version R09-RS220 of GTDB 16S taxonomy: `gtdb=R09-RS220` or `gtdb` as parameter to `--dada_ref_taxonomy`
+- [#753](https://github.com/nf-core/ampliseq/pull/753) - ANCOM-BC via QIIME2 can be used with `--ancombc`, `--ancombc_formula`, and `--ancombc_formula_reflvl`, plotting can be modified with thresholds `--ancombc_effect_size` and `--ancombc_significance`
 
 ### `Changed`
 
 - [#749](https://github.com/nf-core/ampliseq/pull/749) - Create barplot also when no metadata is given
+- [#753](https://github.com/nf-core/ampliseq/pull/753) - ANCOM via QIIME2 is not run anymore by default but on request whith `--ancom`, therefore `--skip_ancom` was removed
 
 ### `Fixed`
 
@@ -24,6 +26,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### `Removed`
 
+- [#753](https://github.com/nf-core/ampliseq/pull/753) - `--skip_ancom` was removed
+
 ## nf-core/ampliseq version 2.9.0 - 2024-04-03
 
 ### `Added`

diff --git a/CITATIONS.md b/CITATIONS.md
@@ -135,6 +135,10 @@
 
   > Mandal S, Van Treuren W, White RA, Eggesbø M, Knight R, Peddada SD. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microb Ecol Health Dis. 2015 May 29;26:27663. doi: 10.3402/mehd.v26.27663. PMID: 26028277; PMCID: PMC4450248.
 
+- [ANCOM-BC](https://pubmed.ncbi.nlm.nih.gov/32665548/)
+
+  > Lin H, Peddada SD. Analysis of compositions of microbiomes with bias correction. Nat Commun. 2020 Jul 14;11(1):3514. doi: 10.1038/s41467-020-17041-7. PMID: 32665548; PMCID: PMC7360769.
+
 - [Adonis](https://doi.org/10.1111/j.1442-9993.2001.01070.pp.x) and [VEGAN](https://CRAN.R-project.org/package=vegan)
 
   > Marti J Anderson. A new method for non-parametric multivariate analysis of variance. Austral ecology, 26(1):32–46, 2001.

diff --git a/README.md b/README.md
@@ -44,7 +44,6 @@ By default, the pipeline currently performs the following:
 - Phylogenetic placement ([EPA-NG](https://github.com/Pbdas/epa-ng))
 - Taxonomical classification using DADA2; alternatives are [SINTAX](https://doi.org/10.1101/074161), [Kraken2](https://doi.org/10.1186/s13059-019-1891-0), and [QIIME2](https://www.nature.com/articles/s41587-019-0209-9)
 - Excludes unwanted taxa, produces absolute and relative feature/taxa count tables and plots, plots alpha rarefaction curves, computes alpha and beta diversity indices and plots thereof ([QIIME2](https://www.nature.com/articles/s41587-019-0209-9))
-- Calls differentially abundant taxa ([ANCOM](https://www.ncbi.nlm.nih.gov/pubmed/26028277))
 - Creates phyloseq R objects ([Phyloseq](https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html))
 - Pipeline QC summaries ([MultiQC](https://multiqc.info/))
 - Pipeline summary report ([R Markdown](https://github.com/rstudio/rmarkdown))
@@ -74,6 +73,7 @@ nextflow run nf-core/ampliseq \
 
 > [!TIP]
 > By default the taxonomic assignment will be performed with DADA2 on SILVA database, but there are various tools and databases readily available, see [taxonomic classification documentation](https://nf-co.re/ampliseq/usage#taxonomic-classification).
+> Differential abundance testing with ([ANCOM](https://www.ncbi.nlm.nih.gov/pubmed/26028277)) or ([ANCOM-BC](https://www.ncbi.nlm.nih.gov/pubmed/32665548)) when opting in.
 
 > [!WARNING]
 > Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;

diff --git a/assets/report_template.Rmd b/assets/report_template.Rmd
@@ -30,7 +30,6 @@ params:
     barplot: FALSE
     abundance_tables: FALSE
     alpha_rarefaction: FALSE
-    ancom: FALSE
     trunclenf: ""
     trunclenr: ""
     max_ee: ""
@@ -102,6 +101,9 @@ params:
     diversity_indices_alpha: FALSE
     diversity_indices_beta: FALSE
     diversity_indices_adonis: ""
+    ancom: FALSE
+    ancombc: FALSE
+    ancombc_formula: FALSE
     picrust_pathways: FALSE
     sbdi: FALSE
     phyloseq: FALSE
@@ -1531,11 +1533,57 @@ Test results are in separate folders following the scheme `Category-{treatment}-
 
 ancom <- sort( unlist( strsplit( params$ancom,"," ) ) )
 for (folder in ancom) {
-    ancom_path <- paste0("qiime2/ancom/",folder)
+    ancom_path <- paste0("qiime2/",folder)
     cat("\n- [",ancom_path,"/index.html](../",ancom_path,"/index.html)\n", sep="")
 }
 ```
 
+<!-- Subsection on ANCOMBC results -->
+
+```{r, results='asis'}
+any_ancombc <- !isFALSE(params$ancombc) || !isFALSE(params$ancombc_formula)
+```
+
+```{r, eval = !isFALSE(params$any_ancombc), results='asis'}
+cat(paste0("
+## ANCOM-BC
+
+[Analysis of Composition of Microbiomes with Bias Correction (ANCOM-BC)](https://www.ncbi.nlm.nih.gov/pubmed/32665548)
+is applied to identify features that are differentially
+abundant across sample groups.
+Comparisons between groups of samples is performed for specific metadata that can be found in folder
+"))
+
+if ( !isFALSE(params$ancombc) && !isFALSE(params$ancombc_formula) ) {
+    cat("[qiime2/ancombc/](../qiime2/ancombc/) and [qiime2/ancombc_formula/](../qiime2/ancombc_formula/)")
+} else if ( !isFALSE(params$ancombc) ) {
+    cat("[qiime2/ancombc/](../qiime2/ancombc/)")
+} else if ( !isFALSE(params$ancombc_formula) ) {
+    cat("[qiime2/ancombc_formula/](../qiime2/ancombc_formula/)")
+}
+cat(".")
+
+cat(paste0("
+Test results are in separate folders following the scheme `Category-{treatment}-{taxonomic level}`:
+"))
+```
+
+```{r, eval = !isFALSE(params$ancombc), results='asis'}
+ancombc <- sort( unlist( strsplit( params$ancombc,"," ) ) )
+for (folder in ancombc) {
+    ancombc_path <- paste0("qiime2/",folder)
+    cat("\n- [",ancombc_path,"/index.html](../",ancombc_path,"/index.html)\n", sep="")
+}
+```
+
+```{r, eval = !isFALSE(params$ancombc_formula), results='asis'}
+ancombc_formula <- sort( unlist( strsplit( params$ancombc_formula,"," ) ) )
+for (folder in ancombc_formula) {
+    ancombc_formula_path <- paste0("qiime2/",folder)
+    cat("\n- [",ancombc_formula_path,"/index.html](../",ancombc_formula_path,"/index.html)\n", sep="")
+}
+```
+
 <!-- Section on PICRUSt2 results -->
 
 ```{r, eval = !isFALSE(params$picrust_pathways), results='asis'}
@@ -1767,7 +1815,7 @@ if ( as.integer(qiime2_filtertaxa_rm) > 0 ) {
 }
 ```
 ```{r, eval = !isFALSE(params$val_used_taxonomy), results='asis'}
-if (!isFALSE(params$barplot) || !isFALSE(params$alpha_rarefaction) || !isFALSE(params$diversity_indices_beta) || !isFALSE(params$ancom)) {
+if (!isFALSE(params$barplot) || !isFALSE(params$alpha_rarefaction) || !isFALSE(params$diversity_indices_beta) || !isFALSE(params$ancom) || !isFALSE(any_ancombc)) {
     qiime_final <- c("Within QIIME2, the final microbial community data was")
     if (!isFALSE(params$barplot)) {
         qiime_final <- c(qiime_final,"visualized in a barplot")
@@ -1782,6 +1830,9 @@ if (!isFALSE(params$barplot) || !isFALSE(params$alpha_rarefaction) || !isFALSE(p
     if (!isFALSE(params$ancom)) {
         qiime_final <- c(qiime_final,"used to find differential abundant taxa with ANCOM ([Mandal et al., 2015](https://pubmed.ncbi.nlm.nih.gov/26028277/))")
     }
+    if (!isFALSE(any_ancombc)) {
+        qiime_final <- c(qiime_final,"used to find differential abundant taxa with ANCOM-BC ([Lin and Peddada, 2020](https://pubmed.ncbi.nlm.nih.gov/32665548/))")
+    }
     cat(paste(qiime_final[1],qiime_final[2]))
     if (length(qiime_final) >= 3) {
         for (x in 3:length(qiime_final)) {

diff --git a/conf/modules.config b/conf/modules.config
@@ -969,6 +969,57 @@ process {
         ]
     }
 
+    withName: 'QIIME2_ANCOMBC_TAX|QIIME2_ANCOMBC_ASV' {
+        // additional arguments for "qiime composition ancombc", deviating from default: --p-lib-cut (0), --p-conserve (--p-no-conserve)
+        ext.args = '--p-prv-cut 0.1 --p-lib-cut 500 --p-alpha 0.05 --p-conserve'
+        // additional arguments for "qiime composition da-barplot"
+        ext.args2 = { [
+            params.ancombc_effect_size ? "--p-effect-size-threshold ${params.ancombc_effect_size}" : '',
+            params.ancombc_significance ? "--p-significance-threshold ${params.ancombc_significance}" : '',
+            '--p-label-limit 1000'
+        ].join(' ') }
+        publishDir = [
+            [
+                path: { "${params.outdir}/qiime2/ancombc" },
+                mode: params.publish_dir_mode,
+                saveAs: { filename -> filename.equals('versions.yml') || filename.endsWith('.qzv') || filename.endsWith('.qza') ? null : filename }
+            ],
+            [
+                path: { "${params.outdir}/qiime2/ancombc/qza_qzv" },
+                mode: params.publish_dir_mode,
+                pattern: "*{.qza,.qzv}",
+                enabled: params.save_intermediates
+            ]
+        ]
+    }
+
+    withName: 'ANCOMBC_FORMULA_TAX|ANCOMBC_FORMULA_ASV' {
+        // additional arguments for "qiime composition ancombc", deviating from default: --p-lib-cut (0), --p-conserve (--p-no-conserve)
+        ext.args = { [
+            params.ancombc_formula_reflvl ? "--p-reference-levels ${params.ancombc_formula_reflvl}" : '',
+            '--p-prv-cut 0.1 --p-lib-cut 500 --p-alpha 0.05 --p-conserve'
+        ].join(' ') }
+        // additional arguments for "qiime composition da-barplot"
+        ext.args2 = { [
+            params.ancombc_effect_size ? "--p-effect-size-threshold ${params.ancombc_effect_size}" : '',
+            params.ancombc_significance ? "--p-significance-threshold ${params.ancombc_significance}" : '',
+            '--p-label-limit 1000'
+        ].join(' ') }
+        publishDir = [
+            [
+                path: { "${params.outdir}/qiime2/ancombc_formula" },
+                mode: params.publish_dir_mode,
+                saveAs: { filename -> filename.equals('versions.yml') || filename.endsWith('.qzv') || filename.endsWith('.qza') ? null : filename }
+            ],
+            [
+                path: { "${params.outdir}/qiime2/ancombc_formula/qza_qzv" },
+                mode: params.publish_dir_mode,
+                pattern: "*{.qza,.qzv}",
+                enabled: params.save_intermediates
+            ]
+        ]
+    }
+
     withName: PICRUST {
         ext.args = "-t epa-ng --remove_intermediate"
         publishDir = [

diff --git a/conf/test.config b/conf/test.config
@@ -47,4 +47,11 @@ params {
     diversity_rarefaction_depth = 500
 
     vsearch_cluster = true
+
+    // Test ANCOMBC
+    ancombc = true
+    ancombc_formula = "treatment1"
+    ancombc_formula_reflvl = "treatment1::b"
+    ancombc_effect_size = 2
+    ancombc_significance = 0.00001
 }
diff --git a/conf/test_failed.config b/conf/test_failed.config
@@ -32,6 +32,7 @@ params {
     ignore_failed_trimming = true
     ignore_empty_input_files = true
     ignore_failed_filtering = true
+    ancombc = true
 
     //this is to remove low abundance ASVs to reduce runtime of downstream processes
     min_samples = 2

diff --git a/conf/test_full.config b/conf/test_full.config
@@ -35,4 +35,8 @@ params {
 
     //run adonis
     qiime_adonis_formula = "habitat"
+
+    //run ANCOM & ANCOMBC
+    ancom = true
+    ancombc = true
 }
diff --git a/conf/test_multiregion.config b/conf/test_multiregion.config
@@ -31,4 +31,7 @@ params {
     // Reduce runtimes
     skip_alpha_rarefaction = true
     tax_agglom_max = 3
+
+    // Run ANCOM
+    ancom = true
 }
diff --git a/conf/test_pplace.config b/conf/test_pplace.config
@@ -33,10 +33,10 @@ params {
     min_frequency = 10
 
     // pplace
-    pplace_tree = "https://github.com/nf-core/test-datasets/raw/phyloplace/testdata/cyanos_16s.newick"
-    pplace_aln = "https://github.com/nf-core/test-datasets/raw/phyloplace/testdata/cyanos_16s.alnfna"
+    pplace_tree = params.pipelines_testdata_base_path + "phyloplace/testdata/cyanos_16s.newick"
+    pplace_aln = params.pipelines_testdata_base_path + "phyloplace/testdata/cyanos_16s.alnfna"
     pplace_model = "GTR+F+I+I+R3"
-    pplace_taxonomy = "https://github.com/nf-core/test-datasets/raw/phyloplace/testdata/cyanos_16s.taxonomy.tsv"
+    pplace_taxonomy = params.pipelines_testdata_base_path + "phyloplace/testdata/cyanos_16s.taxonomy.tsv"
     pplace_name = "test_pplace"
 
     // Adjust taxonomic levels
@@ -46,5 +46,4 @@ params {
     // Skip some steps to reduce runtime
     skip_alpha_rarefaction = true
     skip_fastqc = true
-    skip_ancom = true
 }
diff --git a/conf/test_sintax.config b/conf/test_sintax.config
@@ -40,6 +40,7 @@ params {
 
     //restrict ANCOM analysis to higher taxonomic levels
     tax_agglom_max = 4
+    ancom = true
 
     sbdiexport = true
 

diff --git a/docs/images/ampliseq_workflow.png b/docs/images/ampliseq_workflow.png
diff --git a/docs/images/ampliseq_workflow.svg b/docs/images/ampliseq_workflow.svg
diff --git a/docs/output.md b/docs/output.md
@@ -43,7 +43,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
   - [Barplot](#barplot) - Interactive barplot
   - [Alpha diversity rarefaction curves](#alpha-diversity-rarefaction-curves) - Rarefaction curves for quality control
   - [Diversity analysis](#diversity-analysis) - High level overview with different diversity indices
-  - [ANCOM](#ancom) - Differential abundance analysis
+  - [Differential abundance analysis](#differential-abundance-analysis) - Calling differentially abundant features with ANCOM or ANCOM-BC
 - [PICRUSt2](#picrust2) - Predict the functional potential of a bacterial community
 - [SBDI export](#sbdi-export) - Swedish Biodiversity Infrastructure (SBDI) submission file
 - [Phyloseq](#phyloseq) - Phyloseq R objects
@@ -544,11 +544,13 @@ Furthermore, ADONIS permutation-based statistical test in vegan-R determine whet
 
 </details>
 
-#### ANCOM
+#### Differential abundance analysis
+
+##### ANCOM
 
 Analysis of Composition of Microbiomes ([ANCOM](https://www.ncbi.nlm.nih.gov/pubmed/26028277)) is applied to identify features that are differentially abundant across sample groups. A key assumption made by ANCOM is that few taxa (less than about 25%) will be differentially abundant between groups otherwise the method will be inaccurate. Parameter `--ancom_sample_min_count` sets the minimum sample counts to retain a sample for ANCOM analysis.
 
-ANCOM is applied to each suitable or specified metadata column for 5 taxonomic levels (2-6).
+On request (`--ancom`), ANCOM is applied to each suitable or specified metadata column for 5 taxonomic levels (2-6).
 
 <details markdown="1">
 <summary>Output files</summary>
@@ -560,6 +562,28 @@ ANCOM is applied to each suitable or specified metadata column for 5 taxonomic l
 
 </details>
 
+##### ANCOM-BC
+
+Analysis of Composition of Microbiomes with Bias Correction ([ANCOM-BC](https://www.ncbi.nlm.nih.gov/pubmed/32665548)) is applied to identify features that are differentially abundant across sample groups. Parameter `--ancom_sample_min_count` sets the minimum sample counts to retain a sample for ANCOM-BC analysis.
+
+On request (`--ancombc`), ANCOM-BC is applied to each suitable or specified metadata column for 5 taxonomic levels (2-6). Independently, multiple comma separated formula can be submitted to ANCOM-BC by `--ancombc_formula`.
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `qiime2/ancombc/` or `qiime2/ancombc_formula/`
+  - `da_barplot/Category-<formula>-<taxonomic level>/`
+    - `index.html`: Links to interactive plots.
+    - `<formula><treatment>-ancombc-barplot.html`: Interactive plots.
+  - `differentials/Category-<formula>-<taxonomic level>/`
+    - `index.html`: Visualised table of statistical results.
+    - `*.csv*`: Comma-separated tables of statistical results.
+  - formula: metadata category / formula that was tested
+  - taxonomic level: level-2 (phylum), level-3 (class), level-4 (order), level-5 (family), level-6 (genus), ASV
+  - treatment: Changes for that treatment group
+
+</details>
+
 ### PICRUSt2
 
 PICRUSt2 (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States) is a software for predicting functional abundances based only on marker gene sequences. On demand (`--picrust`), Enzyme Classification numbers (EC), KEGG orthologs (KO) and MetaCyc ontology predictions will be made for each sample.

diff --git a/docs/usage.md b/docs/usage.md
@@ -16,6 +16,7 @@
   - [Taxonomic classification](#taxonomic-classification)
   - [Multiple region analysis with Sidle](#multiple-region-analysis-with-sidle)
   - [Metadata](#metadata)
+  - [Differential abundance analysis](#differential-abundance-analysis)
   - [Updating the pipeline](#updating-the-pipeline)
   - [Reproducibility](#reproducibility)
 - [Core Nextflow arguments](#core-nextflow-arguments)
@@ -306,6 +307,10 @@ Sample identifiers should be 36 characters long or less, and also contain only A
 
 The columns which are to be assessed can be specified by `--metadata_category`. If `--metadata_category` isn't specified than all columns that fit the specification are automatically chosen.
 
+### Differential abundance analysis
+
+Differential abundance analysis for relative abundance from microbial community analysis are plagued by multiple issues that aren't fully solved yet. But some approaches seem promising, for example Analysis of Composition of Microbiomes with Bias Correction ([ANCOM-BC](https://pubmed.ncbi.nlm.nih.gov/32665548/)). [ANCOM](https://pubmed.ncbi.nlm.nih.gov/26028277/) and ANCOM-BC are integrated into the pipeline, but only executed on request via `--ancom` and `--ancombc`, more details in the [nf-core/ampliseq website parameter documentation](https://nf-co.re/ampliseq/parameters/#differential-abundance-analysis).
+
 ### Updating the pipeline
 
 When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: