Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for NFCORE_MAG:MAG:CAT_SUMMARY input file name collision #489

Merged
merged 38 commits into from
Nov 3, 2023
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
fe5ff58
fix: tracking of pre-processing in meta to handle downstream handling
maxibor Aug 25, 2023
0ae55d2
update gtdb-tk version
maxibor Aug 28, 2023
ecb7e99
update CAT
maxibor Aug 28, 2023
9069426
cleanup: remove dump debugging
maxibor Aug 28, 2023
724c28d
feat: check named results also includes refinement info
maxibor Aug 28, 2023
4947462
cleanup: remove dumps
maxibor Aug 28, 2023
0958404
Merge branch 'custom' into custom2
maxibor Aug 28, 2023
9a3de4f
feat: update gtdb-tk to 2.3.2
maxibor Aug 30, 2023
0886ca0
doc: update output filenames
maxibor Sep 1, 2023
5a359f6
update GTDB-Tk module to 2.3.2
maxibor Sep 7, 2023
8af91a7
Merge branch 'custom2' of github.com:maxibor/mag into custom
maxibor Sep 7, 2023
498e5be
Merge branch 'custom' of github.com:maxibor/mag into custom
maxibor Sep 7, 2023
199ca1c
update GTDB-TK prefix
maxibor Sep 7, 2023
e9428d0
tmp: add dump for depth
maxibor Sep 11, 2023
1da34ba
update CAT
maxibor Sep 20, 2023
01b2a03
dump depth
maxibor Sep 20, 2023
78b090f
remove duplicate bins
maxibor Sep 20, 2023
a501fe3
fix: add depth
maxibor Sep 20, 2023
3153e40
unique busco bins
maxibor Sep 22, 2023
5da6c7d
log: add busco dump
maxibor Sep 22, 2023
611a53e
fix: add missing config entries for concoct
maxibor Sep 22, 2023
e4ea756
eukaryote bins are not sent to refined channel
maxibor Sep 22, 2023
f9b4a36
debug: add dump for bin metrics
maxibor Sep 22, 2023
0a16911
debug: ch_filtered_bins -> bins
maxibor Sep 22, 2023
7c94795
debug: add groupTuple
maxibor Sep 22, 2023
73a14b5
debug: flatten bins
maxibor Sep 22, 2023
07c6dbd
debug: add bin_unbins dump
maxibor Sep 22, 2023
548a03a
debug: new combine for depths to avoid duplicated entries
maxibor Sep 22, 2023
12a2420
debug: add depths dump
maxibor Sep 22, 2023
1d84692
pin nf-validation
maxibor Oct 10, 2023
04e2214
Merge remote-tracking branch 'upstream/dev' into custom2
maxibor Oct 16, 2023
db0c32e
dev: both is an accepted value
maxibor Oct 16, 2023
425079a
dev: reactivate bin mixing
maxibor Oct 16, 2023
1cfc3c5
[automated] Fix linting with Prettier
nf-core-bot Oct 30, 2023
9856ffe
Update CHANGELOG.md
jfy133 Nov 1, 2023
1ae1973
Update CHANGELOG.md
jfy133 Nov 1, 2023
86681f7
Update subworkflows/local/depths.nf
jfy133 Nov 1, 2023
1c41f35
Apply suggestions from code review
jfy133 Nov 2, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -377,8 +377,8 @@ process {
}

withName: 'CHECKM_LINEAGEWF' {
tag = { "${meta.assembler}-${meta.binner}-${meta.id}" }
ext.prefix = { "${meta.assembler}-${meta.binner}-${meta.id}_wf" }
tag = { "${meta.assembler}-${meta.binner}-${meta.domain}-${meta.refinement}-${meta.id}" }
ext.prefix = { "${meta.assembler}-${meta.binner}-${meta.domain}-${meta.refinement}-${meta.id}_wf" }
publishDir = [
path: { "${params.outdir}/GenomeBinning/QC/CheckM" },
mode: params.publish_dir_mode,
Expand All @@ -387,7 +387,7 @@ process {
}

withName: 'CHECKM_QA' {
ext.prefix = { "${meta.assembler}-${meta.binner}-${meta.id}_qa" }
ext.prefix = { "${meta.assembler}-${meta.binner}-${meta.domain}-${meta.refinement}-${meta.id}_qa" }
ext.args = "-o 2 --tab_table"
publishDir = [
path: { "${params.outdir}/GenomeBinning/QC/CheckM" },
Expand Down Expand Up @@ -458,6 +458,7 @@ process {

withName: GTDBTK_CLASSIFYWF {
ext.args = "--extension fa"
ext.prefix = { "${meta.assembler}-${meta.binner}-${meta.domain}-${meta.refinement}-${meta.id}" }
publishDir = [
path: { "${params.outdir}/Taxonomy/GTDB-Tk/${meta.assembler}/${meta.binner}/${meta.id}" },
mode: params.publish_dir_mode,
Expand Down
37 changes: 19 additions & 18 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -476,6 +476,7 @@ For each bin or refined bin the median sequencing depth is computed based on the
- `predicted_genes/[assembler]-[bin].rna.gff`: Contig positions for rRNA genes in gff version 3 format
- `predicted_genes/barrnap.log`: Barrnap log file (ribosomal RNA predictor)
- `GenomeBinning/QC/`
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]-quast_summary.tsv`: QUAST output summarized per sample/condition.
- `quast_summary.tsv`: QUAST output for all bins summarized

</details>
Expand Down Expand Up @@ -531,9 +532,9 @@ By default, nf-core/mag runs CheckM with the `check_lineage` workflow that place
<summary>Output files</summary>

- `GenomeBinning/QC/CheckM/`
- `[assembler]-[binner]-[sample/group]_qa.txt`: Detailed statistics about bins informing completeness and contamamination scores (output of `checkm qa`). This should normally be your main file to use to evaluate your results.
- `[assembler]-[binner]-[sample/group]_wf.tsv`: Overall summary file for completeness and contamination (output of `checkm lineage_wf`).
- `[assembler]-[binner]-[sample/group]/`: intermediate files for CheckM results, including CheckM generated annotations, log, lineage markers etc.
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]_qa.txt`: Detailed statistics about bins informing completeness and contamamination scores (output of `checkm qa`). This should normally be your main file to use to evaluate your results.
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]_wf.tsv`: Overall summary file for completeness and contamination (output of `checkm lineage_wf`).
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]/`: intermediate files for CheckM results, including CheckM generated annotations, log, lineage markers etc.
- `checkm_summary.tsv`: A summary table of the CheckM results for all bins (output of `checkm qa`).

</details>
Expand Down Expand Up @@ -581,14 +582,14 @@ If `--gunc_save_db` is specified, the output directory will also contain the req
<summary>Output files</summary>

- `Taxonomy/CAT/[assembler]/[binner]/`
- `[assembler]-[binner]-[sample/group].ORF2LCA.names.txt.gz`: Tab-delimited files containing the lineage of each contig, with full lineage names
- `[assembler]-[binner]-[sample/group].bin2classification.names.txt.gz`: Taxonomy classification of the genome bins, with full lineage names
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group].ORF2LCA.names.txt.gz`: Tab-delimited files containing the lineage of each contig, with full lineage names
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group].bin2classification.names.txt.gz`: Taxonomy classification of the genome bins, with full lineage names
- `Taxonomy/CAT/[assembler]/[binner]/raw/`
- `[assembler]-[binner]-[sample/group].concatenated.predicted_proteins.faa.gz`: Predicted protein sequences for each genome bin, in fasta format
- `[assembler]-[binner]-[sample/group].concatenated.predicted_proteins.gff.gz`: Predicted protein features for each genome bin, in gff format
- `[assembler]-[binner]-[sample/group].ORF2LCA.txt.gz`: Tab-delimited files containing the lineage of each contig
- `[assembler]-[binner]-[sample/group].bin2classification.txt.gz`: Taxonomy classification of the genome bins
- `[assembler]-[binner]-[sample/group].log`: Log files
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group].concatenated.predicted_proteins.faa.gz`: Predicted protein sequences for each genome bin, in fasta format
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group].concatenated.predicted_proteins.gff.gz`: Predicted protein features for each genome bin, in gff format
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group].ORF2LCA.txt.gz`: Tab-delimited files containing the lineage of each contig
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group].bin2classification.txt.gz`: Taxonomy classification of the genome bins
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group].log`: Log files

</details>

Expand All @@ -609,14 +610,14 @@ If the parameters `--cat_db_generate` and `--save_cat_db` are set, additionally
<summary>Output files</summary>

- `Taxonomy/GTDB-Tk/[assembler]/[binner]/[sample/group]/`
- `gtdbtk.[assembler]-[binner]-[sample/group].{bac120/ar122}.summary.tsv`: Classifications for bacterial and archaeal genomes (see the [GTDB-Tk documentation for details](https://ecogenomics.github.io/GTDBTk/files/summary.tsv.html).
- `gtdbtk.[assembler]-[binner]-[sample/group].{bac120/ar122}.classify.tree.gz`: Reference tree in Newick format containing query genomes placed with pplacer.
- `gtdbtk.[assembler]-[binner]-[sample/group].{bac120/ar122}.markers_summary.tsv`: A summary of unique, duplicated, and missing markers within the 120 bacterial marker set, or the 122 archaeal marker set for each submitted genome.
- `gtdbtk.[assembler]-[binner]-[sample/group].{bac120/ar122}.msa.fasta.gz`: FASTA file containing MSA of submitted and reference genomes.
- `gtdbtk.[assembler]-[binner]-[sample/group].{bac120/ar122}.filtered.tsv`: A list of genomes with an insufficient number of amino acids in MSA.
- `gtdbtk.[assembler]-[binner]-[sample/group].*.log`: Log files.
- `gtdbtk.[assembler]-[binner]-[sample/group].failed_genomes.tsv`: A list of genomes for which the GTDB-Tk analysis failed, e.g. because Prodigal could not detect any genes.
- `Taxonomy/GTDB-Tk/gtdbtk_summary.tsv`: A summary table of the GTDB-Tk classification results for all bins, also containing bins which were discarded based on the BUSCO QC, which were filtered out by GTDB-Tk ((listed in `*.filtered.tsv`) or for which the analysis failed (listed in `*.failed_genomes.tsv`).
- `gtdbtk.[assembler]-[binner]-[sample/group].{bac120/ar122}.summary.tsv`: Classifications for bacterial and archaeal genomes (see the [GTDB-Tk documentation for details](https://ecogenomics.github.io/GTDBTk/files/summary.tsv.html)).
- `gtdbtk.[assembler]-[binner]-[domain]-[refinement]-[sample/group].{bac120/ar122}.classify.tree.gz`: Reference tree in Newick format containing query genomes placed with pplacer.
- `gtdbtk.[assembler]-[binner]-[domain]-[refinement]-[sample/group].{bac120/ar122}.markers_summary.tsv`: A summary of unique, duplicated, and missing markers within the 120 bacterial marker set, or the 122 archaeal marker set for each submitted genome.
- `gtdbtk.[assembler]-[binner]-[domain]-[refinement]-[sample/group].{bac120/ar122}.msa.fasta.gz`: FASTA file containing MSA of submitted and reference genomes.
- `gtdbtk.[assembler]-[binner]-[domain]-[refinement]-[sample/group].{bac120/ar122}.filtered.tsv`: A list of genomes with an insufficient number of amino acids in MSA.
- `gtdbtk.[assembler]-[binner]-[domain]-[refinement]-[sample/group].*.log`: Log files.
- `gtdbtk.[assembler]-[binner]-[domain]-[refinement]-[sample/group].failed_genomes.tsv`: A list of genomes for which the GTDB-Tk analysis failed, e.g. because Prodigal could not detect any genes.
- `Taxonomy/GTDB-Tk/gtdbtk_summary.tsv`: A summary table of the GTDB-Tk classification results for all bins, also containing bins which were discarded based on the BUSCO QC, which were filtered out by GTDB-Tk (listed in `*.filtered.tsv`) or for which the analysis failed (listed in `*.failed_genomes.tsv`).

</details>

Expand Down
2 changes: 1 addition & 1 deletion modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@
},
"gtdbtk/classifywf": {
"branch": "master",
"git_sha": "c67eaf89682a12966f60008a8fa30f5dd29239df",
"git_sha": "898259a38563f29c3c5d2490876019ec2d6f49c5",
"installed_by": ["modules"]
},
"gunc/downloaddb": {
Expand Down
43 changes: 23 additions & 20 deletions modules/local/cat.nf
Original file line number Diff line number Diff line change
@@ -1,39 +1,42 @@
process CAT {
tag "${meta.assembler}-${meta.binner}-${meta.id}-${db_name}"
tag "${meta.assembler}-${meta.binner}-${meta.domain}-${meta.refinement}-${meta.id}-${db_name}"

conda "bioconda::cat=4.6 bioconda::diamond=2.0.6"
conda "bioconda::cat=5.2.3"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/mulled-v2-75e2a26f10cbf3629edf2d1600db3fed5ebe6e04:eae321284604f7dabbdf121e3070bda907b91266-0' :
'biocontainers/mulled-v2-75e2a26f10cbf3629edf2d1600db3fed5ebe6e04:eae321284604f7dabbdf121e3070bda907b91266-0' }"
'https://depot.galaxyproject.org/singularity/cat:5.2.3--hdfd78af_1' :
'biocontainers/cat:5.2.3--hdfd78af_1' }"

input:
tuple val(meta), path("bins/*")
tuple val(db_name), path("database/*"), path("taxonomy/*")

output:
path("*.names.txt.gz") , emit: tax_classification
path("raw/*.ORF2LCA.txt.gz") , emit: orf2lca
path("raw/*.predicted_proteins.faa.gz"), emit: faa
path("raw/*.predicted_proteins.gff.gz"), emit: gff
path("raw/*.log") , emit: log
path("raw/*.bin2classification.txt.gz"), emit: tax_classification_taxids
path "versions.yml" , emit: versions
path("*.ORF2LCA.names.txt.gz") , emit: orf2lca_classification
path("*.bin2classification.names.txt.gz") , emit: tax_classification_names
path("raw/*.ORF2LCA.txt.gz") , emit: orf2lca
path("raw/*.predicted_proteins.faa.gz") , emit: faa
path("raw/*.predicted_proteins.gff.gz") , emit: gff
path("raw/*.log") , emit: log
path("raw/*.bin2classification.txt.gz") , emit: tax_classification_taxids
path "versions.yml" , emit: versions

script:
def official_taxonomy = params.cat_official_taxonomy ? "--only_official" : ""
def args = task.ext.args ?: ''
def prefix = task.ext.prefix ?: "${meta.assembler}-${meta.binner}-${meta.domain}-${meta.refinement}-${meta.id}"
"""
CAT bins -b "bins/" -d database/ -t taxonomy/ -n "${task.cpus}" -s .fa --top 6 -o "${meta.assembler}-${meta.binner}-${meta.id}" --I_know_what_Im_doing
CAT add_names -i "${meta.assembler}-${meta.binner}-${meta.id}.ORF2LCA.txt" -o "${meta.assembler}-${meta.binner}-${meta.id}.ORF2LCA.names.txt" -t taxonomy/ ${official_taxonomy}
CAT add_names -i "${meta.assembler}-${meta.binner}-${meta.id}.bin2classification.txt" -o "${meta.assembler}-${meta.binner}-${meta.id}.bin2classification.names.txt" -t taxonomy/ ${official_taxonomy}
CAT bins $args -b "bins/" -d database/ -t taxonomy/ -n "${task.cpus}" -s .fa --top 6 -o "${prefix}" --I_know_what_Im_doing
CAT add_names -i "${prefix}.ORF2LCA.txt" -o "${prefix}.ORF2LCA.names.txt" -t taxonomy/ ${official_taxonomy}
CAT add_names -i "${prefix}.bin2classification.txt" -o "${prefix}.bin2classification.names.txt" -t taxonomy/ ${official_taxonomy}

mkdir raw
mv *.ORF2LCA.txt *.predicted_proteins.faa *.predicted_proteins.gff *.log *.bin2classification.txt raw/
gzip "raw/${meta.assembler}-${meta.binner}-${meta.id}.ORF2LCA.txt" \
"raw/${meta.assembler}-${meta.binner}-${meta.id}.concatenated.predicted_proteins.faa" \
"raw/${meta.assembler}-${meta.binner}-${meta.id}.concatenated.predicted_proteins.gff" \
"raw/${meta.assembler}-${meta.binner}-${meta.id}.bin2classification.txt" \
"${meta.assembler}-${meta.binner}-${meta.id}.ORF2LCA.names.txt" \
"${meta.assembler}-${meta.binner}-${meta.id}.bin2classification.names.txt"
gzip "raw/${prefix}.ORF2LCA.txt" \
"raw/${prefix}.concatenated.predicted_proteins.faa" \
"raw/${prefix}.concatenated.predicted_proteins.gff" \
"raw/${prefix}.bin2classification.txt" \
"${prefix}.ORF2LCA.names.txt" \
"${prefix}.bin2classification.names.txt"

cat <<-END_VERSIONS > versions.yml
"${task.process}":
Expand Down
9 changes: 5 additions & 4 deletions modules/local/quast_bins.nf
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
process QUAST_BINS {
tag "${meta.assembler}-${meta.binner}-${meta.id}"
tag "${meta.assembler}-${meta.binner}-${meta.domain}-${meta.refinement}-${meta.id}"

conda "bioconda::quast=5.0.2"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
Expand All @@ -15,15 +15,16 @@ process QUAST_BINS {
path "versions.yml" , emit: versions

script:
def prefix = task.ext.prefix ?: "${meta.assembler}-${meta.binner}-${meta.domain}-${meta.refinement}-${meta.id}"
"""
BINS=\$(echo \"$bins\" | sed 's/[][]//g')
IFS=', ' read -r -a bins <<< \"\$BINS\"
for bin in \"\${bins[@]}\"; do
metaquast.py --threads "${task.cpus}" --max-ref-number 0 --rna-finding --gene-finding -l "\${bin}" "\${bin}" -o "QUAST/\${bin}"
if ! [ -f "QUAST/${meta.assembler}-${meta.domain}-${meta.binner}-${meta.id}-quast_summary.tsv" ]; then
cp "QUAST/\${bin}/transposed_report.tsv" "QUAST/${meta.assembler}-${meta.domain}-${meta.binner}-${meta.id}-quast_summary.tsv"
if ! [ -f "QUAST/${prefix}-quast_summary.tsv" ]; then
cp "QUAST/\${bin}/transposed_report.tsv" "QUAST/${prefix}-quast_summary.tsv"
else
tail -n +2 "QUAST/\${bin}/transposed_report.tsv" >> "QUAST/${meta.assembler}-${meta.domain}-${meta.binner}-${meta.id}-quast_summary.tsv"
tail -n +2 "QUAST/\${bin}/transposed_report.tsv" >> "QUAST/${prefix}-quast_summary.tsv"
fi
done

Expand Down
71 changes: 41 additions & 30 deletions modules/nf-core/gtdbtk/classifywf/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading