Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for NFCORE_MAG:MAG:CAT_SUMMARY input file name collision #489

Merged
merged 38 commits into from
Nov 3, 2023

Conversation

maxibor
Copy link
Member

@maxibor maxibor commented Aug 8, 2023

This PR brings a fix to what was first described in #433
The issue happened when using --postbinning_input both in combination with domain annotation, bin_refinement and CAT profiling.
This is because bins annotated as eukaryotic are sent to the refined bins channels as is, bypassing dastool, and hence keeping the same name.
This fix adds a "refined" entry to the meta, and includes it in the new prefix for the CAT process.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs- [ ] If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@maxibor maxibor changed the title Fix for NFCORE_MAG:MAG:CAT_SUMMARY input file name collision` Fix for NFCORE_MAG:MAG:CAT_SUMMARY input file name collision Aug 8, 2023
@github-actions
Copy link

github-actions bot commented Aug 8, 2023

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 1c41f35

+| ✅ 158 tests passed       |+
#| ❔   1 tests were ignored |#
!| ❗   3 tests had warnings |!

❗ Test warnings:

  • pipeline_todos - TODO string in main.nf: Remove this line if you don't need a FASTA file [TODO: try and test using for --host_fasta and --host_genome]
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in WorkflowMag.groovy: Optionally add in-text citation tools to this list.

❔ Tests ignored:

  • files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy

✅ Tests passed:

Run details

  • nf-core/tools version 2.10
  • Run at 2023-11-02 13:27:11

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments, but a more general one, won't most of the downstream steps that can take in output from bin refinement also need update to their prefixes (E.g. in modules.conf)?

Otherwise LGTM :)

subworkflows/local/binning_refinement.nf Outdated Show resolved Hide resolved
workflows/mag.nf Outdated Show resolved Hide resolved
Copy link
Contributor

@prototaxites prototaxites left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

One minor thought - I added params.cat_official_taxonomy before I really got to grips with how $args works. Now that CAT has a $args option, might it make sense to move that into there inside the config?

A few minor comments, but a more general one, won't most of the downstream steps that can take in output from bin refinement also need update to their prefixes (E.g. in modules.conf)?

I am not sure that there are clashes in my experience, but I don't think more detailed labelling of files hurts (except in user readability down the line). This might be something to think about more deeply when planning v3.0!

@CarsonJM
Copy link
Contributor

CarsonJM commented Aug 8, 2023

@prototaxites I think it would make sense to move params.cat_official_taxonomy into modules.conf!

Also, do you think it would be a good idea to standardize how prefixes are defined (currently some prefixes are defined in within the module's def prefix line and others seem to be defined in modules.conf under ext.prefix)? Perhaps there's a reason for this, but just a thought!

Copy link
Contributor

@CarsonJM CarsonJM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Copy link
Contributor

@tillenglert tillenglert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Had some minor comments/ideas though. Thanks for the implementation and Bugfix!

nextflow.config Outdated Show resolved Hide resolved
workflows/mag.nf Outdated Show resolved Hide resolved
@jfy133
Copy link
Member

jfy133 commented Aug 10, 2023

@maxibor (or anyone else) any thoughts on my original feedback:

<...> but a more general one, won't most of the downstream steps that can take in output from bin refinement also need update to their prefixes (E.g. in modules.conf)?

CHANGELOG.md Outdated Show resolved Hide resolved
@maxibor
Copy link
Member Author

maxibor commented Aug 24, 2023

Update: this is actually a more generalized issue with processes that perform a collect: CAT, but also Quast, and GTDB-Tk

@jfy133
Copy link
Member

jfy133 commented Aug 24, 2023

@maxibor (or anyone else) any thoughts on my original feedback:

<...> but a more general one, won't most of the downstream steps that can take in output from bin refinement also need update to their prefixes (E.g. in modules.conf)?

I guess this was correct then???

I wonder why this issue hasn't come up before...?!

@maxibor
Copy link
Member Author

maxibor commented Aug 24, 2023

It's a new issue because samples can now appear multiple times in downstream channels for the following reasons:

  • Split by domain (new, only happens when using domain classification)
  • Split by refined/not refined (only happens when using bin refinement)
  • Split by binned/unbinned (only happens when using --postbinning_input both)

When they're collected at the final steps (Quast, GTDB-Tk, or CAT), bins that were not renamed by the way they were treated now appear multiple times with same name, and triggers the input file name collision.

Which in addition was particularly challenging to debug because of a misreporting of the actual error nextflow-io/nextflow#3828

@prototaxites
Copy link
Contributor

prototaxites commented Aug 27, 2023

I’ve been thinking about this and I’m not sure we need the refined/unrefined label to fix this - this might specifically just be a bug in the code that assigns bins post-refinement.

If we think about the original sources of bins, we have bins coming from the three binners - metabat, maxbin, and concoct. Each of these bins should have a unique numbered file name. If refinement is enabled, these bins are fed into DAS tool and a new set of bins are returned - that are not necessarily the same as the input bins. These should also be uniquely named, differently to the original bins because their binner is listed as DAS tool.

If domain classification is enabled with bin refinement, then undefined eukaryotic bins are passed as “refined” at this stage as otherwise they will be lost, unless ‘both’ is set for post-binning input. I think this is the only time that a duplicate bin can slip through. So don’t we just have to fix the code so that eukaryotic bins are only fed to the refined bin channel if ‘both’ is not selected? Otherwise, every bin should be unique and only be able to enter each process downstream exactly once, because they might get split into separate channels, and recombined, but never duplicated?

@maxibor
Copy link
Member Author

maxibor commented Aug 28, 2023

A small schematic to illustrate the issue.
The bins do not need to be renamed, only the results of GTDB-Tk, Quast, and CAT. Otherwise the .collect() for the summary process of these three tools will fail with an input file name collision.

mag meta flowthrough_230828_161552

I've now update the meta's accordingly, and passed them through when needed.

@maxibor
Copy link
Member Author

maxibor commented Aug 28, 2023

There are now 4 possible values for the refinement meta:

  • unrefined for bins that won't be refined
  • unrefined_unbinned for the contigs that were not binned and will not be refined
  • dastool_refined for bins that were refined with DASTool
  • dastool_refined_unbinned for the unbinned processed with DASTool

Note that the input file name collision was not only be triggered by the lack of refinment tracking, but also by the lack of domain tracking.
So using all the meta to Quast/GTDB-TK/CAT output filenames will avoid this error (while not affecting bin names).

@jfy133
Copy link
Member

jfy133 commented Aug 28, 2023

That's a really helpful diagram @maxibor thank you very much!

@jfy133
Copy link
Member

jfy133 commented Oct 1, 2023

I am currently downloading the latest GTDB database, then I'll do a manual 'full test' and if all looks good then we can merge!

@jfy133
Copy link
Member

jfy133 commented Oct 2, 2023

Self notes:

  • CheckM looks good: summary seems to be raw bins (MEGAHIT-MetaBAT2-test_minigut_sample2.unbinned.1) but per CheckM run looks good1
  • QUAST looks good: summary seems to be raw bins (MEGAHIT-CONCOCT-test_minigut_67.fa) but per QUAST run looks good1
  • CAT looks good: summary seems to be raw bins (MEGAHIT-CONCOCT-test_minigut_sample2_24.fa) but per CAT run output has correct names1
  • GTDBTK looks good: couldn't test due to db issue

This can be merged once I've done one test without domain classsifcation, and I'll follow up with a few other things I noticed with a separate PR

Footnotes

  1. Indeed because they do report on physical binned collection of bin FASTAs rather than the file itself? When looking in the work directory of QUAST_BINS_SUMMARY all the files are there, but the contents of the files have the original bin names 2 3

@jfy133
Copy link
Member

jfy133 commented Oct 2, 2023

I propose merge this, patch release, then I'll make issues and a further patch release of the other more minor things I found (once i've had a chance to look into them)

@jfy133
Copy link
Member

jfy133 commented Oct 3, 2023

I'm happy with this now after some testing! Just need pull upstream from dev and CHANGELOG update :)

@jfy133
Copy link
Member

jfy133 commented Oct 4, 2023

SPeaking privately apparently Maxime found more issues (but is currently on holiday)

@jfy133
Copy link
Member

jfy133 commented Oct 30, 2023

Ok, there are multiple issues but not related directly to problem this PR is solving.

We will break this down bit-by-bit so merging this now, and fixes will come in over time 👍

Thanks again @maxibor !

EDIT: once the latest dev branch has been pulled into this PR and changelog updated!

@jfy133
Copy link
Member

jfy133 commented Oct 30, 2023

@nf-core-bot fix linting

subworkflows/local/depths.nf Outdated Show resolved Hide resolved
@jfy133
Copy link
Member

jfy133 commented Nov 1, 2023

Just need to remove some dumps, but hopefully the .flatten() of the bins going into depth fixes the bug, then will merge

subworkflows/local/busco_qc.nf Outdated Show resolved Hide resolved
subworkflows/local/depths.nf Outdated Show resolved Hide resolved
subworkflows/local/depths.nf Outdated Show resolved Hide resolved
subworkflows/local/gtdbtk.nf Outdated Show resolved Hide resolved
@maxibor
Copy link
Member Author

maxibor commented Nov 2, 2023

Long time in the making update:
That'll would be the pattern for better solution regarding the input file name collision

new_channel = channel_name.collectFile(keepHeader: true) {
        meta, f ->
        [f.getBaseName()+'.tsv', f]
    }
WHATEVER_SUMMARY_PROCESS(new_channel.collect())

This pattern takes all the files from the channel_name channel that share the same name and concatenates them into one (and keepHeader makes sure only header is written).
It's the equivalent of

cat a/test.tsv  b/test.tsv > test.tsv

@jfy133
Copy link
Member

jfy133 commented Nov 3, 2023

Long time in the making update: That'll would be the pattern for better solution regarding the input file name collision

new_channel = channel_name.collectFile(keepHeader: true) {
        meta, f ->
        [f.getBaseName()+'.tsv', f]
    }
WHATEVER_SUMMARY_PROCESS(new_channel.collect())

This pattern takes all the files from the channel_name channel that share the same name and concatenates them into one (and keepHeader makes sure only header is written). It's the equivalent of

cat a/test.tsv  b/test.tsv > test.tsv

Hm interesting... but do we want to collapse those bins back into a single bin again? I'm pretty sure of the tools rely on the file name to disambiguate between bins so I'm worried we would then not be able to distinguish between eukaryotic vs non-eukaryotic bins in the summary tools.

Furthermore, does it support gzipped files? The original error you reported are gzipped output from CAT and I don't see reference to that in the NXF docs on collectFile

That said, I may be being overly paranoid there. So I suggest that I merge this current PR as a conservative approach (making sure everything is very explicitly independent). But then I will make an issue describing what you're suggesting as an enhancement - as other than my (maybe unfounded) worry above - that would indeed be an elegant solution 👍

@jfy133 jfy133 merged commit 9917795 into nf-core:dev Nov 3, 2023
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants