-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul strandedness detection / comparison #1306
Conversation
|
Co-authored-by: Harshil Patel <[email protected]>
I agree that the ratio between "forward" and "reverse" is more meaningful that some fixed percentage of either. Ideally I would like to see a different message, though, because there are basically three scenarios:
It would be nice if we could differentiate between those cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor typos, once fixed, ok to me
@nf-core-bot fix linting pretty please 🙏 |
…ypes in strand totals
… insert snapshots from the only tests I do want to update
Co-authored-by: Maxime U Garcia <[email protected]>
…q into improve_rseqc_strandedness
I'm not 100% convinced by the way the RSeQC results are used to check strandedness, and it leads to confusion.
To illustrate, unstranded data in RSeQC looks like:
... not:
The latter case is the consequence of reads aligning to regions where strand cannot be determined. This would include:
(2) might be quite common with either genomic DNA contamination or intronic reads.
As is, the supplied strandedness must be correct for 70% of reads for the check to pass. The problem with this is that where there is a high level of undetermined reads, this check can fail easily, even where the Salmon-based check generated the strandedness automatically, which is confusing.
The more important statistic in determining strandedness is whether the two 'Fraction of reads explained by' lines are similar, or not. The undetermined section might make you worry about why, but it shouldn't concern you if you're just checking the strand bias.
That's what I'm proposing here, that one of the 'Fraction of reads explained by' lines should be at least e.g. 5X the magnitude of the other. This would be consistent with the Salmon check and reduce confusion.
Hoping for some discussion / agreement!
Phase 2: the bigger picture
I thought harder about this, and realised that a lot of issues stemmed from comparing Salmon's internal strand inference (over which we have little control) and the bespoke way we were inferring strandedness from RSeQC results.
My proposal is now as follows:
lib_format_counts.json
output to derive our own strandedess from its numbers.I think that by doing this we serve the nuances alluded to by @tdanhorn without having to engineer a variety of error messages.
The result currently looks like this:
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).