-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
covstats: % unmapped calculation seems inaccurate (>3000%, etc) #73
Comments
thanks for the clear report and diagnosis. |
Updated version
Previous version
TBProfilerTBProfiler is was also run on the fastqs before variants were called. Because it uses a different aligner, it can be expected that its results may differ from covstats, but they should be similar since they align to the same reference genome (H37Rv).
The fact TBProfiler seems to be saying 81.55% of a sample is unmapped while neo-covstats says 95.33% of a sample is unmapped is worth noting, but I don't think it's unreasonable, especially considering these are samples specifically designed to be ornery. |
Hi, thanks for following up. You could try increasing the number of sampled reads, e.g.:
to see if it converges to match TBProfiler a bit better. |
Sorry for the slow response, this fell off my radar. The samples I'm processing vary hugely in size -- we're running this on almost every Illumina-processed tuberculosis sample on SRA, and some of that is in a bit of a rough state -- so we're concerned adjusting the number of sampled reads may cause issues with the smaller samples. In any case, this is definitely much more accurate than what we were seeing earlier. Would it be appropriate to make a release? |
Hi, I tagged a release here: https://github.com/brentp/goleft/releases/tag/v0.2.6 let me know if you have any further suggestions. I see what you mean about increasing the number of sampled reads. |
Covstats can report that over 3000% of a sample is unmapped.
Example
BioSample SAMN18146202 is a simulated Mycobacterium tuberculosis septum sample which has been contaminated in silico with common contaminants to test bioinformatics pipelines. Due to how the sample was made, essentially anything that is not contamination is reference (H37Rv).
When run on covstats, covstats claims 3683.32% or 3671.93% (depending on how we decontaminate the fastqs) of the data is unmapped. It makes similar bold (>100%) claims of SAMN18146222, SAMN18146203, SAMN18146200, etc.
Possible cause
It appears the that the numerator (nUnmapped) increasing is mutually exclusive with the denominator (k) increasing, but at the end of the script it's treated like a percentage and multiplied by 100 -- eg, if there's 30 unmapped for every 1 mapped, it should be like 30/31 --> 96%, but because of where k is relative to the continue it's more like 30/1 --> 3000%.
Scope
k is the denominator for several other values too, so this may affect them as well.
The text was updated successfully, but these errors were encountered: