covstats: % unmapped calculation seems inaccurate (>3000%, etc) #73

aofarrel · 2023-12-12T23:22:08Z

Covstats can report that over 3000% of a sample is unmapped.

Example

BioSample SAMN18146202 is a simulated Mycobacterium tuberculosis septum sample which has been contaminated in silico with common contaminants to test bioinformatics pipelines. Due to how the sample was made, essentially anything that is not contamination is reference (H37Rv).

When run on covstats, covstats claims 3683.32% or 3671.93% (depending on how we decontaminate the fastqs) of the data is unmapped. It makes similar bold (>100%) claims of SAMN18146222, SAMN18146203, SAMN18146200, etc.

Possible cause

for len(insertSizes) < n {
    rec, err := br.Read()
    if err == io.EOF {
	    break
    }
    pcheck(err)
    if rec.Flags&sam.Unmapped != 0 {
	    nUnmapped++
	    continue
    }
    k++
    [...]
}
[...]
s.ProportionUnmapped = float64(nUnmapped) / float64(k)
[...]
fmt.Fprintf(os.Stdout, "%.2f\t%s\t%.2f\t%.1f\t%.1f\t%.1f\t%d\t%s\t%s\n", coverage, sizes.String(), 100*sizes.ProportionUnmapped,[...]

It appears the that the numerator (nUnmapped) increasing is mutually exclusive with the denominator (k) increasing, but at the end of the script it's treated like a percentage and multiplied by 100 -- eg, if there's 30 unmapped for every 1 mapped, it should be like 30/31 --> 96%, but because of where k is relative to the continue it's more like 30/1 --> 3000%.

Scope

k is the denominator for several other values too, so this may affect them as well.

The text was updated successfully, but these errors were encountered:

brentp · 2023-12-13T07:52:03Z

thanks for the clear report and diagnosis.
I have pushed a fix. would you try the attached binary (gunzip, chmod +x and ./goleft_linux64 covstats ... )
and verify it looks good for you? If so I will make a new release.
goleft_linux64.gz

aofarrel · 2023-12-15T22:26:38Z

Updated version

coverage	insert_mean	insert_sd	insert_5th	insert_95th	template_mean	template_sd	pct_unmapped	pct_proper_pair	read_length	bam	sample
4.34	-0.44	17.25	-29	29	299.57	17.24	95.33	4.6	150	SAMN18146222_to_H37Rv.bam	SAMN18146222
3.93	-0.5	17.25	-28	28	299.47	17.47	97.35	2.6	150	SAMN18146202_to_H37Rv.bam	SAMN18146202
18.61	-0.55	17.35	-29	29	299.45	17.36	0.2	98.7	150	SAMN18146198_to_H37Rv.bam	SAMN18146198

Previous version

coverage	insert_mean	insert_sd	insert_5th	insert_95th	template_mean	template_sd	pct_unmapped	pct_proper_pair	read_length	bam	sample
4.34	-0.44	17.25	-29	29	299.57	17.24	2043.06	98.6	150	SAMN18146222_to_H37Rv.bam	SAMN18146222
3.93	-0.5	17.25	-28	28	299.47	17.47	3671.93	98.9	150	SAMN18146202_to_H37Rv.bam	SAMN18146202
18.61	-0.55	17.35	-29	29	299.45	17.36	0.2	98.9	150	SAMN18146198_to_H37Rv.bam	SAMN18146198

TBProfiler

TBProfiler is was also run on the fastqs before variants were called. Because it uses a different aligner, it can be expected that its results may differ from covstats, but they should be similar since they align to the same reference genome (H37Rv).

sample	% mapped	100 - % mapped	median coverage (always rounds to integer)
SAMN18146222	18.45%	81.55%	4
SAMN18146202	16.8%	83.2%	4
SAMN18146198	99.83%	0.17%	18

The fact TBProfiler seems to be saying 81.55% of a sample is unmapped while neo-covstats says 95.33% of a sample is unmapped is worth noting, but I don't think it's unreasonable, especially considering these are samples specifically designed to be ornery.

brentp · 2023-12-18T11:06:59Z

Hi, thanks for following up. You could try increasing the number of sampled reads, e.g.:

goleft covstats -n 10000000 ...

to see if it converges to match TBProfiler a bit better.

aofarrel · 2024-02-13T20:50:07Z

Sorry for the slow response, this fell off my radar. The samples I'm processing vary hugely in size -- we're running this on almost every Illumina-processed tuberculosis sample on SRA, and some of that is in a bit of a rough state -- so we're concerned adjusting the number of sampled reads may cause issues with the smaller samples.

In any case, this is definitely much more accurate than what we were seeing earlier. Would it be appropriate to make a release?

brentp · 2024-02-14T07:49:59Z

Hi, I tagged a release here: https://github.com/brentp/goleft/releases/tag/v0.2.6

let me know if you have any further suggestions. I see what you mean about increasing the number of sampled reads.

brentp added a commit that referenced this issue Dec 13, 2023

fix for #73

3ba2e6f

aofarrel mentioned this issue Dec 19, 2023

Refactor & run fastp before decontam & guardrail mode aofarrel/myco#45

Merged

aofarrel added a commit to aofarrel/goleft-wdl that referenced this issue Mar 19, 2024

Update Docker (addresses brentp/goleft#73)

c6886ff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

covstats: % unmapped calculation seems inaccurate (>3000%, etc) #73

covstats: % unmapped calculation seems inaccurate (>3000%, etc) #73

aofarrel commented Dec 12, 2023 •

edited

Loading

brentp commented Dec 13, 2023

aofarrel commented Dec 15, 2023

brentp commented Dec 18, 2023 •

edited

Loading

aofarrel commented Feb 13, 2024

brentp commented Feb 14, 2024

covstats: % unmapped calculation seems inaccurate (>3000%, etc) #73

covstats: % unmapped calculation seems inaccurate (>3000%, etc) #73

Comments

aofarrel commented Dec 12, 2023 • edited Loading

Example

Possible cause

Scope

brentp commented Dec 13, 2023

aofarrel commented Dec 15, 2023

Updated version

Previous version

TBProfiler

brentp commented Dec 18, 2023 • edited Loading

aofarrel commented Feb 13, 2024

brentp commented Feb 14, 2024

aofarrel commented Dec 12, 2023 •

edited

Loading

brentp commented Dec 18, 2023 •

edited

Loading