Heatmap for Per Sequence Quality Scores is an extra base long #812

MattBashton · 2018-08-06T13:37:43Z

Description of bug:
Per Sequence Quality Scores heatmap is incorrect on the final base (which is actually non existent), my reads are 100bp long, the individual sample report will show positions numbered 1 to 100 with the final base (100) having sane levels of A, T, C and G i.e. 24-26% in my various input files. Using the heatmap however will produce an image with bases numbered positions 0-100 (i.e. 101 bp in length) here the final base is always 100% G however this data point is bogus. I have confirmed the actual length of my input reads to be 100bp using a text editor (emacs with column numbering mode enabled). Additionally these two plots should ideally either be both 1 indexed or 0 indexed but not a mixture of both

MultiQC Error log:
None generated

Files that triggers the error:
Case_001_R2_fastqc.zip
Case_001_R1_fastqc.zip

MultiQC run details (please complete the following):

Command used to run MultiQC: multiqc <dir>
MultiQC Version: MultiQC v1.16
Operating System: Ubuntu 16.04.5 LTS
Python Version: 3.5.2
Method of MultiQC installation: pip install --user

Additional context
FastQC v0.11.7

The text was updated successfully, but these errors were encountered:

ewels · 2018-08-17T09:24:03Z

Hi @MattBashton,

Thanks for the detailed report. I guess you are specifically talking about the hover text when moving the cursor over the heatmap? The colouring of the heatmap looks fine to me when I make a report with your files, but I see that the text does report 0-100 with 100% on base 100 as you say.

I've just had a look at the code and this hover text is generated dynamically by taking the cursor position and reverse-engineering the colour code underneath it. The 100% G is because the browser is incorrectly reporting rgb(0,0,0) at the far extreme of the plot. I don't think I can do anything about this as it's due to the core browser canvas rendering. The base position 0 is because it's doing Math.floor of the current x position, so when you get to less than 1bp it reports 0. I've just switched this for Math.round and added a check to not allow it to report 0bp. This should now be available in MultiQC v1.7dev.

Note that it can be a little tricky to get precisely the correct base numbers because of the way that FastQC reports this data. It does so with a base position range, which can vary:

94-95	23.746249462902973	26.6517600021941	26.067876874506705	23.534113660396216
96-97	23.967289520566574	26.26382242694542	25.883346696990806	23.885541355497196
98-99	23.938792540213242	26.38431566079754	26.243589803749835	23.43330199523938
100-101	24.42109582916103	26.361364398196013	25.320439466630525	23.897100306012433

The data behind the MultiQC plot needs to have single data points (for the line graph), so I get around this by just picking the average value of the range and using that.

In short - it's best to be a little careful with either a MultiQC or FastQC report if you really need per-base accuracy, especially at the end of reads.

Interestingly the FastQC HTML reports themselves say that your reads are 101bp long! Maybe a bug in FastQC..?

Phil

MattBashton · 2018-08-17T10:25:12Z

Hi Phil, yes just to confirm, will the fix prevent the 100% G base being reported on mouse over? I feel this is important because checking for poor calls at the end of reads is one of them main usage cases of both FastQC and MulltiQC prior to trimming, so eliminating this bug is a good thing. Given this behaviour (at present) one would incorrectly assume their reads were all 100% G at position 100, (which is of course a bogus position 101) if we incorrectly assuming 1 based indexing (which is reasonable), but incorrectly interpreting this plot readout could cause the end user to consider trimming base 100 - which would be unnecessary. You sate that you can't prevent the 100% G being reported? This is a critical issue IMO.

Another problem this plot has too is that the averages (par the final 100 position) appear to be reported in bins of two, so positions 0-1, have the same Q score, as do 1-2, and 2-3, i.e for 0,1 is the same value as is 1-2 and 2-3. However here is a another issue, every base is reported twice, so moving left to right the mouse over shows stats at the top of the plot for 0,1,1,2,2 but each successive report of frequencies for the bases differs, so bp 2 is reported as: 34, 21, 22, 23% initially but move 1 pixel to the left and the readout for bp 2 is now: 26, 24, 25, 25 for read 1. This is somewhat broken because how is the end user supposed to interpret the correct value for the 2nd base pair when two are reported. I fully understand that underlying this data FastQC is reporting them in 2bp bins, which is why this happens but the heatmap needs to be transparent to this, otherwise mis interpretation can occur. Perhaps you should render the heatmap at 2pb resolution, rather than reporting every data point twice with different values?

Yes I had noticed that current FastQC reports these reads to be 101bp in length, although it's plots don't runt to 101bp, suspect this is also a bug in FastQC, although as downstream input it's one you might want to be aware off.

MattBashton · 2018-08-17T10:26:40Z

I've attached my report output to show what I mean.

multiqc_report_error.html.zip

MattBashton · 2018-08-17T11:06:59Z

I should update this the reads are indeed 101bp long, forgot emacs uses 0 based indexing whilst checking. However the 101 based being reported as 100% G is clearly not correct.

Show the original base pair range label when hovering on the sequence content line plot data. see #812.

Use the original data instead of calculating it from the visible pixels. See issue #812

ewels · 2018-08-17T11:31:52Z

I was a little concerned that implementing the changes you suggest would involve rewriting a tonne of the code. However, after having a poke around and refreshing myself with how the plot is created, I realised that most of the pieces were already in place. I've refactored the code for both the heatmap and the line plot to show the original FastQC base pair range. I've also updated the heatmap to show the raw base percentages instead of back-converting from the colours (though with no decimal places showing it's unlikely that this will make a detectable difference). The side effect of pulling the original data in this way is that it's fixed the 100% G bug 🎉

I quite enjoyed fixing this one! Please give it a try and confirm that it works as expected now.

Added 3px x offset for picking up the base position when hovering on the sequence content heatmap. Done empirically...! See #812

MattBashton · 2018-08-17T13:01:31Z

Hey glad to hear you enjoyed fixing it :) I've re ran my end, and the output looks flawless with respect to this issue, so glad this got fixed!!! Which was my motivation for reporting it! Attached is the output.
multiqc_report_fixed.html.zip

ewels · 2018-08-17T13:08:29Z

Great! Thanks for reporting and testing 👍

ewels added the bug: core Bug in the main MultiQC code label Aug 17, 2018

ewels closed this as completed in 26f894e Aug 17, 2018

ewels added a commit that referenced this issue Aug 17, 2018

FastQC seq content line plot tooltip

4268999

Show the original base pair range label when hovering on the sequence content line plot data. see #812.

ewels added a commit that referenced this issue Aug 17, 2018

Refactor FastQC heatmap hover key.

51cb4f1

Use the original data instead of calculating it from the visible pixels. See issue #812

ewels added a commit that referenced this issue Aug 17, 2018

FastQC heatmap: add x offset to hover

1b04f6f

Added 3px x offset for picking up the base position when hovering on the sequence content heatmap. Done empirically...! See #812

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heatmap for Per Sequence Quality Scores is an extra base long #812

Heatmap for Per Sequence Quality Scores is an extra base long #812

MattBashton commented Aug 6, 2018

ewels commented Aug 17, 2018

MattBashton commented Aug 17, 2018

MattBashton commented Aug 17, 2018

MattBashton commented Aug 17, 2018

ewels commented Aug 17, 2018

MattBashton commented Aug 17, 2018

ewels commented Aug 17, 2018

Heatmap for Per Sequence Quality Scores is an extra base long #812

Heatmap for Per Sequence Quality Scores is an extra base long #812

Comments

MattBashton commented Aug 6, 2018

ewels commented Aug 17, 2018

MattBashton commented Aug 17, 2018

MattBashton commented Aug 17, 2018

MattBashton commented Aug 17, 2018

ewels commented Aug 17, 2018

MattBashton commented Aug 17, 2018

ewels commented Aug 17, 2018