Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MBias plots #473

Closed
LandiMi2 opened this issue Jan 20, 2022 · 4 comments
Closed

MBias plots #473

LandiMi2 opened this issue Jan 20, 2022 · 4 comments

Comments

@LandiMi2
Copy link

Hey, @FelixKrueger I have been having a challenge understanding why R2 read from Illumina library has biases on methylation calls and how to correct them. I understand you can ignore a few bases 5' or 3' but in cases
CMV-8-R2. This read quality was okay (if the quality was poor then, that could be a possible reason). I don't really know how to correct it. Is it even relevant to correct these graphs to at least look like
CMV-5-R1 - (Read 1) or just ignore and carry on with the downstream analysis. What would be the impact on the downstream analysis?
I have looked for literature explaining the cause of these biases but I have found none. Please comment.

@FelixKrueger
Copy link
Owner

If you see dramatic biases the M-bias plot they are typically indicative of either technical issues or a consequence of the type of library preparation and/or procedure. As long as these methylation values do not reflect true methylation values they introduce spurios methylation calls, and thus introduce noise. Arguably, if you are looking for very strong effects you might get away with a bit more noise in the system, but ideally you would want to start your downstream analysis with as clean data as possible (that is at least my opinion).

Sometimes you may end up with fairly easy-to-remedy technical artefacts, such as end repair fill-in biases (https://sequencing.qcfail.com/articles/library-end-repair-reaction-introduces-methylation-biases-in-paired-end-pe-bisulfite-seq-applications/), which can be simply be corrected by using --ingore 3 or similar.

Some other techniques or kits used introduce biases, e.g. PBAT, single-cell applications, Zymoe Pico-methyl, Accel Swift to name just a few, introduce their own biases (see e.g. here: https://sequencing.qcfail.com/articles/mispriming-in-pbat-libraries-causes-methylation-bias-and-poor-mapping-efficiencies/ or here for recommendations for trimming: https://github.com/FelixKrueger/Bismark/tree/master/Docs#ix-notes-about-different-library-types-and-commercial-kits).

In your specific case, Read 1 looks like one you would hope to get (assuming this is a plant species?). Read 2 certainly has a somewhat spiky methylation pattern over the first 8-10bp (?) which quite clearly is much lower than for the rest of the read. Whether you want to hard-clip the reads (e.g. with Trim Galore --clip_r2 10, maybe this would also improve the alignment rate?) or simply ignore these residues within the methylation extractor is kind of your choice. IF you look at the number of actual methylation calls performed you will see that over the first ~10bp you have a fairly high number of calls compared to the more 3' end of Read 2 (which is a consequence of overlap detection and removal that is expected), so your Read 2 calls will contribute a comparatively high number of biased (and potentially spurious) calls.

I would be somewhat more alarmed by the fact that your Read 1 methylation are around 30/15/3 % in CpG/CHG/CHH context, and 45/25/10% for Read 2. Arguably that difference is much bigger than the biases observed at the 5' end of Read 2. The easiest explanation for this would be that the reads do not belong to the same sample - which would be great. If they are from the same sample, you would be in the awkward position to decide how to proceed - do you want to just use R1, or just R2, or simply use both and see what you get? You could also go back to the sequencing facility to see if something appeared weird, check which kind of sequencer your data was on (overcalling of Gs for Read 2?) etc. But that is kind of yet another question...

@LandiMi2
Copy link
Author

Thanks, @FelixKrueger for your response. I guess in my case I will proceed with the analysis with only R1. I understand I lose on the coverage. These sequences were done a long time ago, so tracking down where the problem was in the library is a bit tricky. Yes, these are sequences from a plant species.

@shaohuaihan
Copy link

Should the total calls line for Read 2 also be smooth ?

image

@FelixKrueger
Copy link
Owner

Duplicate post (see #673).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants