Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalized the DMR scores #261

Open
Congnguyenn opened this issue Sep 13, 2024 · 4 comments
Open

Normalized the DMR scores #261

Congnguyenn opened this issue Sep 13, 2024 · 4 comments
Labels
question Looking for clarification on inputs and/or outputs

Comments

@Congnguyenn
Copy link

Hello @ArtRand,
Thank you for your tool, It helps me a lot,
I have a couple of questions related to the DMR score when I am trying to find the DMR (TSS regions provided by a bed file) in 2 samples with a fixed size (2kbps/region). The command I used is as follows

modkit dmr pair \ -a \${methylbed[i]} \ -b \${methylbed[j]} \ --regions-bed \${filename_i}_\${filename_j}.${bedtype}.intersected \ --min-valid-coverage ${cov_cutoff} \ --ref ${reference} \ --missing quiet \ --base C \ --threads ${task.cpus} \ --header \ --log-filepath \${filename_i}_\${filename_j}.dmr.log \ --out-path \${filename_i}_\${filename_j}.dmr \ --force

  • I found that the score column was not well correlated with the *difference_pct_modified=abs(a_pct_modified - b_pct_modified)100 as shown in this plot. Some points (TSS) have extremely high scores (log2 scale) but the difference_pct_modified were not really high. This is a bit strange to me intuitively.
    image

  • In 191, you mentioned that "the score is, unfortunately, somewhat correlated with the number of potentially modified positions (CpGs in this case I believe) in the region."

So my questions are:

  • Should I normalize the score with the number of CpG sites in a certain TSS and the depth at each TSS (or whatever regions), therefore we will have a more standardized score, If yes, is there any available or suggested formula?
  • Should I consider the indeed DMR regions based on both score (from Modkit) and difference_pct_modified (As I calculated above), rather than solely based on the score? if yes, can you suggest a suitable cut-off? for example: score >= 4 and difference_pct_modified >=50? as illustrate in this image
    image
@Congnguyenn
Copy link
Author

Hello @ArtRand
It would be nice to have your inputs on the above queries.

Thanks

@ArtRand
Copy link
Contributor

ArtRand commented Sep 19, 2024

Hello @Congnguyenn,

I apologize for the delay in getting back to you.

Should I normalize the score with the number of CpG sites in a certain TSS and the depth at each TSS (or whatever regions), therefore we will have a more standardized score, If yes, is there any available or suggested formula?

Looking at your first set of plots, are the low-%change/high-scoring (circled) points particularly long TSS? Did you perform 5hmC calling?
To answer your question, I don't have a suggested formula. Normalizing by the number of CpGs is a reasonable idea. One potential draw back, however, is there may be TSSs that have very dense GpGs, then compared to a TSS with lower density and the same count, there will be less variability in the dense CpG TSS since correlation of neighboring CpGs tends to be high.

Should I consider the indeed DMR regions based on both score (from Modkit) and difference_pct_modified (As I calculated above), rather than solely based on the score? if yes, can you suggest a suitable cut-off? for example: score >= 4 and difference_pct_modified >=50? as illustrate in this image

I'm working on a method that will let users "count" the number of DMRs in regions, but I don't have anything yet. Your decision function seems reasonable to me, you could also try to make a null distribution for your problem. The latter is closer to the route I've been going down.

@ArtRand ArtRand added the question Looking for clarification on inputs and/or outputs label Sep 19, 2024
@Congnguyenn
Copy link
Author

Congnguyenn commented Sep 20, 2024

Thank @ArtRand for your response!

Looking at your first set of plots, are the low-%change/high-scoring (circled) points particularly long TSS? Did you perform 5hmC calling?

No, I set a fixed TSS size of 2000 bps, but the number of CpG sites within each TSS can vary. I only analyzed 5mC using the command modkit adjust-mods to combine 5hmC and 5mC. Additionally, I want to understand if the number of reads in a region (depth) affects the score value. For example, consider two TSS regions (TSS1 and TSS2). TSS1 has lower depth in sample A and sample B (10X and 12X, respectively), while TSS2 has higher depth in sample A and sample B (50X and 60X, respectively). Does the different score value get affected by the different depth?

To my knowledge, higher depth increases confidence in probability (due to more observations lead to smaller variance in the posterior) and does not affect the score. Only the number of CpG sites and the ratio between methylated and unmethylated sites affect the score.

I also want to know if are you working on functional annotation, the tool needs to consider the methylation status (methyl or unmethyl), I believe this would be a useful feature because it helps us to interpret the different

@ArtRand
Copy link
Contributor

ArtRand commented Sep 25, 2024

Hello @Congnguyenn,

I want to understand if the number of reads in a region (depth) affects the score value. For example, consider two TSS regions (TSS1 and TSS2). TSS1 has lower depth in sample A and sample B (10X and 12X, respectively), while TSS2 has higher depth in sample A and sample B (50X and 60X, respectively). Does the different score value get affected by the different depth?

It depends. Higher depth will allow the model to be more confident that the observed methylation distribution is close to the true (latent) methylation distribution. The effect of higher coverage is you'll get higher scores at smaller effect sizes (differences in methylation). If the effect size is very small or zero, increasing coverage will not increase the score. Practically speaking however, you'll probably observe higher scores in regions with higher coverage, but I don't think it makes these scores "incorrect". I could see how you might want to sub-sample the regions of high coverage to be equivalent to the regions of lower coverage, this functionality isn't available in modkit dmr pair with --regions right now. Let me think about it and do some experiments and maybe I can add it into the next release (should be pretty easy).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

2 participants