Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does pileup always output a value for a covered base? #200

Closed
Ge0rges opened this issue Jun 6, 2024 · 3 comments
Closed

Does pileup always output a value for a covered base? #200

Ge0rges opened this issue Jun 6, 2024 · 3 comments
Labels
question Looking for clarification on inputs and/or outputs

Comments

@Ge0rges
Copy link

Ge0rges commented Jun 6, 2024

Hi,

Just a simple question (for once), I wanted to ensure I understand the output of pileup correctly. For any given location/nucleotide, if there is no methylation but there is coverage of that base, will N canonical always be non-zero?

@ArtRand
Copy link
Contributor

ArtRand commented Jun 7, 2024

Hello @Ge0rges,

Any position with valid coverage >0 will have a bedMethyl record.

For any given location/nucleotide, if there is no methylation but there is coverage of that base, will N canonical always be non-zero?

I would say "For any given location/nucleotide, if there is no methylation but there is coverage of that base and the modification call passes the confidence threshold, will N canonical always be non-zero?"

Let me know if you find this to not be the case.

@Ge0rges
Copy link
Author

Ge0rges commented Jun 7, 2024

So just to double check, even if a base is not methylated but covered it could not show up if the modification call isn't above the threshold?

Would it be incorrect to treat covered bases below the threshold as unmethylated? I guess unknown is different that unmethylated...

@ArtRand
Copy link
Contributor

ArtRand commented Jun 7, 2024

@Ge0rges

A potential confusion is that modkit does not assume a base is canonical if the probability of modification is close to $\frac{1}{N_{\text{classes}}}$. A modification call can be canonical or any of the potential modification codes/states. So if a base is covered, but the call is canonical, there is a confidence/probability associated with that canonical call. If that canonical probability is below the pass threshold, it will be filtered out.

For example (take a simple 5mC/canonical example):

$$ P_{\text{5mC}} = 0.7, P_{\text{5hmC}} = 0.2, P_{\text{canonical}} = 1 - P_{\text{5mC}} + P_{\text{5hmC}} = 0.1, \text{Call} \Rightarrow \text{5mC} $$

However if the threshold is 0.9, then this call will fail not be considered canonical. There are more examples in the documentation.

If a read is filtered out, it adds a count to column 16 not $N_{\text{valid cov}}$. If a row has $N_{\text{valid cov}}$ of zero, it's not included in the output. You can see this easily when you run modkit pileup with --no-filtering and --filter-threshold 0.95 (or higher) and count the rows.

You can, however, force a default to canonical call by setting the thresholds specifically for each modified base --mod-threshold h:0.95 --mod-threshold m:0.95 --filter-threshold C:0.05 in this case what happens is the canonical probability (0.1) is still above the requirement for the base (0.05) and a canonical call is emitted. Using the thresholds this way is only advised when you have a very strong prior that bases should be canonical unless there is strong evidence of modification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

2 participants