-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
modkit results look off for CHG and CHH contexts - plant methylation #234
Comments
Hello @colindaven, How are you producing the basecalls/base modification calls now? Are you using dorado/remora? Your commands look fine to me. One note is that
You can run them together, however
As mentioned above, it's likely that the missing sites have zero valid coverage. Could you use
If you have a strong prior that bases are canonical, you can force the behavior that bases are considered canonical unless there is sufficient evidence that they are modified. There is more information in this thread. |
Hi @ArtRand thanks for the quick and informative reply. I called this data with Things are looking much better now due to adding the See the bottom tracks in red - these were taken with default settings are are very sparse. The 3 upper CG, CHG and CHH tracks are closer to what I'd expect and what I saw from Megalodon and modbamtobed. Note the extreme sparsity of the red calls (mostly nothing at all). I can't imagine effective coverage will be zero since these are 50X datasets. Do you avoid calling in repeat regions since plants are full of repeats. I'll keep experimenting and discuss these results with colleagues as I get more datasets coming in. Quantitatively the numbers of CHG and CHH are looking a lot better, relative to the CPGs. I'll have to check how many CPGs there are in this genome though.
I don't think this is the threshold you mean - from the command modkit summary. I ran each command above with
BTW, the performance of the tool is really nice! Thanks. |
Hello @colindaven, Sorry for the delay. Just to recap, if a genomic position has 0 valid coverage, it won't be written in the bedMethyl/bedGraph table. The
The threshold value of 0.7 is just fine from my experience. One thing you could look at is the raw read call probabilities in these regions of sparse calls: $ modkit extract ${bam} null --read-calls --region ${region} --filter-threshold 0.7050781 where Glad you like the tool, the next major release should have more performance improvements as well. |
@colindaven Any luck here? |
This one slipped through, apologies. I'm not setting a Command - note this is quite different to yours - I omitted Should <READ_CALLS_PATH> by the path to the bam/cram again (or the CPG calls tsv output file?).
|
Hi,
I've been puzzled by the lack of calls/site for CHG and CHH methylation for a few days now. I called similar data with Megalodon and modbamtobed a year or two ago and got very different results (similar numbers of sites across all three, and very densely across the genome - with modkit most of the genome is not covered by sites so far). CHH sites are quite broadly defined, so there should be quite a few present in the genome.
Workarounds tried so far
Yet I still seem to get way too few CHG and CHH sites, especially after filtering
Code examples (in nextflow) - just the modkit lines
Awk filtering by aligned read coverage to get bedGraph
Thanks
Colin
The text was updated successfully, but these errors were encountered: