Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modkit results look off for CHG and CHH contexts - plant methylation #234

Open
colindaven opened this issue Jul 19, 2024 · 5 comments
Open
Labels
troubleshooting workflow and data preparation questions

Comments

@colindaven
Copy link

Hi,

I've been puzzled by the lack of calls/site for CHG and CHH methylation for a few days now. I called similar data with Megalodon and modbamtobed a year or two ago and got very different results (similar numbers of sites across all three, and very densely across the genome - with modkit most of the genome is not covered by sites so far). CHH sites are quite broadly defined, so there should be quite a few present in the genome.

Workarounds tried so far

  • converted to bam instead of default cram after noting log entry saying cram may be unstable - no change
  • converted bedMethyl to bedgraph to bigwig and inspected their
  • modified all sorts of conversion min coverage thresholds from 1 to 10
  • used modkit bedgraph directly

Yet I still seem to get way too few CHG and CHH sites, especially after filtering

wc -l *.bedMethyl
  13432960 TO_mod_pos_CG.bedMethyl
   2077794 TO_mod_pos_CHG.bedMethyl
   7516912 TO_mod_pos_CHH.bedMethyl

wc -l *.bedGraph
 13243677 TO_mod_pos_CG.bedMethyl.bedGraph
    85938 TO_mod_pos_CHG.bedMethyl.bedGraph
   324101 TO_mod_pos_CHH.bedMethyl.bedGraph

Code examples (in nextflow) - just the modkit lines

process modkit_pileup_cpg {
    modkit pileup $cram ${prefix}_CG.bedMethyl -t $task.cpus --ref $params.fasta --motif CG 0 --combine-strands --ignore h
process modkit_pileup_chg {
    modkit pileup $cram ${prefix}_CHG.bedMethyl -t $task.cpus --ref $params.fasta --motif CHG 0 --ignore h
process modkit_pileup_chh {
    modkit pileup $cram ${prefix}_CHH.bedMethyl -t $task.cpus --ref $params.fasta --motif CHH 0 --ignore h

Awk filtering by aligned read coverage to get bedGraph

    awk '\$10 >= $params.minCoverageForMethylation' $bedMethyl | cut -f1-3,11 | sort -T '.' -k1,1 -k2,2n >  ${bedMethyl}.bedGraph
    awk '\$10 >= $params.minCoverageForMethylation' $bedMethyl | cut -f1-3,11 | sort -T '.' -k1,1 -k2,2n >  ${bedMethyl}.bedGraph
    awk '\$10 >= $params.minCoverageForMethylation' $bedMethyl | cut -f1-3,11 | sort -T '.' -k1,1 -k2,2n >  ${bedMethyl}.bedGraph
  • Q1 I am running CPG, CHG and CHH filtering separately - should I be running them together then separating the output file by context ?
  • Q2 - can you see any other problems or internal modkit reasons why I am getting so few bedMethyl and bedGraph results ?
  • Q3 - does modkit have other parameters to not produce results for unmethylated Cs in CHG or CHH contexts ?

Thanks
Colin

@ArtRand
Copy link
Contributor

ArtRand commented Jul 19, 2024

Hello @colindaven,

How are you producing the basecalls/base modification calls now? Are you using dorado/remora? Your commands look fine to me. One note is that modkit will not emit a bedmethyl record when valid coverage is zero (as documented in the troubleshooting), you may try adding the --no-filtering flag.

Q1 I am running CPG, CHG and CHH filtering separately - should I be running them together then separating the output file by context ?

You can run them together, however --combine-strands requires that the motif is reverse-complement palindromic, CHH and CHG are not. So you'll have to either drop --combine-strands or change the motifs to ones that can be combined.

Q2 - can you see any other problems or internal modkit reasons why I am getting so few bedMethyl and bedGraph results ?

As mentioned above, it's likely that the missing sites have zero valid coverage. Could you use --log-filepath and report back what the estimated thresholds are?

Q3 - does modkit have other parameters to not produce results for unmethylated Cs in CHG or CHH contexts ?

If you have a strong prior that bases are canonical, you can force the behavior that bases are considered canonical unless there is sufficient evidence that they are modified. There is more information in this thread.

@ArtRand ArtRand added the troubleshooting workflow and data preparation questions label Jul 19, 2024
@colindaven
Copy link
Author

colindaven commented Jul 23, 2024

Hi @ArtRand

thanks for the quick and informative reply.

I called this data with dorado - 0.7.2 I believe - and used a pipeline with minimap -y to map the MM and ML tags from the fastq to the final cram.

Things are looking much better now due to adding the --no-filtering flag. However, the default settings - at least for this application and my testing so far - look to be quite off what I'd expect, as I mentioned above.

See the bottom tracks in red - these were taken with default settings are are very sparse. The 3 upper CG, CHG and CHH tracks are closer to what I'd expect and what I saw from Megalodon and modbamtobed. Note the extreme sparsity of the red calls (mostly nothing at all).

modkit_methyl_no_filter_vs_defaults

I can't imagine effective coverage will be zero since these are 50X datasets. Do you avoid calling in repeat regions since plants are full of repeats.

I'll keep experimenting and discuss these results with colleagues as I get more datasets coming in.

Quantitatively the numbers of CHG and CHH are looking a lot better, relative to the CPGs. I'll have to check how many CPGs there are in this genome though.


wc -l *yl.bedGraph
  13459001 TO_mod_pos_CG.bedMethyl.bedGraph
   3469108 TO_mod_pos_CHG.bedMethyl.bedGraph
  12144973 TO_mod_pos_CHH.bedMethyl.bedGraph

wc -l *yl
  13459001 TO_mod_pos_CG.bedMethyl
   3469108 TO_mod_pos_CHG.bedMethyl
  12144973 TO_mod_pos_CHH.bedMethyl

I don't think this is the threshold you mean - from the command modkit summary. I ran each command above with
--log-filepath cpg_modkit.log but could not see meaningful coverage data/est thresholds (for me). Maybe that is due to the --no-filtering tag?

T053_mod_pos_summary.log:[src/commands.rs::837][2024-07-22 18:12:58][INFO] calculating threshold at 10% percentile
T053_mod_pos_summary.log:[src/thresholds.rs::94][2024-07-22 18:12:58][DEBUG] calculating per base thresholds
T053_mod_pos_summary.log:[src/thresholds.rs::97][2024-07-22 18:13:05][DEBUG] probs per base took 7s
T053_mod_pos_summary.log:[src/thresholds.rs::114][2024-07-22 18:13:06][DEBUG] filter thresholds took 0s
T053_mod_pos_summary.log:[src/thresholds.rs::85][2024-07-22 18:13:06][INFO] calculated thresholds: C: 0.7050781

BTW, the performance of the tool is really nice!

Thanks.
Colin

@ArtRand
Copy link
Contributor

ArtRand commented Jul 24, 2024

Hello @colindaven,

Sorry for the delay.

Just to recap, if a genomic position has 0 valid coverage, it won't be written in the bedMethyl/bedGraph table. The --no-filtering flag essentially disables the pass/fail behavior so valid coverage should be roughly equivalent to the stranded read coverage. It seems strange to me that the modification calls in these regions would all be low confidence. The log file snippet you send has a record:

T053_mod_pos_summary.log:[src/thresholds.rs::85][2024-07-22 18:13:06][INFO] calculated thresholds: C: 0.7050781

The threshold value of 0.7 is just fine from my experience. One thing you could look at is the raw read call probabilities in these regions of sparse calls:

$ modkit extract ${bam} null --read-calls --region ${region} --filter-threshold 0.7050781

where ${region} is chr:start-stop like usual. This will produce a read calls table where the fail column will tell you if the call would have failed with the provided threshold. You can then load up this table into a data frame and look at the distribution of call_prob. I'd be curious to know if it's systematically low in these regions and if so by how much.

Glad you like the tool, the next major release should have more performance improvements as well.

@ArtRand
Copy link
Contributor

ArtRand commented Aug 27, 2024

@colindaven Any luck here?

@colindaven
Copy link
Author

This one slipped through, apologies.

I'm not setting a fail column - do you mean the inferred column - and false? Or I probably need to adjust my command but don't know how quite yet.

Command - note this is quite different to yours -

I omitted null and --read-calls after they failed with this error
error: a value is required for '--read-calls-path <READ_CALLS_PATH>' but none was supplied

Should <READ_CALLS_PATH> by the path to the bam/cram again (or the CPG calls tsv output file?).

modkit extract --region TO_Heinz1706_hq-KG-ONTv1.chrSL4.0ch03:8400000-8500000 --filter-threshold 0.7050781 /mnt/scratch/TO_mod_pos.cram out3.tsv

_mod_strand  fw_soft_clipped_start  fw_soft_clipped_end  read_length  mod_qual     mod_code  base_qual  ref_kmer  query_kmer  canonical_base  modified_primary_base  inferred  flag
             22                     6                    248321       0.001953125  h         50         .         TTCGC       C               C                      false     16
             22                     6                    248321       0.009765625  m         50         .         TTCGC       C               C                      false     16
             22                     6                    248321       0.009765625  h         46         .         TTCGG       C               C                      false     16
             22                     6                    248321       0.017578125  m         46         .         TTCGG       C               C                      false     16
             22                     6                    248321       0.005859375  h         50         .         AACGA       C               C                      false     16
             22                     6                    248321       0.009765625  m         50         .         AACGA       C               C                      false     16
             22                     6                    248321       0.001953125  h         50         .         ATCGA       C               C                      false     16
             22                     6                    248321       0.001953125  m         50         .         ATCGA       C               C                      false     16
             22                     6                    248321       0.005859375  h         50         .         ATCGT       C               C                      false     16
             22                     6                    248321       0.009765625  m         50         .         ATCGT       C               C                      false     16
             22                     6                    248321       0.001953125  h         50         .         TCCGA       C               C                      false     16
             22                     6                    248321       0.005859375  m         50         .         TCCGA       C               C                      false     16
             22                     6                    248321       0.001953125  h         44         .         ATCGA       C               C                      false     16
             22                     6                    248321       0.009765625  m         44         .         ATCGA       C               C                      false     16
             22                     6                    248321       0.001953125  h         50         .         TGCGA       C               C                      false     16
             22                     6                    248321       0.005859375  m         50         .         TGCGA       C               C                      false     16
             22                     6                    248321       0.009765625  h         50         .         GCCGA       C               C                      false     16
             22                     6                    248321       0.013671875  m         50         .         GCCGA       C               C                      false     16
             22                     6                    248321       0.021484375  h         50         .         GGCGA       C               C                      false     16
             22                     6                    248321       0.029296875  m         50         .         GGCGA       C               C                      false     16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
troubleshooting workflow and data preparation questions
Projects
None yet
Development

No branches or pull requests

2 participants