Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redundant motifs from modkit motif search #272

Open
galenptacek opened this issue Sep 30, 2024 · 2 comments
Open

Redundant motifs from modkit motif search #272

galenptacek opened this issue Sep 30, 2024 · 2 comments

Comments

@galenptacek
Copy link

galenptacek commented Sep 30, 2024

Hello!

I'm using modkit for a bacterial (epi)genomics project. I have a large, GC-rich bacterial genome with several methyltransferases which may or may not be expressed under culture conditions. I'd like to find out the recognition sequences of these enzymes and how many of them are active (how many unique methylated motifs are there?), and motif search seems like the tool for that job.

My only issue is that when I search for motifs exhaustively (using the defaults for min-sites, log-odds and fraction in the high modified), I get a few confident (log_odds > 10), parsimonious motifs followed by a number of redundant entries. These either contain another motif as a substring (e.g. AGCT is highly methylated, GNNNNSAGCTSNNG gets trawled up as its own motif), or a permutation of a motif with a canonical nucleotide replaced by an IUPAC ambiguity (e.g. GCCGGC turns into GCCGGNG), or a bunch of motifs split off from a more parsimonious motif with degenerate bases (SAGCTS becomes GAGCTC, CAGCTC, ...)

For example, here's the first 12 results from my attempt:
modkit pileup -t 16 --filter-threshold 0.66 bam/alignment_sorted.bam pileup/mods.bed
modkit motif search -t 16 -i pileup/mods.bed -r alignment/ref.fasta -o motifs/motifs.tsv # all defaults
image

Modkit confidently discovers the GCCGGC motif, but it also a bunch of redundant entries containing the GCCGGC motif:
m GCCGGC 2
m GNNNGNNGGGSCGGNG 11 -> GNNNGNNGGGCCGGCG | two substitutions, S->C, N->C
m GGGGCCGGNG 5 -> GGGGCCGGCG | one substitution, N->C
m GNNGGGGCCGG 8 -> GNNGGGGCCGGC | insert one C at the end
m GNNNGGGGCCGG 9 -> GNNNGGGGCCGGC | insert one C at the end
seem to just be permutations of each other in different genome contexts.

I guess my question is how to approach reporting unique motifs. Is there a principled way to filter out redundant entries or align and cluster them? Can I adjust the parameters of motif search or motif refine to be more permissive about merging sequences?

I am in no way a Bioinformatics Expert so if I'm missing something obvious or making a mistake processing my data or setting parameters, feel free to let me know. Also happy to send over any data or set up a reproducible example.

Thanks,
Galen

@marcus1487
Copy link
Contributor

I can certainly see how these appear to be over-specified motifs. We can certainly look into how to make the command report the best core motif in these cases. It would be good to get some more information about your data to best explore options for more robust results. Specifically I would recommend using the modkit motif evaluate command. This will produce a report much like the search command, but with user defined motifs. In your case it seems that the CCGG and AGCT appear to be the best motifs to evaluate. This command would look like this: modkit motif evaluate -i pileup/mods.bed -r alignment/ref.fasta --known-motif CCGG m 1 --known-motif AGCT m 2. My hypothesis is that one or more of the metrics may not pass the default thresholds. If you could report this result that would be very helpful.

In terms of adjusting the search command, we can certainly look into this, but it would seem that this might involve dynamically allowing some motifs to relax the thresholds when such over-specified groups of motifs are found. We will consider this idea for future development of this command.

@galenptacek
Copy link
Author

Thanks for the quick response! I can absolutely report the result:

image

In this case I think both of them fail the default threshold for frac_mod (0.85?) and AGCT also fails the default threshold for minimum log-odds (2.5). If I'm interpreting this correctly, CmGG is enriched in the set of high-modified positions, but the fraction of CmGG motifs that are highly modified is too low for consideration as a motif?

Over-specified is probably a better word than redundant; I get the sense that there are probably different definitions of 'motif' for different situations, and I can imagine a situation where additional specificity surrounding a core motif is biologically meaningful, especially if the sequence is variably methylated and there's competition for binding sites, etc.

I'm curious if there's a quick and dirty way to find the 'core' motifs from the output, other than by eye. I guess I could read through the table from high to low log-odds, cache the first entry, move to the next, check if it contains one of the cached motifs (matching IUPAC codes), if not cache it, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants