-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
computing ANI for metagenomes vs ref genomes #18
Comments
OK - did this for GCA_001509055.1 in hu-s1 (SRR1976948) and it was illuminating (hah!) but not really surprising -
FastANIsez they're all basically the same, as we would expect --
sourmash comparesez the genbank and consensus are quite similar, but the megahit one is quite different
this highlights two things - ANI != Jaccard similarity, and |
That's interesting. What does the megahit assembly look like? From the limited information in the two tables I'd assume that there were regions of low similarity/absence in both assemblies (but especially megahit's) that fastANI was ignoring - i.e. coverage is effectively lower. It may be that, in low coverage comparisons, Jaccard approaches give a more "honest" estimate of genome similarity (taking into account overall composition), but may underestimate the relatedness of regions that are homologous. |
I agree! (I'm not sure how to answer the question "what does the megahit assembly look like" tho :) This is just more of the "ANI is about shared regions, Jaccard measures highlights differences outside of shared regions too" concept. I'm going to do this on a bunch more genomes and look into the results more systematically once I have bigger n; I have all the raw results, just need to do the preprocessing! |
Here's an interesting case -- FastANI output:
sourmash compare output:
- both the mapping-based consensus genome and the sgc genome are more similar to each other (by a lot) than they are to the genbank genome! Note also that we searched ALL of genbank, so there is no better genome out there than that one - and we built two that much more closely match to the SRR1976948 metagenome! |
I have dozens of these now. I wonder what the best way to summarize is 🤔 .
I mean, it's kinda clear that ANI and Jaccard similarities are poor proxies for each other... I guess a different thing we really want is a summary table for metagenome-derived genome ANIs vs genbank. So, for each metagenome, we'd produce a table:
and maybe that would be a valuable report for people considering whether or not to construct a new metagenome-derived genome? |
Hmm, one more thing is, how much of the genbank/consensus genome is covered by the reads here? (100% of the megahit genome is covered by the reads) |
ok, some specific questions to ask -
(and then we need to get into "what is ANI useful for?") |
huh. not what I was expecting! misc notes --
dunno if this is useful. things to ruminate on. |
May not be relevant and I may be misinterpreting but -- some of Hu SB1 was in Hu SB2. Best match in genbank could originate from SB2 and not SB1 which could cause weird results? Seems weird because gather should already pull the best results...just thinking out loud, and as I do so this comment is making less sense yet i'm going to post it here anyway :) |
I was thinking about the same thing and came to the same realization (that it shouldn't matter). However, it is definitely important to realize that some of the reference genomes were built from hu-s1, and some weren't! That some were taken directly from hu-s1 is why we have a lot of ANI~100% in there, I think! |
Fair point… ;) I think I mean in terms of genome completeness, CDS accuracy/completeness, that sort of thing. I'm not very familiar with output from MEGAHIT. [edit] sorry - I'm being a bit slow… I just read properly and realised you're using a community read dataset as input; I had misunderstood. Apologies. |
I agree (now I've better internalised what's going on). I think we may be able to make a case for a threshold coverage (or something that expresses relative confidence) where we have previously observed high ANI %ID, but very low %coverage between distantly-related complete genomes. |
THAT's what a megahit assembly looks like 😆 I haven't measured any of that stuff. Easy enough to do, but MAGs are usually a mess so 🤷 I had some more thoughts about how I need to be looking at ANI with only the covered bits of the genome. Tricky-ish to do. Will think about how to do it best. |
there are three ways to do this.
I’m in the process of trying (2) and (3) now. All very straightforward with tools we have.
Not clear what to do about genomes that are not fully covered, will do some exploratory analysis! probably best to eliminate those regions from consideration (which is what an assembly based approach will do).
comments from @taylorreiter -
comments from @widdowquinn -
and a somewhat tangential comment from nanopore considerations -
The text was updated successfully, but these errors were encountered: