-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
provide a new cleaning approach based on whole-genome gather? #33
Comments
Can we check if a contig only has one hash? I worry that short contigs, which I think are more likely to be binned wrong, will only have one hash on them (or at least only one identifiable hash), and having a threshold of at least 3 matches will allow those contigs to slip through. Whole genome gather might help in this situation in particular |
maybe a middle ground approach could work? we do a whole-genome gather and identify matches that look well-supported that way, and then take those matches more seriously at a contig level, e.g. lower the threshold for removing contigs with such hashes in there. |
Just to get a start on this, I added a flag on the breakdown of "clean" output that identified whole-genome matches whose taxonomy is outside that of genome lineage at match_rank, and are presumed contaminants -- see below for the results, '!!' indicates contamination.
I haven't dug into why those matches aren't being removed (what thresholds apply etc. etc.), will do so next. But, in any case, we could maybe do this at the very beginning instead of the end, and then focus more aggressively on removing contigs with those hashes. |
Digging into Loomba(The Loomba* genome is in the demo.) With the latest reporting code af435fc and a match_rank of genus (which is intentionally overly strict :), we get the following out-of-genus matches when doing gather on the clean contigs (marked with !!).
In #110, a work-in-progress PR, I added some code to report on which contigs passed Reason 1 as "clean", yet still had some of the hashes above. (I also removed LCA cleaning, reasons 2 and 3.) Here are some of the results:
This shows that there's a mixture of reasons why contigs are passing our filters as "not contaminants". First, some of them look like the putative contamination is in the (significant) minority; consider
where the contig has 74% of 86 hashes identified, and 84% of those 74% match to one lineage, while a paltry 8 hashes (10%) match outside the genus. This looks pretty legit to me, although we can dig to see if it's a chimeric contig or something. Second, some of the contigs are just too small for our current filters. See
- only one hash, so this doesn't pass our GATHER_THRESHOLD of 3 for removal. This is a good candidate for removal since we have whole-genome evidence that this is probably a whack contig. Third, there are puzzling in-between situations that need more investigation. Consider:
or
where I don't even understand what's going on, because the numbers don't make sense, but naively I would expect the majority-match to a non-genus lineage would have resulted in removal... Isn't debugging fun? :) |
Loomba moreLooking into:
I printed out the gather results for that contig, and we see:
and we see that the majority match is to the correct lineage. So this is either a chimeric contig or ...something weirder. 🤷 Charcoal is correct not to remove it, based on the current per-contig understanding. For another weird one, we see:
the problem here seems to be that the gather results are individually below our match threshold, but collectively are problematic. ...come to think of it, that describes the first situation in this comment - the 21 of 39 one - too. Hmm. Incidentally, the %s seem to be incorrect - they shouldn't add to over 100%. I think I need to fix that reporting... |
For the case where a contig is removed because it has one identifiable hash that matches another species, should we try and increase the k-size to increase our confidence in the match? e.g. if it matches at a k = 41 or k = 51, I would be more confident perhaps, although k = 31 is already pretty specific. |
I think that in this issue's case we are lending support to the "one identifiable hash being a problem" by looking at the whole genome gather first - we already know this species is a problem. It's not like the LCA situation where we are condemning a contig on too little evidence. Multiple ksizes is an interesting idea too! It's actually fairly straightforward now with zipped SBTs and some of the sourmash accession retrieval stuff (sourmash-bio/sourmash#993) to do this at a technical level (tl;dr retrieve matching signatures at different ksizes by name, for fun and profit). |
ah i forgot about the whole genome gather first! yes, I agree, one is probably enough in this case. |
in
LoombaR_2017__SID1050_bax__bin.11.fa.gz
, we see:and many of these look ...suspect. But they are not being removed, probably because the GATHER_MIN_MATCHES=3 is too stringent at the per-contig level; the report above is at the whole-genome level, and so takes full advantage of the combinatorics of gather.
If we truly believe that gather is nicely specific, we could identify any of these matches that are over the GATHER_MIN_MATCHES threshold as likely contaminants and remove any contig that contains any hashes from these matches.
then again, that might effectively be what we're already doing, since we're already only looking at matches that are from the whole-genome gather. ergh. leaving this here for further thought.
The text was updated successfully, but these errors were encountered: