provide a new cleaning approach based on whole-genome gather? #33

ctb · 2020-05-14T14:33:23Z

in LoombaR_2017__SID1050_bax__bin.11.fa.gz, we see:

breakdown of clean contigs w/gather:
  75.29% - to GCF_002159845 s__Anaeromassilibacillus sp002159845
  1.56% - to GCF_900104675 s__Angelakisella massiliensis
  1.42% - to GCF_002160955 s__Gemmiger_A sp002160955
  1.28% - to GCA_900066645 s__Lawsonibacter sp900066645
  0.97% - to GCF_900199515 s__OEMR01 sp900199515
  0.33% - to GCF_001261775 s__Anaeromassilibacillus senegalensis
  0.33% - to GCF_002159455 s__Flavonifractor sp002159455
  0.33% - to GCF_000239295 s__Flavonifractor plautii
  0.33% - to GCF_900199495 s__Flavonifractor sp900199495
  0.17% - to GCF_000169255 s__Pseudoflavonifractor capillosus
  0.17% - to GCA_000435295 s__Ruminiclostridium_C sp000435295
  0.17% - to GCF_001305115 s__Anaeromassilibacillus sp001305115
  0.17% - to GCF_000406165 s__Dickeya zeae
  0.17% - to GCA_002313795 s__Faecalibacterium prausnitzii_L
  0.17% - to GCF_900240245 s__Lachnoclostridium_A edouardi
  0.17% - to GCF_900113995 s__D5 sp900113995
  0.17% - to GCF_000723885 s__Paenibacillus camerounensis

and many of these look ...suspect. But they are not being removed, probably because the GATHER_MIN_MATCHES=3 is too stringent at the per-contig level; the report above is at the whole-genome level, and so takes full advantage of the combinatorics of gather.

If we truly believe that gather is nicely specific, we could identify any of these matches that are over the GATHER_MIN_MATCHES threshold as likely contaminants and remove any contig that contains any hashes from these matches.

then again, that might effectively be what we're already doing, since we're already only looking at matches that are from the whole-genome gather. ergh. leaving this here for further thought.

The text was updated successfully, but these errors were encountered:

taylorreiter · 2020-05-14T15:48:07Z

Can we check if a contig only has one hash? I worry that short contigs, which I think are more likely to be binned wrong, will only have one hash on them (or at least only one identifiable hash), and having a threshold of at least 3 matches will allow those contigs to slip through. Whole genome gather might help in this situation in particular

ctb · 2020-06-01T13:48:52Z

maybe a middle ground approach could work? we do a whole-genome gather and identify matches that look well-supported that way, and then take those matches more seriously at a contig level, e.g. lower the threshold for removing contigs with such hashes in there.

ctb · 2020-06-01T13:57:42Z

Just to get a start on this, I added a flag on the breakdown of "clean" output that identified whole-genome matches whose taxonomy is outside that of genome lineage at match_rank, and are presumed contaminants -- see below for the results, '!!' indicates contamination.

breakdown of clean contigs w/gather:
     74.91% - to GCF_002159845 s__Anaeromassilibacillus sp002159845
     2.56% - to GCF_900104675 s__Angelakisella massiliensis
     1.54% - to GCF_002160955 s__Gemmiger_A sp002160955
     1.25% - to GCF_002159455 s__Flavonifractor sp002159455
     1.27% - to GCA_900066645 s__Lawsonibacter sp900066645
  !! 0.48% - to GCF_900199515 s__OEMR01 sp900199515
     0.32% - to GCF_001261775 s__Anaeromassilibacillus senegalensis
     0.32% - to GCF_002160025 s__An200 sp002160025
     0.33% - to GCF_000239295 s__Flavonifractor plautii
     0.33% - to GCF_900199495 s__Flavonifractor sp900199495
     0.16% - to GCF_000169255 s__Pseudoflavonifractor capillosus
     0.16% - to GCA_000435295 s__Ruminiclostridium_C sp000435295
     0.16% - to GCF_001305115 s__Anaeromassilibacillus sp001305115
  !! 0.16% - to GCF_000406165 s__Dickeya zeae
     0.16% - to GCA_002313795 s__Faecalibacterium prausnitzii_L
  !! 0.17% - to GCF_900240245 s__Lachnoclostridium_A edouardi
     0.17% - to GCF_900169495 s__Provencibacterium massiliense
     0.17% - to GCF_900113995 s__D5 sp900113995
  !! 0.17% - to GCF_000723885 s__Paenibacillus camerounensis

I haven't dug into why those matches aren't being removed (what thresholds apply etc. etc.), will do so next.

But, in any case, we could maybe do this at the very beginning instead of the end, and then focus more aggressively on removing contigs with those hashes.

ctb · 2020-06-09T12:36:32Z

Digging into Loomba

(The Loomba* genome is in the demo.)

With the latest reporting code af435fc and a match_rank of genus (which is intentionally overly strict :), we get the following out-of-genus matches when doing gather on the clean contigs (marked with !!).

breakdown of clean contigs w/gather:
     74.48% - to GCF_002159845 s__Anaeromassilibacillus sp002159845
  !! 2.35% - to GCF_900104675 s__Angelakisella massiliensis
  !! 1.36% - to GCF_002160955 s__Gemmiger_A sp002160955
  !! 1.22% - to GCA_900066645 s__Lawsonibacter sp900066645
  !! 0.93% - to GCF_002159455 s__Flavonifractor sp002159455
  !! 0.47% - to GCF_900199515 s__OEMR01 sp900199515
  !! 0.47% - to GCF_900142265 s__Anaerotignum lactatifermentans
     0.31% - to GCF_001261775 s__Anaeromassilibacillus senegalensis
  !! 0.32% - to GCF_900104665 s__Traorella massiliensis
  !! 0.32% - to GCF_002160825 s__GCA-900066575 sp002160825
  !! 0.32% - to GCF_000239295 s__Flavonifractor plautii
  !! 0.32% - to GCF_900199495 s__Flavonifractor sp900199495
  !! 0.32% - to GCA_001940855 s__Phil1 sp001940855
  !! 0.16% - to GCF_000169255 s__Pseudoflavonifractor capillosus
  !! 0.16% - to GCA_000435295 s__Ruminiclostridium_C sp000435295
  !! 0.16% - to GCA_003287405 s__Faecalibacterium prausnitzii_J
     0.16% - to GCF_001305115 s__Anaeromassilibacillus sp001305115
  !! 0.16% - to GCF_000406165 s__Dickeya zeae
  !! 0.16% - to GCA_002313795 s__Faecalibacterium prausnitzii_L
  !! 0.16% - to GCF_900240245 s__Lachnoclostridium_A edouardi
  !! 0.16% - to GCF_002161595 s__Fournierella sp002161595
  !! 0.16% - to GCF_900169495 s__Provencibacterium massiliense
  !! 0.16% - to GCF_900113995 s__D5 sp900113995
  !! 0.16% - to GCF_000723885 s__Paenibacillus camerounensis
  !! 0.16% - to GCA_002472405 s__F23-B02 sp002472405

In #110, a work-in-progress PR, I added some code to report on which contigs passed Reason 1 as "clean", yet still had some of the hashes above. (I also removed LCA cleaning, reasons 2 and 3.)

Here are some of the results:

ZZZ intersection with bad hashes: 13 of 201
f_ident 0.81 / f_match 0.89
ZZZ intersection with bad hashes: 2 of 155
f_ident 0.75 / f_match 0.97
ZZZ intersection with bad hashes: 2 of 133
f_ident 0.73 / f_match 0.98
ZZZ intersection with bad hashes: 5 of 114
f_ident 0.75 / f_match 0.93
ZZZ intersection with bad hashes: 8 of 86
f_ident 0.74 / f_match 0.84
ZZZ intersection with bad hashes: 1 of 84
f_ident 0.75 / f_match 0.98
ZZZ intersection with bad hashes: 1 of 59
f_ident 0.78 / f_match 0.85
ZZZ intersection with bad hashes: 11 of 63
f_ident 0.86 / f_match 0.78
ZZZ intersection with bad hashes: 4 of 27
f_ident 0.63 / f_match 0.76
ZZZ intersection with bad hashes: 21 of 39
f_ident 0.82 / f_match 0.44
ZZZ intersection with bad hashes: 1 of 31
f_ident 0.81 / f_match 0.96
ZZZ intersection with bad hashes: 1 of 27
f_ident 0.67 / f_match 0.89
ZZZ intersection with bad hashes: 4 of 18
f_ident 0.50 / f_match 0.56
ZZZ intersection with bad hashes: 9 of 13
f_ident 0.69 / f_match 0.22
ZZZ intersection with bad hashes: 1 of 17
f_ident 0.82 / f_match 0.93
ZZZ intersection with bad hashes: 1 of 13
f_ident 0.31 / f_match 0.75
ZZZ intersection with bad hashes: 3 of 9
f_ident 0.33 / f_match 0.67
ZZZ intersection with bad hashes: 1 of 1
f_ident 1.00 / f_match 1.00
ZZZ intersection with bad hashes: 1 of 1
f_ident 1.00 / f_match 1.00

This shows that there's a mixture of reasons why contigs are passing our filters as "not contaminants".

First, some of them look like the putative contamination is in the (significant) minority; consider

ZZZ intersection with bad hashes: 8 of 86
f_ident 0.74 / f_match 0.84

where the contig has 74% of 86 hashes identified, and 84% of those 74% match to one lineage, while a paltry 8 hashes (10%) match outside the genus. This looks pretty legit to me, although we can dig to see if it's a chimeric contig or something.

Second, some of the contigs are just too small for our current filters. See

ZZZ intersection with bad hashes: 1 of 1
f_ident 1.00 / f_match 1.00

- only one hash, so this doesn't pass our GATHER_THRESHOLD of 3 for removal. This is a good candidate for removal since we have whole-genome evidence that this is probably a whack contig.

Third, there are puzzling in-between situations that need more investigation. Consider:

ZZZ intersection with bad hashes: 21 of 39
f_ident 0.82 / f_match 0.44

or

ZZZ intersection with bad hashes: 9 of 13
f_ident 0.69 / f_match 0.22

where I don't even understand what's going on, because the numbers don't make sense, but naively I would expect the majority-match to a non-genus lineage would have resulted in removal...

Isn't debugging fun? :)

ctb · 2020-06-09T13:04:08Z

Loomba more

Looking into:

ZZZ intersection with bad hashes: 21 of 39
f_ident 0.82 / f_match 0.44

I printed out the gather results for that contig, and we see:

ZZZ intersection with bad hashes: 21 of 39
     64.10% - to GCF_002159845 s__Anaeromassilibacillus sp002159845
  !! 35.71% - to GCF_900104675 s__Angelakisella massiliensis
  !! 11.11% - to GCF_002159455 s__Flavonifractor sp002159455
  !! 12.50% - to GCF_900169495 s__Provencibacterium massiliense

f_ident 0.82 / f_match 0.44

and we see that the majority match is to the correct lineage.

So this is either a chimeric contig or ...something weirder. 🤷 Charcoal is correct not to remove it, based on the current per-contig understanding.

For another weird one, we see:

ZZZ intersection with bad hashes: 9 of 13
  --------- (likely false positives below this line) ---------
  !! 15.38% - to GCF_900104665 s__Traorella massiliensis
  !! 18.18% - to GCF_900142265 s__Anaerotignum lactatifermentans
  !! 22.22% - to GCA_001940855 s__Phil1 sp001940855
  !! 14.29% - to GCF_900104675 s__Angelakisella massiliensis
  !! 16.67% - to GCA_002472405 s__F23-B02 sp002472405
  !! 20.00% - to GCA_003287405 s__Faecalibacterium prausnitzii_J
** note: matches under 23.077% may be false positives

f_ident 0.69 / f_match 0.22

the problem here seems to be that the gather results are individually below our match threshold, but collectively are problematic.

...come to think of it, that describes the first situation in this comment - the 21 of 39 one - too. Hmm.

Incidentally, the %s seem to be incorrect - they shouldn't add to over 100%. I think I need to fix that reporting...

ctb · 2020-06-09T13:05:03Z

code for the above in 6bec2d1 in #110

taylorreiter · 2020-06-09T14:40:57Z

For the case where a contig is removed because it has one identifiable hash that matches another species, should we try and increase the k-size to increase our confidence in the match? e.g. if it matches at a k = 41 or k = 51, I would be more confident perhaps, although k = 31 is already pretty specific.

ctb · 2020-06-09T14:46:59Z

For the case where a contig is removed because it has one identifiable hash that matches another species, ...

I think that in this issue's case we are lending support to the "one identifiable hash being a problem" by looking at the whole genome gather first - we already know this species is a problem. It's not like the LCA situation where we are condemning a contig on too little evidence.

Multiple ksizes is an interesting idea too! It's actually fairly straightforward now with zipped SBTs and some of the sourmash accession retrieval stuff (sourmash-bio/sourmash#993) to do this at a technical level (tl;dr retrieve matching signatures at different ksizes by name, for fun and profit).

taylorreiter · 2020-06-09T14:48:31Z

ah i forgot about the whole genome gather first! yes, I agree, one is probably enough in this case.

This was referenced Jun 1, 2020

switch to using gather exclusively, away from LCA methods. #110

Merged

contig-level statistics, more thoughts #113

Open

ctb mentioned this issue Jun 9, 2020

look at matches across multiple ksizes #117

Open

ctb mentioned this issue Dec 21, 2022

adjust default thresholding for prefetch and gather? sourmash-bio/sourmash#2360

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

provide a new cleaning approach based on whole-genome gather? #33

provide a new cleaning approach based on whole-genome gather? #33

ctb commented May 14, 2020

taylorreiter commented May 14, 2020

ctb commented Jun 1, 2020

ctb commented Jun 1, 2020 •

edited

Loading

ctb commented Jun 9, 2020 •

edited

Loading

ctb commented Jun 9, 2020

ctb commented Jun 9, 2020

taylorreiter commented Jun 9, 2020

ctb commented Jun 9, 2020

taylorreiter commented Jun 9, 2020

provide a new cleaning approach based on whole-genome gather? #33

provide a new cleaning approach based on whole-genome gather? #33

Comments

ctb commented May 14, 2020

taylorreiter commented May 14, 2020

ctb commented Jun 1, 2020

ctb commented Jun 1, 2020 • edited Loading

ctb commented Jun 9, 2020 • edited Loading

Digging into Loomba

ctb commented Jun 9, 2020

Loomba more

ctb commented Jun 9, 2020

taylorreiter commented Jun 9, 2020

ctb commented Jun 9, 2020

taylorreiter commented Jun 9, 2020

ctb commented Jun 1, 2020 •

edited

Loading

ctb commented Jun 9, 2020 •

edited

Loading