[MRG] add max_containment to `MinHash` class. #1346

ctb · 2021-02-23T14:37:59Z

Add max_containment() and --max-containment per #1343.

Fixes #1247
Fixes #1343

Main features:

adds MinHash.max_containment(other, downsample=False)
adds SourmashSignature.max_containment(other, downsample=False)
adds --max-containment flag to sourmash search
adds do_max_containment flag to search for Index classes (LinearIndex, SBT, LCA_Database)

In addition, this PR indulges in misc cleanup:

refactors some of the Index code to use named arguments, now that legacy support for Python 2.7 was removed.
removes untested result caching in sbtmh.py.
fixes namespace collision and associated broken test in test_signature.py - there were two test_str functions.
fixes namespace collision and associated broken test in test__minhash.py -- there were two test_mh_len functions.
test that SBT.search(...) properly fails when threshold=None (this code did not have test coverage)
remove duplicate test_compare_containment_abund_flatten test function in test_sourmash.py.

TODO:

add feature for compare --max-containment along with tests
check that max_containment does not allow non-scaled signatures
add test for if do_containment and do_max_containment conditions
add test for if threshold is None: in sbt.py
add command line tests for max containment
check SBT especially

Checklist

Is it mergeable?
make test Did it pass the tests?
make coverage Is the new code covered?
Did it change the command-line interface? Only additions are allowed
without a major version increment. Changing file formats also requires a
major version number increment.
Was a spellchecker run on the source code and documentation after
changes were made?

codecov · 2021-02-23T14:44:26Z

Codecov Report

Merging #1346 (108748a) into latest (55741dc) will increase coverage by 0.27%.
The diff coverage is 97.82%.

@@            Coverage Diff             @@
##           latest    #1346      +/-   ##
==========================================
+ Coverage   88.88%   89.15%   +0.27%     
==========================================
  Files         123      123              
  Lines       18321    18593     +272     
  Branches     1410     1432      +22     
==========================================
+ Hits        16284    16577     +293     
+ Misses       1800     1780      -20     
+ Partials      237      236       -1

Flag	Coverage Δ
python	`94.41% <97.82%> (+0.24%)`	⬆️
rust	`67.37% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/sourmash/index.py	`93.10% <70.00%> (-2.20%)`	⬇️
src/sourmash/sbt.py	`84.14% <84.61%> (-0.16%)`	⬇️
src/sourmash/lca/lca_db.py	`92.42% <85.71%> (+0.02%)`	⬆️
src/sourmash/sbtmh.py	`89.83% <89.47%> (+6.65%)`	⬆️
src/sourmash/cli/compare.py	`100.00% <100.00%> (ø)`
src/sourmash/cli/search.py	`100.00% <100.00%> (ø)`
src/sourmash/commands.py	`82.63% <100.00%> (+0.41%)`	⬆️
src/sourmash/compare.py	`89.18% <100.00%> (+1.31%)`	⬆️
src/sourmash/minhash.py	`93.51% <100.00%> (+0.15%)`	⬆️
src/sourmash/search.py	`91.22% <100.00%> (ø)`
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55741dc...108748a. Read the comment docs.

src/sourmash/index.py

        ignore_abundance = kwargs.get('ignore_abundance', False)

        # configure search - containment? ignore abundance?
        if do_containment:
            query_match = lambda x: query.contained_by(x, downsample=True)
+        elif do_max_containment:
+            query_match = lambda x: query.max_containmenty(x, downsample=True)


…tainment

ctb · 2021-02-23T22:48:41Z

witness!

search

% sourmash search genome-s10.fa.gz.sig all.sbt.zip  

== This is sourmash version 4.0.0rc2.dev1+gce950caa. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: ../genome-s10.fa.gz... (k=31, DNA)
loaded 1 databases.

3 matches:
similarity   match
----------   -----
100.0%       ../genome-s10.fa.gz
 51.6%       ../genome-s10+s11.fa.gz
 16.7%       ../genome-s10-small.fa.gz

search --containment

sourmash search genome-s10.fa.gz.sig all.sbt.zip  --containment

== This is sourmash version 4.0.0rc2.dev1+gce950caa. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: ../genome-s10.fa.gz... (k=31, DNA)
loaded 1 databases.

3 matches:
similarity   match
----------   -----
100.0%       ../genome-s10.fa.gz
100.0%       ../genome-s10+s11.fa.gz
 16.7%       ../genome-s10-small.fa.gz

search --max-containment

% sourmash search genome-s10.fa.gz.sig all.sbt.zip  --max-containment

== This is sourmash version 4.0.0rc2.dev1+gce950caa. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: ../genome-s10.fa.gz... (k=31, DNA)
loaded 1 databases.

3 matches:
similarity   match
----------   -----
100.0%       ../genome-s10-small.fa.gz
100.0%       ../genome-s10.fa.gz
100.0%       ../genome-s10+s11.fa.gz

…tainment

ctb · 2021-03-05T15:08:12Z

@bluegenes this might be ready to try out. I wouldn't trust the SBT code just yet, but everything else should work I think.

bluegenes · 2021-03-05T15:27:55Z

Thanks @ctb!! Will take it for a spin

bluegenes · 2021-03-10T01:45:28Z

sourmash search --max-containment seems to be running on my testing! YAY.

I somewhat forgot that the search csv output is less informative than the gather csv output -- ultimately, it would be really nice to have add'l info, particularly signature name + matched bp, etc, as you suggest in #1366 for 5.0.

some mini results:

sourmash search --max-containment output.protein-pigeon/prodigal/signatures/pigeon1.0-GOV_18503.prodigal.sig output.pigeon1.0-cluster/pigeon1.0.protein-k10.mc0.05.founders.sbt.zip -o output.pigeon1.0-cluster/cluster_info/GOV_18503_x_pigeon1.0.protein-k10.mc0.05.best-founder.txt --threshold 0.01 --ksize 10 --protein

--max-containment:

4 matches; showing first 3:
similarity   match
----------   -----
 10.0%       EarthsVirome_13607
  8.7%       GOV_66421
  4.5%       EarthsVirome_15562

** note, none of these show up with the default threshold (0.08). I needed --threshold 0.05 for the first match to show up, 0.03 for the next and 0.02 for the 3rd match. I guess I thought search used a percentage threshold -- will go look at it, but wanted to record here in the mean time.

--containment:

4 matches; showing first 3:
similarity   match
----------   -----
  5.3%       EarthsVirome_13607
  3.1%       GOV_66421
  2.3%       EarthsVirome_15562

jaccard:

3 matches:
similarity   match
----------   -----
  3.8%       EarthsVirome_13607
  2.7%       GOV_66421
  1.7%       EarthsVirome_15562

bluegenes · 2021-03-10T01:51:38Z

would it be straightforward to add compare --max-containment here, or would that be better in a separate PR?

ctb · 2021-03-11T14:43:25Z

would it be straightforward to add compare --max-containment here, or would that be better in a separate PR?

Totes. Great idea.

ctb · 2021-03-11T14:55:15Z

sourmash search --max-containment seems to be running on my testing! YAY.

I somewhat forgot that the search csv output is less informative than the gather csv output -- ultimately, it would be really nice to have add'l info, particularly signature name + matched bp, etc, as you suggest in #1366 for 5.0.

Yeah, search CSV output is terrible! As you will see in #1370 😉 I am planning to output the following in the prefetch command,

intersect bp
match bp
query bp
jaccard similarity
query containment
match containment
max_containment
and the only hesitation I have in adding all of these to search output is that they would be subject to different thresholds depending on the command line arguments, and my hot take is that I don't like that.

Specifically, imagine that you do a search with --max-containment vs --similarity - for the same query, the results will (of course) be different, and the order and thresholding in the output CSV will be different.

Of course, this is a problem for the current output, too, in that the actual meaning of the similarity number in the output file changes depending on the search parameters. So that's bad. Grr.

🤔

Hmm, I wonder if we could include an extra column that is something like "search key" to indicate what the search was? It'll be redundant (since it would have to be in every row for this columnar data output) but at least it'd be there.

Might also be a good opportunity to support JSON or YAML output from search, gather, and prefetch - then those files could contain far more information, including full command-line parameters, threshold, etc. etc. See #448.

We could also turn off CSV output from search entirely, and tell people to use prefetch for programmatic foo...

** note, none of these show up with the default threshold (0.08). I needed --threshold 0.05 for the first match to show up, 0.03 for the next and 0.02 for the 3rd match. I guess I thought search used a percentage threshold -- will go look at it, but wanted to record here in the mean time.

Curious what you mean by "percentage threshold"? It just takes a floating point number that is the lowest similarity etc to report.

OH! You mean the thresholding is actually just wrong for --max-containment. I, umm, should fix that! I appreciate the implication that this was actually something intelligent, and not just a dumb bug...

…tainment

ctb · 2021-03-12T00:34:07Z

hi @bluegenes this PR is ready for review (and merge?)!

Other than code review (sorry for all the mess...) the two remaining things are --

do you want me to add more stuff to the CSV output in this PR, or should I start a new PR for that? I think this one is getting kind of big.
could do a test on a large-ish SBT, by doing sourmash search --max-containment on both the SBT and on a directory of contents created with sourmash sig split run on the SBT? I'd like a double-check on the SBT max containment code - no reason it shouldn't work, just slightly paranoid about computers...

ctb · 2021-03-12T00:54:45Z

I added the query information into the search CSV output in cbd2503. I couldn't add all of the info that @bluegenes might want in, however, because search works on regular MinHashes, too, which don't support containment or max_containment.

Perhaps another reason to consider getting rid of "regular" MinHash per #1354

ctb · 2021-03-12T00:55:09Z

(I suppose I could put zeros in for all those numbers when running on regular minhashes...)

bluegenes · 2021-03-12T01:01:25Z

(I suppose I could put zeros in for all those numbers when running on regular minhashes...)

NA's?

ctb · 2021-03-12T01:02:22Z

On Thu, Mar 11, 2021 at 05:01:40PM -0800, Tessa Pierce wrote: >(I suppose I could put zeros in for all those numbers when running on regular minhashes...) NA's?

<sigh>

src/sourmash/commands.py

bluegenes · 2021-03-12T14:46:54Z

src/sourmash/commands.py

+        fieldnames = ['similarity', 'name', 'filename', 'md5',
+                      'query_filename', 'query_name', 'query_md5']


It would be nice to modify similarity to containment / max_containment for csv output. Thoughts?

semantic versioning prevents us from removing the similarity header before 5.0. we could add new columns, I 'spose. I don't like the idea that column headers change depending on command line arguments, though. Not sure how to think about it.

(suggest we punt this to a new issue and discuss it there.)

punted to #1390

Co-authored-by: Tessa Pierce <[email protected]>

src/sourmash/minhash.py

Co-authored-by: Tessa Pierce <[email protected]>

src/sourmash/minhash.py

src/sourmash/sbt.py

bluegenes · 2021-03-12T15:14:45Z

src/sourmash/search.py

+                              query=query,
+                              query_filename=query.filename,
+                              query_name=query.name,
+                              query_md5=query.md5sum()[:8]


is there a reason to truncate one md5 and not the other?

...and now that I think about it, it's for semantic versioning, eh?

err, nope, didn't notice I wasn't truncating the md5sum for the match. Hrm.

punted to #1390

bluegenes

the rest lgtm!

ctb · 2021-03-12T15:38:15Z

OK, I think I addressed everything. Will wait for tests to pass before merging.

add max_containment to MinHash

ba21268

ctb added 5 commits February 23, 2021 06:50

add max_containment to SourmashSignature

74e8e07

add initial scaffolding for max_containment

7e1bdca

compute actual max containment

8f73291

interim comments

402ca77

ok, the basic logic should be laid out

21d6fdb

bluegenes reviewed Feb 23, 2021

View reviewed changes

ctb added 6 commits February 23, 2021 11:09

fix typo per tessa

787764a

more typo

7d2dae3

cleanup and fixes

bed7110

change implementation away from **kwargs

e3b0f61

update lca_db.search for max_containment

9ffa5ca

Merge branch 'latest' of github.com:dib-lab/sourmash into add/max_con…

3a22806

…tainment

ctb added 8 commits February 23, 2021 14:52

implement sourmash search --max-containment

ae92d08

add lca database for --max-containment test

61e9888

fix some issues with identifiers in the LCA code

30df58d

Merge branch 'latest' of github.com:dib-lab/sourmash into add/max_con…

f014cca

…tainment

fix duplicate test name

ba79e9c

test basic search, no SBT

d8d3657

fix previously hidden test

fc499d5

Merge branch 'latest' into add/max_containment

77f5407

Merge branch 'latest' of github.com:dib-lab/sourmash into add/max_con…

4810b00

…tainment

ctb added 5 commits March 11, 2021 16:18

remove duplicated test function

a583783

test compare --max-containment

e5c67d8

add test for both --containment and --max-containment

bbd3898

produce friendlier error message in search

8e63268

grammar fix in comments

356e934

ctb changed the title ~~[WIP] add max_containment to MinHash class.~~ [MRG] add max_containment to MinHash class. Mar 12, 2021

improve CSV output for search, marginally

cbd2503

ctb mentioned this pull request Mar 12, 2021

add name, md5 of query signature to gather and search output #1247

Closed