[MRG] expand signature selection and compatibility checking in database loading code #1420

ctb · 2021-03-29T14:50:45Z

This is a PR into #1406.

Building off of the refactored database loading in #1406, this PR introduces expanded signature selection and compatibility checking for Index classes.

This results in much simpler code with much less special-casing. 🎉

Fixes #1376
Addresses #1072

Some searches may now work where they didn't before - for example,

test_search_traverse_incompatible will now succeed, because it now selects the compatible signatures and ignores the incompatible ones.

TODO:

write specific tests against select interface for key Index subclasses: LinearIndex, MultiIndex, SBT, LCA_Database - punted to write more comprehensive Index.select(...) tests #1427
also test that partial restrictions (e.g. just ksize) are appropriately implemented - punted to write more comprehensive Index.select(...) tests #1427
clean up/simplify LCA_Database.select(...) logic
write up downsampling details ...somewhere - codebase, or docs? - since they are now cleanly separated out in select(...) ref write up downsampling details #407
document in this PR which previously-failing commands (slash tests) now succeed and why :)

Checklist

Is it mergeable?
make test Did it pass the tests?
make coverage Is the new code covered?
Did it change the command-line interface? Only additions are allowed
without a major version increment. Changing file formats also requires a
major version number increment.
Was a spellchecker run on the source code and documentation after
changes were made?

…glist_loading

ctb · 2021-03-31T17:47:02Z

Ready for review and merge @luizirber @bluegenes!

bluegenes · 2021-04-01T17:41:14Z

src/sourmash/index.py

+            return False
+
+    # 'scaled' and 'num' are incompatible
+    if scaled:


What's the reasoning for not checking the value of scaled here?

Oooh, great question! I have two answers :). One is pragmatic, and one is conceptual.

The pragmatic answer is, we didn't need it in order for all the tests to pass 😆

The conceptual answer is, this selects signatures that could be converted, but doesn't actually do the conversion. This is a tension that I haven't figured out how to resolve - should selectors return appropriately downsampled signatures, or should they just select signatures that could be downsampled?

I think the right thing to do is to punt this to a new issue for discussion, since I don't think we have any situations in the codebase that depend on answering it right now (see first point, "all tests pass").

(I will take care of punting to the new issue, but would appreciate pushback and/or your thoughts!)

The conceptual answer is, this selects signatures that could be converted, but doesn't actually do the conversion. This is a tension that I haven't figured out how to resolve - should selectors return appropriately downsampled signatures, or should they just select signatures that could be downsampled?

I like the approach of selecting sigs that could be converted! ...but not all scaled sigs can be downsampled (desired scaled < sig scaled), right?

I think the right thing to do is to punt this to a new issue for discussion, since I don't think we have any situations in the codebase that depend on answering it right now (see first point, "all tests pass").

😆

The conceptual answer is, this selects signatures that could be converted, but doesn't actually do the conversion. This is a tension that I haven't figured out how to resolve - should selectors return appropriately downsampled signatures, or should they just select signatures that could be downsampled?

I like the approach of selecting sigs that could be converted! ...but not all scaled sigs can be downsampled (desired scaled < sig scaled), right?

yep. I think this is a great idea for a targeted test, I'll see what I can do!

I think I prefer returning without explicit downsampling, and maybe make another function that uses select_signatures to do the actual downsampling.

(depending on the application that can be a lot of downsampling without really needing it...)

luizirber · 2021-04-01T21:11:27Z

src/sourmash/sbt.py

+        # check ksize.
+        if ksize is not None and db_mh.ksize != ksize:
+            raise ValueError(f"search ksize {ksize} is different from database ksize {db_mh.ksize}")
+
+        # check moltype.
+        if moltype is not None and db_mh.moltype != moltype:
+            raise ValueError(f"search moltype {moltype} is different from database moltype {db_mh.moltype}")
+
+        # containment requires 'scaled'.
+        if containment:
+            if not scaled:
+                raise ValueError("'containment' requires 'scaled' in SBT.select'")
+            if not db_mh.scaled:
+                raise ValueError("cannot search this SBT for containment; signatures are not calculated with scaled")
+
+        # 'num' and 'scaled' do not mix.
+        if num:
+            if not db_mh.num:
+                raise ValueError(f"this database was created with 'scaled' MinHash sketches, not 'num'")
+            if num != db_mh.num:
+                raise ValueError(f"num mismatch for SBT: num={num}, {db_mh.num}")
+
+        if scaled:
+            if not db_mh.scaled:
+                raise ValueError(f"this database was created with 'num' MinHash sketches, not 'scaled'")
+
+            # we can downsample SBTs for containment operations.
+            if scaled > db_mh.scaled and not containment:
+                raise ValueError(f"search scaled value {scaled} is less than database scaled value of {db_mh.scaled}")


This looks a lot like select_signature, are you duplicating to have better error messages?

precisely - better error messages! ( it's not actually identical, I don't think, and I shied away from forcing it into the same code - if, after a lot of testing, it turns out to be identical, we can refactor.)

luizirber · 2021-04-01T21:13:27Z

CI thoughts: only 4 checks were executed, because this is a PR into a branch that is not latest. Should it be fixed (PRs always run, against any base branch)?

ctb · 2021-04-01T21:30:53Z

CI thoughts: only 4 checks were executed, because this is a PR into a branch that is not latest. Should it be fixed (PRs always run, against any base branch)?

whoops, didn't catch that this was being reviewed before #1406 :).

Not sure it matters one way or another?

ctb · 2021-04-01T21:31:12Z

(in the sense that presumably this code would be checked thoroughly before merge into latest)

…1406) * add an IndexOfIndexes class * rename to MultiIndex * switch to using MultiIndex for loading from a directory * some more MultiIndex tests * add test of MultiIndex.signatures * add docstring for MultiIndex * stop special-casing SIGLISTs * fix test to match more informative error message * switch to using LinearIndex.load for stdin, too * add __len__ to MultiIndex * add check_csv to check for appropriate filename loading info * add comment * fix databases load * more tests needed * add tests for incompatible signatures * add filter to LinearIndex and MultiIndex * clean up sourmash_args some more * shift loading over to Index classes * refactor, fix tests * switch to a list of loader functions * comments, docstrings, and tests passing * update to use f strings throughout sourmash_args.py * add docstrings * update comments * remove unnecessary changes * revert to original test * remove unneeded comment * clean up a bit * debugging update * better exception raising and capture for signature parsing * more specific error message * revert change in favor of creating new issue * add commentary => TODO * add tests for MultiIndex.load_from_directory; fix traverse code * switch lca summarize over to usig MultiIndex * switch to using MultiIndex in categorize * remove LoadSingleSignatures * test errors in lca database loading * remove unneeded categorize code * add testme info * verified that this was tested * remove testme comments * add tests for MultiIndex.load_from_file_list * Expand signature selection and compatibility checking in database loading code (#1420) * refactor select, add scaled/num/abund * fix scaled check for LCA database * add debug_literal * fix scaled check for SBT * fix LCA database ksize message & test * add 'containment' to 'select' * added 'is_database' flag for nicer UX * remove overly broad exception catching * document downsampling foo * fix file_list -> pathlist * fix typo

ctb added 12 commits March 28, 2021 14:23

refactor select, add scaled/num/abund

7f52d7c

more work

dde14fd

catch ValueError from db.select

3f498a4

update debug print to sys.stder

df19926

fix scaled check for LCA database

e8233ca

add debug_literal

b44c3cf

break things when filter returns empty Index

7133ac1

fix scaled check for SBT

f5f1c9c

fix a few tests

d6f156f

fix LCA database ksize message & test

785a9a4

flag for removal

23d7ac4

Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/si…

efc07cd

…glist_loading

This comment has been minimized.

Sign in to view

ctb added 6 commits March 31, 2021 07:06

add 'containment' to 'select'

12399e7

fix remaining tests

2b7acb9

Merge branch 'refactor/db_load_multiindex' into refactor/siglist_loading

f663426

update comments

9aae1cb

remove all the cruft, yay

2630be2

added 'is_database' flag for nicer UX

4f1a7fe

This was referenced Mar 31, 2021

Improvement: allow both num and scaled to be set #538

Open

start thinking about a standard selector framework for signature search/compatibility #1072

Closed

[MRG] refactor & clean up database loading around MultiIndex class #1406

Merged

ctb added 2 commits March 31, 2021 09:46

remove overly broad exception catching

736ddf3

add docstrings

16719ce

This was referenced Mar 31, 2021

question: can MultiIndex wrap other indices? #1425

Closed

should _load_databases indicate how many incompatible signatures were filtered out? #1426

Open

write more comprehensive Index.select(...) tests #1427

Open

document downsampling foo

6d8663e

ctb changed the title ~~[WIP] expand signature selection and compatibility checking in database loading code~~ [MRG] expand signature selection and compatibility checking in database loading code Mar 31, 2021

ctb mentioned this pull request Apr 1, 2021

[WIP] add zipfile collection support #1429

Closed

6 tasks

bluegenes reviewed Apr 1, 2021

View reviewed changes

luizirber reviewed Apr 1, 2021

View reviewed changes

luizirber approved these changes Apr 1, 2021

View reviewed changes

luizirber merged commit e4e20de into refactor/db_load_multiindex Apr 1, 2021

luizirber deleted the refactor/siglist_loading branch April 1, 2021 22:23

ctb mentioned this pull request Apr 2, 2021

revisit LinearIndex.select behavior for scaled (and num?) #1433

Closed

This was referenced Apr 3, 2021

[MRG] implement a simple ZipFileLinearIndex class #1349

Merged

Draft release notes for v4.1.0 #1391

Closed

write up downsampling details #407

Closed

ctb mentioned this pull request May 15, 2021

summary: selectors are good, let's maybe have more of them. #1524

Open

2 tasks

ctb mentioned this pull request Jan 18, 2022

add docs on FracMinHash downsampling #1799

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] expand signature selection and compatibility checking in database loading code #1420

[MRG] expand signature selection and compatibility checking in database loading code #1420

ctb commented Mar 29, 2021 •

edited

Loading

This comment has been minimized.

ctb commented Mar 31, 2021

bluegenes Apr 1, 2021

ctb Apr 1, 2021

ctb Apr 1, 2021

bluegenes Apr 1, 2021 •

edited

Loading

ctb Apr 1, 2021

luizirber Apr 1, 2021

luizirber Apr 1, 2021

ctb Apr 1, 2021

luizirber commented Apr 1, 2021

ctb commented Apr 1, 2021

ctb commented Apr 1, 2021

[MRG] expand signature selection and compatibility checking in database loading code #1420

[MRG] expand signature selection and compatibility checking in database loading code #1420

Conversation

ctb commented Mar 29, 2021 • edited Loading

Checklist

This comment has been minimized.

ctb commented Mar 31, 2021

bluegenes Apr 1, 2021

Choose a reason for hiding this comment

ctb Apr 1, 2021

Choose a reason for hiding this comment

ctb Apr 1, 2021

Choose a reason for hiding this comment

bluegenes Apr 1, 2021 • edited Loading

Choose a reason for hiding this comment

ctb Apr 1, 2021

Choose a reason for hiding this comment

luizirber Apr 1, 2021

Choose a reason for hiding this comment

luizirber Apr 1, 2021

Choose a reason for hiding this comment

ctb Apr 1, 2021

Choose a reason for hiding this comment

luizirber commented Apr 1, 2021

ctb commented Apr 1, 2021

ctb commented Apr 1, 2021

ctb commented Mar 29, 2021 •

edited

Loading

bluegenes Apr 1, 2021 •

edited

Loading