Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] expand signature selection and compatibility checking in database loading code #1420

Merged
merged 21 commits into from
Apr 1, 2021

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Mar 29, 2021

This is a PR into #1406.

Building off of the refactored database loading in #1406, this PR introduces expanded signature selection and compatibility checking for Index classes.

This results in much simpler code with much less special-casing. 🎉

Fixes #1376
Addresses #1072

Some searches may now work where they didn't before - for example,

  • test_search_traverse_incompatible will now succeed, because it now selects the compatible signatures and ignores the incompatible ones.

TODO:

Checklist

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@ctb

This comment has been minimized.

@ctb ctb changed the title [WIP] expand signature selection and compatibility checking in database loading code [MRG] expand signature selection and compatibility checking in database loading code Mar 31, 2021
@ctb
Copy link
Contributor Author

ctb commented Mar 31, 2021

Ready for review and merge @luizirber @bluegenes!

@ctb ctb mentioned this pull request Apr 1, 2021
6 tasks
return False

# 'scaled' and 'num' are incompatible
if scaled:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning for not checking the value of scaled here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh, great question! I have two answers :). One is pragmatic, and one is conceptual.

The pragmatic answer is, we didn't need it in order for all the tests to pass 😆

The conceptual answer is, this selects signatures that could be converted, but doesn't actually do the conversion. This is a tension that I haven't figured out how to resolve - should selectors return appropriately downsampled signatures, or should they just select signatures that could be downsampled?

I think the right thing to do is to punt this to a new issue for discussion, since I don't think we have any situations in the codebase that depend on answering it right now (see first point, "all tests pass").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I will take care of punting to the new issue, but would appreciate pushback and/or your thoughts!)

Copy link
Contributor

@bluegenes bluegenes Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conceptual answer is, this selects signatures that could be converted, but doesn't actually do the conversion. This is a tension that I haven't figured out how to resolve - should selectors return appropriately downsampled signatures, or should they just select signatures that could be downsampled?

I like the approach of selecting sigs that could be converted! ...but not all scaled sigs can be downsampled (desired scaled < sig scaled), right?

I think the right thing to do is to punt this to a new issue for discussion, since I don't think we have any situations in the codebase that depend on answering it right now (see first point, "all tests pass").

😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conceptual answer is, this selects signatures that could be converted, but doesn't actually do the conversion. This is a tension that I haven't figured out how to resolve - should selectors return appropriately downsampled signatures, or should they just select signatures that could be downsampled?

I like the approach of selecting sigs that could be converted! ...but not all scaled sigs can be downsampled (desired scaled < sig scaled), right?

yep. I think this is a great idea for a targeted test, I'll see what I can do!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer returning without explicit downsampling, and maybe make another function that uses select_signatures to do the actual downsampling.

(depending on the application that can be a lot of downsampling without really needing it...)

Comment on lines +216 to +244
# check ksize.
if ksize is not None and db_mh.ksize != ksize:
raise ValueError(f"search ksize {ksize} is different from database ksize {db_mh.ksize}")

# check moltype.
if moltype is not None and db_mh.moltype != moltype:
raise ValueError(f"search moltype {moltype} is different from database moltype {db_mh.moltype}")

# containment requires 'scaled'.
if containment:
if not scaled:
raise ValueError("'containment' requires 'scaled' in SBT.select'")
if not db_mh.scaled:
raise ValueError("cannot search this SBT for containment; signatures are not calculated with scaled")

# 'num' and 'scaled' do not mix.
if num:
if not db_mh.num:
raise ValueError(f"this database was created with 'scaled' MinHash sketches, not 'num'")
if num != db_mh.num:
raise ValueError(f"num mismatch for SBT: num={num}, {db_mh.num}")

if scaled:
if not db_mh.scaled:
raise ValueError(f"this database was created with 'num' MinHash sketches, not 'scaled'")

# we can downsample SBTs for containment operations.
if scaled > db_mh.scaled and not containment:
raise ValueError(f"search scaled value {scaled} is less than database scaled value of {db_mh.scaled}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a lot like select_signature, are you duplicating to have better error messages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

precisely - better error messages! ( it's not actually identical, I don't think, and I shied away from forcing it into the same code - if, after a lot of testing, it turns out to be identical, we can refactor.)

@luizirber
Copy link
Member

CI thoughts: only 4 checks were executed, because this is a PR into a branch that is not latest. Should it be fixed (PRs always run, against any base branch)?

@ctb
Copy link
Contributor Author

ctb commented Apr 1, 2021

CI thoughts: only 4 checks were executed, because this is a PR into a branch that is not latest. Should it be fixed (PRs always run, against any base branch)?

whoops, didn't catch that this was being reviewed before #1406 :).

Not sure it matters one way or another?

@ctb
Copy link
Contributor Author

ctb commented Apr 1, 2021

(in the sense that presumably this code would be checked thoroughly before merge into latest)

@luizirber luizirber merged commit e4e20de into refactor/db_load_multiindex Apr 1, 2021
@luizirber luizirber deleted the refactor/siglist_loading branch April 1, 2021 22:23
ctb added a commit that referenced this pull request Apr 2, 2021
…1406)

* add an IndexOfIndexes class

* rename to MultiIndex

* switch to using MultiIndex for loading from a directory

* some more MultiIndex tests

* add test of MultiIndex.signatures

* add docstring for MultiIndex

* stop special-casing SIGLISTs

* fix test to match more informative error message

* switch to using LinearIndex.load for stdin, too

* add __len__ to MultiIndex

* add check_csv to check for appropriate filename loading info

* add comment

* fix databases load

* more tests needed

* add tests for incompatible signatures

* add filter to LinearIndex and MultiIndex

* clean up sourmash_args some more

* shift loading over to Index classes

* refactor, fix tests

* switch to a list of loader functions

* comments, docstrings, and tests passing

* update to use f strings throughout sourmash_args.py

* add docstrings

* update comments

* remove unnecessary changes

* revert to original test

* remove unneeded comment

* clean up a bit

* debugging update

* better exception raising and capture for signature parsing

* more specific error message

* revert change in favor of creating new issue

* add commentary => TODO

* add tests for MultiIndex.load_from_directory; fix traverse code

* switch lca summarize over to usig MultiIndex

* switch to using MultiIndex in categorize

* remove LoadSingleSignatures

* test errors in lca database loading

* remove unneeded categorize code

* add testme info

* verified that this was tested

* remove testme comments

* add tests for MultiIndex.load_from_file_list

* Expand signature selection and compatibility checking in database loading code (#1420)

* refactor select, add scaled/num/abund
* fix scaled check for LCA database
* add debug_literal
* fix scaled check for SBT
* fix LCA database ksize message & test
* add 'containment' to 'select'
* added 'is_database' flag for nicer UX
* remove overly broad exception catching
* document downsampling foo

* fix file_list -> pathlist

* fix typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants