Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sourmash gather doesn't automatically figure out ksize from database #809

Closed
ctb opened this issue Dec 24, 2019 · 8 comments
Closed

sourmash gather doesn't automatically figure out ksize from database #809

ctb opened this issue Dec 24, 2019 · 8 comments

Comments

@ctb
Copy link
Contributor

ctb commented Dec 24, 2019

despite SBTs currently being limited to one ksize, sourmash gather doesn't automatically figure out which ksize to select from a signature.

@ctb
Copy link
Contributor Author

ctb commented Apr 4, 2020

Verified on latest master. The problem is that sourmash_args.load_query_signature is run before we know what the database ksize etc is.

@ctb
Copy link
Contributor Author

ctb commented Apr 4, 2020

(note that it works if the default ksize = 31 is present in the signature, but this is not ideal behavior...)

@ctb
Copy link
Contributor Author

ctb commented Apr 4, 2020

sourmash search should have the same problem. sigh.

@ctb
Copy link
Contributor Author

ctb commented Apr 4, 2020

triple sigh. Fixing this will require refactoring load_dbs_and_sigs.

@ctb
Copy link
Contributor Author

ctb commented Apr 5, 2020

Man, the code in #934 is getting ugly. I think it's because there's so many special cases etc. etc. Here are some of my initial design considerations --

  • in general we should take into account the list of possible query signatures, the selector args for query signature, the set of subject databases, and the set of possible subject signatures.
  • SBT and LCA databases are special in that (unlike normal sig files/lists) they will provide have a fixed set of requirements for ksize, moltype, and num/scaled. They can also be slow to load. So (where possible) we should error out as soon as we reach an argument that is incompatible.
  • for lists of (query and subject) signatures, we should be ok with eliminating a bunch of them, as long as some remain. two guiding principles: if a signature file is explicitly specified on the command line, it must have at least one useful sig; but for large scale signature loading (from directories), there is no need.
  • where possible, we should figure out if there is a unique set of conditions that can be used to figure out the input query parameters

also, not to be too ambitious, but

  • perhaps we can provide nicer signature loading logic, c.f. Document and streamline/refactor signature loading and saving #919, as part of this?
  • we should design to permit additional selector arguments on the query like name/md5sum matching.
  • we should also plan a bit for a future where SBTs and LCAs become more flexible and permit several different kinds of signatures.

@ctb
Copy link
Contributor Author

ctb commented Apr 5, 2020

A few more quick thoughts --

  • we should expect intersections across unions of parameters to work for collections of signatures
  • we should build individual loading functions for DB.

@ctb
Copy link
Contributor Author

ctb commented May 3, 2020

I wonder if a better approach than #934 is to delegate signature compatibility checking to the collections directly? e.g. standardize the selector framework in #936 and then have the query signature collection "check itself" progressively against loaded signatures/databases by using the selector framework to narrow down compatible signatures. The code could then output a complaint when it down to zero compatible signatures.

@ctb
Copy link
Contributor Author

ctb commented Mar 30, 2022

This was fixed in #1406. 🎉

@ctb ctb closed this as completed Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant