Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use new load_file_as_signatures function more broadly #1059

Merged
merged 57 commits into from
Jul 2, 2020

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Jun 28, 2020

#1044 added a new function load_file_as_signatures that permitted sig split to load signatures from databases as well as signature files. This PR deploys that function a bit more widely, and adds selector functionality.

This PR permits many (most?) CLI functions to load signatures from databases without any special sauce.
So, for example sourmash sig describe can now be run on SBTs and LCAs. It also addresses many generic loading issues, including stdin loading (#1049), loading from directories via subdir traversal, and the ability to provide lists of signature files in a text file as queries or database index builds.

Examples

This will update old.sig and add the newly computed signature(s) from input.fa --

sourmash compute input.fa -o - | sourmash sig cat old.sig - -o old.sig

This will concatenate all signatures under tests/test-data --

sourmash sig cat tests/test-data

This will list all signatures in that database --

sourmash sig describe tests/test-data/prot/protein.sbt.zip

This will use the signature with that md5sum in that database as a query against all the signatures under that directory --

sourmash search tests/test-data/prot/protein.sbt.zip --md5 120d311c  tests/test-data/prot/protein/

Testing thoughts

While the actual changes are relatively minor (no significant changes to current tests!) the implications are pretty big and so I'm planning on adding quite a few tests. We're also hitting on some untested behavior and/or grey areas in current behavior, so I'll add some lockdown tests to do my best to make that our current behavior for 3.x isn't too changed.

Affected issues

Fixes #1050 - load_signature now fails properly on non-existent file
Fixes #1048 - traverse-directory behavior now on by default in signature submodule commands
Fixes #1049 - stdin now accepted for signature input
Fixes #672 - compare now runs on SBTs/LCAs as well
Fixes #594 - sourmash sig extract now works on databases
Fixes #662 - added --from-file and --query-from-file in appropriate places
Closes #1039 - can convert b/t databases now, on command line
Fixes #978 - new function load_file_as_index now loads databases as Index subclasses.

Detailed list of new functionality

General functionality

Almost all commands that take a list of signatures can now load signatures from LCA and SBT databases. Commands that need a single query signature also support loading from databases, as well as signature files, and can usually use an --md5 selector to pick out the right signature.

Expansion of support for reading signatures from stdin.

Specific additions to CLI

  • add --from-file to sourmash compare CLI
  • add --md5 selector to sourmash gather and sourmash search to pick a query signature.
  • add --from-file to sourmash index
  • add --query-from-file to sourmash lca classify and sourmash lca summarize
  • add --from-file to sourmash lca index
  • add --query-from-file to sourmash multigather
  • add --md5 selector to sourmash sig export to pick a single signature
  • add --from-file to sourmash compare

Specific additions to API

  • top-level load_file_as_index is new top-level API for loading Index objects from files.
  • top-level load_file_as_signatures is new top-level API for loading of collections of signatures, including from SBT dbs, LCA dbs, and directories.

Bug fixes

  • fix moltype selector bug on SBTs (+ tested)
  • sourmash sig rename now correctly fails on files that it could not open (+ tested)
  • fixed signature.load_signatures behavior where do_raise is now respected (+ added test)
  • fixed bug in testing compatibility of LCA databases when loaded by sourmash_args functions (+ added test)

TODO:

  • document new capabilities
  • lots of tests...
  • for add --traverse-directory to sourmash sig split and sourmash sig cat #1048, do we need to provide an explicit --traverse-directory flag? maybe just traverse by default?
  • double check sig rename that did NOT use load_signatures(..., do_raise=True)
  • provide traversal functionality
  • provide list-of-file functionality, maybe
  • XXX refactor load_single_signature to allow md5 selector - punt to make load_one_signature use load_file_as_signatures API #1062
  • look into refactoring sourmash_args.load_query_signature
  • add md5sum selector to functions that use load_query_signature
  • look into refactoring load_dbs_and_sigs
  • look into providing a new top-level load_database function to replace load_sbt etc. ref provide a unified loading API for databases #978
  • add to sourmash index
  • add to sourmash lca index
  • add to sourmash lca classify
  • add to sourmash lca summarize
  • add to sourmash compare
  • upgrade multigather
  • add --from-file to compare?
  • resolve @ctb notes
  • figure out right split b/t exit and not in sourmash_args.load_file_as_signatures
  • think about removing error etc calls in signature.load_signatures

Testing TODO

  • check where we do (and shouldn't) change --traverse-directory and --force behavior
  • gather db-as-query; --md5 behavior with query
  • search db-as-query; --md5 behavior with query
  • index --from-file; make issue re argparse modification stuff for 4.0 (adjust sourmash index argparse to accomodate --from-file #1066)
  • lca classify --query-from-file, with and without other --query; traverse tests
  • lca summarize --query-from-file, with and without other --query; traverse tests
  • lca index --from-file; make issue re argparse modification for 4.0; traverse tests
  • multigather --query-from-file
  • compare new behavior; traverse stuff, too
  • test for sbt.py fix / minhash vs sig, line 142
  • some tests (?) for stdin in various places - sourmash compute, sig submodule
  • sourmash sig submodule more generally, what to test? traversal behavior, databases as inputs
  • test for corrected load_signatures behavior/do_raise

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@codecov
Copy link

codecov bot commented Jun 28, 2020

Codecov Report

Merging #1059 into master will increase coverage by 8.99%.
The diff coverage is 91.14%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1059      +/-   ##
==========================================
+ Coverage   83.51%   92.51%   +8.99%     
==========================================
  Files          99       74      -25     
  Lines        8885     5662    -3223     
==========================================
- Hits         7420     5238    -2182     
+ Misses       1465      424    -1041     
Flag Coverage Δ
#rusttests ?
Impacted Files Coverage Δ
sourmash/signature.py 90.36% <66.66%> (-1.77%) ⬇️
sourmash/lca/command_classify.py 85.71% <75.00%> (+0.38%) ⬆️
sourmash/lca/command_summarize.py 90.00% <83.33%> (+0.89%) ⬆️
sourmash/commands.py 86.64% <90.00%> (-0.24%) ⬇️
sourmash/sourmash_args.py 95.19% <91.91%> (-0.63%) ⬇️
sourmash/__init__.py 100.00% <100.00%> (ø)
sourmash/cli/compare.py 100.00% <100.00%> (ø)
sourmash/cli/gather.py 100.00% <100.00%> (ø)
sourmash/cli/index.py 100.00% <100.00%> (ø)
sourmash/cli/lca/classify.py 100.00% <100.00%> (ø)
... and 41 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 760daf6...8e565ba. Read the comment docs.

@ctb ctb changed the title deploy new load_signatures API a bit more broadly use new load_file_as_signatures function more broadly Jun 29, 2020
@ctb
Copy link
Contributor Author

ctb commented Jul 1, 2020

ok, this looks ready for a full review. ...any takers? it's not really THAT many changes, apart from the new tests 😁

@luizirber
Copy link
Member

Let's try...

sig_md5 = sig.md5sum()
if sig_md5.startswith(select_md5.lower()):
sl = [sig]
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for large collections (databases?), should we check if there is more than one sig starting with the same MD5?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thanks for catching this!

I'm not sure what to do about this... for really large collections it could be quite slow. But then again, for large collections it's not clear you want to use this! So, yeah. I'll change the behavior and add a test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 11e519a

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what to do about this... for really large collections it could be quite slow. But then again, for large collections it's not clear you want to use this! So, yeah. I'll change the behavior and add a test.

Yeah... at this point there is no access to the databases anymore, where this info can be checked while loading or it's less expensive (at least for SBTs? I think it is for LCA too).

Copy link
Member

@luizirber luizirber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment about MD5 selectors, but otherwise LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment