Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add sourmash sig grep #1864

Merged
merged 30 commits into from
Mar 7, 2022
Merged

[MRG] add sourmash sig grep #1864

merged 30 commits into from
Mar 7, 2022

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Mar 5, 2022

Implements sourmash sig grep as in #1075.

sig grep does substring matching on name, filename, and md5. It supports regular expressions, as well as the -v (exclude instead of include) and -i (case insensitive) options. It also supports CSV output in manifest format (for use as picklists), and the -c/--count option to just count the number of signatures in each file.

A key feature of grep is that it (by default) requires manifests. That way users won't get caught up accidentally in searches of very large databases w/o manifests, which has been happening to me a lot recently.

Fixes #1075.

Examples

Extract by acccession:

% sourmash sig grep GCF_000019185. /group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.zip -o out.zip

...

loaded 258406 total that matched ksize & molecule type
extracted 1 signatures from 1 file(s)

Fail when no manifest present:

% sourmash sig grep GCF_000019185. /group/ctbrowngrp/gtdb/databases/gtdb-rs202.genomic.k31.zip -o out.zip

...

ERROR on filename '/group/ctbrowngrp/gtdb/databases/gtdb-rs202.genomic.k31.zip'.
sig grep requires a manifest by default, but no manifest present.

Count some things and output a manifest:

% sourmash sig grep -ci shewanella /group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.zip --csv matches.csv

== This is sourmash version 4.2.4.dev27+g0a1b713f. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

(no signatures will be output because of --silent/--count).
252 matches: /group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.zip
wrote manifest containing matches to CSV matches 'matches.csv'

Notes

eventually we should probably add the default manifest requirement to several more sig subcommands, like extract, but this will require a major version bump.

TODO

  • upgrade PR description with example command
  • tests needed!
  • Visit add a sig grep command? #1075 and extract non-grep stuff to new issue
  • make issue to require manifests in various sig subcomands (sig extract, for example)
  • write docs

@codecov
Copy link

codecov bot commented Mar 5, 2022

Codecov Report

Merging #1864 (76b2b02) into latest (cccd06c) will increase coverage by 8.07%.
The diff coverage is 98.29%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1864      +/-   ##
==========================================
+ Coverage   82.38%   90.46%   +8.07%     
==========================================
  Files         119       91      -28     
  Lines       12944     8859    -4085     
  Branches     1729     1751      +22     
==========================================
- Hits        10664     8014    -2650     
+ Misses       2016      579    -1437     
- Partials      264      266       +2     
Flag Coverage Δ
python 90.46% <98.29%> (+0.10%) ⬆️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/manifest.py 90.83% <93.33%> (+0.32%) ⬆️
src/sourmash/sig/grep.py 98.70% <98.70%> (ø)
src/sourmash/cli/sig/__init__.py 100.00% <100.00%> (ø)
src/sourmash/cli/sig/grep.py 100.00% <100.00%> (ø)
src/sourmash/sig/__init__.py 100.00% <100.00%> (ø)
src/core/tests/storage.rs
src/core/src/ffi/hyperloglog.rs
src/core/src/sketch/hyperloglog/mod.rs
src/core/src/ffi/nodegraph.rs
src/core/tests/minhash.rs
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cccd06c...76b2b02. Read the comment docs.

@ctb
Copy link
Contributor Author

ctb commented Mar 5, 2022

@taylorreiter @bluegenes @luizirber any thoughts on additional functionality beyond regexps, -v, and -i? this is already an immediately useful command 😆

src/sourmash/sig/grep.py Outdated Show resolved Hide resolved
@bluegenes
Copy link
Contributor

bluegenes commented Mar 5, 2022

@taylorreiter @bluegenes @luizirber any thoughts on additional functionality beyond regexps, -v, and -i? this is already an immediately useful command 😆

Maybe also -c/--count for just counting results you would get (suppressing normal matched output)?

Base automatically changed from cleanup/misc to latest March 5, 2022 17:55
@ctb
Copy link
Contributor Author

ctb commented Mar 5, 2022

Maybe also -c/--count for just counting results you would get (suppressing normal matched output)?

yeah, I like this!

what about other outputs, like -l for showing what files things are in? I can see it maybe being useful.

and another brainstorming thought - I'm guessing it might be handy to support switching the output to a manifest-style CSV so you can use this to create picklists.

@ctb
Copy link
Contributor Author

ctb commented Mar 6, 2022

Added -c/--count and --csv for outputting manifests of matches.

I think (modulo tests and some refactoring) this PR might be feature complete, at least until we get some experience in using the functionality and have more ideas on what to add. I'll push forward on the tests as I have time, but it's pretty usable already, so let me know if you have any thoughts or comments!

@ctb ctb changed the title [WIP] add sourmash sig grep [MRG] add sourmash sig grep Mar 7, 2022
@ctb
Copy link
Contributor Author

ctb commented Mar 7, 2022

Welp, got the tests and docs done sooner than expected. Ready for review & merge @sourmash-bio/devs!

Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like sig grep works with picklist input as well - picklist is applied first, then pattern match. If yes, I think we need a test for this case (pattern + picklist input). Happy to review again with add'l test, or leave to your discretion.

src/sourmash/cli/sig/grep.py Outdated Show resolved Hide resolved
Comment on lines +128 to +129
if picklist:
sourmash_args.report_picklist(args, picklist)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need a test for grep + picklist? I see one for when this will fail (LCA)...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh, nice catch! test_sig_grep_7_picklist_md5_lca_fail tests the failure, but there's no test for the successful use of --picklist. Will add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in 76b2b02

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@ctb ctb merged commit ba38d14 into latest Mar 7, 2022
@ctb ctb deleted the add/sig_grep branch March 7, 2022 23:47
@ctb
Copy link
Contributor Author

ctb commented Mar 7, 2022

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add a sig grep command?
2 participants