-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] add basic picklist functionality to sourmash sig extract
#1587
Conversation
The functionality seems reasonably well baked, so I'd love some UX review when people have bandwidth! Requests for comments/review:
|
Codecov Report
@@ Coverage Diff @@
## latest #1587 +/- ##
==========================================
+ Coverage 81.05% 89.34% +8.28%
==========================================
Files 102 76 -26
Lines 10314 6717 -3597
Branches 1172 1198 +26
==========================================
- Hits 8360 6001 -2359
+ Misses 1748 507 -1241
- Partials 206 209 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
I think it's fine, but @luizirber or @bluegenes may have bigger/better thoughts about this
I don't hate I don't like |
👍
yeah :). maybe I'll support both...
kk, thx! Incidentally, as I've been working through picklists in the umpteen attached PRs and trying them out on real data, I'm feeling pretty good about this functionality. It is incredibly useful when working with Annoyingly Large Databases. |
sourmash sig extract
sourmash sig extract
sourmash sig extract
sourmash sig extract
@bluegenes @mr-eyes @keyabarve this is ready for review and merge! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this lgtm!
...but I'm always concerned that I haven't stared at it long enough if I don't find anything to nitpick...
That's the academic in you speaking. Quash it! 🤣 |
This PR defines a
SignaturePicklist
class for subsetting collections of signatures, and adds associated functionality tosourmash sig extract
with the--picklist
argument.So, for example, you can do
and
sig extract
will pick out only the subset of signatures whose md5sums perfectly match the column namedmd5
in the CSV filelist.csv
.The argument string is of the following format:
pickfile:column:coltype
Here,
pickfile
is the path to a CSV file;column
is the name of the column to select from the CSV file; andcoltype
is the type of matching to do on that column.coltype
s that are currently supported:name
- exact match to signature's namemd5
- exact match to signature's md5summd5prefix8
- match to 8-character prefix of signature's md5summd5short
- same asmd5prefix8
ident
- exact match to signature's identifieridentprefix
- match to signature's identifier, before '.'Identifiers are constructed by using the first space delimited word in the signature name.
For now, picklists iterate across all of the signatures, which will be slow for large collections. But, once the various APIs are defined we can enable faster lookup for various Index types types - e.g. Zipfile collections, SBTs, LCAs, and anything with a manifest should be able to pick things much faster than linear time. (See #1590 and beyond!)
Additional notes:
sourmash sig describe --csv <out.csv>
; the CSV file now contains empty columns for empty name/filename, as opposed to** no name **
, because that just makes more sense.sourmash sig extract
streaming in both input and output, which could significantly reduce memory usage in certain circumstances (e.g. large collections being extracted/subsetted to zip files or directories)This PR is on the path to some functionality discussed in more detail in a few places, wrt manifests, picklists, and lazy loading:
save_pathlist_of_signatures
method? #1365TODO:
Example
building picklists with sourmash sig describe
Here I built a picklist for a collection of test signatures like so:
(In this case the picklist actually contains information for all the signatures, but it could be a subset.)
some example output
Here is some output of the picklist extraction:
Using exact md5sum matches; note, duplicate md5!
Using exact name matches; note, duplicate and empty names!
Using md5sum 8-character prefix:
Using identifiers:
Using identifiers with prefixes:
Picking signatures with combinatorial restrictions (e.g. ksize):