Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add basic picklist functionality to sourmash sig extract #1587

Merged
merged 19 commits into from
Jun 16, 2021
Merged

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Jun 12, 2021

This PR defines a SignaturePicklist class for subsetting collections of signatures, and adds associated functionality to sourmash sig extract with the --picklist argument.

So, for example, you can do

sourmash sig extract --picklist list.csv:md5:md5sum

and sig extract will pick out only the subset of signatures whose md5sums perfectly match the column named md5 in the CSV file list.csv.

The argument string is of the following format: pickfile:column:coltype

Here, pickfile is the path to a CSV file; column is the name of the column to select from the CSV file; and coltype is the type of matching to do on that column.

coltypes that are currently supported:

  • name - exact match to signature's name
  • md5 - exact match to signature's md5sum
  • md5prefix8 - match to 8-character prefix of signature's md5sum
  • md5short - same as md5prefix8
  • ident - exact match to signature's identifier
  • identprefix - match to signature's identifier, before '.'

Identifiers are constructed by using the first space delimited word in the signature name.

For now, picklists iterate across all of the signatures, which will be slow for large collections. But, once the various APIs are defined we can enable faster lookup for various Index types types - e.g. Zipfile collections, SBTs, LCAs, and anything with a manifest should be able to pick things much faster than linear time. (See #1590 and beyond!)

Additional notes:

  • this PR breaks backwards compatibility for sourmash sig describe --csv <out.csv>; the CSV file now contains empty columns for empty name/filename, as opposed to ** no name **, because that just makes more sense.
  • this PR makes sourmash sig extract streaming in both input and output, which could significantly reduce memory usage in certain circumstances (e.g. large collections being extracted/subsetted to zip files or directories)

This PR is on the path to some functionality discussed in more detail in a few places, wrt manifests, picklists, and lazy loading:

TODO:

  • add tests for nomatch
  • add tests for extract w/--name, --md5
  • add tests for extract w/ksize, moltype selectors
  • add functionality for n_found, --require-all
  • maybe add functionality to negate picklists?
  • maybe disambiguate number of signatures from number of matches more, e.g. "found 64 of 64"
  • add some documentation

Example

building picklists with sourmash sig describe

Here I built a picklist for a collection of test signatures like so:

% sourmash sig describe podar-ref/ --csv out.csv -q

(In this case the picklist actually contains information for all the signatures, but it could be a subset.)

some example output

Here is some output of the picklist extraction:


Using exact md5sum matches; note, duplicate md5!

% sourmash sig extract podar-ref --picklist out.csv:md5:md5 -o xyz.zip
picking column 'md5' of type 'md5' from 'out.csv'
loaded 194 distinct values into picklist.
WARNING: 194 values in column 'md5' were not distinct
loaded 455 sigs from 'podar-ref'
loaded 455 total that matched ksize & molecule type
extracted 455 signatures from 1 file(s)
for given picklist, found 194 matches of 194 total
+ sourmash sig extract podar-ref --picklist out.csv:name:name -o xyz.zip

Using exact name matches; note, duplicate and empty names!

% sourmash sig extract podar-ref --picklist out.csv:name:name -o xyz.zip
picking column 'name' of type 'name' from 'out.csv'
loaded 64 distinct values into picklist.
WARNING: 196 empty values in column 'name' in CSV file
WARNING: 64 values in column 'name' were not distinct
loaded 455 sigs from 'podar-ref'
loaded 455 total that matched ksize & molecule type
extracted 259 signatures from 1 file(s)
for given picklist, found 64 matches of 64 total

Using md5sum 8-character prefix:

% sourmash sig extract podar-ref --picklist out.csv:md5:md5prefix8 -o xyz.zip
picking column 'md5' of type 'md5prefix8' from 'out.csv'
loaded 194 distinct values into picklist.
WARNING: 194 values in column 'md5' were not distinct
loaded 455 sigs from 'podar-ref'
loaded 455 total that matched ksize & molecule type
extracted 455 signatures from 1 file(s)
for given picklist, found 194 matches of 194 total

Using identifiers:

% sourmash sig extract podar-ref --picklist out.csv:name:ident -o xyz.zip
picking column 'name' of type 'ident' from 'out.csv'
loaded 64 distinct values into picklist.
WARNING: 196 empty values in column 'name' in CSV file
WARNING: 64 values in column 'name' were not distinct
loaded 455 sigs from 'podar-ref'
loaded 455 total that matched ksize & molecule type
extracted 259 signatures from 1 file(s)
for given picklist, found 64 matches of 64 total

Using identifiers with prefixes:

% sourmash sig extract podar-ref --picklist picking column 'name' of type 'ident.' from 'out.csv'
loaded 64 distinct values into picklist.
WARNING: 196 empty values in column 'name' in CSV file
WARNING: 64 values in column 'name' were not distinct
loaded 455 sigs from 'podar-ref'
loaded 455 total that matched ksize & molecule type
extracted 259 signatures from 1 file(s)
for given picklist, found 64 matches of 64 total

Picking signatures with combinatorial restrictions (e.g. ksize):

+ sourmash sig extract podar-ref --picklist out.csv:name:ident -o xyz.zip -k 31

== This is sourmash version 4.1.2. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

picking column 'name' of type 'ident' from 'out.csv'
loaded 64 distinct values into picklist.
WARNING: 196 empty values in column 'name' in CSV file
WARNING: 64 values in column 'name' were not distinct
loaded 197 sigs from 'podar-ref'
loaded 197 total that matched ksize & molecule type
extracted 129 signatures from 1 file(s)
for given picklist, found 64 matches of 64 total

@ctb
Copy link
Contributor Author

ctb commented Jun 12, 2021

The functionality seems reasonably well baked, so I'd love some UX review when people have bandwidth!

Requests for comments/review:

  • is the picklist command-line argument format ok, or am I missing something that's better, simpler, clearer, and/or more flexible?
  • are there better names for the column types? I particularly dislike md5prefix8 and ident.; maybe md5short and identprefix would be better? or something?

@codecov
Copy link

codecov bot commented Jun 12, 2021

Codecov Report

Merging #1587 (14b87d4) into latest (ff75ec0) will increase coverage by 8.28%.
The diff coverage is 95.83%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1587      +/-   ##
==========================================
+ Coverage   81.05%   89.34%   +8.28%     
==========================================
  Files         102       76      -26     
  Lines       10314     6717    -3597     
  Branches     1172     1198      +26     
==========================================
- Hits         8360     6001    -2359     
+ Misses       1748      507    -1241     
- Partials      206      209       +3     
Flag Coverage Δ
python 89.34% <95.83%> (+0.10%) ⬆️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/sig/picklist.py 94.52% <94.52%> (ø)
src/sourmash/sig/__main__.py 91.68% <97.77%> (+0.44%) ⬆️
src/sourmash/cli/sig/extract.py 100.00% <100.00%> (ø)
src/core/src/ffi/nodegraph.rs
src/core/src/ffi/mod.rs
src/core/src/index/sbt/mod.rs
src/core/tests/test.rs
src/core/src/ffi/utils.rs
src/core/src/index/search.rs
src/core/src/sketch/hyperloglog/estimators.rs
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff75ec0...14b87d4. Read the comment docs.

@taylorreiter
Copy link
Contributor

  • is the picklist command-line argument format ok, or am I missing something that's better, simpler, clearer, and/or more flexible?

I think it's fine, but @luizirber or @bluegenes may have bigger/better thoughts about this

  • are there better names for the column types? I particularly dislike md5prefix8 and ident.; maybe md5short and identprefix would be better? or something?

I don't hate md5prefix8, I knew what it was as soon as I saw the column header. md5short I also know, but I like that md5prefix8 tells me how many characters are in it. Feels more hackable.

I don't like ident.. I thought it was an abbreviation of identifier until I looked at the column descriptions. identprefix is fine tho.

@ctb
Copy link
Contributor Author

ctb commented Jun 14, 2021

  • is the picklist command-line argument format ok, or am I missing something that's better, simpler, clearer, and/or more flexible?

I think it's fine, but @luizirber or @bluegenes may have bigger/better thoughts about this

👍

  • are there better names for the column types? I particularly dislike md5prefix8 and ident.; maybe md5short and identprefix would be better? or something?

I don't hate md5prefix8, I knew what it was as soon as I saw the column header. md5short I also know, but I like that md5prefix8 tells me how many characters are in it. Feels more hackable.

yeah :). maybe I'll support both...

I don't like ident.. I thought it was an abbreviation of identifier until I looked at the column descriptions. identprefix is fine tho.

kk, thx!

Incidentally, as I've been working through picklists in the umpteen attached PRs and trying them out on real data, I'm feeling pretty good about this functionality. It is incredibly useful when working with Annoyingly Large Databases.

@ctb ctb changed the title [WIP] add some basic picklist functionality to sourmash sig extract [MRG] add some basic picklist functionality to sourmash sig extract Jun 16, 2021
@ctb ctb changed the title [MRG] add some basic picklist functionality to sourmash sig extract [MRG] add basic picklist functionality to sourmash sig extract Jun 16, 2021
@ctb
Copy link
Contributor Author

ctb commented Jun 16, 2021

@bluegenes @mr-eyes @keyabarve this is ready for review and merge!

Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this lgtm!

...but I'm always concerned that I haven't stared at it long enough if I don't find anything to nitpick...

@ctb
Copy link
Contributor Author

ctb commented Jun 16, 2021

this lgtm!

...but I'm always concerned that I haven't stared at it long enough if I don't find anything to nitpick...

That's the academic in you speaking. Quash it! 🤣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants