Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EXP] provide signature file loading function via HTTP #2256

Open
wants to merge 2 commits into
base: latest
Choose a base branch
from

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Sep 3, 2022

This PR adds support for direct loading of signatures via HTTP URLs with GET, i.e the normal way of getting files from a Web server.

See discussion here, #2257.

NEXT STEPS: Per #2257, I should should look into impementing this more generically using fsspec.

So, for example, this supports:

Loading JSON sig/sig.gz files directly from Web sites

If you make raw JSON signature files available via an apache download link, you can Do Things With Them:

sourmash sig describe https://farm.cse.ucdavis.edu/~ctbrown/wort-data/wort-sra/ERR1040406.sig 

You can build manifests with HTTP URLs in them, too

For example,

sourmash sig collect https://farm.cse.ucdavis.edu/~ctbrown/wort-data/wort-sra/ERR1040406.sig -o mf.sqldb
sqlite3 mf.sqldb 'select internal_location from sourmash_sketches'

yields

https://farm.cse.ucdavis.edu/~ctbrown/wort-data/wort-sra/ERR1040406.sig
https://farm.cse.ucdavis.edu/~ctbrown/wort-data/wort-sra/ERR1040406.sig
https://farm.cse.ucdavis.edu/~ctbrown/wort-data/wort-sra/ERR1040406.sig

and then you can do things like

sourmash sig summarize mf.sqldb

In turn, this allows you to use the full machinery of picklists etc. on non-local signatures.

With the caveat that you might end up asking to download 13 TB of signature files if you make a mistake...

More on standalone manifests

So for example if you have a manifest containing a bunch of signatures you can use --include to get just the signatures containing a keyword in the name, or at a particular ksize, and build a local database just from them:

sourmash sig describe --include Sulfito farm.podar-ref.mf.csv

Summary thoughts

Anyway this is some useful mixture of oh-so-wrong and so-very-much-right...

This is probably most useful for things like genomes where the individual signatures are quite small; we could distribute just a manifest CSV file to support certain kinds of things. It's also a good reason to support better encoding formats than JSON per #1262.

NOTE: conflicts, maybe, with #1644, which also takes HTTP URLs.

Alternative/additional implementation thoughts

Instead of a custom loader fn for signatures only, we could generically support grabbing files and turning them into file handles for signature, pathlist, and manifest loading.

@codecov
Copy link

codecov bot commented Sep 3, 2022

Codecov Report

Merging #2256 (0d7e715) into latest (2e0175a) will increase coverage by 7.29%.
The diff coverage is 33.33%.

@@            Coverage Diff             @@
##           latest    #2256      +/-   ##
==========================================
+ Coverage   84.85%   92.14%   +7.29%     
==========================================
  Files         131      100      -31     
  Lines       15664    11397    -4267     
  Branches     2249     2251       +2     
==========================================
- Hits        13291    10502    -2789     
+ Misses       2082      603    -1479     
- Partials      291      292       +1     
Flag Coverage Δ
python 92.14% <33.33%> (-0.05%) ⬇️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/sourmash_args.py 92.79% <33.33%> (-0.82%) ⬇️
src/core/src/index/search.rs
src/core/src/index/sbt/mod.rs
src/core/src/index/mod.rs
src/core/src/sketch/nodegraph.rs
src/core/src/ffi/nodegraph.rs
src/core/src/from.rs
src/core/src/ffi/utils.rs
src/core/src/ffi/mod.rs
src/core/src/ffi/signature.rs
... and 22 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ctb
Copy link
Contributor Author

ctb commented Jan 1, 2023

implemented as a plug-in here: https://github.com/sourmash-bio/sourmash_plugin_load_urls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant