-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Move greyhound-core into sourmash #1238
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1238 +/- ##
==========================================
+ Coverage 83.58% 89.94% +6.35%
==========================================
Files 113 88 -25
Lines 12185 8592 -3593
Branches 1684 1705 +21
==========================================
- Hits 10185 7728 -2457
+ Misses 1743 600 -1143
- Partials 257 264 +7
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Hmm. 👎 I instead propose that we add an "experimental" subcommand in sourmash CLI that is explicitly not stable. Alternatively, as long as it's exposed by the Rust API to Python, I can use it in scripts. So we could keep it away from the CLI, at the cost of not being able to evolve a better CLI as we play with use cases. Motivation: I do most of my experimentation in Python, as do others, and I think sourmash has benefited substantially from this kind of experimentation. So I want to keep this available! |
ab74189
to
b37b51f
Compare
Fair point. Note that we can request "experimental" features from the Rust side on any sourmash PRs just fine (add "--features experimental" to the cargo invocation in |
is there value in combining this with #1131 at all? (I don't really know, am wondering.) |
Probably... I see the LcaDB as a superset of RevIndex, and could probably extend the latter with the taxinfo bits from the former, but this becomes a larger PR than this one. |
c52ab78
to
05f34f4
Compare
as I suspected, shoehorning greyhound into how sourmash does gather is more First hurdle is that I did the greyhound demos using on-disk signatures (and paths), Next, the gather issue discussed in #1263 |
4013982
to
ef5c62f
Compare
Is fine by me! |
add getset, wip parallel feature flag wip colors simpler impl first parallel hash to color construction wip Revert "wip" This reverts commit d65da76. must insert small_color into large_color before setting it trying out a small set impl try compressing colors inside reduce size test use new released vec-collections update cbindgen make parallel/sequential more maintainable some notes on partial serde start revindex in py start ffi first test passing second test passing modify colors.update to accept an iter instead of slices color count tracker update sourmash.h blanket implementation for counter_gather start working on memstorage niv update avoid a mut ref in save by using lots of mutexes fix codecov path fixes expose InnerStorage basic test passing in-memory sigs working revert counter_gather to gather in search.py lint cleanup cbindgen fixes moved MemStorage to #1463 implement signatures() fix initialization
🎉 🎉 🎉 🎉 🎉 🎉 |
After a frustrating 15 minutes of debugging, I feel like the reorganization of |
On-disk RevIndex based on RocksDB, initially implemented in https://github.com/luizirber/2022-06-26-rocksdb-eval This is the index/core data structure backing https://mastiff.sourmash.bio There are many changes in the Rust code, so bumping the version to `0.12.0`. This is mostly not exposed thru the FFI yet. Tests from the from the in-memory `RevIndex` (greyhound) from #1238 were kept working, but it is not well supported (doesn't allow saving/loading from disk, for example), and should be wholly replaced by `sourmash::index::revindex::disk_revindex` (the on-disk RevIndex) in the future. It is confusing to have these different RevIndex impls in Rust, and I started converging them, but the work is not completely done yet. #2727 is a better starting point for how `Index` abc/trait should work acrosss Python/Rust, and I started moving the Rust indices to start from a `LinearIndex` and later specialize into a `RevIndex`, which will make easier to translate the work from #2727 for future indices across FFI. A couple of new concepts introduced in this PR: - a `Collection` is a `Manifest` + `Storage`. So a zip file like the ones for GTDB databases fit this easily (storage = `ZipStorage`, manifest is read from the zipfile), but a file paths list does too (manifest built from the file paths, storage = `FSStorage`). This goes in a bit of different direction than #1901, which was extending `Storage` to support more functionality. I think `Storage` should stay pretty bare and mostly deal with loading/saving data, but not much knowledge of **what** data is there (this is covered with `Manifest`). - a `CollectionSet` is a consistent collection of signatures. Consistent here means: same k-size, downsample-compatible for scaled, same moltype. You can create a `CollectionSet` by running `.select()` on a `Collection`. `CollectionSet` is required for building indices (because we shouldn't be building indices mixing k-size/moltype), and greatly simplifies the logic in many places because it is not necessary to check for compatibility. - `LinearIndex` was rewritten based on `Collection` (and also the `greyhound`/`branchwater` parallelism), and this supports the "parallel search without an index" use case. There is no index construction per se here, pretty much just a thin layer on top of `Collection` implementing functionality expected from indices. - `Manifest`, `Selection`, and `Picklist` are still incomplete, but the relevant function definitions are in place, need to barrage it with tests (and potentially exposing to Python and reusing the ones there in #2726) ## Feature - Initial implementation for `Manifest`, `Selection`, and `Picklist` following the Python API. - `Collection` is a new abstraction for working with a set of signatures. A collection needs a `Storage` for holding the signatures (on-disk, in-memory, or remotely), and a `Manifest` to describe the metadata for each signature. - Expose CSV parsing and RocksDB errors. - New module `sourmash::index::revindex::disk_revindex` with the on-disk RevIndex implementation based on RocksDB. - Add `iter` and `iter_mut` methods for `Signature`. - Add `load_sig` and `save_sig` methods to `Storage` trait for higher-level data manipulation and caching. - Add `spec` method to `Storage` to allow constructing a concrete `Storage` from a string description. - Add `InnerStorage` for synchronizing parallel access to `Storage` implementations. - Add `MemStorage` for keeping signatures in-memory (mostly for debugging and testing). ## Refactor - Rename `HashFunctions` variants to follow camel-case, so `Murmur64Protein` instead of `murmur64_protein` - `LinearIndex` is now implemented as a thin layer on top of `Collection`. - Move `GatherResult` to `sourmash::index` module. - Move `sourmash::index::revindex` to `sourmash::index::mem_revindex` (this is the Greyhound version of revindex, in-memory only). It was also refactored internally to build a version of a `LinearIndex` that will be merged in the future with `sourmash::index::LinearIndex` - Move `select` method from `Index` trait into a separate `Select` trait, and implement it for `Signature` based on the new `Selection` API. - Move `SigStore` into `sourmash::storage` module, and remove the generic. Now it always stores `Signature`. Also implement `Select` for it. ## Build - Add new `branchwater` feature (enabled by default), which can be disabled by downstream projects to limit bringing heavy dependencies like rocksdb - Add new `rkyv` feature (disabled by default), making `MinHash` serializable with the `rkyv` crate. - Add semver checks for CI (so we bump versions accordingly, or avoid breaking changes) - Reduce features combinations on Rust checks (takes much less time to run) - Disable `musllinux` wheels (need to figure out how to build rocksdb for it) --------- Co-authored-by: Tessa Pierce Ward <[email protected]> Co-authored-by: C. Titus Brown <[email protected]>
This PR moves the
greyhound-core
(RevIndex
andgather
) into sourmash. It doesn't bring the CLI, web server or browser frontend.(The CLI should probably be exposed here at some point, the web server and frontend should go into
wort
).I created a new feature on the Rust side, "experimental". The idea is to allow experimentation without making stability guarantees, including passing all checks required for merging (like wasm support). #1221 is another example of an "experimental" feature.
In order to avoid piling up experimental features, I also propose a requirement that sourmash-Python CAN'T use the "experimental" feature. This keeps us honest, and force stabilization in the Rust side =]
This is sort of equivalent to the
nightly
features in the Rust compiler.Lots left to do, but a small list:
parallel
feature to expose rayon