-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optionally enable full gather
csv output from fastmultigather
#187
Comments
or we could shortcut! |
This was referenced Jan 27, 2024
So a few lessons from calc-full-gather - basically, calc-full-gather works with fastgather results, but not with fastmultigather results. This is because fastmultigather for various reasons doesn't have the same columns as fastgather. SIGH. |
bluegenes
added a commit
to sourmash-bio/sourmash
that referenced
this issue
Feb 21, 2024
This PR adds a `calculate_gather_stats` function (and criterion benchmark) to calculate all gather statistics directly in rust. It makes use of this function in `disk_revindex.rs`, meaning that gather on a rocksdb database will now produce the full `GatherResult`. Note that this PR does not change the python layer to use these stats. This is just a step on the way to that :). ## New functions: - `Sketch::MinHash::sum_abunds()` - sum abundances in a MinHash - `Sketch::MinHash::n_unique_kmers()` - len(self) * self.scaled. - `Sketch::inflate(abunds_from)` - inflate abundances in self using abund values from `abunds_from` MinHash - `Sketch::inflated_abundances(abunds_from)` - same process as `inflate`, but just return the abundance vector and sum of abundances. - `ani_utils::ani_from_containment` - a streamlined function to get ANI from containment. I suspect a lot of the time added for ANI calculation (ref: testing over in branchwater) is the way we calculate it in python (including calculating a total number of k-mers). This excludes anything not really needed for basic ANI without confidence intervals. - `ani_utils::ani_from_containment_ci` - get high and low CI ANI from containment - `index::calculate_gather_stats` - calculate all gather statistics that can be separated from the gather iteration and return the full `GatherResult` The "expensive" calculations: - abundance-weighted values require re-calculating intersections and/or manipulating the abundance vector > Thankfully I think the new `inflate`/`inflated_abundances` code is pretty efficient? It uses the `itertools` `merge_join_by` strategy from https://github.com/sourmash-bio/sourmash/compare/lirber/itertools_merge. Thanks @luizirber :) - abundance `median` and `stddev` currently require cloning abunds, not sure how big of an issue that is. - ANI with confidence intervals ## Punted Issues - #3020: calculate all gather stats for all other database types. start with `linear.rs` since it uses `GatherResult` struct already. - #3021 try moving `calculate_gather_stats` outside of the `gather` functions to enable postprocessing in parallel. - #3022 : write ffi for the new functions? - `sketch::minhash::sum_abunds` - `sketch::minhash::n_unique_kmers`, - `sketch::minhash::inflate` - `sketch::minhash::inflated_abundances` - `ani_utils::ani_from_containment` - `ani_utils::ani_from_containment_ci` - `ani_utils::prob_nothing_in_common` ref sourmash-bio/sourmash_plugin_branchwater#187. --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Mohamed Abuelanin <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Luiz Irber <[email protected]> Co-authored-by: Luiz Irber <[email protected]>
4 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
after #134, I was hoping to provide all gather columns in the branchwater fastmultigather results csv by using the rust
GatherResult
and adding in the query-specific columns within branchwater.I think that's feasible, but might require a little more work in rust core:
disk_revindex.rs
: need to populate remainingGatherResult
cols https://github.com/sourmash-bio/sourmash/blob/latest/src/core/src/index/revindex/disk_revindex.rs#L327-L334linear.rs
: need to populate remainingGatherResult
cols https://github.com/sourmash-bio/sourmash/blob/ccb1d491807767ab9d3c9b005ff21ac953c34883/src/core/src/index/linear.rs#L186-L193mem_revindex.rs
: is this actually usingGatherResult
? https://github.com/sourmash-bio/sourmash/blob/latest/src/core/src/index/revindex/mem_revindex.rs#L201-L233Ref:
rust
GatherResult
: https://github.com/sourmash-bio/sourmash/blob/ccb1d491807767ab9d3c9b005ff21ac953c34883/src/core/src/index/mod.rs#L27-L59python
GatherResult
(all columns needed for output file parity). Tthe query-specific ones could/should remain outside of the rustGatherResult
, since they're just extra info that help us later, once we no longer have access to the query sketch. https://github.com/sourmash-bio/sourmash/blob/ccb1d491807767ab9d3c9b005ff21ac953c34883/src/sourmash/search.py#L428The text was updated successfully, but these errors were encountered: