should sourmash gather insist on uniform scaling? #2951

ctb · 2024-01-28T18:18:46Z

thinking through some of the gather issues revealed/discussed in #2950, and also the bug in #2825, and also worrying that branchwater fastgather/fastmultigather don't handle adaptive downsampling properly, I'm wondering if we should insist that either all database sketches have a scaled no higher than the query, or there is an explicit --scaled argument provided?

so, if a query had scaled=1000 and a database sequence had scaled=10,000, gather would refuse to run unless --scaled=10000 was specified.

It seems like an obvious UX improvement and deals nicely with confusing issues revealed in #2825.

The text was updated successfully, but these errors were encountered:

bluegenes · 2024-02-08T00:14:35Z

I think I like this - it's clear on what is happening and the results are more straightforward to interpret than if we allow adaptive downsampling.

Two thoughts:

Would databases need to be at a consistent scaled? This should be straightforward for any prepared database and with manifests + select. Are there any database types where this would present an issue? e.g. sigs in a directory w/ no manifest?
This would mean that multiple queries in a fastmultigather run would all be run at the same scaled. Probably fine, could always run separate commands.

Ref: me encountering scaling mismatches while trying to update gather stats calculations :) sourmash-bio/sourmash_plugin_branchwater#205 using #2943 :)

ctb · 2024-02-08T14:24:53Z

I think you might be being too restrictive? I meant that the scaled would be established at the beginning of the gather, and it would be an error if it came across a sketch that had a scaled that was too high.

This could generally be done at the beginning for most of our database types (anything with a manifest can easily be inspected for a scaled factor). I think it would be something to implement at the select call stage.

IIRC the only two sketch types that support multiple scaled out of the box are signature JSON files and zip files.

@luizirber

…ather` bug around `scaled`. (#3342) This PR does five things: First, it swaps the implementation of `KmerMinHash::downsample_max_hash` with `KmerMinHash::downsample_scaled`, and the same for `KmerMinHashBTree`. Previously a call to `downsample_scaled` calculated the right `max_hash` from `scaled`, then called `downsample_max_hash`, which then converted `max_hash` back to `scaled`. This reverses the logic so that (slightly) less work is done and, more importantly, the code is a bit more straightforward. Second, it changes the `downsample_*` functions so that they do not downsample when no downsampling is needed. As part of this the method signatures are changed to take an object, rather than a reference. This lets the functions return an unmodified `KmerMinHash` when no downsampling is needed. Third, it turns out the `downsample_*` functions didn't check to make sure that the new `scaled` value was larger than the old one, i.e. they didn't prevent upsampling. That check was added and a new error, `CannotUpsampleScaled`, was added to sourmash core. Fourth, this uncovered a bug in `RevIndex::gather` where the query was downsampled to the match, even when the match was lower scaled. This PR rejiggers the code so that downsampling is done appropriately in the `gather` and `calculate_gather_stats`. Since `RevIndex::gather` isn't used in the the sourmash CLI, the bug only presented in the test suite and in the branchwater plugin; see sourmash-bio/sourmash_plugin_branchwater#468 and sourmash-bio/sourmash_plugin_branchwater#467, where a fastmultigather test had to be fixed because of the incorrect scaled values output by `RevIndex::gather`. Fifth, it includes #3348 from @luizirber, which adds a `Signature::try_into()` to `KmerMinHash` to support the elimination of some clones. Because of the method signature change for the `downsample_*` functions, the sourmash-core version needs to be bumped to a new major version, 0.16.0. It's been a fun journey! 😅 Fixes #3343 Some notes on further changes and performance implications: As a consequence of the `RevIndex::gather` changes, redundant downsampling has to be done in `RevIndex::gather` and `calculate_gather_stats`, unless we want to change the method signature of `calculate_gather_stats`. I decided the PR was big enough that I didn't want to do that in addition. It should not affect most use cases where `scaled` is the same, and we will see if it results in any slowdowns over in the branchwater plugin. See #3196 for an issue on all of this. We could also just insist that the query scaled is the one to pay attention to, per #2951. This would simplify the code in Python-land as well. Overall, the performance implications of this PR are not clear. Previously downsampling was being done even when it wasn't needed, so this may speed things up quite a lot for our typical use case! On the other hand, redundant downsampling will happen in cases where there are scaled mismatches. We just need to benchmark it, I think. Some preliminary benchmarking reported in sourmash-bio/sourmash_plugin_branchwater#430 (comment) suggests that fastgather is now much more memory effficient 🎉 so that's good! TODO: - [x] resolve the scaled mismatch stuff. do we return an `Err` or what if the downsampling can't be performed? - [x] update PR description - [x] add more tests for downsampling, and maybe for gather - [x] play with this code over in the branchwater plugin too! sourmash-bio/sourmash_plugin_branchwater#467 --------- Co-authored-by: Luiz Irber <[email protected]>

ctb added the 5.0 issues to address for a 5.0 release label Jan 28, 2024

ctb mentioned this issue Oct 12, 2024

MRG: improve downsampling behavior on KmerMinHash; fix RevIndex::gather bug around scaled. #3342

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

should sourmash gather insist on uniform scaling? #2951

should sourmash gather insist on uniform scaling? #2951

ctb commented Jan 28, 2024

bluegenes commented Feb 8, 2024 •

edited

Loading

ctb commented Feb 8, 2024

should sourmash gather insist on uniform scaling? #2951

should sourmash gather insist on uniform scaling? #2951

Comments

ctb commented Jan 28, 2024

bluegenes commented Feb 8, 2024 • edited Loading

ctb commented Feb 8, 2024

bluegenes commented Feb 8, 2024 •

edited

Loading