[MRG] compute-optimized MinHash (for small scaled or large cardinalities) #1045

luizirber · 2020-06-24T15:11:35Z

As discussed in #1010, for small scaled values or datasets with large cardinality the current implementation starts spending too much time reallocating the internal vector used for keeping mins and abundances. This PR is a first try on creating a compute-optimized MinHash that solves that problem, using BTree structures in Rust (a BTreeSet for mins and a BTreeMap for abunds, but could use only a BTreeMap since the keys are the mins already).

Anecdotally, I used this to calculate signatures for some long reads samples from CAMI 2, and it took 15 minutes instead of 2+ days (and haven't finished when I stopped running) of the current method.

BUT! All the other operations (merge, similarity, etc) are SLOWER. Only insertion ends up being faster. That's why I'm calling it "compute-optimized", because in the other cases it's better to use the current one. (Pending: analysis of gather, which does rebuild the query minhash a lot...)

Fixes #1010

TODO

Keep two MinHash impls, this one and the original (this PR is replacing the old one for now)
set up proptest using both impls, see if they give same results
use this one in build_templates for compute

Checklist

Is it mergeable?
make test Did it pass the tests?
make coverage Is the new code covered?
Did it change the command-line interface? Only additions are allowed
without a major version increment. Changing file formats also requires a
major version number increment.
Was a spellchecker run on the source code and documentation after
changes were made?

codecov · 2020-06-24T15:15:03Z

Codecov Report

Merging #1045 into master will decrease coverage by 9.12%.
The diff coverage is 96.01%.

@@            Coverage Diff             @@
##           master    #1045      +/-   ##
==========================================
- Coverage   92.42%   83.30%   -9.13%     
==========================================
  Files          72       97      +25     
  Lines        5454     8749    +3295     
==========================================
+ Hits         5041     7288    +2247     
- Misses        413     1461    +1048

Flag	Coverage Δ
#rusttests	`68.19% <96.01%> (?)`

Impacted Files	Coverage Δ
src/core/tests/test.rs	`100.00% <ø> (ø)`
src/core/src/signature.rs	`43.00% <16.66%> (ø)`
src/core/src/cmd.rs	`87.30% <66.66%> (ø)`
src/core/tests/signature.rs	`95.83% <90.69%> (ø)`
src/core/src/sketch/minhash.rs	`94.63% <97.56%> (ø)`
src/core/src/index/sbt/mhbt.rs	`63.70% <100.00%> (ø)`
src/core/src/sketch/nodegraph.rs	`84.19% <100.00%> (ø)`
src/core/tests/minhash.rs	`99.52% <100.00%> (ø)`
src/core/src/ffi/mod.rs	`0.00% <0.00%> (ø)`
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a7c07eb...5fa1e86. Read the comment docs.

luizirber · 2020-06-25T17:31:34Z

Issues to punt from this PR:

The B-Tree impl is a copy-paste-fix from the Vec impl. There is opportunity for refactoring and sharing most of the methods -> Refactor MinHash implementations (many shared methods) #1055
~~More opportunities for property testing, because both impl should return the same results.~~ Done in this PR.
This PR defaults compute to use the B-Tree impl. Add a flag in the CLI to choose the Vec one? The Vec one is better in very limited cases (very small datasets), so I think we don't need the CLI flag -> How to expose the B-Tree MinHash impl to Python? #1056
Another refactoring opportunity: Move add_sequence and add_protein to SigsTrait, with a default implementation. It only requires ksize() and add_word(), and would avoid replicating that logic in many places -> Move add_sequence and add_protein to SigsTrait #1057

luizirber · 2020-06-25T17:34:30Z

This is ready for review @ctb @olgabot @bluegenes

It is still missing coverage on the Rust side (since most of those methods end up not being exposed to Python at all), I'll set them up as more oracle-based property testing (which will also raise the Vec-based MinHash coverage)

ctb · 2020-06-26T00:31:02Z

Huh. Those are a lot of issues to punt from this PR :).

This doesn't touch the Python API or command-line interface. I'm not sure how to review it because of that! I'm fine with merging it, I guess?

luizirber · 2020-06-26T03:03:34Z

Huh. Those are a lot of issues to punt from this PR :).

I wanted to avoid making it even more massive with a bunch of random changes... And the refactors are easier later, because they will have a baseline that already works.

(And the add_sequence/add_protein was more a reminder for something that I noticed while doing this, but it's an ortogonal PR to this one).

This doesn't touch the Python API or command-line interface. I'm not sure how to review it because of that! I'm fine with merging it, I guess?

I think this is the relevant info: https://github.com/luizirber/sourmash_resources/blob/03ca7cea8df4640f83fcfa3359ce0be9ce0abab1/README.md#compute

Performance didn't change for a regular use case, and it improved a lot for small scaled or large cardinalities. The mem consumption can be lowered (by using only a BTreeMap instead of BTreeSet+BTreeMap in the abundance case), but that can come later (it doesn't change the public API).

I'll bring up the coverage and test more on the Rust side, and then merge. And probably cut 3.3.2 before Monday?

ctb · 2020-06-26T12:47:23Z

sounds good to me. There are a few PRs I'd like to make it into a release but I guess we can alway cut another one soon after :)

luizirber added 4 commits June 23, 2020 08:58

don't depend on internals

09422d4

wip btree minhash

ffa2efb

abundance fixes

315fb0c

sigh, unsorted mins...

6aa2aec

use large minhash for compute

7537b3f

luizirber force-pushed the rust/btree branch from a7f4efb to 6337389 Compare June 24, 2020 20:09

fix lints and msrv (older backtrace crate)

ca40809

luizirber force-pushed the rust/btree branch from 6337389 to ca40809 Compare June 24, 2020 20:36

proptest for merge, and fix a bug

23bf906

luizirber changed the title ~~[WIP] compute-optimized MinHash (for small scaled or large cardinalities)~~ [MRG] compute-optimized MinHash (for small scaled or large cardinalities) Jun 25, 2020

luizirber added 3 commits June 25, 2020 13:49

more tests

d454274

clippy fixes

edd1a47

msrv fixes again

29aa978

ctb approved these changes Jun 26, 2020

View reviewed changes

luizirber added 4 commits June 26, 2020 09:45

more tests

b7eff05

moooore teeeeeests

52b65b1

guess what? tests

51631da

final tests, and bump rust version

5fa1e86

luizirber merged commit bc2b168 into master Jun 26, 2020

luizirber deleted the rust/btree branch June 26, 2020 22:29

This was referenced Jun 27, 2020

Refactor MinHash implementations (many shared methods) #1055

Open

How to expose the B-Tree MinHash impl to Python? #1056

Open

Move add_sequence and add_protein to SigsTrait #1057

Closed

ctb mentioned this pull request Jul 13, 2020

Draft release notes for sourmash v3.4.0 #1100

Closed

luizirber mentioned this pull request Nov 19, 2020

How to export kmer abundances? dib-lab/khmer#1862

Open

luizirber mentioned this pull request May 2, 2021

differentiate between mutable and immutable MinHash objects? #1494

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] compute-optimized MinHash (for small scaled or large cardinalities) #1045

[MRG] compute-optimized MinHash (for small scaled or large cardinalities) #1045

luizirber commented Jun 24, 2020 •

edited by ctb

Loading

codecov bot commented Jun 24, 2020 •

edited

Loading

luizirber commented Jun 25, 2020 •

edited

Loading

luizirber commented Jun 25, 2020

ctb commented Jun 26, 2020

luizirber commented Jun 26, 2020

ctb commented Jun 26, 2020

[MRG] compute-optimized MinHash (for small scaled or large cardinalities) #1045

[MRG] compute-optimized MinHash (for small scaled or large cardinalities) #1045

Conversation

luizirber commented Jun 24, 2020 • edited by ctb Loading

TODO

Checklist

codecov bot commented Jun 24, 2020 • edited Loading

Codecov Report

luizirber commented Jun 25, 2020 • edited Loading

luizirber commented Jun 25, 2020

ctb commented Jun 26, 2020

luizirber commented Jun 26, 2020

ctb commented Jun 26, 2020

luizirber commented Jun 24, 2020 •

edited by ctb

Loading

codecov bot commented Jun 24, 2020 •

edited

Loading

luizirber commented Jun 25, 2020 •

edited

Loading