Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add clear option to set_abundances(...) method #1046

Merged
merged 15 commits into from
Jul 14, 2020

Conversation

erikyoung85
Copy link
Contributor

@erikyoung85 erikyoung85 commented Jun 25, 2020

  • add a clear method to MinHash objects in Python and Rust, that clears all hashes (and abundances).
  • add a clear option to set_abundances that clears the MinHash before setting the abundances.
  • add a add_hash_with_abundance to the Python API for MinHash that adds the given hash with the given abundance. Of note, this sums the abundances if there is already a non-zero abundance
    Fixes add method to MinHash class to update abundances for a specific hash #1018.
    Fixes should we provide add_hash_with_abundance via Python MinHash API? #893.
  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@erikyoung85 erikyoung85 requested a review from ctb June 25, 2020 03:26
@erikyoung85 erikyoung85 marked this pull request as draft June 25, 2020 03:28
@codecov
Copy link

codecov bot commented Jun 25, 2020

Codecov Report

Merging #1046 into master will decrease coverage by 0.13%.
The diff coverage is 32.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1046      +/-   ##
==========================================
- Coverage   83.99%   83.86%   -0.14%     
==========================================
  Files          98       98              
  Lines        9086     9108      +22     
==========================================
+ Hits         7632     7638       +6     
- Misses       1454     1470      +16     
Flag Coverage Δ
#rusttests 67.99% <0.00%> (-0.33%) ⬇️
Impacted Files Coverage Δ
src/core/src/ffi/minhash.rs 0.00% <0.00%> (ø)
src/core/src/sketch/minhash.rs 93.18% <0.00%> (-0.94%) ⬇️
sourmash/_minhash.py 95.07% <100.00%> (+0.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f5a55b...74b3de0. Read the comment docs.

@erikyoung85
Copy link
Contributor Author

Hey Titus, my guess is that there are at least a few errors with my PR but the tests all still pass. Let me know if there is anything I should change, as usual.

Also, is this supposed to give a command line option that sets no_clear to true? (I didn't attempt to add this feature so let me know if I should)

@erikyoung85 erikyoung85 marked this pull request as ready for review June 25, 2020 04:12
@ctb
Copy link
Contributor

ctb commented Jun 25, 2020

hi @erikyoung85 promising start!

I think only the set_abundances method needs a no_clear option; I don't see a use case for a MinHash object to have that property set permanently. Also, MinHash objects get saved into/loaded from JSON and so there'd be a lot more work involved to save that state, and we have to be careful about forwards and backwards compatibility, so without a strong use case I'm -1 on adding it.

If you want to expand into Rust territory, you could add #893 into this. I haven't looked closely at what needs to be done there, though.

I'll think about the tests more later. A good rule of thumb is that if you're worried about errors, first write some simple tests that check general behavior, and then start testing the edge cases that worry you!

@ctb
Copy link
Contributor

ctb commented Jun 26, 2020

so, some brainstorming about tests --

  • on an empty MinHash, no_clear should have no effect
  • on a non-empty MinHash, set_abundances should (by default) clear old {hash: abund} values
  • on a non-empty MinHash, set_abundances with no_clear should retain old {hash: abund} values.

One "interesting" question is, what happens when set_abundances tries to set the abundance of a hash that already has an abundance?

@erikyoung85
Copy link
Contributor Author

So my first attempt at adding #893 into this PR. CAUTION: I've never used rust before, and my python skills are still very new, so this is very much a rough draft :) please do give me any and all advice you can think of

sourmash/_minhash.py Outdated Show resolved Hide resolved
sourmash/_minhash.py Outdated Show resolved Hide resolved
@erikyoung85
Copy link
Contributor Author

So it looks like a MinHash with an already set hash of {10, 1}, when set_abundances(values, clear=False) is called it will add the hash abundances together, instead of resetting it. Correct me if I'm wrong, but this is exactly what add_hash_with_abundance(...) does, so I'm guessing you would prefer for set_abundances to reset the hash's abundance instead. Let me know, and I am definitely willing to try to implement this change in rust if you want me to!

@luizirber
Copy link
Member

So it looks like a MinHash with an already set hash of {10, 1}, when set_abundances(values, clear=False) is called it will add the hash abundances together, instead of resetting it. Correct me if I'm wrong, but this is exactly what add_hash_with_abundance(...) does, so I'm guessing you would prefer for set_abundances to reset the hash's abundance instead. Let me know, and I am definitely willing to try to implement this change in rust if you want me to!

I just noticed that .clear() is not exposed to Python, and then went to check what set_abundances was doing on the Rust side, and saw it is calling mh.clear(). So, it will be harder to solve this in the Python side...

You can add a new clear: bool argument to kmerminhash_set_abundances, and put an if clear check around this line.

(and if you feel like it, exposing .clear() to Python would be cool too =])

@erikyoung85
Copy link
Contributor Author

erikyoung85 commented Jul 3, 2020

Okay, so I wanted to simulate the mh.clear() line being passed on due to the new clear: bool rust argument you mentioned being set to false (I simulated this just by commenting out the mh.clear() line). However, this didn't seem to fix the problem. I found out that it sets the abundances by calling add_hash_with_abundance() in a loop with the new hashes and abunds as parameters.

So, I think I found a way (from google) to override the already set abundance if the hash already exists while merging the old and new dictionaries by doing new_values = {**current_values, **values}. I know it is not exactly the way you asked for but the tests seem to pass without any added extra time.

Let me know what you think or if my code can be improved or something of course!

@ctb
Copy link
Contributor

ctb commented Jul 3, 2020 via email

@luizirber
Copy link
Member

I know it is not exactly the way you asked for (...)

Just to be clear, I was suggesting, not asking. Because, as you can see...

Okay, so I wanted to simulate the mh.clear() line being passed on due to the new clear: bool rust argument you mentioned being set to false (I simulated this just by commenting out the mh.clear() line). However, this didn't seem to fix the problem. I found out that it sets the abundances by calling add_hash_with_abundance() in a loop with the new hashes and abunds as parameters.

... I didn't consider all the conditions, and you're right, add_hash_with_abundance sums the new abundance to the old one, which is not what this PR is about!

but the tests seem to pass without any added extra time.

I was suggesting moving the code into Rust because it is not true that it doesn't take extra time. We don't have many benchmarks, but if you check benchmarks/benchmarks.py there is one check for set_abundances near the end of the file. I added an extra one for set_abundance(clear=False), like this:

    def time_set_abundances_noclear(self):
        mh = self.mh
        mins = self.populated_mh.get_mins(with_abundance=True)
        for i in range(500):
            mh.set_abundances(mins, clear=False)

and then ran the benchmarks (you need to pip install asv): make build && asv dev

benchmarks.TimeMinAbundanceSuite.time_set_abundances          45.1±0ms
benchmarks.TimeMinAbundanceSuite.time_set_abundances_noclear  78.1±0ms

If you do it in Rust, then it takes about the same time.

(any time we pull data from Rust to Python with .get_mins() and then send it back to Rust you can be sure that it will be slower than doing the same operation in Rust...)

So, I think I found a way (from google) to override the already set abundance if the hash already exists while merging the old and new dictionaries by doing new_values = {**current_values, **values}.

That's a nice find, very cool trick!

Let me know what you think or if my code can be improved or something of course!

As you described, the main issue is that set_abundances in src/core/src/ffi/minhash.rs is calling .add_many_with_abund, which in turn calls.add_hash_with_abundance in a loop. It may be time to define a proper set_abundances method in src/core/src/sketch/minhash.rs that does the right thing: iterates over the mins and abundances, and sets the value directly instead of calling .add_hash_with_abundance.

BUT! There are drawbacks here too. .add_hash_with_abundances is preferable because it maintains a proper MinHash when called (it checks if the hash is within bounds, or if it replaces another one...), and set_abundances would have to do it too. Maybe do a .set_hash_with_abundance instead, that checks if the hash already exists in the MinHash, and if it does then it sends the abundance to the correct value, and if not then it calls .add_hash_with_abundance? Something like this:

     pub fn set_hash_with_abundance(&mut self, hash: u64, abundance: u64) {
         let mut found = false;
         if let Ok(pos) = self.mins.binary_search(&hash) {
             if self.mins[pos] == hash {
                 found = true;
                 if let Some(ref mut abunds) = self.abunds {
                     abunds[pos] = abundance;
                 }
             }
         }

         if found == false {
             self.add_hash_with_abundance(hash, abundance);
         }
     }

(and then adapt kmerminhash_set_abundances to call this method in a loop instead of calling .add_many_with_abund(&pairs))


And just to echo @ctb here: it is really cool how far you're getting with this PR! Congrats for getting deep in the bowels of sourmash =]

@ctb
Copy link
Contributor

ctb commented Jul 4, 2020

whoops sorry erik I just updated from master branch, which has changed due to a PR being merged. Bad habit of mine! You'll need to do a git pull origin before you can push again.

sourmash/_minhash.py Outdated Show resolved Hide resolved
assert a.get_mins(with_abundance=True) == {10: 3}

a.set_abundances({20: 5, 10: 2}, clear=False)
assert a.get_mins(with_abundance=True) == {10: 2, 20: 5}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I think this assert is maybe getting the wrong result - we want set_abundances(..., clear=False) to add the abundances, so here the abundance of 10 would be 5 instead of 2.

note that dictionary updates using the dictionary unpacking mechanism don't sum the values for colliding keys (as you would expect, because dictionaries store many types of values, and there's no reason to special case int in this way).

>>> x = { 1: 1, 2: 1 }
>>> y = { 1: 1, 3: 1 }
>>> {**x, **y}
{1: 1, 2: 1, 3: 1}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait so would it be better for set_abundances(..., clear=False) to add the abundances or to override and set new ones? Because I think I'm getting opposing opinions from you and @luizirber. Also, If it adds the abundances, wouldn't it have the exact same functionality as add_hash_with_abundance(...)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great discussion! I'm putting this in a fuller comment now, but add_hash_with_abundance(...) takes one hash at a time, while set_abundances(...) takes multiple hashes and is more efficient for large MinHashes.

@erikyoung85
Copy link
Contributor Author

erikyoung85 commented Jul 8, 2020

Thank you for telling me about the benchmarks test that is very cool and I've been playing around with it! So I went ahead with the idea that set_abundances(..., clear=False) would override the abundance of an already set hash, instead of adding to it (but this can be easily changed back if you decide it should be added instead).

I moved the code to rust, and it seemed to help the benchmarks a bit, but the noclear option was still pretty slow as you can see here:

benchmarks.TimeMinAbundanceSuite.time_set_abundances                       87.6±0ms
benchmarks.TimeMinAbundanceSuite.time_set_abundances_noclear        215±0ms

This is why I tried the idea that Luiz mentioned earlier of adding the clear argument to the kmerminhash_set_abundances_with_clear_option function, however google tells me that rust doesn't support optional arguments in functions. So instead of adding the clear argument to every other call of the function, I just made a new, almost identical function with the clear argument.

After doing this, the benchmarks look much better:

benchmarks.TimeMinAbundanceSuite.time_set_abundances                        97.1±0ms
benchmarks.TimeMinAbundanceSuite.time_set_abundances_noclear         103±0ms

Let me know if adding this function is okay, because If not I can definitely go through and add a clear=True option to all other calls of the function instead of adding a whole new function. Not sure which is better for readability and such.

Thanks again for all the help and guidance this has been fun learning more about sourmash as well as python and rust!!

@erikyoung85 erikyoung85 requested review from luizirber and ctb July 8, 2020 22:03
@erikyoung85
Copy link
Contributor Author

Scratch the comment above, turns out I didn't interpret my error messages correctly. No need to add any extra functions sorry for the confusion!

tests/test__minhash.py Outdated Show resolved Hide resolved
tests/test__minhash.py Outdated Show resolved Hide resolved
@ctb
Copy link
Contributor

ctb commented Jul 12, 2020

Congrats! This is looking good!

In the opening comment on this PR, can you edit things a bit pls?

First, I think to close multiple issues, you need to put the #893 issue on another line, i.e.

Fixes <issue 1>
Fixes <issue 2>

(you should see that on the lower right of the PR discussion window, it says something like "this PR will close the following issues...")

Second, MinHash is a class, not a function :). So MinHash(...) is the constructor.

Third, it'd be good to have a complete accounting in the top level note for the PR. By my count, you do the following in this PR --

  • add a clear method to MinHash objects in Python and Rust, that clears all hashes (and abundances).
  • add a clear option to set_abundances that clears the MinHash before setting the abundances.
  • add a add_hash_with_abundance to the Python API for MinHash that adds the given hash with the given abundance. Of note, this seems to sum the abundances if there is already a non-zero abundance.

The top level comment should reflect all of this so that you can see it "in a glance", so to speak. And then all the conversations etc below will be left there in case anyone cares about why we made the decisions we made.


In terms of the behavior of the functions, I'm not sure I like the behavior of set_abundances(..., clear=False). It looks like what it does is override the count of hashes that already exist, while ignoring counts that do exist. I don't really have a use case for this in mind. I'm thinking that set_abundances(..., clear=False) should add the new abundance to the existing abundance. (Yes, this was discussed in this comment and below - I think I've come down firmly on the side of adding the two abundances together :).

@luizirber any thoughts?

@erikyoung85 erikyoung85 changed the title Add no_clear option to MinHash(...) function Add no_clear option to MinHash(...) class Jul 12, 2020
@erikyoung85 erikyoung85 changed the title Add no_clear option to MinHash(...) class Add clear option to set_abundances(...) method Jul 12, 2020
@erikyoung85 erikyoung85 requested a review from ctb July 13, 2020 16:39
Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! I'll wait a day or so for @luizirber to see if he has feedback before approving tho :)

@ctb ctb changed the title Add clear option to set_abundances(...) method [MRG] Add clear option to set_abundances(...) method Jul 13, 2020
@ctb ctb merged commit a6800f1 into master Jul 14, 2020
@ctb
Copy link
Contributor

ctb commented Jul 14, 2020

nice work! 🎉

@ctb ctb deleted the no_clear_option_MinHash branch July 14, 2020 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants