-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Add clear
option to set_abundances(...) method
#1046
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1046 +/- ##
==========================================
- Coverage 83.99% 83.86% -0.14%
==========================================
Files 98 98
Lines 9086 9108 +22
==========================================
+ Hits 7632 7638 +6
- Misses 1454 1470 +16
Continue to review full report at Codecov.
|
Hey Titus, my guess is that there are at least a few errors with my PR but the tests all still pass. Let me know if there is anything I should change, as usual. Also, is this supposed to give a command line option that sets no_clear to true? (I didn't attempt to add this feature so let me know if I should) |
hi @erikyoung85 promising start! I think only the If you want to expand into Rust territory, you could add #893 into this. I haven't looked closely at what needs to be done there, though. I'll think about the tests more later. A good rule of thumb is that if you're worried about errors, first write some simple tests that check general behavior, and then start testing the edge cases that worry you! |
60bf4b0
to
107d63d
Compare
so, some brainstorming about tests --
One "interesting" question is, what happens when set_abundances tries to set the abundance of a hash that already has an abundance? |
So my first attempt at adding #893 into this PR. CAUTION: I've never used rust before, and my python skills are still very new, so this is very much a rough draft :) please do give me any and all advice you can think of |
So it looks like a MinHash with an already set hash of {10, 1}, when |
I just noticed that You can add a new (and if you feel like it, exposing |
Okay, so I wanted to simulate the So, I think I found a way (from google) to override the already set abundance if the hash already exists while merging the old and new dictionaries by doing Let me know what you think or if my code can be improved or something of course! |
Hi Erik, skimmed it quickly -- this is definitely progress! I need to
sit down and go through it more thoroughly tho :). Nice work getting this
far!
best,
--titus
|
Just to be clear, I was suggesting, not asking. Because, as you can see...
... I didn't consider all the conditions, and you're right,
I was suggesting moving the code into Rust because it is not true that it doesn't take extra time. We don't have many benchmarks, but if you check def time_set_abundances_noclear(self):
mh = self.mh
mins = self.populated_mh.get_mins(with_abundance=True)
for i in range(500):
mh.set_abundances(mins, clear=False) and then ran the benchmarks (you need to
If you do it in Rust, then it takes about the same time. (any time we pull data from Rust to Python with
That's a nice find, very cool trick!
As you described, the main issue is that BUT! There are drawbacks here too. pub fn set_hash_with_abundance(&mut self, hash: u64, abundance: u64) {
let mut found = false;
if let Ok(pos) = self.mins.binary_search(&hash) {
if self.mins[pos] == hash {
found = true;
if let Some(ref mut abunds) = self.abunds {
abunds[pos] = abundance;
}
}
}
if found == false {
self.add_hash_with_abundance(hash, abundance);
}
} (and then adapt And just to echo @ctb here: it is really cool how far you're getting with this PR! Congrats for getting deep in the bowels of sourmash =] |
whoops sorry erik I just updated from master branch, which has changed due to a PR being merged. Bad habit of mine! You'll need to do a |
tests/test__minhash.py
Outdated
assert a.get_mins(with_abundance=True) == {10: 3} | ||
|
||
a.set_abundances({20: 5, 10: 2}, clear=False) | ||
assert a.get_mins(with_abundance=True) == {10: 2, 20: 5} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I think this assert is maybe getting the wrong result - we want set_abundances(..., clear=False)
to add the abundances, so here the abundance of 10 would be 5 instead of 2.
note that dictionary updates using the dictionary unpacking mechanism don't sum the values for colliding keys (as you would expect, because dictionaries store many types of values, and there's no reason to special case int in this way).
>>> x = { 1: 1, 2: 1 }
>>> y = { 1: 1, 3: 1 }
>>> {**x, **y}
{1: 1, 2: 1, 3: 1}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait so would it be better for set_abundances(..., clear=False)
to add the abundances or to override and set new ones? Because I think I'm getting opposing opinions from you and @luizirber. Also, If it adds the abundances, wouldn't it have the exact same functionality as add_hash_with_abundance(...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great discussion! I'm putting this in a fuller comment now, but add_hash_with_abundance(...)
takes one hash at a time, while set_abundances(...)
takes multiple hashes and is more efficient for large MinHashes.
Thank you for telling me about the benchmarks test that is very cool and I've been playing around with it! So I went ahead with the idea that I moved the code to rust, and it seemed to help the benchmarks a bit, but the noclear option was still pretty slow as you can see here:
This is why I tried the idea that Luiz mentioned earlier of adding the After doing this, the benchmarks look much better:
Let me know if adding this function is okay, because If not I can definitely go through and add a Thanks again for all the help and guidance this has been fun learning more about sourmash as well as python and rust!! |
Scratch the comment above, turns out I didn't interpret my error messages correctly. No need to add any extra functions sorry for the confusion! |
Congrats! This is looking good! In the opening comment on this PR, can you edit things a bit pls? First, I think to close multiple issues, you need to put the #893 issue on another line, i.e. Fixes <issue 1> (you should see that on the lower right of the PR discussion window, it says something like "this PR will close the following issues...") Second, MinHash is a class, not a function :). So Third, it'd be good to have a complete accounting in the top level note for the PR. By my count, you do the following in this PR --
The top level comment should reflect all of this so that you can see it "in a glance", so to speak. And then all the conversations etc below will be left there in case anyone cares about why we made the decisions we made. In terms of the behavior of the functions, I'm not sure I like the behavior of @luizirber any thoughts? |
clear
option to set_abundances(...) method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! I'll wait a day or so for @luizirber to see if he has feedback before approving tho :)
clear
option to set_abundances(...) methodclear
option to set_abundances(...) method
nice work! 🎉 |
Fixes add method to MinHash class to update abundances for a specific hash #1018.
Fixes should we provide
add_hash_with_abundance
via Python MinHash API? #893.make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?