-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an implementation of alias method for weighted indices #692
Conversation
@huonw review? |
I apologise, I'm not in a position to review PRs to |
As noted in #601 the implementation is based on the pseudo code on http://www.keithschwarz.com/darts-dice-coins. Maybe that makes it easier to review. |
So the new method seems to be slower for small sets, but much faster for large sets. |
[WIP] is because of the missing docs and the open discussion on the exact API. The same API as the existing method is not possible. To show you why I have created a branch that implements this method for integer weights. zroug/rand@master...alias-method-integer-weights-demonstration. The main issue is this line https://github.com/zroug/rand/blob/9f83f8dd5c32797295c1d983d6d36260e8fe0b1d/src/distributions/weighted.rs#L164. This is a multiplication with an Another issue with that line is, that it limits how big an integer weight can be. (Because of overflows.) That might get problematic when you want to use smaller integer types like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
I see what you mean about generalising; there are a few tricky points:
- forcing remaining aliases to 1 is only necessary for floats (doesn't matter much since nothing happens otherwise)
- the float version normalises odds to 1 where the integer version normalises to
weight_sum
; in theory both versions could normalise toweight_sum
, likely without issue - the target type
T
must supportT: From<usize>
orT: TryFrom<usize>
... I don't believe there is a uniform solution for this yet (see Conversions:FromLossy
andTryFromLossy
traits rust-lang/rfcs#2484) - as you say, there is a chance of overflow when using too small weight type; this should be caught in debug builds anyway; I guess documenting this should be sufficient
I think none of these should actually stop us having a uniform implementation for all weight types however (but for now we may need our own trait abstracting over From
etc.; there may already be one elsewhere in Rand).
I made the implementation generic. I created a custom trait for now but that could be changed when rust has more traits built in. I also tried to keep checking for errors for types that behave like primitive types. For the benchmarks I have used the exact same ones, that the existing implementation uses. |
The last force-push was just a rebase. I would have thought that GitHub detects that but it has added it as new commits... |
Regarding the documentation: Can I copy relevant parts from the existing documentation? Also English isn't my native langue, so you may have to correct some things when I write the documentation. |
Yes, you can copy what you need. Sure, I'll be happy to review for English errors, but you seem to know the language quite well already! |
I removed the [WIP] because I have added everything I wanted to add. Of course I will still address any review issues. Should I squash the commits before merge? |
We don't require squashing commits, though if there are many small ones it can be nice to tidy them up a little. |
I cleaned up the commits a bit. |
Updated benchmarks:
So the alias method seems to be faster for integers and large float sets, but a bit slower for small float sets. I think it might make sense to drop our existing implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zroug!
src/distributions/mod.rs
Outdated
#[cfg(feature = "alloc")] | ||
pub use self::weighted::{ | ||
AliasMethodWeight, AliasMethodWeightedIndex, AliasMethodWeightedIndexError, WeightedError, | ||
WeightedIndex, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively we could make the module public and name these weighted::{AliasMethodWeight, AliasMethod, Error, BinarySearch}
. Thoughts?
(obviously a breaking change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be better and more like it is done in the standard library (std::io::Error
). Again, let me know if you want to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But maybe the module must be renamed because with the proposed naming the 'index' part is lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too sure we need to keep Index
in the name anyway; it's the only type of weighted sampling we have. It's not the best idea to break this stuff again, but still better to get it right than leave a mess IMO.
But lets wait for @vks to comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned above, I think we might get rid of our current implementation and only have the alias method. Then there wouldn't be naming issues.
Alternatively, if we decide to keep both, I would prefer to drop the common AliasMethod
prefix and instead have an alias_method
module. This would make it much easier for users to switch between the two implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current benchmarks show the alias method to always be in the lead or only slightly behind, however if you move the set-up time into the measurement loop, then the binary-search method can be significantly faster (three times faster on the large_set bench with 1000 samples; nine times faster with 100 samples, and a little faster on the smaller sets).
Memory usage will be a little higher with the Alias method due to the extra Vec<usize>
; mostly this is unimportant I think (unless memory constrained and having a large set of weights in a small type).
The Alias method has some extra requirements on the type, notably Copy
. Should we use Clone
instead?
I think there is room for both implementations, though the current presentation and documentation is not ideal. So what do you think about the following structure?
distributions::{
weighted::{
alias::{WeightedIndex, Weight},
WeightedIndex, Error
},
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Alias method has some extra requirements on the type, notably Copy. Should we use Clone instead?
I missed that. In that case we should probably have both, and working with Clone
would be nicer.
So what do you think about the following structure?
This is what I would suggest as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good. @zroug would you make these changes please?
The module documentation should give advice something like the following:
If you will sample from the
WeightedIndex
distribution only a few times, then the binary search method will be fastest, however, if you require many samples (thousands) then the Alias method may be faster. Both methods haveO(n)
set-up, however the cost factor for the Alias method is significantly higher enablingO(1)
sampling vsO(log(n))
for the binary-search method. For smalln
, sampling may also be faster with the binary-search method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will make these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did these changes in 5b29341 but I used alias_method
instead of alias
as module name. Just alias
didn't sound right to me and the word alias has a much broader meaning. Are you okay with this?
I wasn't sure if I should keep the reexports for WeightedIndex
. I have kept them.
There are still a few broken links in the documentation, and it would be nice to have some comparative information in the One caveat, as @zroug mentions, is that @vks do you agree? If so we can merge this as-is then prepare some doc link fixes. |
@dhardy Yes, I agree!
Note that |
Good point. Maybe it is best just to deprecate |
This is the pull request that was requested in #601. It adds an alternative implementation for
WeightedIndex
that is based on the alias method which hasO(1)
sampling speed.In comparison to the code I posted in #601 I changed the error handling to better match the existing
WeightedIndex
implementation and improved sampling performance.I'm not sure on the API though. Having this algorithm only for
f64
weights seems a bit restricting. Especially when I consider that for the use case I originally created this implementation for I usedf32
weights to reduce memory usage. While this algorithm can be adapted to work with integer weights, a multiplication with ausize
(the number of weights) would still be necessary which would make that potentially prone to overflows.