Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an implementation of alias method for weighted indices #692

Merged
merged 10 commits into from
Mar 27, 2019

Conversation

zroug
Copy link
Contributor

@zroug zroug commented Jan 14, 2019

This is the pull request that was requested in #601. It adds an alternative implementation for WeightedIndex that is based on the alias method which has O(1) sampling speed.

In comparison to the code I posted in #601 I changed the error handling to better match the existing WeightedIndex implementation and improved sampling performance.

I'm not sure on the API though. Having this algorithm only for f64 weights seems a bit restricting. Especially when I consider that for the use case I originally created this implementation for I used f32 weights to reduce memory usage. While this algorithm can be adapted to work with integer weights, a multiplication with a usize (the number of weights) would still be necessary which would make that potentially prone to overflows.

@dhardy
Copy link
Member

dhardy commented Jan 15, 2019

@huonw review?

@huonw
Copy link
Contributor

huonw commented Jan 15, 2019

I apologise, I'm not in a position to review PRs to rand at the moment; hopefully someone else can take up the task. (A brief glance over the code seems reasonable, but I haven't dug into it to understand it deeply.)

@zroug
Copy link
Contributor Author

zroug commented Jan 15, 2019

As noted in #601 the implementation is based on the pseudo code on http://www.keithschwarz.com/darts-dice-coins. Maybe that makes it easier to review.

@vks
Copy link
Collaborator

vks commented Jan 18, 2019

  • Doc comments are still missing. I presume this is the reason for the [WIP].
  • To be able to replace WeightedIndex, it needs to be implemented for weights that implement SampleUniform + PartialOrd instead of just f64.
  • Here are the benchmark results with my machine:
test distr_weighted_alias_method           ... bench:      10,912 ns/iter (+/- 351) = 733 MB/s
test distr_weighted_alias_method_large_set ... bench:      12,174 ns/iter (+/- 415) = 657 MB/s
test distr_weighted_f64                    ... bench:       7,745 ns/iter (+/- 374) = 1032 MB/s
test distr_weighted_i8                     ... bench:      10,587 ns/iter (+/- 287) = 755 MB/s
test distr_weighted_large_set              ... bench:      63,551 ns/iter (+/- 2,587) = 125 MB/s
test distr_weighted_u32                    ... bench:      10,874 ns/iter (+/- 536) = 735 MB/s

So the new method seems to be slower for small sets, but much faster for large sets.

@zroug
Copy link
Contributor Author

zroug commented Jan 18, 2019

[WIP] is because of the missing docs and the open discussion on the exact API.

The same API as the existing method is not possible. To show you why I have created a branch that implements this method for integer weights. zroug/rand@master...alias-method-integer-weights-demonstration.

The main issue is this line https://github.com/zroug/rand/blob/9f83f8dd5c32797295c1d983d6d36260e8fe0b1d/src/distributions/weighted.rs#L164. This is a multiplication with an usize, the number of weights. So every weight type must have an implementation for that.

Another issue with that line is, that it limits how big an integer weight can be. (Because of overflows.) That might get problematic when you want to use smaller integer types like u16 as weights.

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

I see what you mean about generalising; there are a few tricky points:

  • forcing remaining aliases to 1 is only necessary for floats (doesn't matter much since nothing happens otherwise)
  • the float version normalises odds to 1 where the integer version normalises to weight_sum; in theory both versions could normalise to weight_sum, likely without issue
  • the target type T must support T: From<usize> or T: TryFrom<usize>... I don't believe there is a uniform solution for this yet (see Conversions: FromLossy and TryFromLossy traits rust-lang/rfcs#2484)
  • as you say, there is a chance of overflow when using too small weight type; this should be caught in debug builds anyway; I guess documenting this should be sufficient

I think none of these should actually stop us having a uniform implementation for all weight types however (but for now we may need our own trait abstracting over From etc.; there may already be one elsewhere in Rand).

src/distributions/weighted.rs Outdated Show resolved Hide resolved
@zroug
Copy link
Contributor Author

zroug commented Jan 24, 2019

I made the implementation generic. I created a custom trait for now but that could be changed when rust has more traits built in. I also tried to keep checking for errors for types that behave like primitive types. For the benchmarks I have used the exact same ones, that the existing implementation uses.

@zroug
Copy link
Contributor Author

zroug commented Jan 24, 2019

The last force-push was just a rebase. I would have thought that GitHub detects that but it has added it as new commits...

src/distributions/weighted.rs Outdated Show resolved Hide resolved
@zroug
Copy link
Contributor Author

zroug commented Jan 30, 2019

Regarding the documentation: Can I copy relevant parts from the existing documentation? Also English isn't my native langue, so you may have to correct some things when I write the documentation.

@dhardy
Copy link
Member

dhardy commented Jan 31, 2019

Yes, you can copy what you need. Sure, I'll be happy to review for English errors, but you seem to know the language quite well already!

@zroug zroug changed the title [WIP] Add an implementation of alias method for weighted indices Add an implementation of alias method for weighted indices Feb 22, 2019
@zroug
Copy link
Contributor Author

zroug commented Feb 22, 2019

I removed the [WIP] because I have added everything I wanted to add. Of course I will still address any review issues. Should I squash the commits before merge?

@dhardy
Copy link
Member

dhardy commented Feb 23, 2019

We don't require squashing commits, though if there are many small ones it can be nice to tidy them up a little.

@zroug
Copy link
Contributor Author

zroug commented Feb 23, 2019

I cleaned up the commits a bit.

@vks
Copy link
Collaborator

vks commented Feb 25, 2019

Updated benchmarks:

test distr_weighted_alias_method_f64       ... bench:      10,417 ns/iter (+/- 238) = 767 MB/s
test distr_weighted_alias_method_i8        ... bench:       9,937 ns/iter (+/- 364) = 805 MB/s
test distr_weighted_alias_method_large_set ... bench:      11,027 ns/iter (+/- 299) = 725 MB/s
test distr_weighted_alias_method_u32       ... bench:       9,904 ns/iter (+/- 336) = 807 MB/s
test distr_weighted_f64                    ... bench:       8,598 ns/iter (+/- 289) = 930 MB/s
test distr_weighted_i8                     ... bench:      10,597 ns/iter (+/- 432) = 754 MB/s
test distr_weighted_large_set              ... bench:      63,551 ns/iter (+/- 4,722) = 125 MB/s
test distr_weighted_u32                    ... bench:      11,003 ns/iter (+/- 561) = 727 MB/s

So the alias method seems to be faster for integers and large float sets, but a bit slower for small float sets. I think it might make sense to drop our existing implementation.

src/distributions/weighted.rs Outdated Show resolved Hide resolved
src/distributions/weighted.rs Outdated Show resolved Hide resolved
src/distributions/weighted.rs Outdated Show resolved Hide resolved
src/distributions/weighted.rs Outdated Show resolved Hide resolved
src/distributions/weighted.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@vks vks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zroug!

src/distributions/weighted.rs Outdated Show resolved Hide resolved
#[cfg(feature = "alloc")]
pub use self::weighted::{
AliasMethodWeight, AliasMethodWeightedIndex, AliasMethodWeightedIndexError, WeightedError,
WeightedIndex,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could make the module public and name these weighted::{AliasMethodWeight, AliasMethod, Error, BinarySearch}. Thoughts?

(obviously a breaking change)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be better and more like it is done in the standard library (std::io::Error). Again, let me know if you want to do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe the module must be renamed because with the proposed naming the 'index' part is lost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too sure we need to keep Index in the name anyway; it's the only type of weighted sampling we have. It's not the best idea to break this stuff again, but still better to get it right than leave a mess IMO.

But lets wait for @vks to comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned above, I think we might get rid of our current implementation and only have the alias method. Then there wouldn't be naming issues.

Alternatively, if we decide to keep both, I would prefer to drop the common AliasMethod prefix and instead have an alias_method module. This would make it much easier for users to switch between the two implementations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current benchmarks show the alias method to always be in the lead or only slightly behind, however if you move the set-up time into the measurement loop, then the binary-search method can be significantly faster (three times faster on the large_set bench with 1000 samples; nine times faster with 100 samples, and a little faster on the smaller sets).

Memory usage will be a little higher with the Alias method due to the extra Vec<usize>; mostly this is unimportant I think (unless memory constrained and having a large set of weights in a small type).

The Alias method has some extra requirements on the type, notably Copy. Should we use Clone instead?


I think there is room for both implementations, though the current presentation and documentation is not ideal. So what do you think about the following structure?

distributions::{
    weighted::{
        alias::{WeightedIndex, Weight},
        WeightedIndex, Error
    },
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Alias method has some extra requirements on the type, notably Copy. Should we use Clone instead?

I missed that. In that case we should probably have both, and working with Clone would be nicer.

So what do you think about the following structure?

This is what I would suggest as well.

Copy link
Member

@dhardy dhardy Mar 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good. @zroug would you make these changes please?

The module documentation should give advice something like the following:

If you will sample from the WeightedIndex distribution only a few times, then the binary search method will be fastest, however, if you require many samples (thousands) then the Alias method may be faster. Both methods have O(n) set-up, however the cost factor for the Alias method is significantly higher enabling O(1) sampling vs O(log(n)) for the binary-search method. For small n, sampling may also be faster with the binary-search method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will make these changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did these changes in 5b29341 but I used alias_method instead of alias as module name. Just alias didn't sound right to me and the word alias has a much broader meaning. Are you okay with this?

I wasn't sure if I should keep the reexports for WeightedIndex. I have kept them.

@dhardy
Copy link
Member

dhardy commented Mar 23, 2019

There are still a few broken links in the documentation, and it would be nice to have some comparative information in the weighted module doc, but these can come later, so I think it's time to merge this.

One caveat, as @zroug mentions, is that distributions::WeightedIndex is now an alias for distributions::weighted::WeightedIndex. The best option may be to deprecate the re-exports, however we should probably hold off on this for now since we might instead move this to another crate like rand_distr::weighted::*.

@vks do you agree? If so we can merge this as-is then prepare some doc link fixes.

@vks
Copy link
Collaborator

vks commented Mar 26, 2019

@dhardy Yes, I agree!

The best option may be to deprecate the re-exports, however we should probably hold off on this for now since we might instead move this to another crate like rand_distr::weighted::*.

Note that rand::seq depends on WeightedIndex. Should this also be moved to a different crate?

@dhardy dhardy merged commit aabc596 into rust-random:master Mar 27, 2019
@dhardy
Copy link
Member

dhardy commented Mar 27, 2019

Good point. Maybe it is best just to deprecate distributions::WeightedIndex then?

This was referenced Mar 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants