Add an implementation of alias method for weighted indices #692

zroug · 2019-01-14T21:47:32Z

This is the pull request that was requested in #601. It adds an alternative implementation for WeightedIndex that is based on the alias method which has O(1) sampling speed.

In comparison to the code I posted in #601 I changed the error handling to better match the existing WeightedIndex implementation and improved sampling performance.

I'm not sure on the API though. Having this algorithm only for f64 weights seems a bit restricting. Especially when I consider that for the use case I originally created this implementation for I used f32 weights to reduce memory usage. While this algorithm can be adapted to work with integer weights, a multiplication with a usize (the number of weights) would still be necessary which would make that potentially prone to overflows.

dhardy · 2019-01-15T10:49:27Z

@huonw review?

huonw · 2019-01-15T12:50:10Z

I apologise, I'm not in a position to review PRs to rand at the moment; hopefully someone else can take up the task. (A brief glance over the code seems reasonable, but I haven't dug into it to understand it deeply.)

zroug · 2019-01-15T14:26:08Z

As noted in #601 the implementation is based on the pseudo code on http://www.keithschwarz.com/darts-dice-coins. Maybe that makes it easier to review.

vks · 2019-01-18T09:51:51Z

Doc comments are still missing. I presume this is the reason for the [WIP].
To be able to replace WeightedIndex, it needs to be implemented for weights that implement SampleUniform + PartialOrd instead of just f64.
Here are the benchmark results with my machine:

test distr_weighted_alias_method           ... bench:      10,912 ns/iter (+/- 351) = 733 MB/s
test distr_weighted_alias_method_large_set ... bench:      12,174 ns/iter (+/- 415) = 657 MB/s
test distr_weighted_f64                    ... bench:       7,745 ns/iter (+/- 374) = 1032 MB/s
test distr_weighted_i8                     ... bench:      10,587 ns/iter (+/- 287) = 755 MB/s
test distr_weighted_large_set              ... bench:      63,551 ns/iter (+/- 2,587) = 125 MB/s
test distr_weighted_u32                    ... bench:      10,874 ns/iter (+/- 536) = 735 MB/s

So the new method seems to be slower for small sets, but much faster for large sets.

zroug · 2019-01-18T15:25:51Z

[WIP] is because of the missing docs and the open discussion on the exact API.

The same API as the existing method is not possible. To show you why I have created a branch that implements this method for integer weights. zroug/rand@master...alias-method-integer-weights-demonstration.

The main issue is this line https://github.com/zroug/rand/blob/9f83f8dd5c32797295c1d983d6d36260e8fe0b1d/src/distributions/weighted.rs#L164. This is a multiplication with an usize, the number of weights. So every weight type must have an implementation for that.

Another issue with that line is, that it limits how big an integer weight can be. (Because of overflows.) That might get problematic when you want to use smaller integer types like u16 as weights.

dhardy

Looks good.

I see what you mean about generalising; there are a few tricky points:

forcing remaining aliases to 1 is only necessary for floats (doesn't matter much since nothing happens otherwise)
the float version normalises odds to 1 where the integer version normalises to weight_sum; in theory both versions could normalise to weight_sum, likely without issue
the target type T must support T: From<usize> or T: TryFrom<usize>... I don't believe there is a uniform solution for this yet (see Conversions: FromLossy and TryFromLossy traits rust-lang/rfcs#2484)
as you say, there is a chance of overflow when using too small weight type; this should be caught in debug builds anyway; I guess documenting this should be sufficient

I think none of these should actually stop us having a uniform implementation for all weight types however (but for now we may need our own trait abstracting over From etc.; there may already be one elsewhere in Rand).

src/distributions/weighted.rs

zroug · 2019-01-24T13:42:00Z

I made the implementation generic. I created a custom trait for now but that could be changed when rust has more traits built in. I also tried to keep checking for errors for types that behave like primitive types. For the benchmarks I have used the exact same ones, that the existing implementation uses.

zroug · 2019-01-24T14:09:53Z

The last force-push was just a rebase. I would have thought that GitHub detects that but it has added it as new commits...

src/distributions/weighted.rs

zroug · 2019-01-30T18:44:19Z

Regarding the documentation: Can I copy relevant parts from the existing documentation? Also English isn't my native langue, so you may have to correct some things when I write the documentation.

dhardy · 2019-01-31T09:32:10Z

Yes, you can copy what you need. Sure, I'll be happy to review for English errors, but you seem to know the language quite well already!

zroug · 2019-02-22T19:53:56Z

I removed the [WIP] because I have added everything I wanted to add. Of course I will still address any review issues. Should I squash the commits before merge?

dhardy · 2019-02-23T08:55:49Z

We don't require squashing commits, though if there are many small ones it can be nice to tidy them up a little.

…ndex

zroug · 2019-02-23T12:29:15Z

I cleaned up the commits a bit.

vks · 2019-02-25T13:30:43Z

Updated benchmarks:

test distr_weighted_alias_method_f64       ... bench:      10,417 ns/iter (+/- 238) = 767 MB/s
test distr_weighted_alias_method_i8        ... bench:       9,937 ns/iter (+/- 364) = 805 MB/s
test distr_weighted_alias_method_large_set ... bench:      11,027 ns/iter (+/- 299) = 725 MB/s
test distr_weighted_alias_method_u32       ... bench:       9,904 ns/iter (+/- 336) = 807 MB/s
test distr_weighted_f64                    ... bench:       8,598 ns/iter (+/- 289) = 930 MB/s
test distr_weighted_i8                     ... bench:      10,597 ns/iter (+/- 432) = 754 MB/s
test distr_weighted_large_set              ... bench:      63,551 ns/iter (+/- 4,722) = 125 MB/s
test distr_weighted_u32                    ... bench:      11,003 ns/iter (+/- 561) = 727 MB/s

So the alias method seems to be faster for integers and large float sets, but a bit slower for small float sets. I think it might make sense to drop our existing implementation.

src/distributions/weighted.rs

vks

Looks good to me, thanks!

dhardy

Thanks @zroug!

src/distributions/weighted.rs

dhardy · 2019-02-27T15:21:05Z

src/distributions/mod.rs

+#[cfg(feature = "alloc")]
+pub use self::weighted::{
+    AliasMethodWeight, AliasMethodWeightedIndex, AliasMethodWeightedIndexError, WeightedError,
+    WeightedIndex,


Alternatively we could make the module public and name these weighted::{AliasMethodWeight, AliasMethod, Error, BinarySearch}. Thoughts?

(obviously a breaking change)

I think that would be better and more like it is done in the standard library (std::io::Error). Again, let me know if you want to do that.

But maybe the module must be renamed because with the proposed naming the 'index' part is lost.

I'm not too sure we need to keep Index in the name anyway; it's the only type of weighted sampling we have. It's not the best idea to break this stuff again, but still better to get it right than leave a mess IMO.

But lets wait for @vks to comment.

As I mentioned above, I think we might get rid of our current implementation and only have the alias method. Then there wouldn't be naming issues.

Alternatively, if we decide to keep both, I would prefer to drop the common AliasMethod prefix and instead have an alias_method module. This would make it much easier for users to switch between the two implementations.

The current benchmarks show the alias method to always be in the lead or only slightly behind, however if you move the set-up time into the measurement loop, then the binary-search method can be significantly faster (three times faster on the large_set bench with 1000 samples; nine times faster with 100 samples, and a little faster on the smaller sets).

Memory usage will be a little higher with the Alias method due to the extra Vec<usize>; mostly this is unimportant I think (unless memory constrained and having a large set of weights in a small type).

The Alias method has some extra requirements on the type, notably Copy. Should we use Clone instead?

I think there is room for both implementations, though the current presentation and documentation is not ideal. So what do you think about the following structure?

distributions::{ weighted::{ alias::{WeightedIndex, Weight}, WeightedIndex, Error }, }

The Alias method has some extra requirements on the type, notably Copy. Should we use Clone instead?

I missed that. In that case we should probably have both, and working with Clone would be nicer.

So what do you think about the following structure?

This is what I would suggest as well.

Good. @zroug would you make these changes please?

The module documentation should give advice something like the following:

If you will sample from the WeightedIndex distribution only a few times, then the binary search method will be fastest, however, if you require many samples (thousands) then the Alias method may be faster. Both methods have O(n) set-up, however the cost factor for the Alias method is significantly higher enabling O(1) sampling vs O(log(n)) for the binary-search method. For small n, sampling may also be faster with the binary-search method.

Yes, I will make these changes.

I did these changes in 5b29341 but I used alias_method instead of alias as module name. Just alias didn't sound right to me and the word alias has a much broader meaning. Are you okay with this?

I wasn't sure if I should keep the reexports for WeightedIndex. I have kept them.

dhardy · 2019-03-23T11:58:00Z

There are still a few broken links in the documentation, and it would be nice to have some comparative information in the weighted module doc, but these can come later, so I think it's time to merge this.

One caveat, as @zroug mentions, is that distributions::WeightedIndex is now an alias for distributions::weighted::WeightedIndex. The best option may be to deprecate the re-exports, however we should probably hold off on this for now since we might instead move this to another crate like rand_distr::weighted::*.

@vks do you agree? If so we can merge this as-is then prepare some doc link fixes.

vks · 2019-03-26T11:43:22Z

@dhardy Yes, I agree!

The best option may be to deprecate the re-exports, however we should probably hold off on this for now since we might instead move this to another crate like rand_distr::weighted::*.

Note that rand::seq depends on WeightedIndex. Should this also be moved to a different crate?

dhardy · 2019-03-27T09:00:43Z

Good point. Maybe it is best just to deprecate distributions::WeightedIndex then?

zroug force-pushed the master branch from 9e90df3 to 50d3f4b Compare January 14, 2019 22:13

dhardy reviewed Jan 22, 2019

View reviewed changes

src/distributions/weighted.rs Outdated Show resolved Hide resolved

zroug force-pushed the master branch from 14feb99 to ade22bb Compare January 24, 2019 14:05

vks reviewed Jan 24, 2019

View reviewed changes

src/distributions/weighted.rs Outdated Show resolved Hide resolved

dhardy reviewed Jan 24, 2019

View reviewed changes

src/distributions/weighted.rs Outdated Show resolved Hide resolved

huonw mentioned this pull request Jan 27, 2019

WeightedChoice could use the Walker or Vose Alias method for O(1) sampling (instead of O(log n)) #601

Closed

zroug changed the title ~~[WIP] Add an implementation of alias method for weighted indices~~ Add an implementation of alias method for weighted indices Feb 22, 2019

zroug added 6 commits February 23, 2019 13:25

Added an implementation of alias method for weighted indices

c2bed15

Added tests for AliasMethodWeightedIndex

1feb633

Get rid of the extra VecDeque during creation of AliasMethodWeightedI…

002a001

…ndex

Made implementation details of AliasMethodWeightedIndex more generic

ea83974

Made AliasMethodWeightedIndex generic

f392fb7

Added documentation for AliasMethodWeightedIndex

2af10fa

zroug force-pushed the master branch from 164b3ab to 2af10fa Compare February 23, 2019 12:27

vks reviewed Feb 25, 2019

View reviewed changes

src/distributions/weighted.rs Outdated Show resolved Hide resolved

vks reviewed Feb 25, 2019

View reviewed changes

src/distributions/weighted.rs Outdated Show resolved Hide resolved

vks reviewed Feb 25, 2019

View reviewed changes

src/distributions/weighted.rs Outdated Show resolved Hide resolved

vks reviewed Feb 25, 2019

View reviewed changes

src/distributions/weighted.rs Outdated Show resolved Hide resolved

vks reviewed Feb 25, 2019

View reviewed changes

src/distributions/weighted.rs Outdated Show resolved Hide resolved

zroug added 2 commits February 25, 2019 22:35

Addressed documentation issues from review

9c44b6a

Use pairwise sum only for floating point weights

8641a9b

vks approved these changes Feb 26, 2019

View reviewed changes

dhardy reviewed Feb 27, 2019

View reviewed changes

zroug added 2 commits March 6, 2019 17:49

Reorganized distributions::weighted module

5b29341

Use a instead of an when appropriate

950afb3

dhardy merged commit aabc596 into rust-random:master Mar 27, 2019

This was referenced Mar 28, 2019

New rand_distr crate #761

Merged

WeightedIndex distribution naming #763

Closed

dhardy mentioned this pull request May 14, 2019

Merging Rust Random Choice #110

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an implementation of alias method for weighted indices #692

Add an implementation of alias method for weighted indices #692

zroug commented Jan 14, 2019

dhardy commented Jan 15, 2019

huonw commented Jan 15, 2019

zroug commented Jan 15, 2019

vks commented Jan 18, 2019

zroug commented Jan 18, 2019

dhardy left a comment

zroug commented Jan 24, 2019

zroug commented Jan 24, 2019

zroug commented Jan 30, 2019

dhardy commented Jan 31, 2019

zroug commented Feb 22, 2019

dhardy commented Feb 23, 2019

zroug commented Feb 23, 2019

vks commented Feb 25, 2019 •

edited

Loading

vks left a comment •

edited

Loading

dhardy left a comment

dhardy Feb 27, 2019

zroug Feb 27, 2019

zroug Feb 27, 2019

dhardy Feb 27, 2019

vks Feb 27, 2019

dhardy Mar 1, 2019

vks Mar 1, 2019

dhardy Mar 1, 2019 •

edited

Loading

zroug Mar 1, 2019

zroug Mar 6, 2019

dhardy commented Mar 23, 2019

vks commented Mar 26, 2019

dhardy commented Mar 27, 2019

Add an implementation of alias method for weighted indices #692

Add an implementation of alias method for weighted indices #692

Conversation

zroug commented Jan 14, 2019

dhardy commented Jan 15, 2019

huonw commented Jan 15, 2019

zroug commented Jan 15, 2019

vks commented Jan 18, 2019

zroug commented Jan 18, 2019

dhardy left a comment

Choose a reason for hiding this comment

zroug commented Jan 24, 2019

zroug commented Jan 24, 2019

zroug commented Jan 30, 2019

dhardy commented Jan 31, 2019

zroug commented Feb 22, 2019

dhardy commented Feb 23, 2019

zroug commented Feb 23, 2019

vks commented Feb 25, 2019 • edited Loading

vks left a comment • edited Loading

Choose a reason for hiding this comment

dhardy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhardy Mar 1, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhardy commented Mar 23, 2019

vks commented Mar 26, 2019

dhardy commented Mar 27, 2019

vks commented Feb 25, 2019 •

edited

Loading

vks left a comment •

edited

Loading

dhardy Mar 1, 2019 •

edited

Loading