Look into this alleged optimal algorithm for bounded random integers #1172

ctsrc · 2021-09-02T20:28:45Z

Background

What is your motivation?

I came across this interesting pull request that someone made on the official Swift programming language repository

It makes some very intriguing claims, stating such as:

reduces the worst-case expected number of random bits required from O(log₂(upperBound)) to log₂(upperBound) + O(1), which is optimal

and

this algorithm can be made unconditional by removing the early out, so that every value computed requires word size + 64 bits from the stream, which breaks the loop-carried dependency for fast generators, unlocking vectorization and parallelization where it was previously impossible

And I was wondering if this might similarly be useful in your rand crate

What type of application is this? (E.g. cryptography, game, numerical simulation)

All applications of random number generation. And from what I understand could make SIMD improvements for random number generation possible also?

Feature request

I request that the rand crate authors have a look at this algorithm swiftlang/swift#39143 and that they determine whether it could be implemented into the rand crate to improve performance for the generation of random numbers.

And alternatively, if it is something that needs language support inside of Rust itself. I am not able to determine on my own whether such is the case or not.

TheIronBorn · 2021-09-02T21:24:39Z

Lemire seems to think it is correct https://twitter.com/lemire/status/1433523690696757251

dhardy · 2021-09-03T09:43:19Z

Indeed, this looks nice, especially the practical option of fixing the maximum number of random bits used. Some notes:

A limitation of RngCore is that we don't know whether the RNG generates 32- or 64-bits natively. Sampling a 32-bit number using this method, with the same cut-off threshold of ≤2⁻⁶⁴ of bias, would require up to three 32-bit samples or up to two 64-bit samples; possibly the best approach here would be different logic for 32-bit vs 64-bit CPUs.
In theory sampling a u128 requires up to three u64 values which may be preferable to sampling up to two u128 values, though in practice the difference is unlikely to be important.
For SIMD implementations, we could take the same approach, terminating early only if all lanes are within the bound.

Then of course it would be nice to compare to #1154. This approach looks good, but it can require up to two widening multiplies (or one and a half). Still worth benchmarking.

@TheIronBorn would you be interested (again)?

TheIronBorn · 2021-09-03T18:57:59Z

Sure. Working on constant-range benchmarks as well

TheIronBorn · 2021-09-05T06:57:13Z

I put together a quick-and-dirty implementation and a benchmark. Here's the branch https://github.com/TheIronBorn/rand/tree/unif-benches. I added a modulo to Uniform::sample to improve the early exit. I think benchmarking on multiple different hardware would be good (especially 32-bit machines with different PRNGs), you can just run cargo bench --bench distributions.

I didn't implement 128-bit integers (what do we do if range > 2^64? just generate 128 bits?)

using https://github.com/BurntSushi/cargo-benchcmp, cargo benchcmp oneill canon <saved_bench_output>

 name (w/ Xoshiro256PlusPlus)    oneill ns/iter      canon ns/iter        diff ns/iter   diff %  speedup
+_distr_uniform_high_reject_i16  37,595 (1063 MB/s)  37,065 (1079 MB/s)           -530   -1.41%   x 1.01
+_distr_uniform_high_reject_i32  200,248 (399 MB/s)  110,348 (724 MB/s)        -89,900  -44.89%   x 1.81
+_distr_uniform_high_reject_i64  201,781 (792 MB/s)  110,587 (1446 MB/s)       -91,194  -45.19%   x 1.82
+_distr_uniform_high_reject_i8   37,887 (527 MB/s)   37,231 (537 MB/s)            -656   -1.73%   x 1.02
-_distr_uniform_low_reject_i16   37,451 (1068 MB/s)  40,636 (984 MB/s)           3,185    8.50%   x 0.92
-_distr_uniform_low_reject_i32   37,554 (2130 MB/s)  40,591 (1970 MB/s)          3,037    8.09%   x 0.93
+_distr_uniform_low_reject_i64   35,988 (4445 MB/s)  35,958 (4449 MB/s)            -30   -0.08%   x 1.00
-_distr_uniform_low_reject_i8    37,605 (531 MB/s)   40,507 (493 MB/s)           2,902    7.72%   x 0.93
+_gen_range_i16_high             47,645 (839 MB/s)   47,396 (843 MB/s)            -249   -0.52%   x 1.01
+_gen_range_i16_low              48,485 (824 MB/s)   43,836 (912 MB/s)          -4,649   -9.59%   x 1.11
+_gen_range_i32_high             221,542 (361 MB/s)  133,900 (597 MB/s)        -87,642  -39.56%   x 1.65
+_gen_range_i32_low              40,666 (1967 MB/s)  40,596 (1970 MB/s)            -70   -0.17%   x 1.00
+_gen_range_i64_high             239,002 (669 MB/s)  131,817 (1213 MB/s)      -107,185  -44.85%   x 1.81
+_gen_range_i64_low              40,780 (3923 MB/s)  39,284 (4072 MB/s)         -1,496   -3.67%   x 1.04
-_gen_range_i8_high              40,472 (494 MB/s)   47,257 (423 MB/s)           6,785   16.76%   x 0.86
-_gen_range_i8_low               40,575 (492 MB/s)   43,446 (460 MB/s)           2,871    7.08%   x 0.93

It's a lot faster when rejections are likely and about the same otherwise. 8-bit and 16-bit have less speedup because the internal 32/64 bit rejections are less likely and perhaps because of the more expensive 64->128 bit multiplication.

Note: I used SmallRng (Xoshiro256PlusPlus) as Pcg64Mcg was significantly slower, likely due to the heavier use of 64->128 bit multiplication. Such a significant contextual slowdown is concerning for a general use API.

 name (w/ Pcg64Mcg)              oneill ns/iter      canon ns/iter        diff ns/iter   diff %  speedup
-_distr_uniform_high_reject_i16  43,841 (912 MB/s)   126,301 (316 MB/s)         82,460  188.09%   x 0.35
+_distr_uniform_high_reject_i32  244,767 (326 MB/s)  127,360 (628 MB/s)       -117,407  -47.97%   x 1.92
+_distr_uniform_high_reject_i64  247,989 (645 MB/s)  139,844 (1144 MB/s)      -108,145  -43.61%   x 1.77
-_distr_uniform_high_reject_i8   43,864 (455 MB/s)   127,611 (156 MB/s)         83,747  190.92%   x 0.34
-_distr_uniform_low_reject_i16   47,197 (847 MB/s)   139,575 (286 MB/s)         92,378  195.73%   x 0.34
-_distr_uniform_low_reject_i32   47,280 (1692 MB/s)  139,337 (574 MB/s)         92,057  194.71%   x 0.34
-_distr_uniform_low_reject_i64   40,512 (3949 MB/s)  131,778 (1214 MB/s)        91,266  225.28%   x 0.31
-_distr_uniform_low_reject_i8    47,263 (423 MB/s)   139,405 (143 MB/s)         92,142  194.96%   x 0.34
-_gen_range_i16_high             54,133 (738 MB/s)   60,703 (658 MB/s)           6,570   12.14%   x 0.89
-_gen_range_i16_low              48,056 (832 MB/s)   60,850 (657 MB/s)          12,794   26.62%   x 0.79
+_gen_range_i32_high             259,113 (308 MB/s)  165,511 (483 MB/s)        -93,602  -36.12%   x 1.57
-_gen_range_i32_low              45,581 (1755 MB/s)  55,439 (1443 MB/s)          9,858   21.63%   x 0.82
+_gen_range_i64_high             266,040 (601 MB/s)  184,480 (867 MB/s)        -81,560  -30.66%   x 1.44
-_gen_range_i64_low              47,443 (3372 MB/s)  64,619 (2476 MB/s)         17,176   36.20%   x 0.73
-_gen_range_i8_high              47,831 (418 MB/s)   63,884 (313 MB/s)          16,053   33.56%   x 0.75
-_gen_range_i8_low               45,598 (438 MB/s)   63,041 (317 MB/s)          17,443   38.25%   x 0.72

dhardy · 2021-09-05T09:41:00Z

Thanks. I believe sample_single_inclusive_canon is correct (code), but the comments indicate you don't quite understand the maths. This method doesn't use rejection sampling on biased results, instead it step-wise approaches the result of multiplying a true-real (infinite precision) random value r in [0, 1) multiplied by range. The extra step(s) are not about bias, but about performing as many steps as required until we know that extra steps cannot change the result after integer rounding — except that we stop after one extra step since the bounds on bias are insignificantly small. (So the change in sample_constant_inclusive_canon doesn't make sense.)

I didn't implement 128-bit integers (what do we do if range > 2^64? just generate 128 bits?)

The Swift impl just uses generics (T: unsigned-fixed-width-int, where bits(T) ≥ 64) if I read it correctly. We could do better: conceptually, three steps with u64 random values multiplied into a 192-bit target. Practically though it makes sense to combine the first two steps (construct a 128-bit random value and use widening multiply into 128-bit high and low parts, with optional early exit). Next, multiply the third u64 by the range into a 192-bit target, take the upper 128 bits and add to the 128-bit low part of the first step, checking for overflow.

Pcg64Mcg was significantly slower

Interesting. Ideally we'd want to benchmark across several RNGs each across several different CPUs. But using sample_single_inclusive_canon instead of sample_constant_inclusive_canon (see above) I don't think the results look so bad.

Another question here is: exactly how much bias is acceptable for the cut-off? 2⁻⁶⁴ seems reasonable, but e.g. if we accept 2⁻⁴⁸ then we could yield u8 and u16 results from a single u64. @vks, others?

TheIronBorn · 2021-09-05T09:52:03Z

The change in constant is incorrect

You’re referring to the modulo in constant? The author recommended that (here) though perhaps I’m not applying it correctly.

It’s not intended to reduce bias, just to make the second generation and multiplication less likely

TheIronBorn · 2021-09-05T09:57:01Z

And perhaps we could change the comments there to discourage that understanding

TheIronBorn · 2021-09-05T20:06:07Z

Added 128-bit implementation

let (new_hi_order, _) = (rng.gen::<u64>() as $u_extra_large).wmul(range as $u_extra_large);

 name (w/ Xoshiro256PlusPlus)     oneill ns/iter       canon ns/iter        diff ns/iter   diff %  speedup
+_distr_uniform_high_reject_i128  460,530 (694 MB/s)   157,020 (2037 MB/s)      -303,510  -65.90%   x 2.93
+_distr_uniform_low_reject_i128   368,029 (869 MB/s)   108,908 (2938 MB/s)      -259,121  -70.41%   x 3.38
+_gen_range_i128_high             515,440 (620 MB/s)   363,085 (881 MB/s)       -152,355  -29.56%   x 1.42
-_gen_range_i128_low              233,882 (1368 MB/s)  238,505 (1341 MB/s)         4,623    1.98%   x 0.98

TheIronBorn · 2021-09-05T23:33:12Z

our shuffle benchmark is almost 2x faster

 name              oneill ns/iter  canon ns/iter  diff ns/iter   diff %  speedup
 _seq_shuffle_100  597             311                    -286  -47.91%   x 1.92

(with Xoshiro128PlusPlus, 1.5x with Pcg32)

TheIronBorn · 2021-09-09T20:05:28Z

SIMD performance is surprising. I thought the lack of an easy SIMD 64->128 multiplication would slow things down too much but it seems for sample_single the lack of threshold/bitmask generation and fewer random bits needed helps a lot. (I did the math once for the expected iterations for the blend method but lost it)
In some cases slower than just scalar thanks to the scalar improvements.

 name                              oneill ns/iter      canon ns/iter       diff ns/iter   diff %  speedup
-_distr_uniform_high_reject_i16x8  100,478 (796 MB/s)  172,737 (463 MB/s)        72,259   71.92%   x 0.58
+_distr_uniform_high_reject_i32x4  212,538 (376 MB/s)  152,175 (525 MB/s)       -60,363  -28.40%   x 1.40
+_distr_uniform_high_reject_i64x2  155,429 (514 MB/s)  69,997 (1142 MB/s)       -85,432  -54.97%   x 2.22
-_distr_uniform_low_reject_i16x8   25,793 (3101 MB/s)  226,827 (352 MB/s)       201,034  779.41%   x 0.11
-_distr_uniform_low_reject_i32x4   26,227 (3050 MB/s)  145,611 (549 MB/s)       119,384  455.20%   x 0.18
-_distr_uniform_low_reject_i64x2   32,843 (2435 MB/s)  95,787 (835 MB/s)         62,944  191.65%   x 0.34
+_gen_range_i16x8_high_reject      301,268 (265 MB/s)  112,707 (709 MB/s)      -188,561  -62.59%   x 2.67
+_gen_range_i16x8_low_reject       817,562 (97 MB/s)   168,693 (474 MB/s)      -648,869  -79.37%   x 4.85
+_gen_range_i32x4_high_reject      150,803 (530 MB/s)  75,148 (1064 MB/s)       -75,655  -50.17%   x 2.01
+_gen_range_i32x4_low_reject       475,460 (168 MB/s)  138,031 (579 MB/s)      -337,429  -70.97%   x 3.44
+_gen_range_i64x2_high_reject      267,958 (298 MB/s)  42,753 (1871 MB/s)      -225,205  -84.04%   x 6.27
+_gen_range_i64x2_low_reject       453,003 (176 MB/s)  103,087 (776 MB/s)      -349,916  -77.24%   x 4.39

dhardy · 2021-09-11T09:31:02Z

Interesting results. But you didn't push your SIMD code?

Why does distr_uniform look so bad (I assume compared to the old method) for i16x8 and i32x4? A possible alternative would be to drop the test and always do the second step for x4 and x8 SIMD impls. Another, which might be acceptable (see note above) would be to never do a second step for integers ≤16 bits. It would be nice to see comparators between all four (the previous method, the new method with test for early exit, without test in 1-step and 2-step variants... but then considering multiple ranges and multiple RNGs then requires quite a lot more testing...).

Of course, the new method requires generating u64 or u128 for each lane, which is potentially a lot more random data than with the rejection method (although that could potentially have rather poor performance in some cases).

SIMD performance is surprising. I thought the lack of an easy SIMD 64->128 multiplication

It might end up being better using 32-bit steps for (some) SIMD impls since it reduces the amount of data up-front. The drawback is that we may need up to two extra steps to get bias below 1-in-2^64 samples.

If there are significant performance implications, we could consider variants with differing bias limits. 2^-32 should be enough for many things (though potentially with detectable bias in large studies) while 2^-16 is probably fine for most games. Using 2^-64 is a nice "one size fits all" approach, but probably does leave significant room for improvement in SIMD impls — should anyone care.

In fact, providing it is well documented, we could probably simply reduce the bias tolerance to 2^-32 for selected SIMD impls.

TheIronBorn · 2021-09-11T21:33:05Z

I did push SIMD commits? https://github.com/TheIronBorn/rand/blob/9e91003d5cdb429bc2a53e4f8a33a34613534be5/src/distributions/uniform.rs#L725-L759

I've since also added benchmarks for those comparisons (branchless, branchy, mod_branchy) you mentioned but haven't run them bc it would take a while. I think I should just worry about the 128-bit (XMM) sized types for now.

i16x8 "constant" performance is so much slower because it requires generating and full-width multiplying a u64x8, at least 4x as much "work" as a 16x8 multiplication. gen_range is so much better because 8 modulos is very slow.

As you say two steps of u32x8 or even just one (if the bias is small enough) could significantly speed things up. I'll see about two-step 32xN benchmarks.

TheIronBorn · 2021-09-11T21:36:32Z

If we had math for exactly how many bits are required we might be able to provide guarantees for 8-bit widths. I don't understand the proposal well enough for that though.

dhardy · 2021-09-12T09:22:04Z

We never know in advance how many bits are required. The method works by adding bits until either it is known that further precision cannot change the result or the expectation of an incorrect result is sufficiently small (no more than one biased result in 2^64 samples).

The reason the algorithm works well for single results is that on a 64-bit CPU, throwing 64-bits at a problem expected to require far less is cheap. But throwing 64×8=512 bits at i16x8 is less cheap, however 32×8=256 bits is still very likely enough bits. Makes sense? Sorry, you probably already understood most of that.

We could even use 16-bit steps, but I'm less convinced, because we'd require up to four extra steps and testing is probably also a significant cost.

It may simply be better to keep the old method for some of these SIMD impls.

TheIronBorn · 2021-09-23T02:39:16Z

I couldn't figure out how to do two-step method for the extra bits (without just doing a 64-bit mul using 32-bit operations that is)

But I put together a comparison of branchless/branchy/mod here:

 name                               canon_branchless ns/iter  canon_branchy_no_mod ns/iter  diff ns/iter   diff %  speedup
-_gen_range_i16x8_high_reject       370,063 (864 MB/s)        381,762 (838 MB/s)                  11,699    3.16%   x 0.97
+_gen_range_i16x8_low_reject        419,860 (762 MB/s)        330,372 (968 MB/s)                 -89,488  -21.31%   x 1.27
-_gen_range_i32x4_high_reject       230,870 (1386 MB/s)       237,903 (1345 MB/s)                  7,033    3.05%   x 0.97
+_gen_range_i32x4_low_reject        257,390 (1243 MB/s)       142,934 (2238 MB/s)               -114,456  -44.47%   x 1.80
-_gen_range_i64x2_high_reject       206,744 (1547 MB/s)       264,210 (1211 MB/s)                 57,466   27.80%   x 0.78
+_gen_range_i64x2_low_reject        224,032 (1428 MB/s)       146,566 (2183 MB/s)                -77,466  -34.58%   x 1.53
-_gen_range_i8x16_high_reject       665,987 (480 MB/s)        691,560 (462 MB/s)                  25,573    3.84%   x 0.96
+_gen_range_i8x16_low_reject        739,970 (432 MB/s)        623,880 (512 MB/s)                -116,090  -15.69%   x 1.19

 name                               canon_branchless ns/iter  canon_branchy_mod ns/iter  diff ns/iter   diff %  speedup
-_distr_uniform_i16x8_high_reject   363,250 (880 MB/s)        365,540 (875 MB/s)                2,290    0.63%   x 0.99
+_distr_uniform_i16x8_low_reject    353,415 (905 MB/s)        68,905 (4644 MB/s)             -284,510  -80.50%   x 5.13
-_distr_uniform_i32x4_high_reject   210,531 (1519 MB/s)       234,356 (1365 MB/s)              23,825   11.32%   x 0.90
+_distr_uniform_i32x4_low_reject    246,272 (1299 MB/s)       83,541 (3830 MB/s)             -162,731  -66.08%   x 2.95
-_distr_uniform_i64x2_high_reject   146,966 (2177 MB/s)       199,202 (1606 MB/s)              52,236   35.54%   x 0.74
+_distr_uniform_i64x2_low_reject    157,588 (2030 MB/s)       75,960 (4212 MB/s)              -81,628  -51.80%   x 2.07
+_distr_uniform_i8x16_high_reject   715,660 (447 MB/s)        669,060 (478 MB/s)              -46,600   -6.51%   x 1.07
+_distr_uniform_i8x16_low_reject    706,100 (453 MB/s)        134,419 (2380 MB/s)            -571,681  -80.96%   x 5.25

The modulo seems to make little difference though:

 name                               canon_branchy_no_mod ns/iter  canon_branchy_mod ns/iter  diff ns/iter   diff %  speedup
+_distr_uniform_i16x8_high_reject   367,512 (870 MB/s)            365,540 (875 MB/s)               -1,972   -0.54%   x 1.01
+_distr_uniform_i16x8_low_reject    76,440 (4186 MB/s)            68,905 (4644 MB/s)               -7,535   -9.86%   x 1.11
-_distr_uniform_i32x4_high_reject   224,735 (1423 MB/s)           234,356 (1365 MB/s)               9,621    4.28%   x 0.96
+_distr_uniform_i32x4_low_reject    83,820 (3817 MB/s)            83,541 (3830 MB/s)                 -279   -0.33%   x 1.00
-_distr_uniform_i64x2_high_reject   177,215 (1805 MB/s)           199,202 (1606 MB/s)              21,987   12.41%   x 0.89
+_distr_uniform_i64x2_low_reject    83,113 (3850 MB/s)            75,960 (4212 MB/s)               -7,153   -8.61%   x 1.09
+_distr_uniform_i8x16_high_reject   674,635 (474 MB/s)            669,060 (478 MB/s)               -5,575   -0.83%   x 1.01
+_distr_uniform_i8x16_low_reject    183,900 (1740 MB/s)           134,419 (2380 MB/s)             -49,481  -26.91%   x 1.37

I thought at first a series of 64->128 mulx would perform better for wmul (used in previous posts) but did some analysis and the LLVM version we already had was better. It improved things vs. the Oneill method:

 name                               oneill ns/iter        canon_branchy_no_mod ns/iter  diff ns/iter   diff %  speedup
+_distr_uniform_i16x8_high_reject   552,320 (579 MB/s)    367,512 (870 MB/s)                -184,808  -33.46%   x 1.50
+_distr_uniform_i16x8_low_reject    79,120 (4044 MB/s)    76,440 (4186 MB/s)                  -2,680   -3.39%   x 1.04
+_distr_uniform_i32x4_high_reject   582,245 (549 MB/s)    224,735 (1423 MB/s)               -357,510  -61.40%   x 2.59
-_distr_uniform_i32x4_low_reject    77,865 (4109 MB/s)    83,820 (3817 MB/s)                   5,955    7.65%   x 0.93
+_distr_uniform_i64x2_high_reject   720,860 (443 MB/s)    177,215 (1805 MB/s)               -543,645  -75.42%   x 4.07
+_distr_uniform_i64x2_low_reject    130,595 (2450 MB/s)   83,113 (3850 MB/s)                 -47,482  -36.36%   x 1.57
+_distr_uniform_i8x16_high_reject   851,180 (375 MB/s)    674,635 (474 MB/s)                -176,545  -20.74%   x 1.26
-_distr_uniform_i8x16_low_reject    111,113 (2879 MB/s)   183,900 (1740 MB/s)                 72,787   65.51%   x 0.60

(the slowdown in i8x16_low_reject may be codesize or the relatively greater prob of 64->128 mul)

Here's the full bench logs https://github.com/TheIronBorn/rand/blob/unif-benches/benches/unif_bench_logs
Also a look at the best for each lane-width: https://github.com/TheIronBorn/rand/blob/unif-benches/benches/bench_sorts, 128-bit is generally faster in scalar, 8/16 with high rejections and constant ranges, not sure what's up with gen_range_i64_low_reject

dhardy · 2021-09-28T11:10:35Z

You’re referring to the modulo in constant? The author recommended that (here) though perhaps I’m not applying it correctly.

Aha, this is the stop condition for Lemire's method. Regardless, it's only a useful optimisation where the range is known at compile-time, which may be the case in your benchmarks but would require an API extension to expose it properly (e.g. we don't want to use it for shuffling), so it's probably only worth considering if there is a significant advantage (only really seen in i8x16 low reject in your latest benchmarks).

dhardy · 2021-09-28T15:42:28Z

There are some issues with your benchmarks unfortunately:

the oneill_distr_uniform are not actually oneill since this uses the sample method which was not updated in switch sample_single int to O'neill's modulo method & update gen_rang… #1154
comparison between sample (pre-constructed distribution) vs sample_single_inclusive without noting this
some of the benches increment the high bound between each iteration, others don't (judging by your selected bounds, I guess you don't expect any of them to)

I'm still going over stuff and will push a draft PR when done.

@TheIronBorn

This is based on @TheIronBorn's work (#1154, #1172), with some changes.

@TheIronBorn

This is from @TheIronBorn (see #1172)

@TheIronBorn

This is based on @TheIronBorn's work (#1154, #1172), with some changes.

@TheIronBorn

This is from @TheIronBorn (see #1172)

@TheIronBorn

This is based on @TheIronBorn's work (#1154, #1172), with some changes.

@TheIronBorn

This is from @TheIronBorn (see #1172)

@TheIronBorn

This is based on @TheIronBorn's work (#1154, #1172), with some changes.

TheIronBorn closed this as completed Sep 5, 2021

TheIronBorn reopened this Sep 5, 2021

vks mentioned this issue Sep 7, 2021

Tracker: rand 0.9 #1165

Open

23 tasks

dhardy mentioned this issue Oct 18, 2021

Compromising accuracy and speed [distributions] #494

Open

dhardy added a commit that referenced this issue Oct 20, 2021

Impl ONeill, Canon, Canon-Lemire and Bitmask methods for integer types

11c9f2a

This is based on @TheIronBorn's work (#1154, #1172), with some changes.

dhardy added a commit that referenced this issue Oct 20, 2021

Add uniform SIMD impls for new algs

2e0aed6

This is from @TheIronBorn (see #1172)

dhardy mentioned this issue Oct 20, 2021

WIP: Investigate Canon's uniform-integer sampling method #1196

Closed

dhardy added a commit that referenced this issue Feb 24, 2022

Impl ONeill, Canon, Canon-Lemire and Bitmask methods for integer types

0584f2c

This is based on @TheIronBorn's work (#1154, #1172), with some changes.

dhardy added a commit that referenced this issue Feb 24, 2022

Add uniform SIMD impls for new algs

90f4382

This is from @TheIronBorn (see #1172)

dhardy added a commit that referenced this issue Feb 24, 2022

Impl ONeill, Canon, Canon-Lemire and Bitmask methods for integer types

00869d7

This is based on @TheIronBorn's work (#1154, #1172), with some changes.

dhardy added a commit that referenced this issue Feb 24, 2022

Add uniform SIMD impls for new algs

aada17d

This is from @TheIronBorn (see #1172)

dhardy added a commit that referenced this issue Feb 28, 2022

UniformInt(single): add ONeill, Canon, Canon-Lemire and Bitmask methods

a87f4e9

This is based on @TheIronBorn's work (#1154, #1172), with some changes.

dhardy mentioned this issue Feb 17, 2023

Uniform sampling: use Canon's method #1287

Merged

dhardy closed this as completed Jul 20, 2024

dhardy added the A-new Propose usage of a new algorithm label Jul 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Look into this alleged optimal algorithm for bounded random integers #1172

Look into this alleged optimal algorithm for bounded random integers #1172

ctsrc commented Sep 2, 2021

TheIronBorn commented Sep 2, 2021

dhardy commented Sep 3, 2021

TheIronBorn commented Sep 3, 2021

TheIronBorn commented Sep 5, 2021 •

edited

Loading

dhardy commented Sep 5, 2021

TheIronBorn commented Sep 5, 2021 •

edited

Loading

TheIronBorn commented Sep 5, 2021

TheIronBorn commented Sep 5, 2021

TheIronBorn commented Sep 5, 2021

TheIronBorn commented Sep 9, 2021 •

edited

Loading

dhardy commented Sep 11, 2021 •

edited

Loading

TheIronBorn commented Sep 11, 2021 •

edited

Loading

TheIronBorn commented Sep 11, 2021

dhardy commented Sep 12, 2021

TheIronBorn commented Sep 23, 2021 •

edited

Loading

dhardy commented Sep 28, 2021

dhardy commented Sep 28, 2021

Look into this alleged optimal algorithm for bounded random integers #1172

Look into this alleged optimal algorithm for bounded random integers #1172

Comments

ctsrc commented Sep 2, 2021

Background

Feature request

TheIronBorn commented Sep 2, 2021

dhardy commented Sep 3, 2021

TheIronBorn commented Sep 3, 2021

TheIronBorn commented Sep 5, 2021 • edited Loading

dhardy commented Sep 5, 2021

TheIronBorn commented Sep 5, 2021 • edited Loading

TheIronBorn commented Sep 5, 2021

TheIronBorn commented Sep 5, 2021

TheIronBorn commented Sep 5, 2021

TheIronBorn commented Sep 9, 2021 • edited Loading

dhardy commented Sep 11, 2021 • edited Loading

TheIronBorn commented Sep 11, 2021 • edited Loading

TheIronBorn commented Sep 11, 2021

dhardy commented Sep 12, 2021

TheIronBorn commented Sep 23, 2021 • edited Loading

dhardy commented Sep 28, 2021

dhardy commented Sep 28, 2021

TheIronBorn commented Sep 5, 2021 •

edited

Loading

TheIronBorn commented Sep 5, 2021 •

edited

Loading

TheIronBorn commented Sep 9, 2021 •

edited

Loading

dhardy commented Sep 11, 2021 •

edited

Loading

TheIronBorn commented Sep 11, 2021 •

edited

Loading

TheIronBorn commented Sep 23, 2021 •

edited

Loading