-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shuffle() modulo bias and extra iteration #57
Comments
This is relevant to this: #18 |
FYI, I played around with optimization 1 (only use the minimum amount of bits) and it doesn't seem to be feasible/sensible. The "optimization" ran orders of magnitude slower than the original version in Python. The |
FYI, this was discussed in the last Eth 2.0 Implementers Call (approx. 2 mins ago) and the spec has been changed. The spec still does an extra shuffle, however that doesn't affect the outcome of the shuffle, so it's not particularly important. |
TLDR; I think there are two bugs in the shuffling function. I suggest how to fix them. I also built a "sandbox" for testing shuffling functions.
I recently went on a deep-dive into shuffling functions to figure out why the beacon_chain one gets stuck in a loop. I have some findings which I will report here.
I believe there are two errors in the present implementation:
I'll give a quick intro to the the Fisher-Yates shuffle and modulo bias for some background. I found it interesting, so you might too.
Background
Durstenfeld-Knuth-Fisher-Yates Shuffle
This is a method of shuffling a list "introduced" by Richard Durstenfeld in 1964 and "popularized" by Donald E. Knuth in The Art of Computer Programming. Apparently it's similar to work done by Ronald Fisher and Frank Yates in 1938. More @ wikipedia.
History aside, it looks like this:
It seems that this is the primitive for the shuffling algorithm used in the spec, so lets assume this is what we're going to use. Seems optimal.
Modulo-Bias
For background, modulo bias is introduced when using modulo to "fit" an random integer of a large range into a smaller range when the smaller range is not a clean multiple of the larger range. In other words,
large_range % small_range > 0
.As an example, consider a function
rng()
which generates an integer between0..3
(inclusive) and an application which requires a integer of0..2
(inclusive). A developer might simply callrng() % 3
, however this introduces a bias because both0 % 3 = 0
and3 % 3 = 0
. This skews the post-modulo result in favor of0
. This is a Bad Thing™.A method to fix this is to simply filter out all "biased" results, that is any results which are greater than
RAND_MAX - (RAND_MAX % n)
, whereRAND_MAX
defines the upper bound of an rng (0..RAND_MAX
) andn
is the upper bound of the desired range (0..n
) andRAND_MAX > n
. Lets apply this knowledge to the previous example and define the following function:The developer can now call
filtered_rng(3)
. This will give them results without a modulo bias. The downside is that it's now possible for the function to run indeterminately, because it's relying onrng
not to generate certain numbers. We can just assume that this won't actually run forever, instead it'll just slow us down a every now and then.Proposed changes
I propose the following code to fix these issues (the unused
config
var stays for simplicity):You can find this code here in a "shuffling sandbox" I made to test the conclusions I have come to.
First, to address error 1 (modulo-bias), the
rand_max
calculation has been moved to where it can dynamically filter the "rng" results on each call to the "rng". If you look at the shuffling psuedo-code (above) you can see that the rng is required to be within dynamic boundsi ≤ j < n
, therefore the "modulo-biased" rng results need to be filtered specific to these bounds, not as per the overall list result (as in the current implementation).Secondly, to address error 2 (unnecessary iteration), the
while
loop has been changed to do one less iteration. The reason this last iteration is unnecessary because we're attempting to swap the last element in the list,n-1
, with any other integer greater or equal ton - 1
but less thann
. The only possible outcome here is for the last element to be swapped with itself which is meaningless.Extra Reading
During the process of this, I made another shuffling implementation here. It is slower than the other implementations (due to abstraction overhead in Python), however it separates the logic into "shuffler" and "rng" components. This is useful in gaining an understanding of the fundamental structures involved.
Optimizations
There are a couple of potential speed-ups I thought of whilst doing this, will test them at some point.
Only use the minimum required amount of bits from the seed. E.g., for each swap consume<-- After some brief testing I no longer think this is a sensible optimization.ceil(log2(lst_count - i))
bits. This should reduce the amount of hashes at the cost of doing a little extra math.Point (2) is not very important, it would not change the outcome of the shuffle. Worth noting for client devs.
However, point (1) would change the results of the shuffling function and would therefore need to be included in the spec. I'll come back to this once I have some benchmarks on speed increases.As always, happy for feedback. Sorry about the essay.
The text was updated successfully, but these errors were encountered: