-
-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0-31 Byte Random Length Benchmark #113
base: master
Are you sure you want to change the base?
Conversation
What are the use cases for this? I think to avoid cache effects, right? I could be a good speed indicator but we already have two better ones, and this is pretty random. Better would be to create an initial. random vector for all functions to test width, but we don't do that yet. Either dump to disc, or change the cmdline API. |
well, it depends on you. just an additional benchmark that comes for free. |
Any hash that returns zero on a zero length hash will pass this with flying colors, if I am reading this correctly, as the length will always be zero.
No, a PRNG serves as a PRNG. Some hash functions can make a decent PRNG, but they serve entirely different purposes and should not be mixed up. |
I say this benchmark is no good. 👎 Its speed and accuracy relies on what values the hash returns, and does better if the hash is less random and has clear patterns causing the low bits to favor zero or near-zero values. If the hash returns zero on a zero-length hash, it will always run at zero-length and the result will be meaningless. Additionally, you can't rely on a random enough result with an array of zeroes. If the "random" lengths were the same each time, that would be a decent benchmark. |
@Cyan4973 thoughts? |
The point that the benchmark outcome is driven by the produced serie of lengths seems valid, |
A simple modification is that to use (sum&32)+1 as length. Since each time the seed is different, it is not lockable, otherwise the hash function will fail later quality tests. |
This wouldn't change the issue that the serie of lengths produced by each hash would be different, |
Also, what if |
it will be fine. sum will be ~0ull or 0xffffffffffffffff due to the propety of uint64_t |
At least we need some realistic and concensus benchmarkes with some randomness to reflect the real situations. Current benchmarks are flawed, eg. not inlined or determinstic. |
It still requires that hashes be PRNGs for a fair test. They are not. They serve a different purpose. A hash function may work as a PRNG, but that has nothing to do with how strong they are. |
You do have a good point, I just don't think that non-deterministic "randomness" is a fair test. |
https://github.com/backtrace-labs/umash/wiki/Execution-traces has some production traces of calls to the umash hash function. It might make sense to hardcode some of that call sequence. |
@pkhuong I like the idea and I'm implementing something somewhat similar. Tthanks for providing a reference to the dataset. |
It addresses the question at rurban#113 What is the "real" average cycles/hash value for a given hash function? We can't know, but we can estimate it better if we assume that the function timing does not depend on input (that's not true for hashes based on multiplication) and we know distribution of key length in advance (that might be somewhat known for certain classes of inputs, but the distribution varies across classes measurably).
It addresses the question at rurban#113 What is the "real" average cycles/hash value for a given hash function? We can't know, but we can estimate it better if we assume that the function timing does not depend on input (that's not true for hashes based on multiplication) and we know distribution of key length in advance (that might be somewhat known for certain classes of inputs, but the distribution varies across classes measurably).
It addresses the question at rurban#113 What is the "real" average cycles/hash value for a given hash function? We can't know, but we can estimate it better if we assume that the function timing does not depend on input (that's not true for hashes based on multiplication) and we know distribution of key length in advance (that might be somewhat known for certain classes of inputs, but the distribution varies across classes measurably).
It addresses the question at rurban#113 What is the "real" average cycles/hash value for a given hash function? We can't know, but we can estimate it better if we assume that the function timing does not depend on input (that's not true for hashes based on multiplication) and we know distribution of key length in advance (that might be somewhat known for certain classes of inputs, but the distribution varies across classes measurably).
It partly addresses the question at rurban#113 What is the "real" average cycles/hash value for a given hash function? We can't know, but we can estimate it better if we assume that the function timing does not depend on input (that's not true for hashes based on multiplication) and we know distribution of key length in advance (that might be somewhat known for certain classes of inputs, but the distribution varies across classes measurably).
Weights coming from two datasets are hard-coded: DNS domain lengths and UMASH traces. Custom one might be passed via ENV{SMHASHER_SMALLKEY_WEIGHTS} It partly addresses the question at rurban#113 What is the "real" average cycles/hash value for a given hash function? We can't know, but we can estimate it better if we assume that the function timing does not depend on input (that's not true for hashes based on multiplication) and we know distribution of key length in advance (that might be somewhat known for certain classes of inputs, but the distribution varies across classes measurably).
Weights coming from two datasets are hard-coded: DNS domain lengths and UMASH traces. Custom one might be passed via ENV{SMHASHER_SMALLKEY_WEIGHTS} It partly addresses the question at rurban#113 What is the "real" average cycles/hash value for a given hash function? We can't know, but we can estimate it better if we assume that the function timing does not depend on input (that's not true for hashes based on multiplication) and we know distribution of key length in advance (that might be somewhat known for certain classes of inputs, but the distribution varies across classes measurably).
Weights coming from two datasets are hard-coded: DNS domain lengths and UMASH traces. Custom one might be passed via ENV{SMHASHER_SMALLKEY_WEIGHTS} It partly addresses the question at #113 What is the "real" average cycles/hash value for a given hash function? We can't know, but we can estimate it better if we assume that the function timing does not depend on input (that's not true for hashes based on multiplication) and we know distribution of key length in advance (that might be somewhat known for certain classes of inputs, but the distribution varies across classes measurably).
Dear rurban:
This is a simple function to benchmark hash functions on random length data (0-31 bytes, turnable) without a PRNG. Because hash function itself serves as a PRNG. This function is very fast (<1s).