Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0-31 Byte Random Length Benchmark #113

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

0-31 Byte Random Length Benchmark #113

wants to merge 1 commit into from

Conversation

wangyi-fudan
Copy link
Contributor

Dear rurban:
This is a simple function to benchmark hash functions on random length data (0-31 bytes, turnable) without a PRNG. Because hash function itself serves as a PRNG. This function is very fast (<1s).

@rurban
Copy link
Owner

rurban commented Mar 10, 2020

What are the use cases for this? I think to avoid cache effects, right? I could be a good speed indicator but we already have two better ones, and this is pretty random.

Better would be to create an initial. random vector for all functions to test width, but we don't do that yet. Either dump to disc, or change the cmdline API.

@wangyi-fudan
Copy link
Contributor Author

well, it depends on you. just an additional benchmark that comes for free.

@easyaspi314
Copy link

easyaspi314 commented Mar 11, 2020

Any hash that returns zero on a zero length hash will pass this with flying colors, if I am reading this correctly, as the length will always be zero.

Because hash function itself serves as a PRNG.

No, a PRNG serves as a PRNG.

Some hash functions can make a decent PRNG, but they serve entirely different purposes and should not be mixed up.

@easyaspi314
Copy link

easyaspi314 commented Mar 11, 2020

I say this benchmark is no good. 👎

Its speed and accuracy relies on what values the hash returns, and does better if the hash is less random and has clear patterns causing the low bits to favor zero or near-zero values.

If the hash returns zero on a zero-length hash, it will always run at zero-length and the result will be meaningless.

Additionally, you can't rely on a random enough result with an array of zeroes.

If the "random" lengths were the same each time, that would be a decent benchmark.

@easyaspi314
Copy link

@Cyan4973 thoughts?

@Cyan4973
Copy link
Contributor

The point that the benchmark outcome is driven by the produced serie of lengths seems valid,
with notably the situation where hash(0) % 32 == 0 which "locks" the serie to 0, resulting in a very favorable scenario compared to truly random lengths.
Note that any hash(n) % 32 == n will also lock the serie to this specific length, defeating the purpose of this benchmark.

@wangyi-fudan
Copy link
Contributor Author

A simple modification is that to use (sum&32)+1 as length. Since each time the seed is different, it is not lockable, otherwise the hash function will fail later quality tests.

@Cyan4973
Copy link
Contributor

This wouldn't change the issue that the serie of lengths produced by each hash would be different,
making a speed comparison between them more difficult.

@easyaspi314
Copy link

Also, what if len == 0 returns -1?

@wangyi-fudan
Copy link
Contributor Author

Also, what if len == 0 returns -1?

it will be fine. sum will be ~0ull or 0xffffffffffffffff due to the propety of uint64_t

@wangyi-fudan
Copy link
Contributor Author

This wouldn't change the issue that the serie of lengths produced by each hash would be different,
making a speed comparison between them more difficult.

At least we need some realistic and concensus benchmarkes with some randomness to reflect the real situations. Current benchmarks are flawed, eg. not inlined or determinstic.

@easyaspi314
Copy link

It still requires that hashes be PRNGs for a fair test.

They are not. They serve a different purpose. A hash function may work as a PRNG, but that has nothing to do with how strong they are.

@easyaspi314
Copy link

At least we need some realistic and concensus benchmarkes with some randomness to reflect the real situations. Current benchmarks are flawed, eg. not inlined or determinstic.

You do have a good point, I just don't think that non-deterministic "randomness" is a fair test.

@pkhuong
Copy link
Contributor

pkhuong commented Mar 3, 2022

https://github.com/backtrace-labs/umash/wiki/Execution-traces has some production traces of calls to the umash hash function. It might make sense to hardcode some of that call sequence.

@rurban rurban changed the title 0-31 Btyes Random Length Benchmark 0-31 Byte Random Length Benchmark Jun 21, 2023
@darkk
Copy link
Contributor

darkk commented Aug 28, 2024

@pkhuong I like the idea and I'm implementing something somewhat similar. Tthanks for providing a reference to the dataset.

darkk added a commit to darkk/smhasher that referenced this pull request Aug 30, 2024
It addresses the question at rurban#113

What is the "real" average cycles/hash value for a given hash function?

We can't know, but we can estimate it better if we assume that the
function timing does not depend on input (that's not true for hashes
based on multiplication) and we know distribution of key length in
advance (that might be somewhat known for certain classes of inputs,
but the distribution varies across classes measurably).
darkk added a commit to darkk/smhasher that referenced this pull request Aug 30, 2024
It addresses the question at rurban#113

What is the "real" average cycles/hash value for a given hash function?

We can't know, but we can estimate it better if we assume that the
function timing does not depend on input (that's not true for hashes
based on multiplication) and we know distribution of key length in
advance (that might be somewhat known for certain classes of inputs,
but the distribution varies across classes measurably).
darkk added a commit to darkk/smhasher that referenced this pull request Aug 30, 2024
It addresses the question at rurban#113

What is the "real" average cycles/hash value for a given hash function?

We can't know, but we can estimate it better if we assume that the
function timing does not depend on input (that's not true for hashes
based on multiplication) and we know distribution of key length in
advance (that might be somewhat known for certain classes of inputs,
but the distribution varies across classes measurably).
darkk added a commit to darkk/smhasher that referenced this pull request Aug 30, 2024
It addresses the question at rurban#113

What is the "real" average cycles/hash value for a given hash function?

We can't know, but we can estimate it better if we assume that the
function timing does not depend on input (that's not true for hashes
based on multiplication) and we know distribution of key length in
advance (that might be somewhat known for certain classes of inputs,
but the distribution varies across classes measurably).
darkk added a commit to darkk/smhasher that referenced this pull request Sep 2, 2024
It partly addresses the question at rurban#113

What is the "real" average cycles/hash value for a given hash function?

We can't know, but we can estimate it better if we assume that the
function timing does not depend on input (that's not true for hashes
based on multiplication) and we know distribution of key length in
advance (that might be somewhat known for certain classes of inputs,
but the distribution varies across classes measurably).
darkk added a commit to darkk/smhasher that referenced this pull request Sep 6, 2024
Weights coming from two datasets are hard-coded: DNS domain lengths and
UMASH traces.  Custom one might be passed via ENV{SMHASHER_SMALLKEY_WEIGHTS}

It partly addresses the question at rurban#113

What is the "real" average cycles/hash value for a given hash function?

We can't know, but we can estimate it better if we assume that the
function timing does not depend on input (that's not true for hashes
based on multiplication) and we know distribution of key length in
advance (that might be somewhat known for certain classes of inputs,
but the distribution varies across classes measurably).
darkk added a commit to darkk/smhasher that referenced this pull request Sep 6, 2024
Weights coming from two datasets are hard-coded: DNS domain lengths and
UMASH traces.  Custom one might be passed via ENV{SMHASHER_SMALLKEY_WEIGHTS}

It partly addresses the question at rurban#113

What is the "real" average cycles/hash value for a given hash function?

We can't know, but we can estimate it better if we assume that the
function timing does not depend on input (that's not true for hashes
based on multiplication) and we know distribution of key length in
advance (that might be somewhat known for certain classes of inputs,
but the distribution varies across classes measurably).
rurban pushed a commit that referenced this pull request Sep 28, 2024
Weights coming from two datasets are hard-coded: DNS domain lengths and
UMASH traces.  Custom one might be passed via ENV{SMHASHER_SMALLKEY_WEIGHTS}

It partly addresses the question at #113

What is the "real" average cycles/hash value for a given hash function?

We can't know, but we can estimate it better if we assume that the
function timing does not depend on input (that's not true for hashes
based on multiplication) and we know distribution of key length in
advance (that might be somewhat known for certain classes of inputs,
but the distribution varies across classes measurably).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants