0-31 Byte Random Length Benchmark #113

wangyi-fudan · 2020-03-10T14:08:03Z

Dear rurban:
This is a simple function to benchmark hash functions on random length data (0-31 bytes, turnable) without a PRNG. Because hash function itself serves as a PRNG. This function is very fast (<1s).

rurban · 2020-03-10T18:53:13Z

What are the use cases for this? I think to avoid cache effects, right? I could be a good speed indicator but we already have two better ones, and this is pretty random.

Better would be to create an initial. random vector for all functions to test width, but we don't do that yet. Either dump to disc, or change the cmdline API.

wangyi-fudan · 2020-03-10T23:48:29Z

well, it depends on you. just an additional benchmark that comes for free.

easyaspi314 · 2020-03-11T14:31:25Z

Any hash that returns zero on a zero length hash will pass this with flying colors, if I am reading this correctly, as the length will always be zero.

Because hash function itself serves as a PRNG.

No, a PRNG serves as a PRNG.

Some hash functions can make a decent PRNG, but they serve entirely different purposes and should not be mixed up.

easyaspi314 · 2020-03-11T16:56:36Z

I say this benchmark is no good. 👎

Its speed and accuracy relies on what values the hash returns, and does better if the hash is less random and has clear patterns causing the low bits to favor zero or near-zero values.

If the hash returns zero on a zero-length hash, it will always run at zero-length and the result will be meaningless.

Additionally, you can't rely on a random enough result with an array of zeroes.

If the "random" lengths were the same each time, that would be a decent benchmark.

easyaspi314 · 2020-03-11T17:16:56Z

@Cyan4973 thoughts?

Cyan4973 · 2020-03-11T17:36:15Z

The point that the benchmark outcome is driven by the produced serie of lengths seems valid,
with notably the situation where hash(0) % 32 == 0 which "locks" the serie to 0, resulting in a very favorable scenario compared to truly random lengths.
Note that any hash(n) % 32 == n will also lock the serie to this specific length, defeating the purpose of this benchmark.

wangyi-fudan · 2020-03-11T23:25:52Z

A simple modification is that to use (sum&32)+1 as length. Since each time the seed is different, it is not lockable, otherwise the hash function will fail later quality tests.

Cyan4973 · 2020-03-12T01:08:33Z

This wouldn't change the issue that the serie of lengths produced by each hash would be different,
making a speed comparison between them more difficult.

easyaspi314 · 2020-03-12T01:13:44Z

Also, what if len == 0 returns -1?

wangyi-fudan · 2020-03-12T04:56:47Z

Also, what if len == 0 returns -1?

it will be fine. sum will be ~0ull or 0xffffffffffffffff due to the propety of uint64_t

wangyi-fudan · 2020-03-12T05:03:02Z

This wouldn't change the issue that the serie of lengths produced by each hash would be different,
making a speed comparison between them more difficult.

At least we need some realistic and concensus benchmarkes with some randomness to reflect the real situations. Current benchmarks are flawed, eg. not inlined or determinstic.

easyaspi314 · 2020-03-12T05:40:39Z

It still requires that hashes be PRNGs for a fair test.

They are not. They serve a different purpose. A hash function may work as a PRNG, but that has nothing to do with how strong they are.

easyaspi314 · 2020-03-12T19:46:56Z

At least we need some realistic and concensus benchmarkes with some randomness to reflect the real situations. Current benchmarks are flawed, eg. not inlined or determinstic.

You do have a good point, I just don't think that non-deterministic "randomness" is a fair test.

pkhuong · 2022-03-03T21:48:26Z

https://github.com/backtrace-labs/umash/wiki/Execution-traces has some production traces of calls to the umash hash function. It might make sense to hardcode some of that call sequence.

darkk · 2024-08-28T14:00:29Z

@pkhuong I like the idea and I'm implementing something somewhat similar. Tthanks for providing a reference to the dataset.

It addresses the question at rurban#113 What is the "real" average cycles/hash value for a given hash function? We can't know, but we can estimate it better if we assume that the function timing does not depend on input (that's not true for hashes based on multiplication) and we know distribution of key length in advance (that might be somewhat known for certain classes of inputs, but the distribution varies across classes measurably).

It partly addresses the question at rurban#113 What is the "real" average cycles/hash value for a given hash function? We can't know, but we can estimate it better if we assume that the function timing does not depend on input (that's not true for hashes based on multiplication) and we know distribution of key length in advance (that might be somewhat known for certain classes of inputs, but the distribution varies across classes measurably).

Weights coming from two datasets are hard-coded: DNS domain lengths and UMASH traces. Custom one might be passed via ENV{SMHASHER_SMALLKEY_WEIGHTS} It partly addresses the question at rurban#113 What is the "real" average cycles/hash value for a given hash function? We can't know, but we can estimate it better if we assume that the function timing does not depend on input (that's not true for hashes based on multiplication) and we know distribution of key length in advance (that might be somewhat known for certain classes of inputs, but the distribution varies across classes measurably).

Weights coming from two datasets are hard-coded: DNS domain lengths and UMASH traces. Custom one might be passed via ENV{SMHASHER_SMALLKEY_WEIGHTS} It partly addresses the question at #113 What is the "real" average cycles/hash value for a given hash function? We can't know, but we can estimate it better if we assume that the function timing does not depend on input (that's not true for hashes based on multiplication) and we know distribution of key length in advance (that might be somewhat known for certain classes of inputs, but the distribution varies across classes measurably).

Add files via upload

6753bd3

rurban changed the title ~~0-31 Btyes Random Length Benchmark~~ 0-31 Byte Random Length Benchmark Jun 21, 2023

darkk mentioned this pull request Aug 30, 2024

Split test=Speed into SpeedBulk and SpeedSmall and report weighted average for Small key speed test #293

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0-31 Byte Random Length Benchmark #113

0-31 Byte Random Length Benchmark #113

wangyi-fudan commented Mar 10, 2020

rurban commented Mar 10, 2020 •

edited

Loading

wangyi-fudan commented Mar 10, 2020

easyaspi314 commented Mar 11, 2020 •

edited

Loading

easyaspi314 commented Mar 11, 2020 •

edited

Loading

easyaspi314 commented Mar 11, 2020

Cyan4973 commented Mar 11, 2020

wangyi-fudan commented Mar 11, 2020

Cyan4973 commented Mar 12, 2020

easyaspi314 commented Mar 12, 2020

wangyi-fudan commented Mar 12, 2020

wangyi-fudan commented Mar 12, 2020

easyaspi314 commented Mar 12, 2020

easyaspi314 commented Mar 12, 2020

pkhuong commented Mar 3, 2022

darkk commented Aug 28, 2024

0-31 Byte Random Length Benchmark #113

Are you sure you want to change the base?

0-31 Byte Random Length Benchmark #113

Conversation

wangyi-fudan commented Mar 10, 2020

rurban commented Mar 10, 2020 • edited Loading

wangyi-fudan commented Mar 10, 2020

easyaspi314 commented Mar 11, 2020 • edited Loading

easyaspi314 commented Mar 11, 2020 • edited Loading

easyaspi314 commented Mar 11, 2020

Cyan4973 commented Mar 11, 2020

wangyi-fudan commented Mar 11, 2020

Cyan4973 commented Mar 12, 2020

easyaspi314 commented Mar 12, 2020

wangyi-fudan commented Mar 12, 2020

wangyi-fudan commented Mar 12, 2020

easyaspi314 commented Mar 12, 2020

easyaspi314 commented Mar 12, 2020

pkhuong commented Mar 3, 2022

darkk commented Aug 28, 2024

rurban commented Mar 10, 2020 •

edited

Loading

easyaspi314 commented Mar 11, 2020 •

edited

Loading

easyaspi314 commented Mar 11, 2020 •

edited

Loading