Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full Station Name Hash Requirement? #118

Closed
synopse opened this issue Apr 22, 2024 · 23 comments
Closed

Full Station Name Hash Requirement? #118

synopse opened this issue Apr 22, 2024 · 23 comments

Comments

@synopse
Copy link
Contributor

synopse commented Apr 22, 2024

We need to discuss about the requirement of "full station name hash".

In (most of) my entries I use the "perfect hash" trick, i.e. only compare the 32-bit of the hash to check for a given station name. With a good enough hash function (e.g. crc32c), it works perfectly fine with our current dataset of 10K stations, and give the correct output results. BUT we may be able to add a line to the dataset with a forged name triggering a hash collision. Then the results would be inaccurate...

In the original 1BRC challenge, this trick was disallowed, and they rejected any solution not explicitly comparing the station names char by char.
gunnarmorling/1brc#495 (reply in thread)

So in my entry, I made this process flow available, and we can compare plain ./abouchez and ./abouchez -f - the later making a full name comparison, but lower (1.96s vs 1.10s on my Intel PC).

To be fair with the original comparison, I would recommend to require a full station name comparison.
It makes numbers lower, but is IMHO more accurate with what we expect on real work.

@gcarreno
Copy link
Collaborator

Hey Arnaud(@synopse),

I was not aware of this. Thank you very much for making me aware of it.

I think most of the entries are using the full name comparison.
But it's not a bad idea to make it a rule.

I'm quite tired right now, so I'll think about the wording after I wake up.
In the mean time, if you have any suggestions on how this needs to be worded, please give me your opinion!

Cheers,
Gus

@synopse
Copy link
Contributor Author

synopse commented Apr 22, 2024

https://github.com/gunnarmorling/1brc?tab=readme-ov-file#rules-and-limits
Seems a good point to start. ;)

Implementations must not rely on specifics of a given data set, e.g. any valid station name as per the constraints above and any data distribution (number of measurements per station) must be supported.

Note that we are not consistent with line ending (we use CR+LF whereas the original only uses LF).
So we can't compare with other languages entries - which is a shame. Perhaps a future new issue. :D

@synopse
Copy link
Contributor Author

synopse commented Apr 22, 2024

So if you add this requirement, the following would be valid:

./abouchez measurements.txt -a -f
./abouchezfullcheck measurements.txt
./abouchezold measurements.txt

But other won't.

@synopse
Copy link
Contributor Author

synopse commented Apr 22, 2024

And if you add this requirements, it seems that the following entries won't pass any more, because they also use the "perfect hash" trick and don't compare the station names:

  • ghatem
  • hgrosser
  • ocoddo
  • sbalazs

@georges-hatem
Copy link
Contributor

Mmm I was looking to implement fletch32 as alternative to crc32, to see if it might be faster (and if it works at all without collisions).

Will await your decisions before proceeding.

@synopse
Copy link
Contributor Author

synopse commented Apr 22, 2024

Mmm I was looking to implement fletch32 as alternative to crc32, to see if it might be faster (and if it works at all without collisions).

BTW crc32() from plain FPC crc.pp unit is likely to be slow, because it uses no HW acceleration, and only a single table lookup array.
You may consider crc32c() or xxhash32() from mormot.core.base instead, since you are already using mORMot.
On my PC: crc32c single table: 583.6MB/s multi table:2.8GB/s sse42:17.8GB/s - note this is crc32c() not crc32().
And xxhash32: 8.1GB/s

@georges-hatem
Copy link
Contributor

georges-hatem commented Apr 22, 2024

BTW crc32() from plain FPC crc.pp unit is likely to be slow, because it uses no HW acceleration, and only a single table lookup array. You may consider crc32c() or xxhash32() from mormot.core.base instead, since you are already using mORMot. On my PC: crc32c single table: 583.6MB/s multi table:2.8GB/s sse42:17.8GB/s - note this is crc32c() not crc32(). And xxhash32: 8.1GB/s

Hello Arnaud,
Thank you for pointing these out.
Indeed, crc32c improves my overall performance by another 10%. However, the only thing holding me back is that I'm shamelessly using mORMot for the MemMap, when I failed to make a manual CreateFileMapping for files larger than 3.x GB (due to the limit of the Cardinal type used internally).

Using even more of mORMot's powerful (and fast!) features almost feels like a cheat, at least towards those not using it at all. I don't really know what to think of it...

@synopse
Copy link
Contributor Author

synopse commented Apr 22, 2024

@georges-hatem
About hashes, see what ocoodo did for its tests:
https://github.com/gcarreno/1brc-ObjectPascal/blob/5ba244d18c14230d2d1b7154474f119f046cba9c/entries/ocoddo/src/UHashes.pas
You have some stand-alone code here. ;)
edit/warning: by looking at the code, I realized that most of this asm is for Win64 only - but crc32csse42()

@georges-hatem
Copy link
Contributor

@synopse
hah, even OCoddo's crc32csse42() is from mORMot :)
more specifically, the time taken by plain crc32 is 4.5 seconds, while mORMot's crc32c takes 3 seconds, so around 33% faster on my system.
will go ahead and commit that change, thanks Arnaud!

@gcarreno
Copy link
Collaborator

Hey Arnaud(@synopse),

Since we have most entries matching the SHA256 for the output, I'm guessing that collision is not our problem.

Is your ask related to comparison with the Java challenge.
And if we don't, completely, give a rat's behind about the Java challenge, does it even matter?

I would like to hear your opinion on this.

Cheers,
Gus

@synopse
Copy link
Contributor Author

synopse commented Apr 22, 2024

It is not about Java. The 1BRC challenge went far beyond that point, and went to involve a lot of other languages, and some C, DotNet and Rust proposals have overstepped the best of Java solutions. As it should. ;)

This 1BRC is a comparison benchmark, and to be fair, it should play with the same rules for every competitor.
We could stay in our own "pure pascal" very small town, or we could compete with the whole universe of languages.
Usually, pascal as a language is seen as old, slow and outdated. Only C or Rust could be fast, and Java and DotNet could be universal.
If we can compete with the others, we open our mind and challenge ourselves even more.
If we compete with others, others may see that pascal is still worth investigating... and investing... 🗡️
This is why we compete in the TFB challenge with good ranking https://www.techempower.com/benchmarks/#section=test&runid=41fc2005-66fd-42be-bf67-1982f8c2a134&hw=ph&test=composite

So I would like our 1BRC challenge to follow the initial requirements, which were fair and plain to my understanding.
Currently, I see at least some missing points:

  • require full station name hashing;
  • end lines with LF only (not CR + LF) to be able to compete with other challengers on the same PC;
  • include the best Java, C, Rust and DotNet results on the same HW to our list, for comparison/information.

@hg747
Copy link
Contributor

hg747 commented Apr 23, 2024

I would not be happy, if we add a new rule about full station name hashing now, because we are at more than half of the challenge period. And I would not understand, why we accept to have 41343 different city names and 32 threads, while the original challenge has only 400 different cities and only 8 threads.

So here comes my suggestion for a compromise:

Gus suggested in the forum a new command line switch "-4 or --400-stations" for our entries to be compatible with the original challenge and suggested to use an alternative results table for their timing values. My idea is: let us combine the new rule about full station name hashing with this command line switch.

Then we have "our" result table for 41343 different city names and 32 threads and "perfect hash" allowed and an alternative results table, which can be 100% compatible to the original challenge (including the CR/LF issue).

It should be free to the competitors to install the new command line switch "-4 or --400-stations" or not. Who does not, will only appear in "our" result table and not in the alternative results table.

I think this would be a good solution.

@georges-hatem
Copy link
Contributor

Understandable. For those who wish, an alternate result 100% compatible with the Java requirements, to be compared with the best implementations of other languages.

Alternatively for those who wish, the results as they are right now.

At the end of the day, it offers the choice of rework/not, at the cost of extra management in the running of tests + automation

@synopse
Copy link
Contributor Author

synopse commented Apr 23, 2024

Instead of -4, I would suggest --compatibility or --legacy because it is not only about 400 stations (which is pointless in fact), but mostly about the line feeds itself.

Or even better, instead of a command line switch, a new Legacy compilation mode (in addition to Debug and Release), perhaps with a conditional like -DLEGACY1BRC with proper {$ifdef LEGACY1BRC} .. {$endif} in the source code, because it would not affect the performance by adapting to the other challenge.

If we got two tables, one with our requirements and one with alternative/java comparison, I would be happy enough. ;)

@paweld
Copy link
Contributor

paweld commented Apr 23, 2024

@gcarreno I think, however, that we should go to the end of the line \n instead of \r\n - in addition to maintaining compatibility, it will also reduce the file by about 0.931 GB.
I know it's closer to the end, but you should have no problem considering both solutions - using \r\n or \n as a line split

Edit: I added PR with the ability to change the end of the line #120

@gcarreno
Copy link
Collaborator

Hey Y'all,

I'm glad I took my time to answer some more on this issue. This gave time for other members of the challenge to voice their opinions and give me a bit more of a thumb on the pulse of things.

One thing I agree with Hartmut(@hg747) is that we are too far into this to add a rule for the full name comparison. Nonetheless, if any person wants to go that way and have a switch that will trigger that code path, I'm quite happy to do some private testing in a not so quiet system.

On the -4 or --400-stations switch, this was only a suggestion. You are completely free to use whatever name for the switch you want, of course.

One thing I completely agree with is Pawel(@paweld)'s proposal to switch from CRLF to LF only.

I'll have a look at is mentioned PR and then I'll post the change everywhere.
I'll also re-generate the official input file and the 400 stations one.
Finally I'll update the SHA256 of the input file on the README.md

Cheers,
Gus

@hg747
Copy link
Contributor

hg747 commented Apr 24, 2024

Hello Gus,

did I understand this correctly, that the next official run on Saturday will be with a new input file, which has only LF instead of CR/LF, so that everyone must update his code to match this (if neccessary)?

Cheers,
Hartmut

@gcarreno
Copy link
Collaborator

Hey Hartmut(@hg747),

You're absolutely correct!!
I've merged @paweld code that has LF as the default.
After I've had some sleep, I'll generate our 41K and the 400 stations file again.
Then I'll put up the new SHA256 for us all to be on the same page.

Dunno if I'm gonna enforce LF for next Saturday. Maybe a bit of a short notice.
I'll give it a bit more time for people to react.
But definitely for the Saturday after the next!!

Cheers,
Gus

@synopse
Copy link
Contributor Author

synopse commented Apr 24, 2024

FYI the hash of the measurements file with only LF is:
2b48bc2fa0b82d748925a820f43f75df01cc06df7447c7571e52d3962e675960

Surprisingly, it is slower in absolute/wall time to process this smaller file with my entry than the bigger CR+LF version.

@synopse
Copy link
Contributor Author

synopse commented Apr 24, 2024

I have update the hash in the main README.md in my pending pull request.
#124

@gcarreno
Copy link
Collaborator

Hey Arnaud(@synopse),

Merci bien!!
Thanks for that !!!

Cheers,
Gus

@synopse
Copy link
Contributor Author

synopse commented Apr 24, 2024

So could we close this issue?

TL;WR:

  • No need to add the full station name hash requirement;
  • Move to single LF instead of CR+LF between measurements rows.

@gcarreno
Copy link
Collaborator

Hey Y'all,

TL;WR:

  • No need to add the full station name hash requirement;
  • Move to single LF instead of CR+LF between measurements rows.

Pretty much!!
Closing!

Cheers,
Gus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants