-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full Station Name Hash Requirement? #118
Comments
Hey Arnaud(@synopse), I was not aware of this. Thank you very much for making me aware of it. I think most of the entries are using the full name comparison. I'm quite tired right now, so I'll think about the wording after I wake up. Cheers, |
https://github.com/gunnarmorling/1brc?tab=readme-ov-file#rules-and-limits
Note that we are not consistent with line ending (we use CR+LF whereas the original only uses LF). |
So if you add this requirement, the following would be valid:
But other won't. |
And if you add this requirements, it seems that the following entries won't pass any more, because they also use the "perfect hash" trick and don't compare the station names:
|
Mmm I was looking to implement fletch32 as alternative to crc32, to see if it might be faster (and if it works at all without collisions). Will await your decisions before proceeding. |
BTW |
Hello Arnaud, Using even more of mORMot's powerful (and fast!) features almost feels like a cheat, at least towards those not using it at all. I don't really know what to think of it... |
@georges-hatem |
@synopse |
Hey Arnaud(@synopse), Since we have most entries matching the Is your ask related to comparison with the Java challenge. I would like to hear your opinion on this. Cheers, |
It is not about Java. The 1BRC challenge went far beyond that point, and went to involve a lot of other languages, and some C, DotNet and Rust proposals have overstepped the best of Java solutions. As it should. ;) This 1BRC is a comparison benchmark, and to be fair, it should play with the same rules for every competitor. So I would like our 1BRC challenge to follow the initial requirements, which were fair and plain to my understanding.
|
I would not be happy, if we add a new rule about full station name hashing now, because we are at more than half of the challenge period. And I would not understand, why we accept to have 41343 different city names and 32 threads, while the original challenge has only 400 different cities and only 8 threads. So here comes my suggestion for a compromise: Gus suggested in the forum a new command line switch "-4 or --400-stations" for our entries to be compatible with the original challenge and suggested to use an alternative results table for their timing values. My idea is: let us combine the new rule about full station name hashing with this command line switch. Then we have "our" result table for 41343 different city names and 32 threads and "perfect hash" allowed and an alternative results table, which can be 100% compatible to the original challenge (including the CR/LF issue). It should be free to the competitors to install the new command line switch "-4 or --400-stations" or not. Who does not, will only appear in "our" result table and not in the alternative results table. I think this would be a good solution. |
Understandable. For those who wish, an alternate result 100% compatible with the Java requirements, to be compared with the best implementations of other languages. Alternatively for those who wish, the results as they are right now. At the end of the day, it offers the choice of rework/not, at the cost of extra management in the running of tests + automation |
Instead of Or even better, instead of a command line switch, a new If we got two tables, one with our requirements and one with alternative/java comparison, I would be happy enough. ;) |
@gcarreno I think, however, that we should go to the end of the line Edit: I added PR with the ability to change the end of the line #120 |
Hey Y'all, I'm glad I took my time to answer some more on this issue. This gave time for other members of the challenge to voice their opinions and give me a bit more of a thumb on the pulse of things. One thing I agree with Hartmut(@hg747) is that we are too far into this to add a rule for the full name comparison. Nonetheless, if any person wants to go that way and have a switch that will trigger that code path, I'm quite happy to do some private testing in a not so quiet system. On the One thing I completely agree with is Pawel(@paweld)'s proposal to switch from I'll have a look at is mentioned Cheers, |
Hello Gus, did I understand this correctly, that the next official run on Saturday will be with a new input file, which has only LF instead of CR/LF, so that everyone must update his code to match this (if neccessary)? Cheers, |
Hey Hartmut(@hg747), You're absolutely correct!! Dunno if I'm gonna enforce Cheers, |
FYI the hash of the measurements file with only LF is: Surprisingly, it is slower in absolute/wall time to process this smaller file with my entry than the bigger CR+LF version. |
I have update the hash in the main README.md in my pending pull request. |
Hey Arnaud(@synopse), Merci bien!! Cheers, |
So could we close this issue? TL;WR:
|
Hey Y'all,
Pretty much!! Cheers, |
We need to discuss about the requirement of "full station name hash".
In (most of) my entries I use the "perfect hash" trick, i.e. only compare the 32-bit of the hash to check for a given station name. With a good enough hash function (e.g. crc32c), it works perfectly fine with our current dataset of 10K stations, and give the correct output results. BUT we may be able to add a line to the dataset with a forged name triggering a hash collision. Then the results would be inaccurate...
In the original 1BRC challenge, this trick was disallowed, and they rejected any solution not explicitly comparing the station names char by char.
gunnarmorling/1brc#495 (reply in thread)
So in my entry, I made this process flow available, and we can compare plain
./abouchez
and./abouchez -f
- the later making a full name comparison, but lower (1.96s vs 1.10s on my Intel PC).To be fair with the original comparison, I would recommend to require a full station name comparison.
It makes numbers lower, but is IMHO more accurate with what we expect on real work.
The text was updated successfully, but these errors were encountered: