-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hash now uses an open addressing algorithm #8017
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using #as
will provide more type information in case of errros compared to #not_nil!
This improves its performance, both in time and memory.
368e083
to
d4c35de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I seem to recall @funny-falcon ran into some issues with the GC finding false positives while he was working on this. That is not something you have seen?
This gives a nice speedup on our internal app for log processing! @bcardiff will this be considered for 0.30? Time in seconds
Details
Using compiled compiler at `.build/crystal'
Crystal 0.30.0-dev [ec2a26f9a] (2019-07-31)
LLVM: 8.0.0
Default target: x86_64-apple-macosx
Crystal 0.29.0 (2019-06-06)
LLVM: 6.0.1
Default target: x86_64-apple-macosx
|
I wish github had properly threaded review comments. This is a comment to my own review comment: I note that you get lets of repeated allocation warnings and that one CI fail due to out of memory. |
|
||
# Computes the next index in `@indices`, needed when an index is not empty. | ||
private def next_index(index : Int32) : Int32 | ||
fit_in_indices(index + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"linear probing" is a bad idea.
Use "quadratic probing" instead. It has same good cache locality on first three probes as for linear probing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Ruby they tried quadratic probing and it turned out to be a bit slower. They use linear probing. That's why I also chose that, and it's simpler.
Ref: https://github.com/ruby/ruby/blob/c94cc6d968a7241a487591a9753b171d8081d335/st.c#L869-L872
But if you want we can try it. I know nothing about quadratic probing, is it just index,
index + 1, index + 4, index + 9, index + 16, etc.`?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the ref they aren't using linear probing now, they are using Double Hashing (see secondary hash
in comments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true! However I don't understand how that works:
Still other bits of the hash value are used when the mapping
results in a collision. In this case we use a secondary hash
value which is a result of a function of the collision bin
index and the original hash value. The function choice
guarantees that we can traverse all bins and finally find the
corresponding bin as after several iterations the function
becomes a full cycle linear congruential generator because it
satisfies requirements of the Hull-Dobell theorem.
I don't know about that theorem so it's harder for me to implement something I don't understand. But I'll definitely have that in mind for further optimizing this (but it'll me some more time, but PRs are welcome!).
The OOM failure on test_linux32 is pretty regular and unrelated to this PR. |
# If we have less than 8 elements we avoid computing the hash | ||
# code and directly compare the keys (might be cheaper than | ||
# computing a hash code of a complex structure). | ||
if entries_size <= 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt it is useful. I could be mistaken.
Was it benchmarked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I just benchmarked it again because I introduced that optimization some time ago, before introducing more optimizations and changing other things.
I ran the same bencharmks as in my main post but only for String. The benchmark for insert
don't have a difference. For read
it gives this (old is without the optimization, new is with it):
old, read (type: String, size: 1) 70.65M ( 14.15ns) (± 2.49%) 0.0B/op 2.68× slower
new, read (type: String, size: 1) 189.05M ( 5.29ns) (± 3.89%) 0.0B/op fastest
old, read (type: String, size: 2) 39.07M ( 25.59ns) (± 2.05%) 0.0B/op 2.26× slower
new, read (type: String, size: 2) 88.16M ( 11.34ns) (± 4.01%) 0.0B/op fastest
old, read (type: String, size: 3) 26.56M ( 37.65ns) (± 1.84%) 0.0B/op 2.00× slower
new, read (type: String, size: 3) 53.24M ( 18.78ns) (± 1.89%) 0.0B/op fastest
old, read (type: String, size: 4) 19.98M ( 50.06ns) (± 2.17%) 0.0B/op 1.72× slower
new, read (type: String, size: 4) 34.35M ( 29.11ns) (± 3.70%) 0.0B/op fastest
old, read (type: String, size: 5) 16.00M ( 62.49ns) (± 1.61%) 0.0B/op 1.49× slower
new, read (type: String, size: 5) 23.78M ( 42.05ns) (± 2.05%) 0.0B/op fastest
old, read (type: String, size: 6) 13.11M ( 76.25ns) (± 1.49%) 0.0B/op 1.39× slower
new, read (type: String, size: 6) 18.17M ( 55.02ns) (± 2.68%) 0.0B/op fastest
old, read (type: String, size: 7) 11.21M ( 89.24ns) (± 1.37%) 0.0B/op 1.22× slower
new, read (type: String, size: 7) 13.71M ( 72.93ns) (± 3.94%) 0.0B/op fastest
old, read (type: String, size: 8) 9.57M (104.47ns) (± 3.21%) 0.0B/op 1.16× slower
new, read (type: String, size: 8) 11.14M ( 89.77ns) (± 3.77%) 0.0B/op fastest
So you can see it's a big optimization. You can also see it becomes faster and faster to start compare the hash first when there are more and more elements, starting from 9 it's almost the same.
I wonder if this optimization could also be applied in the Ruby code... 🤔
I might give it a try later (in Ruby), now I'm curious!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think if the key type is more complex and ==
takes more time then it will be slower. So maybe it should also be applied for a few cases... maybe just do it if the key is a String because we know it's true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried it with keys being Array(String)
and the differences are actually much bigger, meaning that it's faster to avoid checking the hash
code altogether for some reason... probably because to compute the hash code you need to consider the entire structure, but to compare elements you bail out as soon as you find a difference.
@straight-shoota Are all the warnings about large allocations also old? |
src/hash.cr
Outdated
end | ||
end | ||
|
||
private module BaseIterator | ||
def initialize(@hash, @current) | ||
def initialize(@hash) | ||
@index = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather track index of a first element (and update it in delete_entry_and_update_counts
).
If one calls hash.shift
in a loop, they will suffer from O(n^2)
without that. (Queue or LRU usecases).
If @indices_size_pow2
will be changed to UInt8
and placed near @indices_bytesize
, then addition if @first : UInt32
will not change size of hash structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry but I don't understand what you mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, I see what you mean. I think we should either remove Hash#shift
or implement your suggestion. We copied Hash#shift
from Ruby but I don't know what's a real use for that it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 01e072c !
@funny-falcon I'll be happy if you can review the commit. The change turned out to be pretty simple thanks to the built abstractions.
Rather clean code! I liked to read it. |
Thank you! 😊 And thank you for the thorough review, you found many good things to fix. |
Yes, but I think the errors were there before. I think |
Thank you @funny-falcon for the idea ❤️
Previously we didn't try to compact small hashes when a resize was needed. However we can do that if we have many deleted elements to avoid a reallocation. This should improve performance when shifting or deleting elements and then inserting more elements.
@funny-falcon thank you! The above articles on RH hashing weren't entirely clear on this. Memory overhead of Hash is important too, and I'd be interested in a graph of the allocation size of |
@funny-falcon Thank you! I implemented all your suggestions. Let me know if I got something wrong 😊 Although I think that many things can be still optimized (representation of things, the probing, the load factor, etc.) I think this is already a good improvement and we can continue refining things after this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely wonderful, thank you so much for your hard work on this, and all you do for Crystal, Ary.
Could someone explain to me with some ascii art of graphics how Robin Hood hashing works?
I can't understand this. Or... I can understand what it says but not what it means. Let's suppose we start with empty buckets:
Now we want to insert 10, which hashes to bucket 1:
Say we want to insert 20, which hashes to bucket 1 too... What do we do? The probe count for 10 is 0 (I guess?) and for 20 we don't know yet but it's definitely higher than 0 because it's occupied. So we swap them...?
Next comes 30, again with bucket 1. So 20 has probe count 0, 10 has probe count 1, and 30 will have probe count 2... so we span everything again?
I can't see the benefit of this, but I'm sure there's something I'm not understanding. |
In Robin hood hashing, we compare old element probe count with current probe count of our element. |
@konovod Thanks, I understand now! |
Thank you everyone for the suggestions! If you find optimizations please send PRs and benchmarks. |
Some optimizations ideas:
|
That's good points @asterite, better to create issue to track them. It can even be a RFC. |
No. If someone finds an optimization please send a PR with benchmark. I don't plan to change anything else otherwise, and there's no need to track anything. |
Gold words.
вс, 4 авг. 2019 г., 23:03 Ary Borenszweig <[email protected]>:
… No. If someone finds an optimization please send a PR with benchmark. I
don't plan to change anything else otherwise, and there's no need to track
anything.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8017?email_source=notifications&email_token=AAB44U37QCJJPFL6MJF4UR3QC4Y2HA5CNFSM4IIA36K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3QIU3I#issuecomment-518031981>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAB44U25ZPY57CSTTDQIIFTQC4Y2HANCNFSM4IIA36KQ>
.
|
I think it would be better to have proper RFCs about the design decisions, instead of PRs, just a thought. |
The default initial_capacity was changed to 8 in #8017
This improves its performance, both in time and memory.
This improves its performance, both in time (always) and memory (generally).
Fixes #4557
The algorithm
The algorithm is basically the one that Ruby uses.
How to review this
There's just a single commit. My recommendation is to first look at the final code, which is thoroughly documented (the important bits are at the top), then at the diff.
I tried to document this really well because otherwise in a month I won't remember any of this. It also enables anyone to understand how it works and possibly keep optimizing things.
Why?
The old Hash implementation uses closed addressing: buckets with linked lists. This apparently isn't great because of pointer chasing and not having good cache locality. Using open addressing removes pointer chasing and improves cache locality. It could just be a theory but empirically it performs much better.
Give me some benchmarks!
I used this code as a benchmark:
Code for benchmarking old Hash vs. new Hash
Results:
As you can see the new implementation is always faster than the old one. Sometimes more memory is used, sometimes less.
I also ran some @kostya benchmarks that used Hash in their implementation. Here are the results:
Havlak:
Havlak seems to be a benchmark measuring how well a language performs in general algorithmic tasks... the new results look good! 😊
Brainfuck:
JSON (when using
JSON.parse
):Knucleotide:
Then some more benchmarks...
There's
HTTP::Request#from_io
which I recently optimized:(using
wrk
against the sample http server increases the requests/sec from 118355.73 to about 122000)Also for curiosity I compared creating a Hash with 1_000_000 elements and seeing how it compares to Ruby and the old Hash.
Code for creating a Hash with 1_000_000 elements in Ruby and Crystal
Results:
Ruby was faster than Crystal! Ruby is simply amazing ❤️. But now Crystal is even faster!
The compiler uses Hash all over the place!
So compile times now should go down! Right? ... Right?
Well, unfortunately no. I think the main reason is that the times are bound by the number of method instances, not by the performance of the many hashes used.
Memory consumption did seem to go down a bit.
When will I have this?
We could push this to 0.31.0. Or... we could have it in 0.30.0 if we delay the release a bit more (note: I don't manage releases, but if the community doesn't mind waiting a bit to get a huge performance boost in their apps then I think we could relax the release date).In 0.31.0.
Final thoughts
remainder
or%
to fit a number inside a range... I did that as almost the last thing because I didn't believe it would improve performance a lot... and it doubled the performance! 😮