Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash now uses an open addressing algorithm #8017

Merged
merged 21 commits into from
Aug 4, 2019

Conversation

asterite
Copy link
Member

@asterite asterite commented Jul 30, 2019

This improves its performance, both in time (always) and memory (generally).

Fixes #4557

The algorithm

The algorithm is basically the one that Ruby uses.

How to review this

There's just a single commit. My recommendation is to first look at the final code, which is thoroughly documented (the important bits are at the top), then at the diff.

I tried to document this really well because otherwise in a month I won't remember any of this. It also enables anyone to understand how it works and possibly keep optimizing things.

Why?

The old Hash implementation uses closed addressing: buckets with linked lists. This apparently isn't great because of pointer chasing and not having good cache locality. Using open addressing removes pointer chasing and improves cache locality. It could just be a theory but empirically it performs much better.

Give me some benchmarks!

I used this code as a benchmark:

Code for benchmarking old Hash vs. new Hash
require "benchmark"
require "./old_hash"
require "random/secure"

def benchmark_create_empty
  Benchmark.ips do |x|
    x.report("old, create") do
      OldHash(Int32, Int32).new
    end
    x.report("new, create") do
      Hash(String, String).new
    end
  end
end

def benchmark_insert_strings(sizes)
  sizes.each do |size|
    values = Array.new(size) { Random::Secure.hex }

    Benchmark.ips do |x|
      x.report("old, insert (type: String, size: #{size})") do
        hash = OldHash(String, String).new
        values.each do |value|
          hash[value] = value
        end
      end
      x.report("new, insert (type: String, size: #{size})") do
        hash = Hash(String, String).new
        values.each do |value|
          hash[value] = value
        end
      end
    end
  end
end

def benchmark_insert_ints(sizes)
  sizes.each do |size|
    values = Array.new(size) { rand(Int32) }

    Benchmark.ips do |x|
      x.report("old, insert (type: Int32, size: #{size})") do
        hash = OldHash(Int32, Int32).new
        values.each do |value|
          hash[value] = value
        end
      end
      x.report("new, insert (type: Int32, size: #{size})") do
        hash = Hash(Int32, Int32).new
        values.each do |value|
          hash[value] = value
        end
      end
    end
  end
end

def benchmark_read_strings(sizes)
  sizes.each do |size|
    values = Array.new(size) { Random::Secure.hex }

    old_hash = OldHash(String, String).new
    new_hash = Hash(String, String).new

    values.each do |value|
      old_hash[value] = value
      new_hash[value] = value
    end

    Benchmark.ips do |x|
      x.report("old, read (type: String, size: #{size})") do
        values.each do |value|
          old_hash[value]
        end
      end
      x.report("new, read (type: String, size: #{size})") do
        values.each do |value|
          new_hash[value]
        end
      end
    end
  end
end

def benchmark_read_ints(sizes)
  sizes.each do |size|
    values = Array.new(size) { rand(1_00_000) }

    old_hash = OldHash(Int32, Int32).new
    new_hash = Hash(Int32, Int32).new

    values.each do |value|
      old_hash[value] = value
      new_hash[value] = value
    end

    Benchmark.ips do |x|
      x.report("old, read (type: Int32, size: #{size})") do
        values.each do |value|
          old_hash[value]
        end
      end
      x.report("new, read (type: Int32, size: #{size})") do
        values.each do |value|
          new_hash[value]
        end
      end
    end
  end
end

sizes = [5, 10, 15, 20, 30, 50, 100, 200, 500, 1_000, 10_000, 100_000]

benchmark_create_empty()
puts
benchmark_insert_strings(sizes)
puts
benchmark_insert_ints(sizes)
puts
benchmark_read_strings(sizes)
puts
benchmark_read_ints(sizes)

Results:

old, create  18.50M ( 54.06ns) (± 1.82%)   160B/op   2.34× slower
new, create  43.32M ( 23.08ns) (± 0.88%)  64.0B/op        fastest

old, insert (type: String, size: 5)   3.60M (277.58ns) (± 7.01%)  480B/op   1.38× slower
new, insert (type: String, size: 5)   4.97M (201.28ns) (± 4.77%)  384B/op        fastest
old, insert (type: String, size: 10)   1.95M (513.74ns) (± 1.64%)  801B/op   1.33× slower
new, insert (type: String, size: 10)   2.59M (386.02ns) (± 3.01%)  832B/op        fastest
old, insert (type: String, size: 15)   1.27M (786.17ns) (± 3.56%)  1.09kB/op   1.55× slower
new, insert (type: String, size: 15)   1.98M (505.71ns) (± 4.48%)    832B/op        fastest
old, insert (type: String, size: 20) 922.03k (  1.08µs) (± 0.99%)  1.41kB/op   1.37× slower
new, insert (type: String, size: 20)   1.27M (788.95ns) (± 2.86%)  1.67kB/op        fastest
old, insert (type: String, size: 30) 639.38k (  1.56µs) (± 5.63%)  2.03kB/op   1.58× slower
new, insert (type: String, size: 30)   1.01M (992.38ns) (± 6.01%)  1.67kB/op        fastest
old, insert (type: String, size: 50) 355.49k (  2.81µs) (± 5.77%)  3.28kB/op   1.49× slower
new, insert (type: String, size: 50) 530.85k (  1.88µs) (± 3.95%)  3.81kB/op        fastest
old, insert (type: String, size: 100) 154.93k (  6.45µs) (± 4.04%)  7.06kB/op   1.81× slower
new, insert (type: String, size: 100) 280.47k (  3.57µs) (± 3.92%)  7.09kB/op        fastest
old, insert (type: String, size: 200)  71.71k ( 13.94µs) (± 3.27%)  13.3kB/op   1.91× slower
new, insert (type: String, size: 200) 136.71k (  7.31µs) (± 2.21%)  14.5kB/op        fastest
old, insert (type: String, size: 500)  23.46k ( 42.62µs) (± 3.53%)  36.1kB/op   2.46× slower
new, insert (type: String, size: 500)  57.82k ( 17.30µs) (± 4.78%)  28.4kB/op        fastest
old, insert (type: String, size: 1000)  12.78k ( 78.27µs) (± 1.72%)  67.4kB/op   2.20× slower
new, insert (type: String, size: 1000)  28.12k ( 35.56µs) (± 4.20%)  56.4kB/op        fastest
old, insert (type: String, size: 10000)   1.00k (997.78µs) (± 6.41%)  662kB/op   2.12× slower
new, insert (type: String, size: 10000)   2.13k (469.91µs) (± 5.39%)  896kB/op        fastest

old, insert (type: Int32, size: 5)   4.83M (207.19ns) (± 0.80%)  400B/op   1.78× slower
new, insert (type: Int32, size: 5)   8.60M (116.23ns) (± 1.76%)  240B/op        fastest
old, insert (type: Int32, size: 10)   2.78M (359.30ns) (± 0.91%)  640B/op   1.72× slower
new, insert (type: Int32, size: 10)   4.79M (208.73ns) (± 1.11%)  448B/op        fastest
old, insert (type: Int32, size: 15)   1.90M (525.28ns) (± 2.96%)  880B/op   1.75× slower
new, insert (type: Int32, size: 15)   3.33M (300.60ns) (± 5.98%)  448B/op        fastest
old, insert (type: Int32, size: 20)   1.30M (769.02ns) (± 6.52%)  1.09kB/op   1.54× slower
new, insert (type: Int32, size: 20)   2.00M (500.54ns) (± 4.76%)    976B/op        fastest
old, insert (type: Int32, size: 30) 972.78k (  1.03µs) (± 4.94%)  1.56kB/op   1.72× slower
new, insert (type: Int32, size: 30)   1.67M (597.26ns) (± 4.29%)    976B/op        fastest
old, insert (type: Int32, size: 50) 582.06k (  1.72µs) (± 5.23%)   2.5kB/op   1.50× slower
new, insert (type: Int32, size: 50) 872.34k (  1.15µs) (± 7.97%)  1.88kB/op        fastest
old, insert (type: Int32, size: 100) 234.48k (  4.26µs) (± 7.09%)   5.5kB/op   2.08× slower
new, insert (type: Int32, size: 100) 488.20k (  2.05µs) (± 6.79%)  4.14kB/op        fastest
old, insert (type: Int32, size: 200) 126.98k (  7.88µs) (± 4.74%)  10.2kB/op   1.84× slower
new, insert (type: Int32, size: 200) 233.78k (  4.28µs) (± 7.45%)  8.47kB/op        fastest
old, insert (type: Int32, size: 500)  40.86k ( 24.47µs) (± 4.14%)  28.3kB/op   2.55× slower
new, insert (type: Int32, size: 500) 104.34k (  9.58µs) (± 7.55%)  16.5kB/op        fastest
old, insert (type: Int32, size: 1000)  17.87k ( 55.97µs) (± 3.24%)  51.8kB/op   2.57× slower
new, insert (type: Int32, size: 1000)  45.98k ( 21.75µs) (± 7.72%)  32.5kB/op        fastest
old, insert (type: Int32, size: 10000)   1.52k (658.91µs) (± 3.88%)  506kB/op   2.16× slower
new, insert (type: Int32, size: 10000)   3.28k (305.04µs) (± 4.09%)  513kB/op        fastest

old, read (type: String, size: 5)  11.64M ( 85.90ns) (± 6.51%)  0.0B/op   2.25× slower
new, read (type: String, size: 5)  26.20M ( 38.16ns) (± 2.98%)  0.0B/op        fastest
old, read (type: String, size: 10)   5.69M (175.80ns) (± 4.64%)  0.0B/op   1.39× slower
new, read (type: String, size: 10)   7.93M (126.07ns) (± 6.97%)  0.0B/op        fastest
old, read (type: String, size: 15)   4.15M (240.69ns) (± 2.01%)  0.0B/op   1.16× slower
new, read (type: String, size: 15)   4.82M (207.47ns) (± 3.30%)  0.0B/op        fastest
old, read (type: String, size: 20)   2.93M (341.71ns) (± 1.47%)  0.0B/op   1.52× slower
new, read (type: String, size: 20)   4.46M (224.21ns) (± 1.25%)  0.0B/op        fastest
old, read (type: String, size: 30)   1.87M (534.24ns) (± 2.90%)  0.0B/op   1.45× slower
new, read (type: String, size: 30)   2.71M (369.30ns) (± 1.90%)  0.0B/op        fastest
old, read (type: String, size: 50) 992.08k (  1.01µs) (± 4.51%)  0.0B/op   1.71× slower
new, read (type: String, size: 50)   1.70M (589.77ns) (± 1.65%)  0.0B/op        fastest
old, read (type: String, size: 100) 596.10k (  1.68µs) (± 1.60%)  0.0B/op   1.40× slower
new, read (type: String, size: 100) 832.91k (  1.20µs) (± 1.82%)  0.0B/op        fastest
old, read (type: String, size: 200) 260.30k (  3.84µs) (± 1.63%)  0.0B/op   1.53× slower
new, read (type: String, size: 200) 398.84k (  2.51µs) (± 1.84%)  0.0B/op        fastest
old, read (type: String, size: 500) 111.03k (  9.01µs) (± 2.89%)  0.0B/op   1.40× slower
new, read (type: String, size: 500) 155.71k (  6.42µs) (± 3.33%)  0.0B/op        fastest
old, read (type: String, size: 1000)  41.59k ( 24.04µs) (± 3.07%)  0.0B/op   1.71× slower
new, read (type: String, size: 1000)  71.27k ( 14.03µs) (± 3.81%)  0.0B/op        fastest
old, read (type: String, size: 10000)   2.66k (376.32µs) (± 4.55%)  0.0B/op   2.46× slower
new, read (type: String, size: 10000)   6.54k (152.95µs) (± 4.03%)  0.0B/op        fastest

old, read (type: Int32, size: 5)  23.67M ( 42.24ns) (± 1.64%)  0.0B/op   2.30× slower
new, read (type: Int32, size: 5)  54.35M ( 18.40ns) (± 3.14%)  0.0B/op        fastest
old, read (type: Int32, size: 10)  11.66M ( 85.77ns) (± 4.63%)  0.0B/op   1.85× slower
new, read (type: Int32, size: 10)  21.58M ( 46.33ns) (± 3.20%)  0.0B/op        fastest
old, read (type: Int32, size: 15)   7.81M (128.00ns) (± 4.15%)  0.0B/op   1.49× slower
new, read (type: Int32, size: 15)  11.63M ( 86.00ns) (± 3.99%)  0.0B/op        fastest
old, read (type: Int32, size: 20)   5.80M (172.48ns) (± 5.34%)  0.0B/op   1.73× slower
new, read (type: Int32, size: 20)  10.02M ( 99.77ns) (± 4.54%)  0.0B/op        fastest
old, read (type: Int32, size: 30)   3.67M (272.51ns) (± 2.99%)  0.0B/op   1.84× slower
new, read (type: Int32, size: 30)   6.74M (148.30ns) (± 1.97%)  0.0B/op        fastest
old, read (type: Int32, size: 50)   2.05M (488.04ns) (± 5.57%)  0.0B/op   1.89× slower
new, read (type: Int32, size: 50)   3.87M (258.41ns) (± 4.24%)  0.0B/op        fastest
old, read (type: Int32, size: 100)   1.09M (921.59ns) (± 8.72%)  0.0B/op   1.70× slower
new, read (type: Int32, size: 100)   1.84M (542.80ns) (± 5.52%)  0.0B/op        fastest
old, read (type: Int32, size: 200) 535.83k (  1.87µs) (± 5.84%)  0.0B/op   1.66× slower
new, read (type: Int32, size: 200) 891.31k (  1.12µs) (± 5.49%)  0.0B/op        fastest
old, read (type: Int32, size: 500) 236.68k (  4.23µs) (± 3.61%)  0.0B/op   1.52× slower
new, read (type: Int32, size: 500) 360.85k (  2.77µs) (± 4.31%)  0.0B/op        fastest
old, read (type: Int32, size: 1000) 106.13k (  9.42µs) (± 4.88%)  0.0B/op   1.66× slower
new, read (type: Int32, size: 1000) 175.92k (  5.68µs) (± 4.08%)  0.0B/op        fastest
old, read (type: Int32, size: 10000)   4.60k (217.62µs) (± 2.94%)  0.0B/op   3.00× slower
new, read (type: Int32, size: 10000)  13.80k ( 72.47µs) (± 1.67%)  0.0B/op        fastest

As you can see the new implementation is always faster than the old one. Sometimes more memory is used, sometimes less.

I also ran some @kostya benchmarks that used Hash in their implementation. Here are the results:

Havlak:

old: 12.49s, 375.1Mb
new: 7.58s, 215.7Mb

Havlak seems to be a benchmark measuring how well a language performs in general algorithmic tasks... the new results look good! 😊

Brainfuck:

old: 5.20s, 1.8Mb
new: 4.22s, 1.8Mb

JSON (when using JSON.parse):

old: 2.20s, 1137.0Mb
new: 2.07s, 961.3Mb

Knucleotide:

old: 1.63s, 26.5Mb
new: 1.01s, 32.4Mb

Then some more benchmarks...

There's HTTP::Request#from_io which I recently optimized:

old: from_io 498.47k (  2.01µs) (± 0.89%)  816B/op  fastest
new: from_io 549.22k (  1.82µs) (± 1.89%)  720B/op  fastest

(using wrk against the sample http server increases the requests/sec from 118355.73 to about 122000)

Also for curiosity I compared creating a Hash with 1_000_000 elements and seeing how it compares to Ruby and the old Hash.

Code for creating a Hash with 1_000_000 elements in Ruby and Crystal
size = 1_000_000

h = {0 => 0}

time = Time.now
size.times do |i|
  h[i] = i
end
puts "Insert: #{Time.now - time}"

time = Time.now
size.times do |i|
  h[i]
end
puts "Read: #{Time.now - time}"

Results:

Ruby 2.7-dev:
  Insert: 0.151813
  Read:   0.13749
Crystal old:
  Insert: 0.238662
  Read:   0.129462
Crystal new:
  Insert: 0.070804
  Read:   0.041008

Ruby was faster than Crystal! Ruby is simply amazing ❤️. But now Crystal is even faster!

The compiler uses Hash all over the place!

So compile times now should go down! Right? ... Right?

Well, unfortunately no. I think the main reason is that the times are bound by the number of method instances, not by the performance of the many hashes used.

Memory consumption did seem to go down a bit.

When will I have this?

We could push this to 0.31.0. Or... we could have it in 0.30.0 if we delay the release a bit more (note: I don't manage releases, but if the community doesn't mind waiting a bit to get a huge performance boost in their apps then I think we could relax the release date).

In 0.31.0.

Final thoughts

  1. Thank you Ruby! ❤️ ❤️ ❤️
  2. Thank you Vladimir Makarov! ❤️ ❤️ ❤️
  3. Algorithms are cool!
  4. There's an optimization in the original Ruby algorithm which uses bitmasks instead of remainder or % to fit a number inside a range... I did that as almost the last thing because I didn't believe it would improve performance a lot... and it doubled the performance! 😮
  5. Feel free to target this PR and benchmark your code and post any interesting speedups here!
  6. I hope CI 32 bits passes! 😊

src/hash.cr Outdated Show resolved Hide resolved
Copy link
Contributor

@j8r j8r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using #as will provide more type information in case of errros compared to #not_nil!

src/hash.cr Outdated Show resolved Hide resolved
src/hash.cr Outdated Show resolved Hide resolved
src/hash.cr Outdated Show resolved Hide resolved
src/hash.cr Outdated Show resolved Hide resolved
This improves its performance, both in time and memory.
Copy link
Contributor

@yxhuvud yxhuvud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seem to recall @funny-falcon ran into some issues with the GC finding false positives while he was working on this. That is not something you have seen?

src/hash.cr Outdated Show resolved Hide resolved
src/hash.cr Outdated Show resolved Hide resolved
@lribeiro
Copy link

This gives a nice speedup on our internal app for log processing!

@bcardiff will this be considered for 0.30?

Time in seconds

test 0.29 0.30dev speedup
forti 7.353 6.722 9.4%
forti_kv 5.462 4.867 12,2%
dns_grt 4.714 4.466 5,6%

Details

~/Work/crystal-1/bin/crystal version

Using compiled compiler at `.build/crystal'
Crystal 0.30.0-dev [ec2a26f9a] (2019-07-31)

LLVM: 8.0.0
Default target: x86_64-apple-macosx
~/Work/crystal-1/bin/crystal build --release -o bin/log_clean src/log_clean.cr 
Using compiled compiler at `.build/crystal'
bin/log_clean -c test/forti/forti.yml >    /dev/null  6.83s user 1.17s system 119% cpu 6.722 total
bin/log_clean -c test/forti/forti_kv.yml > /dev/null  5.11s user 2.02s system 146% cpu 4.867 total
bin/log_clean -c config/dns_grt.yml >      /dev/null  4.50s user 0.74s system 117% cpu 4.466 total

crystal version

Crystal 0.29.0 (2019-06-06)

LLVM: 6.0.1
Default target: x86_64-apple-macosx
bin/log_clean -c test/forti/forti.yml >    /dev/null  7.49s user 1.26s system 119% cpu 7.353 total
bin/log_clean -c test/forti/forti_kv.yml > /dev/null  5.80s user 2.08s system 144% cpu 5.462 total
bin/log_clean -c config/dns_grt.yml >      /dev/null  4.78s user 0.79s system 118% cpu 4.714 total

@yxhuvud
Copy link
Contributor

yxhuvud commented Jul 31, 2019

I wish github had properly threaded review comments. This is a comment to my own review comment: I note that you get lets of repeated allocation warnings and that one CI fail due to out of memory.


# Computes the next index in `@indices`, needed when an index is not empty.
private def next_index(index : Int32) : Int32
fit_in_indices(index + 1)
Copy link
Contributor

@funny-falcon funny-falcon Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"linear probing" is a bad idea.
Use "quadratic probing" instead. It has same good cache locality on first three probes as for linear probing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Ruby they tried quadratic probing and it turned out to be a bit slower. They use linear probing. That's why I also chose that, and it's simpler.

Ref: https://github.com/ruby/ruby/blob/c94cc6d968a7241a487591a9753b171d8081d335/st.c#L869-L872

But if you want we can try it. I know nothing about quadratic probing, is it just index, index + 1, index + 4, index + 9, index + 16, etc.`?

Copy link
Contributor

@konovod konovod Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the ref they aren't using linear probing now, they are using Double Hashing (see secondary hash in comments).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true! However I don't understand how that works:

 Still other bits of the hash value are used when the mapping
 results in a collision.  In this case we use a secondary hash
 value which is a result of a function of the collision bin
 index and the original hash value.  The function choice
 guarantees that we can traverse all bins and finally find the
 corresponding bin as after several iterations the function
 becomes a full cycle linear congruential generator because it
 satisfies requirements of the Hull-Dobell theorem.

I don't know about that theorem so it's harder for me to implement something I don't understand. But I'll definitely have that in mind for further optimizing this (but it'll me some more time, but PRs are welcome!).

src/hash.cr Outdated Show resolved Hide resolved
@straight-shoota
Copy link
Member

The OOM failure on test_linux32 is pretty regular and unrelated to this PR.

# If we have less than 8 elements we avoid computing the hash
# code and directly compare the keys (might be cheaper than
# computing a hash code of a complex structure).
if entries_size <= 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt it is useful. I could be mistaken.
Was it benchmarked?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I just benchmarked it again because I introduced that optimization some time ago, before introducing more optimizations and changing other things.

I ran the same bencharmks as in my main post but only for String. The benchmark for insert don't have a difference. For read it gives this (old is without the optimization, new is with it):

old, read (type: String, size: 1)  70.65M ( 14.15ns) (± 2.49%)  0.0B/op   2.68× slower
new, read (type: String, size: 1) 189.05M (  5.29ns) (± 3.89%)  0.0B/op        fastest
old, read (type: String, size: 2)  39.07M ( 25.59ns) (± 2.05%)  0.0B/op   2.26× slower
new, read (type: String, size: 2)  88.16M ( 11.34ns) (± 4.01%)  0.0B/op        fastest
old, read (type: String, size: 3)  26.56M ( 37.65ns) (± 1.84%)  0.0B/op   2.00× slower
new, read (type: String, size: 3)  53.24M ( 18.78ns) (± 1.89%)  0.0B/op        fastest
old, read (type: String, size: 4)  19.98M ( 50.06ns) (± 2.17%)  0.0B/op   1.72× slower
new, read (type: String, size: 4)  34.35M ( 29.11ns) (± 3.70%)  0.0B/op        fastest
old, read (type: String, size: 5)  16.00M ( 62.49ns) (± 1.61%)  0.0B/op   1.49× slower
new, read (type: String, size: 5)  23.78M ( 42.05ns) (± 2.05%)  0.0B/op        fastest
old, read (type: String, size: 6)  13.11M ( 76.25ns) (± 1.49%)  0.0B/op   1.39× slower
new, read (type: String, size: 6)  18.17M ( 55.02ns) (± 2.68%)  0.0B/op        fastest
old, read (type: String, size: 7)  11.21M ( 89.24ns) (± 1.37%)  0.0B/op   1.22× slower
new, read (type: String, size: 7)  13.71M ( 72.93ns) (± 3.94%)  0.0B/op        fastest
old, read (type: String, size: 8)   9.57M (104.47ns) (± 3.21%)  0.0B/op   1.16× slower
new, read (type: String, size: 8)  11.14M ( 89.77ns) (± 3.77%)  0.0B/op        fastest

So you can see it's a big optimization. You can also see it becomes faster and faster to start compare the hash first when there are more and more elements, starting from 9 it's almost the same.

I wonder if this optimization could also be applied in the Ruby code... 🤔

I might give it a try later (in Ruby), now I'm curious!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think if the key type is more complex and == takes more time then it will be slower. So maybe it should also be applied for a few cases... maybe just do it if the key is a String because we know it's true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried it with keys being Array(String) and the differences are actually much bigger, meaning that it's faster to avoid checking the hash code altogether for some reason... probably because to compute the hash code you need to consider the entire structure, but to compare elements you bail out as soon as you find a difference.

@yxhuvud
Copy link
Contributor

yxhuvud commented Jul 31, 2019

@straight-shoota Are all the warnings about large allocations also old?

src/hash.cr Outdated Show resolved Hide resolved
src/hash.cr Outdated Show resolved Hide resolved
src/hash.cr Outdated Show resolved Hide resolved
src/hash.cr Outdated Show resolved Hide resolved
src/hash.cr Outdated
end
end

private module BaseIterator
def initialize(@hash, @current)
def initialize(@hash)
@index = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather track index of a first element (and update it in delete_entry_and_update_counts).
If one calls hash.shift in a loop, they will suffer from O(n^2) without that. (Queue or LRU usecases).

If @indices_size_pow2 will be changed to UInt8 and placed near @indices_bytesize, then addition if @first : UInt32 will not change size of hash structure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry but I don't understand what you mean.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, I see what you mean. I think we should either remove Hash#shift or implement your suggestion. We copied Hash#shift from Ruby but I don't know what's a real use for that it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 01e072c !

@funny-falcon I'll be happy if you can review the commit. The change turned out to be pretty simple thanks to the built abstractions.

@funny-falcon
Copy link
Contributor

Rather clean code! I liked to read it.

@asterite
Copy link
Member Author

@funny-falcon

Rather clean code! I liked to read it.

Thank you! 😊 And thank you for the thorough review, you found many good things to fix.

@asterite
Copy link
Member Author

@yxhuvud

This is a comment to my own review comment: I note that you get lets of repeated allocation warnings and that one CI fail due to out of memory.

Yes, but I think the errors were there before. I think Hash might require a bit more memory because now it always doubles its size when a resize is needed. Also we keep adding tests to the spec suite and right now the compiler specs leak memory. I'll open another issue to track this problem. The solution for now is to just run specs on CI in smaller chunks instead of in two big chunks. This is a problem that only affects CI, regular machines usually have a lot of memory and don't need to compile the entire Crystal's spec suite.

Previously we didn't try to compact small hashes when a resize was
needed. However we can do that if we have many deleted elements
to avoid a reallocation. This should improve performance when shifting
or deleting elements and then inserting more elements.
@RX14
Copy link
Contributor

RX14 commented Aug 1, 2019

@funny-falcon thank you! The above articles on RH hashing weren't entirely clear on this.

Memory overhead of Hash is important too, and I'd be interested in a graph of the allocation size of @indexes and @entries in bytes for varying Hash(Reference, Reference) sizes. I suspect that the load factor in @indexes is fairly insignificant compared to the size of @entries.

@asterite
Copy link
Member Author

asterite commented Aug 1, 2019

@funny-falcon Thank you! I implemented all your suggestions. Let me know if I got something wrong 😊

Although I think that many things can be still optimized (representation of things, the probing, the load factor, etc.) I think this is already a good improvement and we can continue refining things after this PR.

Copy link
Contributor

@RX14 RX14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely wonderful, thank you so much for your hard work on this, and all you do for Crystal, Ary.

@asterite
Copy link
Member Author

asterite commented Aug 3, 2019

Could someone explain to me with some ascii art of graphics how Robin Hood hashing works?

When a collision occur, compare the two items’ probing count, the one with larger probing number stays and the other continue to probe. Repeat until the probing item finds an empty spot.

I can't understand this. Or... I can understand what it says but not what it means.

Let's suppose we start with empty buckets:

[_, _, _, _, _, _] 

Now we want to insert 10, which hashes to bucket 1:

[_, 10, _, _, _, _] 

Say we want to insert 20, which hashes to bucket 1 too... What do we do? The probe count for 10 is 0 (I guess?) and for 20 we don't know yet but it's definitely higher than 0 because it's occupied. So we swap them...?

[_, 20, 10, _, _, _] 

Next comes 30, again with bucket 1. So 20 has probe count 0, 10 has probe count 1, and 30 will have probe count 2... so we span everything again?

[_, 30, 20, 10, _, _] 

I can't see the benefit of this, but I'm sure there's something I'm not understanding.

@konovod
Copy link
Contributor

konovod commented Aug 4, 2019

In Robin hood hashing, we compare old element probe count with current probe count of our element.
So in your example nothing will be shifted (current probe count will be equal to element probe count) - hash table will be
[_, 10, 20, 30, _, _]
Now if we insert element 21 that hashes to 2, we will compare it to 20 (0 vs 1) and 30(1 vs 2) and still insert at 4:
[_, 10, 20, 30, 21, _]
But now if we insert 40 that hashes to 1:
it will skip 10, 20, 30 as they have same probe count, but will steal a place from 21 as it will have probe count 3 and 21 will have just 1:
[_, 10, 20, 30, 40, _]
and then we continue to search place for 21, finding it at next position:
[_, 10, 20, 30, 40, 21]

@asterite
Copy link
Member Author

asterite commented Aug 4, 2019

@konovod Thanks, I understand now!

@asterite asterite merged commit 3d8609b into crystal-lang:master Aug 4, 2019
@asterite asterite deleted the open-addressing-hash branch August 4, 2019 16:07
@asterite
Copy link
Member Author

asterite commented Aug 4, 2019

Thank you everyone for the suggestions!

If you find optimizations please send PRs and benchmarks.

@asterite
Copy link
Member Author

asterite commented Aug 4, 2019

Some optimizations ideas:

  • Robin Hood hashing
  • Double hashing
  • Quadratic probing
  • Reduce the allocated size of @entries (it could be possible to allocate less than @indices_size / 2)
  • Other ideas...?

@j8r
Copy link
Contributor

j8r commented Aug 4, 2019

That's good points @asterite, better to create issue to track them. It can even be a RFC.

@asterite
Copy link
Member Author

asterite commented Aug 4, 2019

No. If someone finds an optimization please send a PR with benchmark. I don't plan to change anything else otherwise, and there's no need to track anything.

@asterite asterite added this to the 0.31.0 milestone Aug 4, 2019
@funny-falcon
Copy link
Contributor

funny-falcon commented Aug 5, 2019 via email

@j8r
Copy link
Contributor

j8r commented Aug 5, 2019

I think it would be better to have proper RFCs about the design decisions, instead of PRs, just a thought.
Of course that's up to the core members to organize their project as they wish, and what they consider the best.

straight-shoota pushed a commit that referenced this pull request Jan 7, 2020
The default initial_capacity was changed to 8 in #8017
dnamsons pushed a commit to dnamsons/crystal that referenced this pull request Jan 10, 2020
This improves its performance, both in time and memory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reimplementation a Hash using open addressing
10 participants