Hash now uses an open addressing algorithm #8017

asterite · 2019-07-30T22:30:18Z

This improves its performance, both in time (always) and memory (generally).

The algorithm

The algorithm is basically the one that Ruby uses.

How to review this

There's just a single commit. My recommendation is to first look at the final code, which is thoroughly documented (the important bits are at the top), then at the diff.

I tried to document this really well because otherwise in a month I won't remember any of this. It also enables anyone to understand how it works and possibly keep optimizing things.

Why?

The old Hash implementation uses closed addressing: buckets with linked lists. This apparently isn't great because of pointer chasing and not having good cache locality. Using open addressing removes pointer chasing and improves cache locality. It could just be a theory but empirically it performs much better.

Give me some benchmarks!

I used this code as a benchmark:

Code for benchmarking old Hash vs. new Hash

require "benchmark"
require "./old_hash"
require "random/secure"

def benchmark_create_empty
  Benchmark.ips do |x|
    x.report("old, create") do
      OldHash(Int32, Int32).new
    end
    x.report("new, create") do
      Hash(String, String).new
    end
  end
end

def benchmark_insert_strings(sizes)
  sizes.each do |size|
    values = Array.new(size) { Random::Secure.hex }

    Benchmark.ips do |x|
      x.report("old, insert (type: String, size: #{size})") do
        hash = OldHash(String, String).new
        values.each do |value|
          hash[value] = value
        end
      end
      x.report("new, insert (type: String, size: #{size})") do
        hash = Hash(String, String).new
        values.each do |value|
          hash[value] = value
        end
      end
    end
  end
end

def benchmark_insert_ints(sizes)
  sizes.each do |size|
    values = Array.new(size) { rand(Int32) }

    Benchmark.ips do |x|
      x.report("old, insert (type: Int32, size: #{size})") do
        hash = OldHash(Int32, Int32).new
        values.each do |value|
          hash[value] = value
        end
      end
      x.report("new, insert (type: Int32, size: #{size})") do
        hash = Hash(Int32, Int32).new
        values.each do |value|
          hash[value] = value
        end
      end
    end
  end
end

def benchmark_read_strings(sizes)
  sizes.each do |size|
    values = Array.new(size) { Random::Secure.hex }

    old_hash = OldHash(String, String).new
    new_hash = Hash(String, String).new

    values.each do |value|
      old_hash[value] = value
      new_hash[value] = value
    end

    Benchmark.ips do |x|
      x.report("old, read (type: String, size: #{size})") do
        values.each do |value|
          old_hash[value]
        end
      end
      x.report("new, read (type: String, size: #{size})") do
        values.each do |value|
          new_hash[value]
        end
      end
    end
  end
end

def benchmark_read_ints(sizes)
  sizes.each do |size|
    values = Array.new(size) { rand(1_00_000) }

    old_hash = OldHash(Int32, Int32).new
    new_hash = Hash(Int32, Int32).new

    values.each do |value|
      old_hash[value] = value
      new_hash[value] = value
    end

    Benchmark.ips do |x|
      x.report("old, read (type: Int32, size: #{size})") do
        values.each do |value|
          old_hash[value]
        end
      end
      x.report("new, read (type: Int32, size: #{size})") do
        values.each do |value|
          new_hash[value]
        end
      end
    end
  end
end

sizes = [5, 10, 15, 20, 30, 50, 100, 200, 500, 1_000, 10_000, 100_000]

benchmark_create_empty()
puts
benchmark_insert_strings(sizes)
puts
benchmark_insert_ints(sizes)
puts
benchmark_read_strings(sizes)
puts
benchmark_read_ints(sizes)

Results:

old, create  18.50M ( 54.06ns) (± 1.82%)   160B/op   2.34× slower
new, create  43.32M ( 23.08ns) (± 0.88%)  64.0B/op        fastest

old, insert (type: String, size: 5)   3.60M (277.58ns) (± 7.01%)  480B/op   1.38× slower
new, insert (type: String, size: 5)   4.97M (201.28ns) (± 4.77%)  384B/op        fastest
old, insert (type: String, size: 10)   1.95M (513.74ns) (± 1.64%)  801B/op   1.33× slower
new, insert (type: String, size: 10)   2.59M (386.02ns) (± 3.01%)  832B/op        fastest
old, insert (type: String, size: 15)   1.27M (786.17ns) (± 3.56%)  1.09kB/op   1.55× slower
new, insert (type: String, size: 15)   1.98M (505.71ns) (± 4.48%)    832B/op        fastest
old, insert (type: String, size: 20) 922.03k (  1.08µs) (± 0.99%)  1.41kB/op   1.37× slower
new, insert (type: String, size: 20)   1.27M (788.95ns) (± 2.86%)  1.67kB/op        fastest
old, insert (type: String, size: 30) 639.38k (  1.56µs) (± 5.63%)  2.03kB/op   1.58× slower
new, insert (type: String, size: 30)   1.01M (992.38ns) (± 6.01%)  1.67kB/op        fastest
old, insert (type: String, size: 50) 355.49k (  2.81µs) (± 5.77%)  3.28kB/op   1.49× slower
new, insert (type: String, size: 50) 530.85k (  1.88µs) (± 3.95%)  3.81kB/op        fastest
old, insert (type: String, size: 100) 154.93k (  6.45µs) (± 4.04%)  7.06kB/op   1.81× slower
new, insert (type: String, size: 100) 280.47k (  3.57µs) (± 3.92%)  7.09kB/op        fastest
old, insert (type: String, size: 200)  71.71k ( 13.94µs) (± 3.27%)  13.3kB/op   1.91× slower
new, insert (type: String, size: 200) 136.71k (  7.31µs) (± 2.21%)  14.5kB/op        fastest
old, insert (type: String, size: 500)  23.46k ( 42.62µs) (± 3.53%)  36.1kB/op   2.46× slower
new, insert (type: String, size: 500)  57.82k ( 17.30µs) (± 4.78%)  28.4kB/op        fastest
old, insert (type: String, size: 1000)  12.78k ( 78.27µs) (± 1.72%)  67.4kB/op   2.20× slower
new, insert (type: String, size: 1000)  28.12k ( 35.56µs) (± 4.20%)  56.4kB/op        fastest
old, insert (type: String, size: 10000)   1.00k (997.78µs) (± 6.41%)  662kB/op   2.12× slower
new, insert (type: String, size: 10000)   2.13k (469.91µs) (± 5.39%)  896kB/op        fastest

old, insert (type: Int32, size: 5)   4.83M (207.19ns) (± 0.80%)  400B/op   1.78× slower
new, insert (type: Int32, size: 5)   8.60M (116.23ns) (± 1.76%)  240B/op        fastest
old, insert (type: Int32, size: 10)   2.78M (359.30ns) (± 0.91%)  640B/op   1.72× slower
new, insert (type: Int32, size: 10)   4.79M (208.73ns) (± 1.11%)  448B/op        fastest
old, insert (type: Int32, size: 15)   1.90M (525.28ns) (± 2.96%)  880B/op   1.75× slower
new, insert (type: Int32, size: 15)   3.33M (300.60ns) (± 5.98%)  448B/op        fastest
old, insert (type: Int32, size: 20)   1.30M (769.02ns) (± 6.52%)  1.09kB/op   1.54× slower
new, insert (type: Int32, size: 20)   2.00M (500.54ns) (± 4.76%)    976B/op        fastest
old, insert (type: Int32, size: 30) 972.78k (  1.03µs) (± 4.94%)  1.56kB/op   1.72× slower
new, insert (type: Int32, size: 30)   1.67M (597.26ns) (± 4.29%)    976B/op        fastest
old, insert (type: Int32, size: 50) 582.06k (  1.72µs) (± 5.23%)   2.5kB/op   1.50× slower
new, insert (type: Int32, size: 50) 872.34k (  1.15µs) (± 7.97%)  1.88kB/op        fastest
old, insert (type: Int32, size: 100) 234.48k (  4.26µs) (± 7.09%)   5.5kB/op   2.08× slower
new, insert (type: Int32, size: 100) 488.20k (  2.05µs) (± 6.79%)  4.14kB/op        fastest
old, insert (type: Int32, size: 200) 126.98k (  7.88µs) (± 4.74%)  10.2kB/op   1.84× slower
new, insert (type: Int32, size: 200) 233.78k (  4.28µs) (± 7.45%)  8.47kB/op        fastest
old, insert (type: Int32, size: 500)  40.86k ( 24.47µs) (± 4.14%)  28.3kB/op   2.55× slower
new, insert (type: Int32, size: 500) 104.34k (  9.58µs) (± 7.55%)  16.5kB/op        fastest
old, insert (type: Int32, size: 1000)  17.87k ( 55.97µs) (± 3.24%)  51.8kB/op   2.57× slower
new, insert (type: Int32, size: 1000)  45.98k ( 21.75µs) (± 7.72%)  32.5kB/op        fastest
old, insert (type: Int32, size: 10000)   1.52k (658.91µs) (± 3.88%)  506kB/op   2.16× slower
new, insert (type: Int32, size: 10000)   3.28k (305.04µs) (± 4.09%)  513kB/op        fastest

old, read (type: String, size: 5)  11.64M ( 85.90ns) (± 6.51%)  0.0B/op   2.25× slower
new, read (type: String, size: 5)  26.20M ( 38.16ns) (± 2.98%)  0.0B/op        fastest
old, read (type: String, size: 10)   5.69M (175.80ns) (± 4.64%)  0.0B/op   1.39× slower
new, read (type: String, size: 10)   7.93M (126.07ns) (± 6.97%)  0.0B/op        fastest
old, read (type: String, size: 15)   4.15M (240.69ns) (± 2.01%)  0.0B/op   1.16× slower
new, read (type: String, size: 15)   4.82M (207.47ns) (± 3.30%)  0.0B/op        fastest
old, read (type: String, size: 20)   2.93M (341.71ns) (± 1.47%)  0.0B/op   1.52× slower
new, read (type: String, size: 20)   4.46M (224.21ns) (± 1.25%)  0.0B/op        fastest
old, read (type: String, size: 30)   1.87M (534.24ns) (± 2.90%)  0.0B/op   1.45× slower
new, read (type: String, size: 30)   2.71M (369.30ns) (± 1.90%)  0.0B/op        fastest
old, read (type: String, size: 50) 992.08k (  1.01µs) (± 4.51%)  0.0B/op   1.71× slower
new, read (type: String, size: 50)   1.70M (589.77ns) (± 1.65%)  0.0B/op        fastest
old, read (type: String, size: 100) 596.10k (  1.68µs) (± 1.60%)  0.0B/op   1.40× slower
new, read (type: String, size: 100) 832.91k (  1.20µs) (± 1.82%)  0.0B/op        fastest
old, read (type: String, size: 200) 260.30k (  3.84µs) (± 1.63%)  0.0B/op   1.53× slower
new, read (type: String, size: 200) 398.84k (  2.51µs) (± 1.84%)  0.0B/op        fastest
old, read (type: String, size: 500) 111.03k (  9.01µs) (± 2.89%)  0.0B/op   1.40× slower
new, read (type: String, size: 500) 155.71k (  6.42µs) (± 3.33%)  0.0B/op        fastest
old, read (type: String, size: 1000)  41.59k ( 24.04µs) (± 3.07%)  0.0B/op   1.71× slower
new, read (type: String, size: 1000)  71.27k ( 14.03µs) (± 3.81%)  0.0B/op        fastest
old, read (type: String, size: 10000)   2.66k (376.32µs) (± 4.55%)  0.0B/op   2.46× slower
new, read (type: String, size: 10000)   6.54k (152.95µs) (± 4.03%)  0.0B/op        fastest

old, read (type: Int32, size: 5)  23.67M ( 42.24ns) (± 1.64%)  0.0B/op   2.30× slower
new, read (type: Int32, size: 5)  54.35M ( 18.40ns) (± 3.14%)  0.0B/op        fastest
old, read (type: Int32, size: 10)  11.66M ( 85.77ns) (± 4.63%)  0.0B/op   1.85× slower
new, read (type: Int32, size: 10)  21.58M ( 46.33ns) (± 3.20%)  0.0B/op        fastest
old, read (type: Int32, size: 15)   7.81M (128.00ns) (± 4.15%)  0.0B/op   1.49× slower
new, read (type: Int32, size: 15)  11.63M ( 86.00ns) (± 3.99%)  0.0B/op        fastest
old, read (type: Int32, size: 20)   5.80M (172.48ns) (± 5.34%)  0.0B/op   1.73× slower
new, read (type: Int32, size: 20)  10.02M ( 99.77ns) (± 4.54%)  0.0B/op        fastest
old, read (type: Int32, size: 30)   3.67M (272.51ns) (± 2.99%)  0.0B/op   1.84× slower
new, read (type: Int32, size: 30)   6.74M (148.30ns) (± 1.97%)  0.0B/op        fastest
old, read (type: Int32, size: 50)   2.05M (488.04ns) (± 5.57%)  0.0B/op   1.89× slower
new, read (type: Int32, size: 50)   3.87M (258.41ns) (± 4.24%)  0.0B/op        fastest
old, read (type: Int32, size: 100)   1.09M (921.59ns) (± 8.72%)  0.0B/op   1.70× slower
new, read (type: Int32, size: 100)   1.84M (542.80ns) (± 5.52%)  0.0B/op        fastest
old, read (type: Int32, size: 200) 535.83k (  1.87µs) (± 5.84%)  0.0B/op   1.66× slower
new, read (type: Int32, size: 200) 891.31k (  1.12µs) (± 5.49%)  0.0B/op        fastest
old, read (type: Int32, size: 500) 236.68k (  4.23µs) (± 3.61%)  0.0B/op   1.52× slower
new, read (type: Int32, size: 500) 360.85k (  2.77µs) (± 4.31%)  0.0B/op        fastest
old, read (type: Int32, size: 1000) 106.13k (  9.42µs) (± 4.88%)  0.0B/op   1.66× slower
new, read (type: Int32, size: 1000) 175.92k (  5.68µs) (± 4.08%)  0.0B/op        fastest
old, read (type: Int32, size: 10000)   4.60k (217.62µs) (± 2.94%)  0.0B/op   3.00× slower
new, read (type: Int32, size: 10000)  13.80k ( 72.47µs) (± 1.67%)  0.0B/op        fastest

As you can see the new implementation is always faster than the old one. Sometimes more memory is used, sometimes less.

I also ran some @kostya benchmarks that used Hash in their implementation. Here are the results:

Havlak:

old: 12.49s, 375.1Mb
new: 7.58s, 215.7Mb

Havlak seems to be a benchmark measuring how well a language performs in general algorithmic tasks... the new results look good! 😊

Brainfuck:

old: 5.20s, 1.8Mb
new: 4.22s, 1.8Mb

JSON (when using `JSON.parse`):

old: 2.20s, 1137.0Mb
new: 2.07s, 961.3Mb

Knucleotide:

old: 1.63s, 26.5Mb
new: 1.01s, 32.4Mb

Then some more benchmarks...

There's HTTP::Request#from_io which I recently optimized:

old: from_io 498.47k (  2.01µs) (± 0.89%)  816B/op  fastest
new: from_io 549.22k (  1.82µs) (± 1.89%)  720B/op  fastest

(using wrk against the sample http server increases the requests/sec from 118355.73 to about 122000)

Also for curiosity I compared creating a Hash with 1_000_000 elements and seeing how it compares to Ruby and the old Hash.

Code for creating a Hash with 1_000_000 elements in Ruby and Crystal

size = 1_000_000

h = {0 => 0}

time = Time.now
size.times do |i|
  h[i] = i
end
puts "Insert: #{Time.now - time}"

time = Time.now
size.times do |i|
  h[i]
end
puts "Read: #{Time.now - time}"

Results:

Ruby 2.7-dev:
  Insert: 0.151813
  Read:   0.13749
Crystal old:
  Insert: 0.238662
  Read:   0.129462
Crystal new:
  Insert: 0.070804
  Read:   0.041008

Ruby was faster than Crystal! Ruby is simply amazing ❤️. But now Crystal is even faster!

The compiler uses Hash all over the place!

So compile times now should go down! Right? ... Right?

Well, unfortunately no. I think the main reason is that the times are bound by the number of method instances, not by the performance of the many hashes used.

Memory consumption did seem to go down a bit.

When will I have this?

We could push this to 0.31.0. Or... we could have it in 0.30.0 if we delay the release a bit more (note: I don't manage releases, but if the community doesn't mind waiting a bit to get a huge performance boost in their apps then I think we could relax the release date).

In 0.31.0.

Final thoughts

Thank you Ruby! ❤️ ❤️ ❤️
Thank you Vladimir Makarov! ❤️ ❤️ ❤️
Algorithms are cool!
There's an optimization in the original Ruby algorithm which uses bitmasks instead of remainder or % to fit a number inside a range... I did that as almost the last thing because I didn't believe it would improve performance a lot... and it doubled the performance! 😮
Feel free to target this PR and benchmark your code and post any interesting speedups here!
I hope CI 32 bits passes! 😊

src/hash.cr

j8r

Using #as will provide more type information in case of errros compared to #not_nil!

src/hash.cr

This improves its performance, both in time and memory.

yxhuvud

I seem to recall @funny-falcon ran into some issues with the GC finding false positives while he was working on this. That is not something you have seen?

src/hash.cr

lribeiro · 2019-07-31T12:23:07Z

This gives a nice speedup on our internal app for log processing!

@bcardiff will this be considered for 0.30?

Time in seconds

test	0.29	0.30dev	speedup
forti	7.353	6.722	9.4%
forti_kv	5.462	4.867	12,2%
dns_grt	4.714	4.466	5,6%

Details

~/Work/crystal-1/bin/crystal version

Using compiled compiler at `.build/crystal'
Crystal 0.30.0-dev [ec2a26f9a] (2019-07-31)

LLVM: 8.0.0
Default target: x86_64-apple-macosx

~/Work/crystal-1/bin/crystal build --release -o bin/log_clean src/log_clean.cr 
Using compiled compiler at `.build/crystal'
bin/log_clean -c test/forti/forti.yml >    /dev/null  6.83s user 1.17s system 119% cpu 6.722 total
bin/log_clean -c test/forti/forti_kv.yml > /dev/null  5.11s user 2.02s system 146% cpu 4.867 total
bin/log_clean -c config/dns_grt.yml >      /dev/null  4.50s user 0.74s system 117% cpu 4.466 total

crystal version

Crystal 0.29.0 (2019-06-06)

LLVM: 6.0.1
Default target: x86_64-apple-macosx

bin/log_clean -c test/forti/forti.yml >    /dev/null  7.49s user 1.26s system 119% cpu 7.353 total
bin/log_clean -c test/forti/forti_kv.yml > /dev/null  5.80s user 2.08s system 144% cpu 5.462 total
bin/log_clean -c config/dns_grt.yml >      /dev/null  4.78s user 0.79s system 118% cpu 4.714 total

yxhuvud · 2019-07-31T12:28:09Z

I wish github had properly threaded review comments. This is a comment to my own review comment: I note that you get lets of repeated allocation warnings and that one CI fail due to out of memory.

funny-falcon · 2019-07-31T12:48:19Z

src/hash.cr

+
+  # Computes the next index in `@indices`, needed when an index is not empty.
+  private def next_index(index : Int32) : Int32
+    fit_in_indices(index + 1)


"linear probing" is a bad idea.
Use "quadratic probing" instead. It has same good cache locality on first three probes as for linear probing.

In Ruby they tried quadratic probing and it turned out to be a bit slower. They use linear probing. That's why I also chose that, and it's simpler.

Ref: https://github.com/ruby/ruby/blob/c94cc6d968a7241a487591a9753b171d8081d335/st.c#L869-L872

But if you want we can try it. I know nothing about quadratic probing, is it just index, index + 1, index + 4, index + 9, index + 16, etc.`?

According to the ref they aren't using linear probing now, they are using Double Hashing (see secondary hash in comments).

It's true! However I don't understand how that works:

Still other bits of the hash value are used when the mapping results in a collision. In this case we use a secondary hash value which is a result of a function of the collision bin index and the original hash value. The function choice guarantees that we can traverse all bins and finally find the corresponding bin as after several iterations the function becomes a full cycle linear congruential generator because it satisfies requirements of the Hull-Dobell theorem.

I don't know about that theorem so it's harder for me to implement something I don't understand. But I'll definitely have that in mind for further optimizing this (but it'll me some more time, but PRs are welcome!).

src/hash.cr

straight-shoota · 2019-07-31T12:59:42Z

The OOM failure on test_linux32 is pretty regular and unrelated to this PR.

funny-falcon · 2019-07-31T13:01:56Z

src/hash.cr

+    # If we have less than 8 elements we avoid computing the hash
+    # code and directly compare the keys (might be cheaper than
+    # computing a hash code of a complex structure).
+    if entries_size <= 8


I doubt it is useful. I could be mistaken.
Was it benchmarked?

Yes. I just benchmarked it again because I introduced that optimization some time ago, before introducing more optimizations and changing other things.

I ran the same bencharmks as in my main post but only for String. The benchmark for insert don't have a difference. For read it gives this (old is without the optimization, new is with it):

old, read (type: String, size: 1) 70.65M ( 14.15ns) (± 2.49%) 0.0B/op 2.68× slower new, read (type: String, size: 1) 189.05M ( 5.29ns) (± 3.89%) 0.0B/op fastest old, read (type: String, size: 2) 39.07M ( 25.59ns) (± 2.05%) 0.0B/op 2.26× slower new, read (type: String, size: 2) 88.16M ( 11.34ns) (± 4.01%) 0.0B/op fastest old, read (type: String, size: 3) 26.56M ( 37.65ns) (± 1.84%) 0.0B/op 2.00× slower new, read (type: String, size: 3) 53.24M ( 18.78ns) (± 1.89%) 0.0B/op fastest old, read (type: String, size: 4) 19.98M ( 50.06ns) (± 2.17%) 0.0B/op 1.72× slower new, read (type: String, size: 4) 34.35M ( 29.11ns) (± 3.70%) 0.0B/op fastest old, read (type: String, size: 5) 16.00M ( 62.49ns) (± 1.61%) 0.0B/op 1.49× slower new, read (type: String, size: 5) 23.78M ( 42.05ns) (± 2.05%) 0.0B/op fastest old, read (type: String, size: 6) 13.11M ( 76.25ns) (± 1.49%) 0.0B/op 1.39× slower new, read (type: String, size: 6) 18.17M ( 55.02ns) (± 2.68%) 0.0B/op fastest old, read (type: String, size: 7) 11.21M ( 89.24ns) (± 1.37%) 0.0B/op 1.22× slower new, read (type: String, size: 7) 13.71M ( 72.93ns) (± 3.94%) 0.0B/op fastest old, read (type: String, size: 8) 9.57M (104.47ns) (± 3.21%) 0.0B/op 1.16× slower new, read (type: String, size: 8) 11.14M ( 89.77ns) (± 3.77%) 0.0B/op fastest

So you can see it's a big optimization. You can also see it becomes faster and faster to start compare the hash first when there are more and more elements, starting from 9 it's almost the same.

I wonder if this optimization could also be applied in the Ruby code... 🤔

I might give it a try later (in Ruby), now I'm curious!

Actually, I think if the key type is more complex and == takes more time then it will be slower. So maybe it should also be applied for a few cases... maybe just do it if the key is a String because we know it's true.

I just tried it with keys being Array(String) and the differences are actually much bigger, meaning that it's faster to avoid checking the hash code altogether for some reason... probably because to compute the hash code you need to consider the entire structure, but to compare elements you bail out as soon as you find a difference.

yxhuvud · 2019-07-31T13:02:29Z

@straight-shoota Are all the warnings about large allocations also old?

src/hash.cr

funny-falcon · 2019-07-31T13:29:43Z

src/hash.cr

    end
  end

  private module BaseIterator
-    def initialize(@hash, @current)
+    def initialize(@hash)
+      @index = 0


I'd rather track index of a first element (and update it in delete_entry_and_update_counts).
If one calls hash.shift in a loop, they will suffer from O(n^2) without that. (Queue or LRU usecases).

If @indices_size_pow2 will be changed to UInt8 and placed near @indices_bytesize, then addition if @first : UInt32 will not change size of hash structure.

I'm sorry but I don't understand what you mean.

Ah, yes, I see what you mean. I think we should either remove Hash#shift or implement your suggestion. We copied Hash#shift from Ruby but I don't know what's a real use for that it.

Done in 01e072c !

@funny-falcon I'll be happy if you can review the commit. The change turned out to be pretty simple thanks to the built abstractions.

funny-falcon · 2019-07-31T13:35:00Z

Rather clean code! I liked to read it.

asterite · 2019-07-31T16:30:55Z

@funny-falcon

Rather clean code! I liked to read it.

Thank you! 😊 And thank you for the thorough review, you found many good things to fix.

asterite · 2019-07-31T18:58:36Z

@yxhuvud

This is a comment to my own review comment: I note that you get lets of repeated allocation warnings and that one CI fail due to out of memory.

Yes, but I think the errors were there before. I think Hash might require a bit more memory because now it always doubles its size when a resize is needed. Also we keep adding tests to the spec suite and right now the compiler specs leak memory. I'll open another issue to track this problem. The solution for now is to just run specs on CI in smaller chunks instead of in two big chunks. This is a problem that only affects CI, regular machines usually have a lot of memory and don't need to compile the entire Crystal's spec suite.

@funny-falcon

Thank you @funny-falcon for the idea ❤️

Previously we didn't try to compact small hashes when a resize was needed. However we can do that if we have many deleted elements to avoid a reallocation. This should improve performance when shifting or deleting elements and then inserting more elements.

RX14 · 2019-08-01T14:17:31Z

@funny-falcon thank you! The above articles on RH hashing weren't entirely clear on this.

Memory overhead of Hash is important too, and I'd be interested in a graph of the allocation size of @indexes and @entries in bytes for varying Hash(Reference, Reference) sizes. I suspect that the load factor in @indexes is fairly insignificant compared to the size of @entries.

asterite · 2019-08-01T22:21:05Z

@funny-falcon Thank you! I implemented all your suggestions. Let me know if I got something wrong 😊

Although I think that many things can be still optimized (representation of things, the probing, the load factor, etc.) I think this is already a good improvement and we can continue refining things after this PR.

RX14

Absolutely wonderful, thank you so much for your hard work on this, and all you do for Crystal, Ary.

asterite · 2019-08-03T23:22:22Z

Could someone explain to me with some ascii art of graphics how Robin Hood hashing works?

When a collision occur, compare the two items’ probing count, the one with larger probing number stays and the other continue to probe. Repeat until the probing item finds an empty spot.

I can't understand this. Or... I can understand what it says but not what it means.

Let's suppose we start with empty buckets:

[_, _, _, _, _, _]

Now we want to insert 10, which hashes to bucket 1:

[_, 10, _, _, _, _]

Say we want to insert 20, which hashes to bucket 1 too... What do we do? The probe count for 10 is 0 (I guess?) and for 20 we don't know yet but it's definitely higher than 0 because it's occupied. So we swap them...?

[_, 20, 10, _, _, _]

Next comes 30, again with bucket 1. So 20 has probe count 0, 10 has probe count 1, and 30 will have probe count 2... so we span everything again?

[_, 30, 20, 10, _, _]

I can't see the benefit of this, but I'm sure there's something I'm not understanding.

konovod · 2019-08-04T09:04:46Z

In Robin hood hashing, we compare old element probe count with current probe count of our element.
So in your example nothing will be shifted (current probe count will be equal to element probe count) - hash table will be
[_, 10, 20, 30, _, _]
Now if we insert element 21 that hashes to 2, we will compare it to 20 (0 vs 1) and 30(1 vs 2) and still insert at 4:
[_, 10, 20, 30, 21, _]
But now if we insert 40 that hashes to 1:
it will skip 10, 20, 30 as they have same probe count, but will steal a place from 21 as it will have probe count 3 and 21 will have just 1:
[_, 10, 20, 30, 40, _]
and then we continue to search place for 21, finding it at next position:
[_, 10, 20, 30, 40, 21]

asterite · 2019-08-04T16:05:08Z

@konovod Thanks, I understand now!

asterite · 2019-08-04T16:08:16Z

Thank you everyone for the suggestions!

If you find optimizations please send PRs and benchmarks.

asterite · 2019-08-04T16:12:13Z

Some optimizations ideas:

Robin Hood hashing
Double hashing
Quadratic probing
Reduce the allocated size of @entries (it could be possible to allocate less than @indices_size / 2)
Other ideas...?

j8r · 2019-08-04T18:57:04Z

That's good points @asterite, better to create issue to track them. It can even be a RFC.

asterite · 2019-08-04T20:03:28Z

No. If someone finds an optimization please send a PR with benchmark. I don't plan to change anything else otherwise, and there's no need to track anything.

funny-falcon · 2019-08-05T04:12:15Z

Gold words. вс, 4 авг. 2019 г., 23:03 Ary Borenszweig <[email protected]>:

…

No. If someone finds an optimization please send a PR with benchmark. I don't plan to change anything else otherwise, and there's no need to track anything. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8017?email_source=notifications&email_token=AAB44U37QCJJPFL6MJF4UR3QC4Y2HA5CNFSM4IIA36K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3QIU3I#issuecomment-518031981>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAB44U25ZPY57CSTTDQIIFTQC4Y2HANCNFSM4IIA36KQ> .

j8r · 2019-08-05T09:22:35Z

I think it would be better to have proper RFCs about the design decisions, instead of PRs, just a thought.
Of course that's up to the core members to organize their project as they wish, and what they consider the best.

The default initial_capacity was changed to 8 in #8017

This improves its performance, both in time and memory.

asterite added performance topic:stdlib:collection labels Jul 30, 2019

RomainFranceschini reviewed Jul 31, 2019

View reviewed changes

src/hash.cr Outdated Show resolved Hide resolved

j8r reviewed Jul 31, 2019

View reviewed changes

src/hash.cr Outdated Show resolved Hide resolved

src/hash.cr Outdated Show resolved Hide resolved

src/hash.cr Outdated Show resolved Hide resolved

src/hash.cr Outdated Show resolved Hide resolved

Hash now uses an open addressing algorithm

d4c35de

This improves its performance, both in time and memory.

asterite force-pushed the open-addressing-hash branch from 368e083 to d4c35de Compare July 31, 2019 10:12

Hash: take nilable types into account in first/last key/value

ec2a26f

yxhuvud reviewed Jul 31, 2019

View reviewed changes

src/hash.cr Outdated Show resolved Hide resolved

src/hash.cr Outdated Show resolved Hide resolved