Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add type-specialized vectorized hash table container #100386

Merged
merged 113 commits into from
Apr 15, 2024
Merged

Add type-specialized vectorized hash table container #100386

merged 113 commits into from
Apr 15, 2024

Conversation

kg
Copy link
Member

@kg kg commented Mar 28, 2024

This PR adds a type-specialized, vectorized (using 128-bit SIMD) hash table container. At present its feature set is limited but I plan to expand it to cover most of our performance-sensitive hashing scenarios in the mono runtime. It's in native/containers because ideally it will be possible to consume this container (or at least its core parts) from the C++ side of things as well.

See #100386 (comment) for a size analysis.

As a test case I migrated the MonoImage namespace cache from GHashTable to a simple string-ptr specialization of this container and validated that it works on wasm (locally, at least). The performance will be bad since I'm not sure how to set -msimd128 properly for this part of the runtime.

General notes:

  • This table's design is inspired by the Folly F14 set of containers, but not based on their code. Reading the blog post gave me enough information to figure out how to write a similar container.
  • This is based on a prototype written in C# that I benchmarked against S.C.G.Dictionary as a way to validate the performance characteristics of the data structure. https://github.com/kg/SimdDictionary/
  • The table does not store pre-computed hash codes. The lowest bits of the hash code form the bucket index, and the highest bits form the 'suffix' which is used to find individual keys quickly inside a given bucket. This reduces the amount of memory needed to store items compared to hashtables that would normally store a complete hash. The downside to this is that rehashing becomes more expensive - but I don't consider that a defect in this design.
  • The table groups keys into "buckets" of N (typically 11 or 12), where we store a group of up to 14 suffixes along with some data inside of a 16-byte-wide SIMD vector, and then store all the keys immediately after the suffixes in the bucket. For correctly-chosen bucket sizes, this allows an entire bucket to fit neatly into 1-4 cache lines.
  • After we select a bucket using the low bits of a hash code, we can locate the first (if any) potential match by suffix using a vectorized equality comparison - the index is determined by the number of trailing zeroes. This is similar to how vectorized IndexOf works for strings in the BCL.
  • Once we've found a bucket that contains potential matches, we scan through the keys sequentially to look for one that matches. We usually find the correct item on the first try, thanks to the match vector.
  • If a bucket fills up, we cascade additional items into its neighbors, after setting a flag to indicate that the bucket has spilled over. This means that lookups don't have to search neighboring buckets unless an overflow has occurred on the target bucket.
  • In the event of hash collisions, performance doesn't degrade until you start overflowing buckets frequently. In the worst case scenario where all keys have the same hashcode, performance is still better than some alternative hashtables.
  • The table stores values in a linear buffer next to the buffer that contains all the buckets, with one value slot for each key slot. If your values are enormous, this could be a problem, but I've chosen not to solve that yet since there are a few different ways to address it and the right choice is situational.

The implementation is split across four main files:

dn-simdhash.h

Declares the public API, configuration constants, inlined helpers, and some public/private data structures. I don't go to any effort to hide internal state from the user.

dn-simdhash-arch.h

Architecture-specific code is kept in this header. It uses clang/gcc vector intrinsics where possible, which will compile down to appropriate intrinsics for the target and simplifies writing correct code. It currently supports x64, wasm* and ARM* on clang/gcc, and x64 on msvc. Unsupported architecture/compiler combos use a scalar fallback.

  • wasm will perform suboptimally unless you enable simd128, and ARM intrinsics are currently blocked on having access to system headers (plus it's untested).

dn-simdhash.c

Contains implementations for the public API. The public API relies on specialized implementations generated by the specialization header, and it accesses those via a vtable. I'll call those the "private API" in this description.
I've kept as much logic in this file as possible, to reduce the amount of binary bloat created by specializations.

dn-simdhash-specialization.h

Contains implementations of the private API, specialized for a given key type, value type, hash function, and key comparer.
You configure the implementation by setting various defines before including it. You should use this by creating a unique .c file for each type of simdhash you want, like the example included in the PR. Since the private API is type-specialized, certain types of misuse become compile errors.

@kg kg added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) NO-REVIEW Experimental/testing PR, do NOT review it labels Mar 28, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Mar 28, 2024
@lambdageek lambdageek added area-VM-meta-mono and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Mar 28, 2024
@kg
Copy link
Member Author

kg commented Mar 28, 2024

Weird build problems I can't reproduce locally:

  • <arm_neon.h> is missing for no obvious reason on our CI. I don't know why. Does anyone know? It should be present when building for arm64 with clang shouldn't it? It's guarded properly with an ifdef, so I don't understand how this is happening.
  • <emmintrin.h> is getting double-included, and I can't tell where the first one is coming from. I'm 99.9% certain it's not my fault, since the test suite builds locally.

@kg
Copy link
Member Author

kg commented Mar 28, 2024

The emmintrin.h double-include is our fault, we have a weird mangled file with that name in our tree that has a broken include guard. Still stumped on NEON.

@danmoseley
Copy link
Member

Maybe I missed it above but you mentioned you benchmarked against S.C.G.Dictionary.. I'm curious of the results.

Not sure whether anyone has experimented with a vectorized variant of that, nor whether it would be sufficiently general purpose that we'd ever consider putting such a thing in the core libraries.

@jkotas
Copy link
Member

jkotas commented Mar 29, 2024

Still stumped on NEON.

If you see this problem with CoreCLR, it is likely caused by CoreCLR PAL. CoreCLR PAL explicitly disables system includes.

@kg
Copy link
Member Author

kg commented Mar 29, 2024

Still stumped on NEON.

If you see this problem with CoreCLR, it is likely caused by CoreCLR PAL. CoreCLR PAL explicitly disables system includes.

That explains it! I was wondering why we had a weird version of emmintrin. So I would need to provide a custom version of the neon header inside the PAL. I don't have easy access to an arm64 development environment to test this on at the moment.

@jkotas
Copy link
Member

jkotas commented Mar 29, 2024

Or ignore CoreCLR for now and wait for #98336 to get merged. #98336 should enable including system headers across CoreCLR.

@kg
Copy link
Member Author

kg commented Mar 29, 2024

Maybe I missed it above but you mentioned you benchmarked against S.C.G.Dictionary.. I'm curious of the results.

Not sure whether anyone has experimented with a vectorized variant of that, nor whether it would be sufficiently general purpose that we'd ever consider putting such a thing in the core libraries.

From my testing, a C# version of this algorithm tends to perform in the range of 90-110% of S.C.G.Dictionary in my BDN measurements. There are specific scenarios where it's worse (in part because I couldn't aggressively optimize parts of it easily) and where it's better (it has significantly better performance for hash collisions). Depending on how you tune the bucket sizes and allocation rules, it uses less memory too. S.C.G's performance here impressed me overall, especially considering there's still room left in that .cs file for micro-optimizations.

Expressing this properly in C# is very difficult because the size of buckets needs to be conditional on the byte width of the items inside the buckets, and I couldn't find a clean way to express that - InlineArray takes a constant argument, etc. I'm not convinced we could offer a general-purpose generic version of this in the BCL without changes to the language and type system. A version limited to unmanaged keys and values would probably be possible, but it would be awkward to write and of limited utility. I originally prototyped a C# version that used 3 arrays - suffixes / keys / values - which avoids the InlineArray problem, but the performance is worse due to a lack of cache locality + more address calculations and bounds checks from the extra array.

My main target with this PR is to replace some of the hot path hash tables in mono (typically but not always GHashTable) - they are 10-20 years old and extremely generic, which adds a lot of overhead from indirect function calls, etc.

Here are some measurements from a BDN run just now:

Type Method Mean Error StdDev Gen2 Allocated
BCLCollisions AddSameRepeatedlyThenClear 9.564 us 0.0295 us 0.0261 us - -
BCLCollisions FillThenClear 3,570.406 us 70.2765 us 80.9305 us - 24 B
BCLCollisions FindExistingWithCollisions 5,503.151 us 109.3294 us 125.9039 us - 48 B
BCLCollisions FindMissingWithCollisions 6,845.631 us 37.9069 us 33.6035 us - 48 B
SimdCollisions AddSameRepeatedlyThenClear 9.468 us 0.0314 us 0.0278 us - -
SimdCollisions FillThenClear 3,002.598 us 55.3918 us 51.8135 us - 24 B
SimdCollisions FindExistingWithCollisions 4,155.909 us 68.9603 us 61.1315 us - 48 B
SimdCollisions FindMissingWithCollisions 5,528.713 us 109.1915 us 169.9979 us - 48 B
BCLInsertion ClearThenRefill 83.497 us 2.0610 us 6.0770 us - 1 B
BCLInsertion InsertExisting 39.677 us 0.3416 us 0.3195 us - -
SimdInsertion ClearThenRefill 117.608 us 0.6924 us 0.5782 us - 1 B
SimdInsertion InsertExisting 39.880 us 0.2150 us 0.1906 us - -
BCLIterate EnumerateKeys 36.013 us 0.3891 us 0.3640 us - 65632 B
BCLIterate EnumeratePairs 95.678 us 0.7668 us 0.7173 us 41.6260 131199 B
BCLIterate EnumerateValues 37.589 us 0.6046 us 0.5656 us - 65632 B
SimdIterate EnumerateKeys 45.804 us 0.8974 us 1.0335 us - 65792 B
SimdIterate EnumeratePairs 91.639 us 0.4401 us 0.4117 us 41.6260 131319 B
SimdIterate EnumerateValues 50.679 us 0.3528 us 0.3128 us - 65792 B
BCLLookup FindExisting 45.083 us 0.5422 us 0.5072 us NA NA
BCLLookup FindMissing 39.561 us 0.1423 us 0.1111 us NA NA
SimdLookup FindExisting 45.684 us 0.1538 us 0.1438 us NA NA
SimdLookup FindMissing 39.728 us 0.6022 us 0.5633 us NA NA
BCLRemoval RemoveItemsThenRefill 130.944 us 0.9038 us 0.8012 us - 2 B
BCLRemoval RemoveMissing 35.589 us 0.0606 us 0.0537 us - -
SimdRemoval RemoveItemsThenRefill 186.603 us 0.3341 us 0.2962 us - 2 B
SimdRemoval RemoveMissing 41.530 us 0.5148 us 0.4816 us - -
BCLResize CreateDefaultSizeAndFill 341.014 us 1.7947 us 1.6787 us 222.1680 942132 B
SimdResize CreateDefaultSizeAndFill 401.432 us 4.1715 us 3.4834 us 230.4688 981742 B

@kg
Copy link
Member Author

kg commented Mar 29, 2024

Current status:

  • REVISED Blocked on getting SIMD enabled for mono/metadata on WASM. Should make progress on that next week once the team is back from vacation. Performance seems to be adequate using the scalar fallback, so we still need to figure out how to enable SIMD for mono/metadata on WASM but it doesn't block this PR.
  • FIXED ARM NEON is blocked on system headers (Remove remaining CRT PAL wrappers and enable including standard headers in the CoreCLR build #98336), but since it only uses a small number of intrinsics, I could imitate our custom vendored header and do the same for NEON. Not sure what to do there; I don't have ARM development hardware to test on anyway.
  • NEON on MSVC will be harder since there are no clang/gcc vector intrinsics there, it seems I would have to hand-write a version of this that operates on 8 lanes at a time (clang helpfully generates the code for me.) I don't know if we actually care about mono on this configuration, but if we ever consume this container from coreclr it would matter.
  • DONE The scalar non-SIMD fallback could stand to be hand-optimized (the generated WASM for it is terrible), but I don't know if we want to invest any time in that. Could use more improvement but I hand-optimized it, the generated code in clang on x64 and wasm looks okay.
  • Verified on MSVC I've manually verified the test suite locally using x64 gcc, x64 clang, and wasm clang. I've inspected the generated code from ARM clang, x64 clang, x64 gcc, wasm clang, and x64 msvc using godbolt. I haven't manually verified at all using MSVC. I can't get arm64 cross-compilation to work in my development environment (weird issues with system headers).
  • I did some startup profiles using the non-vectorized wasm version and performance looks okay. (The cached hashcode in the example string_ptr specialization was added based on profiling).
  • DONE It would be nice to replace the murmurhash3-32 used for strings with something that can operate on null-terminated strings, so we don't have to waste time calling strlen before hashing. the old g_str_hash was good at this, but it wasn't particularly strong as an avalanching hash function, nor was it pipelined/vectorized at all. Someone who knows hash functions better might be able to suggest a solution for this; I picked murmurhash3 since we already have it in-tree elsewhere, and AFAIK it's a proven non-cryptographic hash function with good performance. Right now in startup profiles with this active, make_key (strlen + murmurhash3-32) is where most of the time is spent.
  • FIXED the string_ptr implementation is more complex due to the need to handle strings longer than 4gb on x64. I'd like to just drop support for that, since the idea of using a 4GB null-terminated string as a hashtable key is kind of ridiculous. Would people be OK with me making that a runtime failure instead?
  • FIXED asserts have been causing me a lot of trouble with this PR, since it seems like in our release builds they're compiled out. DN_ASSERT is the same. What's the right way to do runtime assertions in this part of the codebase? I'm used to g_assert, which is enabled in all mono builds.
  • This container hasn't been tuned for memory usage. I'm currently allocating space for 120% of requested capacity, and rounding bucket counts up to the next power of two. These choices are both inspired by the f14 blog post + my local testing with my C# prototype, but they may not be optimal for our use cases. POT bucket counts make bucket selection faster (bitwise & instead of integer %) and reduce the number of rehash operations, but the increased memory waste for large numbers of items might be undesirable.
  • FIXED It would be cool if we could deterministically enforce cache line alignment for bucket sizes at compile time, but I wasn't able to figure out a way to do it. I'm not convinced that everything in here is alignment-safe, though I was unable to spot any issues.
  • The code size per-specialization seems okay, though x64 clang seems to eagerly inline the get/add operations into their callers when compiling the test suite even though they're not marked static or inline, and that makes me slightly concerned.
  • Rehashing performance is sub-optimal, though I don't know if it's meaningfully worse than GHashTable. I have a general idea for how to do in-place rehashing, but I don't think it's worth doing that yet - it's definitely harder to get right.
  • FIXED If you remove lots of items in a table that's had buckets overflow, the performance degradation from bucket overflow doesn't go away. I have a general sense of how to fix this, but it complicates the algorithm and is tricky to get right so I haven't done it yet. Many of our hash tables are insert-only or don't live long enough for this to be an issue.
  • While I replaced a couple GHashTables in mono/metadata to prove that this works, I didn't update the code to pre-reserve capacity for the tables, since it wasn't obvious how big they should be. Doing that will probably reduce the number of rehash operations and meaningfully improve startup performance on WASM, since we spend a sizable amount of time rehashing there.
  • I believe this data structure can also be used to implement a hashset by using a 0-byte value type, but I haven't tested that. It would be good to support that as well since it's a scenario that appears in startup profiles.

@danmoseley
Copy link
Member

thanks for the data!

S.C.G's performance here impressed me overall, especially considering there's still room left in that .cs file for micro-optimizations.

Curious what comes to mind? As I thought we'd drained the ones we were aware of.

@kg
Copy link
Member Author

kg commented Mar 29, 2024

thanks for the data!

S.C.G's performance here impressed me overall, especially considering there's still room left in that .cs file for micro-optimizations.

Curious what comes to mind? As I thought we'd drained the ones we were aware of.

Off the top of my head, FindValue does bounds-checked array indexing on the _entries table, and so does TryInsert. Thoroughly a micro-optimization though. I found that the bounds checks were a problem for my C# prototype, but my guess is that S.C.G.Dictionary benefits from JIT optimizations that my code wasn't arranged properly to exploit.

@kg
Copy link
Member Author

kg commented Mar 31, 2024

I did a size analysis of the main container I aim to replace, GHashTable, and compared it with dn_simdhash. Summary first, details after.

Estimated best-case memory usage on 32-bit, assuming minimum dlmalloc overhead

For 366 items (ghashtable favored):

  • GHashTable would use (20*367) = ~7340 bytes, or 20.05b/item
  • dn_simdhash would use ((32*16) + (384*8)) = ~3584 bytes, or 09.79b/item

For 767 items (simdhash favored):

  • GHashTable would use (20*823) = ~16460 bytes, or 21.46b/item
  • dn_simdhash would use ((64*16) + (768*8)) = ~7168 bytes, or 09.34b/item

GHashTable

  • 3 pointers per slot (12-24 bytes), keys and values are thus always pointer-sized
  • Slot count is prime (11, 19, 37, 73, 109, 163, 251, 367, 557, 823, 1237...), default initial capacity is 11
  • Table header stores hash_func, key_equal_func, value_destroy_func, key_destroy_func
  • Table body is an array of Slot*
  • Each slot is independently allocated with g_new -> g_malloc -> G_MALLOC_INTERNAL -> malloc
    • Emscripten appears to use dlmalloc to service our mallocs
    • Dlmalloc aligns allocations to 8 bytes
    • Dlmalloc has 4-8 bytes of overhead per allocation
  • Amortized cost per item on 32-bit (ignoring waste from alignment):
    • (sizeof(Slot):12) + (sizeof(Slot*):4) + (dlmalloc-overhead:4-8) = 20-24 bytes/item
  • Amortized cost per item on 64-bit (ignoring waste from alignment):
    • (sizeof(Slot):24) + (sizeof(Slot*):8) + (dlmalloc-overhead:8-16) = 40-48 bytes/item

dn_simdhash

  • 2-14 keys per bucket, plus 16 byte suffix table. Our buckets currently contain either 11 or 12 key slots. Buckets are 16-byte aligned*.
    • At present this means each bucket is either 64 (12 4-byte keys) or 128 (11 8-byte keys) bytes on 32 bit arch
    • On 64-bit arch, string_ptr keys are 16 bytes (12 bytes of data + 4 bytes of padding), and we pack them into 192-byte buckets
  • Values live in a sequentially allocated table with one slot for every key slot
  • Bucket count is a power of two (1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048...), default bucket count is 1
    • This means default initial capacity is 11 or 12
    • Capacity steps are (for 11-item buckets) (11, 22, 44, 88, 176, 352, 704, 1408...), not much worse than GHashTable primes
    • Capacity steps are (for 12-item buckets) (12, 24, 48, 96, 192, 384, 768, 1536...)
  • Table header stores metadata (key/bucket/value sizes), allocator, and buffer sizes/pointers. A bit bigger than GHashTable's
  • Amortized cost per item for 12-item buckets (like-for-like with GHashTable on 32-bit):
    • (sizeof(KEY_T):4) + (sizeof(VALUE_T):4) + (bucket_overhead/12:1.33~) = 9.33~ bytes/item
  • Amortized cost per item for 11-item buckets (like-for-like with GHashTable on 64-bit):
    • (sizeof(KEY_T):8) + (sizeof(VALUE_T):8) + (bucket_overhead/11:1.45~) = 17.45~ bytes/item

Additional performance thoughts on dn_simdhash

  • dn_simdhash does less pointer chasing than GHashTable, due to the lack of Slot** indirection
  • keys are grouped together into buckets, so cache locality is improved for hash collisions
  • For common key types the hash and equality functions are inlined into the body of find/add/remove operations, removing an indirect call
    • Indirect calls are significantly more expensive on WASM than native, in addition to their effect of preventing inlining
    • We can still implement indirect hash and equality functions for a "ghash compatible" table

Code size increase

I don't know how to measure this with any accuracy. We monitor the size of dotnet.wasm on an ongoing basis, so it would manifest there once we merge changes. FWIW, the test suite + its dn_simdhash and dn_vector dependencies combined generate a 20550-byte WASM file or a 27072-byte ELF binary from x64 clang. Looking at disassembly of the ELF binary, the dn_simdhash_* function bodies are approximately 28% of the total character count, which would put the cost of one unique simdhash instantiation on x64 at ~8KB. Clang loves to inline simdhash functions into their callers, so I don't completely trust this estimate.

@@ -42,6 +42,11 @@ else()
set(metadata_platform_sources ${metadata_unix_sources})
endif()

set(imported_native_sources
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocked on getting SIMD enabled for mono/metadata on WASM.

@kg I /think/ just using set_source_file_properties to set compile flags would work:

set_source_file_properites(${imported_native_sources} PROPERTIES COMPILE_FLAGS -msimd128)

under the appropriate if()/endif().

It's possible this won't work if we're doing something weird with paths.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is we need to conditionally build with/without simd based on the msbuild property. Right now that's handled by building simd and non simd modules that get linked in dynamically at the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sorry, I misunderstood what you're doing. I thought this was predicated on requiring simd support in all wasm builds.

If you need conditional builds, you will need to create two separate .a files for the whole runtime (or maybe just two different .a files for anything that depends on the hash containers) and then letting the logic in src/mono/wasm/build select which one to link in.

conceptually our build here is:

  1. build src/mono/mono.proj. That invokes cmake to compile src/mono/mono/mini (this also compiles metadata/ utils/, eglib/ etc)
  2. build src/mono/browser/browser.proj once to create a kind of default build of dotnet.native.wasm for users who don't need to do native compilation on their dev machines
  3. for people who do native compilation on their dev machines, instead distribute '.a' files of the src/mono/mono bits and allow them to do the final compilation/linking on their own (this is the job of src/mono/browser/build/ .targets files)

If you need step 3 or step 2 to do different things based on msbuild properties, but you need to choose different outputs from step 1, then step 1 needs to produce multiple artifacts

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we could move the hash containers and stuff into their own object that gets linked late? i tried putting these things in our existing simd/no-simd objects, but i got link errors earlier in the build because it seems like we have a linking tree, i.e.

ab = a + b
cd = c + d
mono-sgen = ab + cd

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be possible to build the hash containers in their own object library (modulo anything inlined in headers). I suspect it wouldn't work if you just do it in src/native/containers but src/mono/mono could probably do something.


if (image->name_cache)
return;

the_name_cache = g_hash_table_new (g_str_hash, g_str_equal);
// TODO: Figure out a good initial capacity for this table by doing a scan,
Copy link
Member

@lambdageek lambdageek Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can estimate a very good guess for corelib (by the time this is first called, I think mono_defaults.corlib will be set already) and just set some kind of default for everything else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I suspect the current default (11-12 items) might be right for smaller images

@kg
Copy link
Member Author

kg commented Apr 15, 2024

Found a bug that caused simdhash to build incorrectly under MSVC. While fixing it, I did a benchmark suite run (these numbers aren't comparable to the linux ones; I had to reduce the iteration count because the rand() in msvc's libc is lower quality than clang's.)

baseline: Warmed 124 time(s). Running 32768 iterations... 19 step(s): avg 3.231ns min 3.173ns max 3.269ns
dn_clear_then_fill_sequential: Warmed 13 time(s). Running 4096 iterations... 15 step(s): avg 32.620ns min 31.967ns max 32.992ns
dn_clear_then_fill_random: Warmed 13 time(s). Running 4096 iterations... 15 step(s): avg 32.824ns min 32.110ns max 33.588ns
dn_fill_then_remove_every_item: Warmed 8 time(s). Running 2048 iterations... 16 step(s): avg 62.134ns min 59.765ns max 64.947ns
ght_find_random_keys: Warmed 17 time(s). Running 8192 iterations... 11 step(s): avg 22.362ns min 21.720ns max 23.031ns
dn_find_random_keys: Warmed 26 time(s). Running 8192 iterations... 16 step(s): avg 15.637ns min 15.148ns max 16.191ns
dn_find_missing_key: Warmed 28 time(s). Running 8192 iterations... 17 step(s): avg 14.768ns min 14.603ns max 15.012ns
ght_clear_then_fill_sequential: Warmed 4 time(s). Running 1024 iterations... 17 step(s): avg 115.914ns min 114.911ns max 117.402ns
ght_clear_then_fill_random: Warmed 3 time(s). Running 1024 iterations... 13 step(s): avg 156.626ns min 155.166ns max 160.343ns
ght_find_missing_key: Warmed 15 time(s). Running 4096 iterations... 19 step(s): avg 26.059ns min 24.940ns max 27.191ns

Interesting that running it on clang x64 shows ght lookup being around 2x as slow as dn_simdhash, but on msvc x64 ght lookup is a bit faster. Maybe a difference in the quality of the SIMD codegen or something, not sure it's worth looking into that deeply yet. Insertion and removal are still faster, as you'd expect.

EDIT: The reason ght was faster on Windows is that the "random" keys were sequential in the range 0-32767. :-)

@kg kg merged commit 3c10707 into dotnet:main Apr 15, 2024
161 of 163 checks passed
matouskozak pushed a commit to matouskozak/runtime that referenced this pull request Apr 30, 2024
This PR adds a type-specialized, vectorized (using 128-bit SIMD) hash table container, and migrates one part of the mono runtime to use it instead of GHashTable. It also adds a basic test suite and basic benchmark suite. Vectorization is not enabled for it in the WASM build yet because we need to make changes to the build there. It is also not vectorized for ARM MSVC.
@github-actions github-actions bot locked and limited conversation to collaborators May 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-VM-meta-mono NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants