Add type-specialized vectorized hash table container #100386

kg · 2024-03-28T04:34:20Z

This PR adds a type-specialized, vectorized (using 128-bit SIMD) hash table container. At present its feature set is limited but I plan to expand it to cover most of our performance-sensitive hashing scenarios in the mono runtime. It's in native/containers because ideally it will be possible to consume this container (or at least its core parts) from the C++ side of things as well.

See #100386 (comment) for a size analysis.

As a test case I migrated the MonoImage namespace cache from GHashTable to a simple string-ptr specialization of this container and validated that it works on wasm (locally, at least). The performance will be bad since I'm not sure how to set -msimd128 properly for this part of the runtime.

General notes:

This table's design is inspired by the Folly F14 set of containers, but not based on their code. Reading the blog post gave me enough information to figure out how to write a similar container.
This is based on a prototype written in C# that I benchmarked against S.C.G.Dictionary as a way to validate the performance characteristics of the data structure. https://github.com/kg/SimdDictionary/
The table does not store pre-computed hash codes. The lowest bits of the hash code form the bucket index, and the highest bits form the 'suffix' which is used to find individual keys quickly inside a given bucket. This reduces the amount of memory needed to store items compared to hashtables that would normally store a complete hash. The downside to this is that rehashing becomes more expensive - but I don't consider that a defect in this design.
The table groups keys into "buckets" of N (typically 11 or 12), where we store a group of up to 14 suffixes along with some data inside of a 16-byte-wide SIMD vector, and then store all the keys immediately after the suffixes in the bucket. For correctly-chosen bucket sizes, this allows an entire bucket to fit neatly into 1-4 cache lines.
After we select a bucket using the low bits of a hash code, we can locate the first (if any) potential match by suffix using a vectorized equality comparison - the index is determined by the number of trailing zeroes. This is similar to how vectorized IndexOf works for strings in the BCL.
Once we've found a bucket that contains potential matches, we scan through the keys sequentially to look for one that matches. We usually find the correct item on the first try, thanks to the match vector.
If a bucket fills up, we cascade additional items into its neighbors, after setting a flag to indicate that the bucket has spilled over. This means that lookups don't have to search neighboring buckets unless an overflow has occurred on the target bucket.
In the event of hash collisions, performance doesn't degrade until you start overflowing buckets frequently. In the worst case scenario where all keys have the same hashcode, performance is still better than some alternative hashtables.
The table stores values in a linear buffer next to the buffer that contains all the buckets, with one value slot for each key slot. If your values are enormous, this could be a problem, but I've chosen not to solve that yet since there are a few different ways to address it and the right choice is situational.

The implementation is split across four main files:

dn-simdhash.h

Declares the public API, configuration constants, inlined helpers, and some public/private data structures. I don't go to any effort to hide internal state from the user.

dn-simdhash-arch.h

Architecture-specific code is kept in this header. It uses clang/gcc vector intrinsics where possible, which will compile down to appropriate intrinsics for the target and simplifies writing correct code. It currently supports x64, wasm* and ARM* on clang/gcc, and x64 on msvc. Unsupported architecture/compiler combos use a scalar fallback.

wasm will perform suboptimally unless you enable simd128, and ARM intrinsics are currently blocked on having access to system headers (plus it's untested).

dn-simdhash.c

Contains implementations for the public API. The public API relies on specialized implementations generated by the specialization header, and it accesses those via a vtable. I'll call those the "private API" in this description.
I've kept as much logic in this file as possible, to reduce the amount of binary bloat created by specializations.

dn-simdhash-specialization.h

Contains implementations of the private API, specialized for a given key type, value type, hash function, and key comparer.
You configure the implementation by setting various defines before including it. You should use this by creating a unique .c file for each type of simdhash you want, like the example included in the PR. Since the private API is type-specialized, certain types of misuse become compile errors.

kg · 2024-03-28T20:28:50Z

Weird build problems I can't reproduce locally:

<arm_neon.h> is missing for no obvious reason on our CI. I don't know why. Does anyone know? It should be present when building for arm64 with clang shouldn't it? It's guarded properly with an ifdef, so I don't understand how this is happening.
<emmintrin.h> is getting double-included, and I can't tell where the first one is coming from. I'm 99.9% certain it's not my fault, since the test suite builds locally.

kg · 2024-03-28T23:14:08Z

The emmintrin.h double-include is our fault, we have a weird mangled file with that name in our tree that has a broken include guard. Still stumped on NEON.

danmoseley · 2024-03-29T01:46:37Z

Maybe I missed it above but you mentioned you benchmarked against S.C.G.Dictionary.. I'm curious of the results.

Not sure whether anyone has experimented with a vectorized variant of that, nor whether it would be sufficiently general purpose that we'd ever consider putting such a thing in the core libraries.

jkotas · 2024-03-29T02:02:37Z

Still stumped on NEON.

If you see this problem with CoreCLR, it is likely caused by CoreCLR PAL. CoreCLR PAL explicitly disables system includes.

kg · 2024-03-29T02:16:39Z

Still stumped on NEON.

If you see this problem with CoreCLR, it is likely caused by CoreCLR PAL. CoreCLR PAL explicitly disables system includes.

That explains it! I was wondering why we had a weird version of emmintrin. So I would need to provide a custom version of the neon header inside the PAL. I don't have easy access to an arm64 development environment to test this on at the moment.

jkotas · 2024-03-29T02:20:53Z

Or ignore CoreCLR for now and wait for #98336 to get merged. #98336 should enable including system headers across CoreCLR.

kg · 2024-03-29T02:33:57Z

Maybe I missed it above but you mentioned you benchmarked against S.C.G.Dictionary.. I'm curious of the results.

Not sure whether anyone has experimented with a vectorized variant of that, nor whether it would be sufficiently general purpose that we'd ever consider putting such a thing in the core libraries.

From my testing, a C# version of this algorithm tends to perform in the range of 90-110% of S.C.G.Dictionary in my BDN measurements. There are specific scenarios where it's worse (in part because I couldn't aggressively optimize parts of it easily) and where it's better (it has significantly better performance for hash collisions). Depending on how you tune the bucket sizes and allocation rules, it uses less memory too. S.C.G's performance here impressed me overall, especially considering there's still room left in that .cs file for micro-optimizations.

Expressing this properly in C# is very difficult because the size of buckets needs to be conditional on the byte width of the items inside the buckets, and I couldn't find a clean way to express that - InlineArray takes a constant argument, etc. I'm not convinced we could offer a general-purpose generic version of this in the BCL without changes to the language and type system. A version limited to unmanaged keys and values would probably be possible, but it would be awkward to write and of limited utility. I originally prototyped a C# version that used 3 arrays - suffixes / keys / values - which avoids the InlineArray problem, but the performance is worse due to a lack of cache locality + more address calculations and bounds checks from the extra array.

My main target with this PR is to replace some of the hot path hash tables in mono (typically but not always GHashTable) - they are 10-20 years old and extremely generic, which adds a lot of overhead from indirect function calls, etc.

Here are some measurements from a BDN run just now:

Type	Method	Mean	Error	StdDev	Gen2	Allocated
BCLCollisions	AddSameRepeatedlyThenClear	9.564 us	0.0295 us	0.0261 us	-	-
BCLCollisions	FillThenClear	3,570.406 us	70.2765 us	80.9305 us	-	24 B
BCLCollisions	FindExistingWithCollisions	5,503.151 us	109.3294 us	125.9039 us	-	48 B
BCLCollisions	FindMissingWithCollisions	6,845.631 us	37.9069 us	33.6035 us	-	48 B
SimdCollisions	AddSameRepeatedlyThenClear	9.468 us	0.0314 us	0.0278 us	-	-
SimdCollisions	FillThenClear	3,002.598 us	55.3918 us	51.8135 us	-	24 B
SimdCollisions	FindExistingWithCollisions	4,155.909 us	68.9603 us	61.1315 us	-	48 B
SimdCollisions	FindMissingWithCollisions	5,528.713 us	109.1915 us	169.9979 us	-	48 B
BCLInsertion	ClearThenRefill	83.497 us	2.0610 us	6.0770 us	-	1 B
BCLInsertion	InsertExisting	39.677 us	0.3416 us	0.3195 us	-	-
SimdInsertion	ClearThenRefill	117.608 us	0.6924 us	0.5782 us	-	1 B
SimdInsertion	InsertExisting	39.880 us	0.2150 us	0.1906 us	-	-
BCLIterate	EnumerateKeys	36.013 us	0.3891 us	0.3640 us	-	65632 B
BCLIterate	EnumeratePairs	95.678 us	0.7668 us	0.7173 us	41.6260	131199 B
BCLIterate	EnumerateValues	37.589 us	0.6046 us	0.5656 us	-	65632 B
SimdIterate	EnumerateKeys	45.804 us	0.8974 us	1.0335 us	-	65792 B
SimdIterate	EnumeratePairs	91.639 us	0.4401 us	0.4117 us	41.6260	131319 B
SimdIterate	EnumerateValues	50.679 us	0.3528 us	0.3128 us	-	65792 B
BCLLookup	FindExisting	45.083 us	0.5422 us	0.5072 us	NA	NA
BCLLookup	FindMissing	39.561 us	0.1423 us	0.1111 us	NA	NA
SimdLookup	FindExisting	45.684 us	0.1538 us	0.1438 us	NA	NA
SimdLookup	FindMissing	39.728 us	0.6022 us	0.5633 us	NA	NA
BCLRemoval	RemoveItemsThenRefill	130.944 us	0.9038 us	0.8012 us	-	2 B
BCLRemoval	RemoveMissing	35.589 us	0.0606 us	0.0537 us	-	-
SimdRemoval	RemoveItemsThenRefill	186.603 us	0.3341 us	0.2962 us	-	2 B
SimdRemoval	RemoveMissing	41.530 us	0.5148 us	0.4816 us	-	-
BCLResize	CreateDefaultSizeAndFill	341.014 us	1.7947 us	1.6787 us	222.1680	942132 B
SimdResize	CreateDefaultSizeAndFill	401.432 us	4.1715 us	3.4834 us	230.4688	981742 B

kg · 2024-03-29T21:41:48Z

Current status:

REVISED Blocked on getting SIMD enabled for mono/metadata on WASM. Should make progress on that next week once the team is back from vacation. Performance seems to be adequate using the scalar fallback, so we still need to figure out how to enable SIMD for mono/metadata on WASM but it doesn't block this PR.
FIXED ARM NEON is blocked on system headers (Remove remaining CRT PAL wrappers and enable including standard headers in the CoreCLR build #98336), but since it only uses a small number of intrinsics, I could imitate our custom vendored header and do the same for NEON. Not sure what to do there; I don't have ARM development hardware to test on anyway.
NEON on MSVC will be harder since there are no clang/gcc vector intrinsics there, it seems I would have to hand-write a version of this that operates on 8 lanes at a time (clang helpfully generates the code for me.) I don't know if we actually care about mono on this configuration, but if we ever consume this container from coreclr it would matter.
DONE ~~The scalar non-SIMD fallback could stand to be hand-optimized (the generated WASM for it is terrible), but I don't know if we want to invest any time in that.~~ Could use more improvement but I hand-optimized it, the generated code in clang on x64 and wasm looks okay.
Verified on MSVC I've manually verified the test suite locally using x64 gcc, x64 clang, and wasm clang. I've inspected the generated code from ARM clang, x64 clang, x64 gcc, wasm clang, and x64 msvc using godbolt. I haven't manually verified at all using MSVC. I can't get arm64 cross-compilation to work in my development environment (weird issues with system headers).
I did some startup profiles using the non-vectorized wasm version and performance looks okay. (The cached hashcode in the example string_ptr specialization was added based on profiling).
DONE It would be nice to replace the murmurhash3-32 used for strings with something that can operate on null-terminated strings, so we don't have to waste time calling strlen before hashing. the old g_str_hash was good at this, but it wasn't particularly strong as an avalanching hash function, nor was it pipelined/vectorized at all. Someone who knows hash functions better might be able to suggest a solution for this; I picked murmurhash3 since we already have it in-tree elsewhere, and AFAIK it's a proven non-cryptographic hash function with good performance. Right now in startup profiles with this active, make_key (strlen + murmurhash3-32) is where most of the time is spent.
FIXED the string_ptr implementation is more complex due to the need to handle strings longer than 4gb on x64. I'd like to just drop support for that, since the idea of using a 4GB null-terminated string as a hashtable key is kind of ridiculous. Would people be OK with me making that a runtime failure instead?
FIXED asserts have been causing me a lot of trouble with this PR, since it seems like in our release builds they're compiled out. DN_ASSERT is the same. What's the right way to do runtime assertions in this part of the codebase? I'm used to g_assert, which is enabled in all mono builds.
This container hasn't been tuned for memory usage. I'm currently allocating space for 120% of requested capacity, and rounding bucket counts up to the next power of two. These choices are both inspired by the f14 blog post + my local testing with my C# prototype, but they may not be optimal for our use cases. POT bucket counts make bucket selection faster (bitwise & instead of integer %) and reduce the number of rehash operations, but the increased memory waste for large numbers of items might be undesirable.
FIXED It would be cool if we could deterministically enforce cache line alignment for bucket sizes at compile time, but I wasn't able to figure out a way to do it. I'm not convinced that everything in here is alignment-safe, though I was unable to spot any issues.
The code size per-specialization seems okay, though x64 clang seems to eagerly inline the get/add operations into their callers when compiling the test suite even though they're not marked static or inline, and that makes me slightly concerned.
Rehashing performance is sub-optimal, though I don't know if it's meaningfully worse than GHashTable. I have a general idea for how to do in-place rehashing, but I don't think it's worth doing that yet - it's definitely harder to get right.
FIXED If you remove lots of items in a table that's had buckets overflow, the performance degradation from bucket overflow doesn't go away. I have a general sense of how to fix this, but it complicates the algorithm and is tricky to get right so I haven't done it yet. Many of our hash tables are insert-only or don't live long enough for this to be an issue.
While I replaced a couple GHashTables in mono/metadata to prove that this works, I didn't update the code to pre-reserve capacity for the tables, since it wasn't obvious how big they should be. Doing that will probably reduce the number of rehash operations and meaningfully improve startup performance on WASM, since we spend a sizable amount of time rehashing there.
I believe this data structure can also be used to implement a hashset by using a 0-byte value type, but I haven't tested that. It would be good to support that as well since it's a scenario that appears in startup profiles.

danmoseley · 2024-03-29T21:46:03Z

thanks for the data!

S.C.G's performance here impressed me overall, especially considering there's still room left in that .cs file for micro-optimizations.

Curious what comes to mind? As I thought we'd drained the ones we were aware of.

kg · 2024-03-29T21:49:20Z

thanks for the data!

S.C.G's performance here impressed me overall, especially considering there's still room left in that .cs file for micro-optimizations.

Curious what comes to mind? As I thought we'd drained the ones we were aware of.

Off the top of my head, FindValue does bounds-checked array indexing on the _entries table, and so does TryInsert. Thoroughly a micro-optimization though. I found that the bounds checks were a problem for my C# prototype, but my guess is that S.C.G.Dictionary benefits from JIT optimizations that my code wasn't arranged properly to exploit.

kg · 2024-03-31T20:35:01Z

I did a size analysis of the main container I aim to replace, GHashTable, and compared it with dn_simdhash. Summary first, details after.

Estimated best-case memory usage on 32-bit, assuming minimum dlmalloc overhead

For 366 items (ghashtable favored):

GHashTable would use (20*367) = ~7340 bytes, or 20.05b/item
dn_simdhash would use ((32*16) + (384*8)) = ~3584 bytes, or 09.79b/item

For 767 items (simdhash favored):

GHashTable would use (20*823) = ~16460 bytes, or 21.46b/item
dn_simdhash would use ((64*16) + (768*8)) = ~7168 bytes, or 09.34b/item

GHashTable

3 pointers per slot (12-24 bytes), keys and values are thus always pointer-sized
Slot count is prime (11, 19, 37, 73, 109, 163, 251, 367, 557, 823, 1237...), default initial capacity is 11
Table header stores hash_func, key_equal_func, value_destroy_func, key_destroy_func
Table body is an array of Slot*
Each slot is independently allocated with g_new -> g_malloc -> G_MALLOC_INTERNAL -> malloc
- Emscripten appears to use dlmalloc to service our mallocs
- Dlmalloc aligns allocations to 8 bytes
- Dlmalloc has 4-8 bytes of overhead per allocation
Amortized cost per item on 32-bit (ignoring waste from alignment):
- (sizeof(Slot):12) + (sizeof(Slot*):4) + (dlmalloc-overhead:4-8) = 20-24 bytes/item
Amortized cost per item on 64-bit (ignoring waste from alignment):
- (sizeof(Slot):24) + (sizeof(Slot*):8) + (dlmalloc-overhead:8-16) = 40-48 bytes/item

dn_simdhash

2-14 keys per bucket, plus 16 byte suffix table. Our buckets currently contain either 11 or 12 key slots. Buckets are 16-byte aligned*.
- At present this means each bucket is either 64 (12 4-byte keys) or 128 (11 8-byte keys) bytes on 32 bit arch
- On 64-bit arch, string_ptr keys are 16 bytes (12 bytes of data + 4 bytes of padding), and we pack them into 192-byte buckets
Values live in a sequentially allocated table with one slot for every key slot
Bucket count is a power of two (1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048...), default bucket count is 1
- This means default initial capacity is 11 or 12
- Capacity steps are (for 11-item buckets) (11, 22, 44, 88, 176, 352, 704, 1408...), not much worse than GHashTable primes
- Capacity steps are (for 12-item buckets) (12, 24, 48, 96, 192, 384, 768, 1536...)
Table header stores metadata (key/bucket/value sizes), allocator, and buffer sizes/pointers. A bit bigger than GHashTable's
Amortized cost per item for 12-item buckets (like-for-like with GHashTable on 32-bit):
- (sizeof(KEY_T):4) + (sizeof(VALUE_T):4) + (bucket_overhead/12:1.33~) = 9.33~ bytes/item
Amortized cost per item for 11-item buckets (like-for-like with GHashTable on 64-bit):
- (sizeof(KEY_T):8) + (sizeof(VALUE_T):8) + (bucket_overhead/11:1.45~) = 17.45~ bytes/item

Additional performance thoughts on dn_simdhash

dn_simdhash does less pointer chasing than GHashTable, due to the lack of Slot** indirection
keys are grouped together into buckets, so cache locality is improved for hash collisions
For common key types the hash and equality functions are inlined into the body of find/add/remove operations, removing an indirect call
- Indirect calls are significantly more expensive on WASM than native, in addition to their effect of preventing inlining
- We can still implement indirect hash and equality functions for a "ghash compatible" table

Code size increase

I don't know how to measure this with any accuracy. We monitor the size of dotnet.wasm on an ongoing basis, so it would manifest there once we merge changes. FWIW, the test suite + its dn_simdhash and dn_vector dependencies combined generate a 20550-byte WASM file or a 27072-byte ELF binary from x64 clang. Looking at disassembly of the ELF binary, the dn_simdhash_* function bodies are approximately 28% of the total character count, which would put the cost of one unique simdhash instantiation on x64 at ~8KB. Clang loves to inline simdhash functions into their callers, so I don't completely trust this estimate.

lambdageek · 2024-04-01T15:58:09Z

src/mono/mono/metadata/CMakeLists.txt

@@ -42,6 +42,11 @@ else()
 set(metadata_platform_sources ${metadata_unix_sources})
 endif()

+set(imported_native_sources


Blocked on getting SIMD enabled for mono/metadata on WASM.

@kg I /think/ just using set_source_file_properties to set compile flags would work:

set_source_file_properites(${imported_native_sources} PROPERTIES COMPILE_FLAGS -msimd128)

under the appropriate if()/endif().

It's possible this won't work if we're doing something weird with paths.

The thing is we need to conditionally build with/without simd based on the msbuild property. Right now that's handled by building simd and non simd modules that get linked in dynamically at the end.

oh, sorry, I misunderstood what you're doing. I thought this was predicated on requiring simd support in all wasm builds.

If you need conditional builds, you will need to create two separate .a files for the whole runtime (or maybe just two different .a files for anything that depends on the hash containers) and then letting the logic in src/mono/wasm/build select which one to link in.

conceptually our build here is:

build src/mono/mono.proj. That invokes cmake to compile src/mono/mono/mini (this also compiles metadata/ utils/, eglib/ etc)

build src/mono/browser/browser.proj once to create a kind of default build of dotnet.native.wasm for users who don't need to do native compilation on their dev machines

for people who do native compilation on their dev machines, instead distribute '.a' files of the src/mono/mono bits and allow them to do the final compilation/linking on their own (this is the job of src/mono/browser/build/ .targets files)

If you need step 3 or step 2 to do different things based on msbuild properties, but you need to choose different outputs from step 1, then step 1 needs to produce multiple artifacts

do you think we could move the hash containers and stuff into their own object that gets linked late? i tried putting these things in our existing simd/no-simd objects, but i got link errors earlier in the build because it seems like we have a linking tree, i.e.

ab = a + b
cd = c + d
mono-sgen = ab + cd

I think it should be possible to build the hash containers in their own object library (modulo anything inlined in headers). I suspect it wouldn't work if you just do it in src/native/containers but src/mono/mono could probably do something.

lambdageek · 2024-04-01T16:03:08Z

src/mono/mono/metadata/class.c


 	if (image->name_cache)
 		return;

-	the_name_cache = g_hash_table_new (g_str_hash, g_str_equal);
+	// TODO: Figure out a good initial capacity for this table by doing a scan,


I think we can estimate a very good guess for corelib (by the time this is first called, I think mono_defaults.corlib will be set already) and just set some kind of default for everything else.

That makes sense. I suspect the current default (11-12 items) might be right for smaller images

Optimize x64 codegen for bucket scans

Make ght_compatible usable outside of mono

Add fill-then-remove measurement

Fix typo in simdhash-arch MSVC implementation

kg · 2024-04-15T09:32:18Z

Found a bug that caused simdhash to build incorrectly under MSVC. While fixing it, I did a benchmark suite run (these numbers aren't comparable to the linux ones; I had to reduce the iteration count because the rand() in msvc's libc is lower quality than clang's.)

baseline: Warmed 124 time(s). Running 32768 iterations... 19 step(s): avg 3.231ns min 3.173ns max 3.269ns
dn_clear_then_fill_sequential: Warmed 13 time(s). Running 4096 iterations... 15 step(s): avg 32.620ns min 31.967ns max 32.992ns
dn_clear_then_fill_random: Warmed 13 time(s). Running 4096 iterations... 15 step(s): avg 32.824ns min 32.110ns max 33.588ns
dn_fill_then_remove_every_item: Warmed 8 time(s). Running 2048 iterations... 16 step(s): avg 62.134ns min 59.765ns max 64.947ns
ght_find_random_keys: Warmed 17 time(s). Running 8192 iterations... 11 step(s): avg 22.362ns min 21.720ns max 23.031ns
dn_find_random_keys: Warmed 26 time(s). Running 8192 iterations... 16 step(s): avg 15.637ns min 15.148ns max 16.191ns
dn_find_missing_key: Warmed 28 time(s). Running 8192 iterations... 17 step(s): avg 14.768ns min 14.603ns max 15.012ns
ght_clear_then_fill_sequential: Warmed 4 time(s). Running 1024 iterations... 17 step(s): avg 115.914ns min 114.911ns max 117.402ns
ght_clear_then_fill_random: Warmed 3 time(s). Running 1024 iterations... 13 step(s): avg 156.626ns min 155.166ns max 160.343ns
ght_find_missing_key: Warmed 15 time(s). Running 4096 iterations... 19 step(s): avg 26.059ns min 24.940ns max 27.191ns

Interesting that running it on clang x64 shows ght lookup being around 2x as slow as dn_simdhash, but on msvc x64 ght lookup is a bit faster. Maybe a difference in the quality of the SIMD codegen or something, not sure it's worth looking into that deeply yet. Insertion and removal are still faster, as you'd expect.

EDIT: The reason ght was faster on Windows is that the "random" keys were sequential in the range 0-32767. :-)

This PR adds a type-specialized, vectorized (using 128-bit SIMD) hash table container, and migrates one part of the mono runtime to use it instead of GHashTable. It also adds a basic test suite and basic benchmark suite. Vectorization is not enabled for it in the WASM build yet because we need to make changes to the build there. It is also not vectorized for ARM MSVC.

kg added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) NO-REVIEW Experimental/testing PR, do NOT review it labels Mar 28, 2024

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Mar 28, 2024

dotnet-policy-service bot assigned kg Mar 28, 2024

lambdageek added area-VM-meta-mono and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Mar 28, 2024

This was referenced Mar 29, 2024

mono_os_mutex_destroy: pthread_mutex_destroy failed with "Device or resource busy" #99609

Closed

[browser][mt] SignalRClientTests timeouts #100388

Closed

[wasm] SignalR tests timing out #100445

Closed

kg removed the NO-REVIEW Experimental/testing PR, do NOT review it label Mar 29, 2024

kg marked this pull request as ready for review March 29, 2024 23:05

kg requested review from lambdageek, steveisok and thaystg as code owners March 29, 2024 23:05

lambdageek reviewed Apr 1, 2024

View reviewed changes

kg added 24 commits April 15, 2024 01:08

Optimize out unaligned 16-byte copy in scalar wasm find_value

0a1936e

Workaround for weird msimd128 codegen

e388a1d

Update comments

b4d07d7

Simplify assertions

703a607

Optimize x64 codegen for bucket scans

Benchmark harness

73c505d

Make ght_compatible usable outside of mono

Checkpoint

97a3b25

Checkpoint

391fdeb

Better missing key measurement

ae69695

Fix sequential/random measurements being meaningfully different

f543c20

Add fill-then-remove measurement

Adjustment based on benchmarking

07c398b

Add a baseline measurement

44a7bd3

Basic ghashtable comparison measurements

9572863

Update makefile

bf03cca

Check in missing changes; fix mono link error

57e6aa2

Add missing license headers

9107045

Partially unroll scalar search for better wasm performance

3d066cc

Fully unroll

06efeeb

Cleanup whitespace / add comment

8d798f8

Only type-check simdhash instances in debug builds

253d920

Update makefile switches

d90828a

Fix and add comments

5f49036

Make test support windows

1f2b89f

Fix typo in simdhash-arch MSVC implementation

Make it possible to build benchmark suite using MSVC

5083f40

Improve MSVC codegen

af63c5c

kg force-pushed the simdhash branch from ea0c1f8 to af63c5c Compare April 15, 2024 09:22

kg merged commit 3c10707 into dotnet:main Apr 15, 2024
161 of 163 checks passed

matouskozak mentioned this pull request Apr 24, 2024

[Perf] Linux/arm64: 32 Regressions on 4/16/2024 12:58:48 AM dotnet/perf-autofiling-issues#33133

Closed

github-actions bot locked and limited conversation to collaborators May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add type-specialized vectorized hash table container #100386

Add type-specialized vectorized hash table container #100386

kg commented Mar 28, 2024 •

edited

Loading

kg commented Mar 28, 2024

kg commented Mar 28, 2024

danmoseley commented Mar 29, 2024

jkotas commented Mar 29, 2024

kg commented Mar 29, 2024

jkotas commented Mar 29, 2024

kg commented Mar 29, 2024

kg commented Mar 29, 2024 •

edited

Loading

danmoseley commented Mar 29, 2024

kg commented Mar 29, 2024

kg commented Mar 31, 2024

lambdageek Apr 1, 2024

kg Apr 1, 2024

lambdageek Apr 1, 2024

kg Apr 1, 2024

lambdageek Apr 2, 2024

lambdageek Apr 1, 2024 •

edited

Loading

kg Apr 1, 2024

kg commented Apr 15, 2024 •

edited

Loading

Add type-specialized vectorized hash table container #100386

Add type-specialized vectorized hash table container #100386

Conversation

kg commented Mar 28, 2024 • edited Loading

dn-simdhash.h

dn-simdhash-arch.h

dn-simdhash.c

dn-simdhash-specialization.h

kg commented Mar 28, 2024

kg commented Mar 28, 2024

danmoseley commented Mar 29, 2024

jkotas commented Mar 29, 2024

kg commented Mar 29, 2024

jkotas commented Mar 29, 2024

kg commented Mar 29, 2024

kg commented Mar 29, 2024 • edited Loading

danmoseley commented Mar 29, 2024

kg commented Mar 29, 2024

kg commented Mar 31, 2024

Estimated best-case memory usage on 32-bit, assuming minimum dlmalloc overhead

GHashTable

dn_simdhash

Additional performance thoughts on dn_simdhash

Code size increase

lambdageek Apr 1, 2024

Choose a reason for hiding this comment

kg Apr 1, 2024

Choose a reason for hiding this comment

lambdageek Apr 1, 2024

Choose a reason for hiding this comment

kg Apr 1, 2024

Choose a reason for hiding this comment

lambdageek Apr 2, 2024

Choose a reason for hiding this comment

lambdageek Apr 1, 2024 • edited Loading

Choose a reason for hiding this comment

kg Apr 1, 2024

Choose a reason for hiding this comment

kg commented Apr 15, 2024 • edited Loading

kg commented Mar 28, 2024 •

edited

Loading

kg commented Mar 29, 2024 •

edited

Loading

lambdageek Apr 1, 2024 •

edited

Loading

kg commented Apr 15, 2024 •

edited

Loading