[Proposal] Vectorized System.Collections.Generic.Dictionary<K, V> #108098

kg · 2024-09-21T04:49:12Z

Proposal: Vectorized System.Collections.Generic.Dictionary<K, V>

See #107830 for some context/early measurements, https://github.com/kg/SimdDictionary/ for a prototype, and https://engineering.fb.com/2019/04/25/developer-tools/f14/?r=1 for a blog post describing a similar hashmap design.

Background and motivation

As part of work to improve startup time for the Mono runtime, I introduced a new vectorized pure-C hashtable implementation called dn_simdhash. We migrated many uses of Mono's old GHashTable to this new container, which delivered sizable reductions in both CPU usage and memory usage. During and after this work, I've had multiple people ask about the feasibility of doing similar work to vectorize the CoreCLR native C++ hash containers or vectorize SCG.Dictionary. I believe that we can do the latter and get improvements to throughput and memory usage, which may also translate to reductions in startup time.

API Proposal / Usage

No public API changes, unless we decide to expose new functionality as a part of this work. I can't think of anything I'd expose offhand.

Enhancements to InlineArray could enable higher performance for this container, but I think that is best left for a separate proposal.

Risks

If not implemented correctly we could introduce some new memory safety issues, but I believe we have what it takes to do this right. It doesn't expose any specific new risk factors that the old dictionary lacked.
Performance could be slightly worse on targets without Vector128 acceleration - on x86, SSE2 is required, and on ARM64 it needs AdvSimd.
It would be difficult to deliver 1:1 identical-or-better performance in every scenario, so customers might experience degraded performance for corner cases.
The current design uses more memory for tiny (< 14 items) dictionaries.
If optimized for performance, code size per-instantiation may increase.
The new design relies on JIT features like inlining of static interface methods and struct decomposition, which could increase time spent in JIT or make it more sensitive to changes in the JIT.
NativeAOT may have trouble producing optimal code for all instantiations.
The number of times GetHashCode or Equals are invoked during key operations (like rehashing) may change.
Iteration order (CopyTo, Enumerator, etc) will change.
Performance may degrade for specific types of bad hash functions.

BDN measurements for 4096-element `<Int64, Int64>` (on Ryzen unless noted)

Memory usage reduced by ~25%, with accompanying GC pause time improvements.
Failed/successful lookup times reduced by 20-30% on Ryzen and 50-75% on Intel for optimal hashes.
Insertion improved by 10% on Ryzen and 45% on Intel. Removal improved by 45% on Intel.
Initialization and lookup times reduced by 10% for all-hashcodes-are-zero.
ContainsValue execution times reduced by 20%.
KVP Enumerator performance improved by 10%.
Clear performance on full dictionary improved by 40% on Intel, while clearing a single-item dictionary got 6x worse. This is a good example of the tradeoffs you encounter with vectorization.

High-level design and notes

This dictionary is vectorized by splitting items into 14-entry buckets, where each bucket contains a Vector128 of 'hash suffixes', 8-bit slices of the item's hashcode, alongside key-value pairs in an InlineArray. Once a bucket is selected based on the modulus of the hash and the bucket count, we do a vectorized scan of the 14 suffixes to identify the most likely matching item in three instructions: vpcmpeqb, vpmovmskb, tzcnt. With an optimal hash, the odds of a suffix collision (more than one match in a bucket) are approximately 8% and the odds of a false-positive are approximately 1%. This means that for a non-degraded table, most lookups will exit after a single vectorized suffix check or after calling Equals on 1-2 items. The 2 remaining bytes in the suffix table are used to store the item count and 'cascade count', respectively. Cascade counts are explained further below. i.e.

internal struct Pair {
    public K Key;
    public V Value;
}

[InlineArray(14)]
internal struct InlinePairArray {
    public Pair Pair0;
}

internal struct Bucket {
    public Vector128<byte> Suffixes;
    public InlinePairArray Pairs;

    public ref byte Count =>
        ref Unsafe.AddByteOffset(ref Unsafe.As<Vector128<byte>, byte>(ref Unsafe.AsRef(in Suffixes)), CountSlot);

    public ref byte CascadeCount =>
        ref Unsafe.AddByteOffset(ref Unsafe.As<Vector128<byte>, byte>(ref Unsafe.AsRef(in Suffixes)), CascadeSlot);
}

Unlike current SCG.Dictionary, all data lives in the single buckets array, which delivers better cache locality for scans. We can locate the appropriate bucket with a single imul once the hash has been modulus'd (which is a single bitwise and for power-of-two bucket counts, and uses the existing FastMod for prime counts.) Scanning buckets after this uses nothing other than add/inc operations and address comparisons.

Small numbers of hash collisions have virtually no negative impact, as each bucket can contain up to 14 items with the same hash. Once a bucket fills, it 'cascades' into the next bucket, which is efficiently tracked (for up to 254 cascaded items) with virtually no degradation. Once a single bucket cascades >= 255 times, it will permanently degrade until the table is rehashed. (It's possible to easily detect this and grow the table in response.) This degraded state causes failed lookups to have to search neighboring buckets, since the 'cascaded' flag will remain set until a bucket is cleared.

It is theoretically possible to rehash this container in-place without a temporary array, at the cost of some buckets becoming erroneously degraded.

Null checks and checks for length-0 are optimized out by having empty dictionaries share a common 1-element empty Buckets array.

Find, Insert and Remove operations can be expressed in terms of a common FindKeyInBucket static interface method and LoopingBucketEnumerator struct, which means i.e. TryInsert is a total of 55 lines of C#. Portions of this search logic are not dependent on the types of K or V, which means they can be shared between instantiations. Example FindKey implementation:

internal ref Pair FindKey<TKeySearcher> (K key, IEqualityComparer<K>? comparer)
    where TKeySearcher : struct, IKeySearcher
{
    var hashCode = TKeySearcher.GetHashCode(comparer, key);
    // It's important to construct the enumerator before computing the suffix, to avoid stalls
    var enumerator = new LoopingBucketEnumerator(this, hashCode);
    // Pre-creating the search vector here using Vector128.Create is possible, but it gets stored into xmm6 and causes
    //  an unconditional stack spill at entry, which outweighs any performance benefit from creating it early.
    // vpbroadcastb is 1op/1c latency on most x86-64 chips, so doing it early isn't very useful either.
    // Unfortunately in some cases Vector128.Create(suffix) gets LICM'd into xmm6 too... https://github.com/dotnet/runtime/issues/108092
    var suffix = GetHashSuffix(hashCode);
    do {
        // Eagerly load the bucket count early for pipelining purposes, so we don't stall when using it later.
        int bucketCount = enumerator.bucket.Count,
            startIndex = FindSuffixInBucket(ref enumerator.bucket, suffix, bucketCount);
        ref var pair = ref TKeySearcher.FindKeyInBucket(ref enumerator.bucket, startIndex, bucketCount, comparer, key, out _);
        if (Unsafe.IsNullRef(ref pair)) {
            if (enumerator.bucket.CascadeCount == 0)
                return ref Unsafe.NullRef<Pair>();
        } else
            return ref pair;
    } while (enumerator.Advance());

    return ref Unsafe.NullRef<Pair>();
}

Once a matching pair has been found in a bucket (for find/remove operations) or a bucket with space has been found (for inserts) we can complete the relevant operation in one step without additional address calculations (imul/idiv/mod, etc). If we cascaded out of previous buckets during insert/remove, we scan backwards to update their cascade counters to keep the table in a consistent state, but we don't pay this cost when there are no collisions.

Clear, copy and enumeration operations have to scan every bucket, then check its count to determine whether to touch its contents. This can produce very different performance from SCG depending on the distribution of items and number of items.

Because of the 14-wide buckets and vectorized handling of hash collisions, We can omit caching each item's HashCode in the buckets (which makes each item smaller), and we can allow the load factor to approach 100% without meaningfully degraded lookup performance. This results in reduced memory usage.

Because the entire container's state is a Count field, a GrowAtCount field, a Buckets array, and potentially a FastMod divisor, concurrent modifications are less hazardous than in SCG.Dictionary. With power-of-two bucket counts instead of prime bucket counts, there is no concurrent modification hazard at all for lookup operations. (Prime bucket counts expose the hazard of FastMod producing an out-of-bounds bucket index; SCG.Dictionary doesn't currently handle this so the array access would throw.)

This container does not use any sort of freelist when performing removes or inserts; values are stored in the appropriate bucket instead of sequentially in a separate entries array starting-from-0. This makes certain enumeration operations potentially slower, i.e. clear or copyto when mostly empty.

Potential runtime/type system improvements

An improvement to InlineArray would allow intelligently sizing buckets so their size is always a power of two. This would turn some imuls in this container into bitshifts, and improve cache line alignment for buckets. The most obvious way to do this would be an enhanced version of InlineArray where you request 'as many items as can fit into X bytes' or, even better, '1 <= N <= 14 items to produce the best-aligned structure`. It's possible to manually align buckets for the most common key/value size (8 bytes, for reference-typed keys/values or int64s) but it's not possible to generically specialize the presence/size of padding, so that optimization doesn't generalize in the current type system. This optimization is something that is possible in dn_simdhash and F14 at present.

The text was updated successfully, but these errors were encountered:

dotnet-policy-service · 2024-09-21T04:49:38Z

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

jkotas · 2024-09-21T14:11:35Z

If we go with this, we should validate the perf on real-world workload (e.g. Bing) by measuring total memory and inclusive time taken by Dictionary before/after.

jeffhandley · 2024-10-22T20:08:33Z

@eiriktsarpalis / @kg -- I suggest the two of you connect to chat about this and capture notes for feasibility and potential value, optionally including @tannergooding in the conversation as well.

kg added api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Collections labels Sep 21, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Sep 21, 2024

jeffhandley added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Sep 21, 2024

jeffhandley added this to the Future milestone Sep 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Vectorized System.Collections.Generic.Dictionary<K, V> #108098

[Proposal] Vectorized System.Collections.Generic.Dictionary<K, V> #108098

kg commented Sep 21, 2024

dotnet-policy-service bot commented Sep 21, 2024

jkotas commented Sep 21, 2024

jeffhandley commented Oct 22, 2024

[Proposal] Vectorized System.Collections.Generic.Dictionary<K, V> #108098

[Proposal] Vectorized System.Collections.Generic.Dictionary<K, V> #108098

Comments

kg commented Sep 21, 2024

Proposal: Vectorized System.Collections.Generic.Dictionary<K, V>

Background and motivation

API Proposal / Usage

Risks

BDN measurements for 4096-element <Int64, Int64> (on Ryzen unless noted)

High-level design and notes

Potential runtime/type system improvements

dotnet-policy-service bot commented Sep 21, 2024

jkotas commented Sep 21, 2024

jeffhandley commented Oct 22, 2024

BDN measurements for 4096-element `<Int64, Int64>` (on Ryzen unless noted)