Additional cross-platform hardware intrinsic APIs for loading/storing, reordering, and extracting a per-element "mask" #63331

tannergooding · 2022-01-04T07:01:30Z

Summary

With #49397 we approved and exposed cross platform APIs on Vector64/128/256 to help developers more easily support multiple platforms.

This was done by mirroring the surface area exposed by Vector<T>. However, due to their fixed sized there are some additional APIs that would be beneficial to expose. Likewise, there are a few APIs for loading/storing vectors that are commonly used for hardware intrinsics that would be beneficial to have cross platform helpers for.

The APIs expose would include the following:

ExtractMostSignificantBits
- On x86/x64 this would be emitted as MoveMask and performs exactly as expected
- On ARM64, this would be emulated via and, element-wise shift-right, 64-bit pairwise add, extract. The JIT could optionally detect if the input is the result of a Compare instruction and elide the shift-right.
- On WASM, this is called bitmask and works identically to MoveMask
- This API and its emulation are used throughout the BCL
Load/Store
- This is the basic load/store operations already in use for x86, x64, and ARM64
LoadAligned/StoreAligned
- This works exactly as the same named APIs on x86/x64
- When optimizations are disabled the alignment is verified
- When optimizations are enabled, this alignment checking may be skipped due to the load being folded into an instruction on modern hardware
- This enables efficient usage of the instruction on both older (pre-AVX) hardware as well as newer (post-AVX) or ARM64 hardware (where no load/store aligned instructions exist)
LoadAlignedNonTemporal/StoreAlignedNonTemporal
- This behaves as LoadAligned/StoreAligned but may optionally treat the memory access as non-temporal and avoid polluting the cache
LoadUnsafe/StoreUnsafe
- These are "new APIs", they cover a "gap" in the API surface that has been encountered and worked around in the BCL and which is semi-regularly requested by the community
- The API that just takes a ref T behaves exactly like the version that takes a pointer, just without requiring pinning
- The API that additionally takes an nuint index behaves like ref Unsafe.Add(ref value, index) and avoids needing to further bloat IL and hinder readability

API Proposal

namespace System.Runtime.Intrinsics
{
    public static partial class Vector64
    {
        public static uint ExtractMostSignificantBits<T>(Vector64<T> vector);

        public static Vector64<T> Load<T>(T* address);
        public static Vector64<T> LoadAligned<T>(T* address);
        public static Vector64<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector64<T> LoadUnsafe<T>(ref T address);
        public static Vector64<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector64<T> source);
        public static void StoreAligned<T>(T* address, Vector64<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector64<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector64<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector64<T> source);
    }

    public static partial class Vector128
    {
        public static uint ExtractMostSignificantBits<T>(Vector128<T> vector);

        public static Vector128<T> Load<T>(T* address);
        public static Vector128<T> LoadAligned<T>(T* address);
        public static Vector128<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector128<T> LoadUnsafe<T>(ref T address);
        public static Vector128<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector128<T> source);
        public static void StoreAligned<T>(T* address, Vector128<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector128<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector128<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector128<T> source);
    }

    public static partial class Vector256
    {
        public static uint ExtractMostSignificantBits<T>(Vector256<T> vector);

        public static Vector256<T> Load<T>(T* address);
        public static Vector256<T> LoadAligned<T>(T* address);
        public static Vector256<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector256<T> LoadUnsafe<T>(ref T address);
        public static Vector256<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector256<T> source);
        public static void StoreAligned<T>(T* address, Vector256<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector256<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector256<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector256<T> source);
    }
}

Additional Notes

Ideally we would also expose "shuffle" APIs allowing the elements of a single or multiple vectors to be reordered:

On x86/x64 these are referred to as Shuffle or Permute (generally takes two elements and one element, respectively; but that isn't always the case)
On ARM64, these are referred to as VectorTableLookup (only takes two elements)
On WASM, these are referred to as Shuffle (takes two elements) and Swizzle (takes one element).
On LLVM, these are referred to as VectorShuffle and only take two elements

Due to the complexities of same APIs, they can't trivially be exposed as a "single" generic API. Likewise, while the behavior for Vector128<T> is consistent on all platforms. Vector64<T> is ARM64 specific and Vector256<T> is x86/x64 specific. The former behaves like Vector128<T> while the latter generally behaves like 2x Vector128<T> (outside a few APIs called Permute#x#). For consistency, the Vector256<T> APIs exposed here would behave identically to Vector128<T> and allow "cross lane permutation".

For the single-vector reordering, the APIs are "trivial":

public static Vector128<byte>   Shuffle(Vector128<byte>   vector, Vector128<byte>   indices)
public static Vector128<sbyte>  Shuffle(Vector128<sbyte>  vector, Vector128<sbyte>  indices)

public static Vector128<short>  Shuffle(Vector128<short>  vector, Vector128<short>  indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> vector, Vector128<ushort> indices)

public static Vector128<int>    Shuffle(Vector128<int>    vector, Vector128<int>    indices)
public static Vector128<uint>   Shuffle(Vector128<uint>   vector, Vector128<uint>   indices)
public static Vector128<float>  Shuffle(Vector128<float>  vector, Vector128<int>    indices)

public static Vector128<long>   Shuffle(Vector128<long>   vector, Vector128<long>   indices)
public static Vector128<ulong>  Shuffle(Vector128<ulong>  vector, Vector128<ulong>  indices)
public static Vector128<double> Shuffle(Vector128<double> vector, Vector128<long>   indices)

For the two-vector reordering, the APIs are generally the same:

public static Vector128<byte>   Shuffle(Vector128<byte>  lower,  Vector128<byte>   upper, Vector128<byte>   indices)
public static Vector128<sbyte>  Shuffle(Vector128<sbyte> lower,  Vector128<sbyte>  upper, Vector128<sbyte>  indices)

public static Vector128<short>  Shuffle(Vector128<short>  lower, Vector128<short>  upper, Vector128<short>  indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> lower, Vector128<ushort> upper, Vector128<ushort> indices)

public static Vector128<int>    Shuffle(Vector128<int>    lower, Vector128<int>    upper, Vector128<int>    indices)
public static Vector128<uint>   Shuffle(Vector128<uint>   lower, Vector128<uint>   upper, Vector128<uint>   indices)
public static Vector128<float>  Shuffle(Vector128<float>  lower, Vector128<float>  upper, Vector128<int>    indices)

public static Vector128<long>   Shuffle(Vector128<long>   lower, Vector128<long>   upper, Vector128<long>   indices)
public static Vector128<ulong>  Shuffle(Vector128<ulong>  lower, Vector128<ulong>  upper, Vector128<ulong>  indices)
public static Vector128<double> Shuffle(Vector128<double> lower, Vector128<double> upper, Vector128<long>   indices)

An upside of these APIs is that for common input scenarios involving constant indices, these can be massively simplified.
A downside for these APIs is that non-constant inputs on older hardware or certain Vector256<T> shuffles involving byte, sbyte, short, or ushort that cross the 128-bit lane boundary can take a couple instructions rather than being a single instruction.

This is ultimately no worse than a few other scenarios on each platform where one platform may have slightly better instruction generation due to the instructions it provides.

The text was updated successfully, but these errors were encountered:

ghost · 2022-01-04T07:01:37Z

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

Issue Details

Summary

With #49397 we approved and exposed cross platform APIs on Vector64/128/256 to help developers more easily support multiple platforms.

This was done by mirroring the surface area exposed by Vector<T>. However, due to their fixed sized there are some additional APIs that would be beneficial to expose. Likewise, there are a few APIs for loading/storing vectors that are commonly used for hardware intrinsics that would be beneficial to have cross platform helpers for.

The APIs expose would include the following:

ExtractMostSignificantBits
- On x86/x64 this would be emitted as MoveMask and performs exactly as expected
- On ARM64, this would be emulated via and, element-wise shift-right, 64-bit pairwise add, extract. The JIT could optionally detect if the input is the result of a Compare instruction and elide the shift-right.
- On WASM, this is called bitmask and works identically to MoveMask
- This API and its emulation are used throughout the BCL
Load/Store
- This is the basic load/store operations already in use for x86, x64, and ARM64
LoadAligned/StoreAligned
- This works exactly as the same named APIs on x86/x64
- When optimizations are disabled the alignment is verified
- When optimizations are enabled, this alignment checking may be skipped due to the load being folded into an instruction on modern hardware
- This enables efficient usage of the instruction on both older (pre-AVX) hardware as well as newer (post-AVX) or ARM64 hardware (where no load/store aligned instructions exist)
LoadAlignedNonTemporal/StoreAlignedNonTemporal
- This behaves as LoadAligned/StoreAligned but may optionally treat the memory access as non-temporal and avoid polluting the cache
LoadUnsafe/StoreUnsafe
- These are "new APIs", they cover a "gap" in the API surface that has been encountered and worked around in the BCL and which is semi-regularly requested by the community
- The API that just takes a ref T behaves exactly like the version that takes a pointer, just without requiring pinning
- The API that additionally takes an nuint index behaves like ref Unsafe.Add(ref value, index) and avoids needing to further bloat IL and hinder readability

API Proposal

namespace System.Runtime.Intrinsics
{
    public static partial class Vector64
    {
        public static uint ExtractMostSignificantBits<T>(Vector64<T> vector);

        public static Vector64<T> Load<T>(T* address);
        public static Vector64<T> LoadAligned<T>(T* address);
        public static Vector64<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector64<T> LoadUnsafe<T>(ref T address);
        public static Vector64<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector64<T> source);
        public static void StoreAligned<T>(T* address, Vector64<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector64<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector64<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector64<T> source);
    }

    public static partial class Vector128
    {
        public static uint ExtractMostSignificantBits<T>(Vector128<T> vector);

        public static Vector128<T> Load<T>(T* address);
        public static Vector128<T> LoadAligned<T>(T* address);
        public static Vector128<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector128<T> LoadUnsafe<T>(ref T address);
        public static Vector128<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector128<T> source);
        public static void StoreAligned<T>(T* address, Vector128<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector128<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector128<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector128<T> source);
    }

    public static partial class Vector256
    {
        public static uint ExtractMostSignificantBits<T>(Vector256<T> vector);

        public static Vector256<T> Load<T>(T* address);
        public static Vector256<T> LoadAligned<T>(T* address);
        public static Vector256<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector256<T> LoadUnsafe<T>(ref T address);
        public static Vector256<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector256<T> source);
        public static void StoreAligned<T>(T* address, Vector256<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector256<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector256<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector256<T> source);
    }
}

Additional Notes

Ideally we would also expose "shuffle" APIs allowing the elements of a single or multiple vectors to be reordered:

On x86/x64 these are referred to as Shuffle or Permute (generally takes two elements and one element, respectively; but that isn't always the case)
On ARM64, these are referred to as VectorTableLookup (only takes two elements)
On WASM, these are referred to as Shuffle (takes two elements) and Swizzle (takes one element).
On LLVM, these are referred to as VectorShuffle and only take two elements

Due to the complexities of same APIs, they can't trivially be exposed as a "single" generic API. Likewise, while the behavior for Vector128<T> is consistent on all platforms. Vector64<T> is ARM64 specific and Vector256<T> is x86/x64 specific. The former behaves like Vector128<T> while the latter generally behaves like 2x Vector128<T> (outside a few APIs called Permute#x#)

For the single-vector reordering, the APIs are "trivial":

public static Vector128<byte>   Shuffle(Vector128<byte>   vector, Vector128<byte>   indices)
public static Vector128<sbyte>  Shuffle(Vector128<sbyte>  vector, Vector128<sbyte>  indices)

public static Vector128<short>  Shuffle(Vector128<short>  vector, Vector128<short>  indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> vector, Vector128<ushort> indices)

public static Vector128<int>    Shuffle(Vector128<int>    vector, Vector128<int>    indices)
public static Vector128<uint>   Shuffle(Vector128<uint>   vector, Vector128<uint>   indices)
public static Vector128<float>  Shuffle(Vector128<float>  vector, Vector128<int>    indices)

public static Vector128<long>   Shuffle(Vector128<long>   vector, Vector128<long>   indices)
public static Vector128<ulong>  Shuffle(Vector128<ulong>  vector, Vector128<ulong>  indices)
public static Vector128<double> Shuffle(Vector128<double> vector, Vector128<long>   indices)

For the two-vector reordering, the APIs are generally the same:

public static Vector128<short>  Shuffle(Vector128<short>  lower, Vector128<short>  upper, Vector128<sbyte>  indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> lower, Vector128<ushort> upper, Vector128<byte>   indices)

public static Vector128<int>    Shuffle(Vector128<int>    lower, Vector128<int>    upper, Vector128<short>  indices)
public static Vector128<uint>   Shuffle(Vector128<uint>   lower, Vector128<uint>   upper, Vector128<ushort> indices)
public static Vector128<float>  Shuffle(Vector128<float>  lower, Vector128<float>  upper, Vector128<short>  indices)

public static Vector128<long>   Shuffle(Vector128<long>   lower, Vector128<long>   upper, Vector128<int>    indices)
public static Vector128<ulong>  Shuffle(Vector128<ulong>  lower, Vector128<ulong>  upper, Vector128<uint>   indices)
public static Vector128<double> Shuffle(Vector128<double> lower, Vector128<double> upper, Vector128<int>    indices)

The exception here is for byte/sbyte where all addresses can no longer be represented in a single indices vector:

public static Vector128<byte>  Shuffle(Vector128<byte>  lower, Vector128<byte>  upper, Vector128<sbyte>  lowerIndices, Vector128<sbyte>  upperIndices)
public static Vector128<sbyte> Shuffle(Vector128<sbyte> lower, Vector128<sbyte> upper, Vector128<byte>   lowerIndices, Vector128<byte>   upperIndices)

The names for the indice inputs of the byte/sbyte variant are a bit confusing, but otherwise fairly straightforward. The Vector64<T> and Vector256<T> variants would be the same as these, just taking their respective types.

An upside of these APIs is that for common input scenarios involving constant indices, these can be massively simplified.
A downside for these APIs is that non-constant inputs on older hardware or certain Vector256<T> shuffles involving byte, sbyte, short, or ushort that cross the 128-bit lane boundary can take a couple instructions rather than being a single instruction.

This is ultimately no worse than a few other scenarios on each platform where one platform may have slightly better instruction generation due to the instructions it provides.

Author:	tannergooding
Assignees:	-
Labels:	`area-System.Runtime.Intrinsics`, `untriaged`
Milestone:	-

ghost · 2022-01-04T07:01:55Z

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

Issue Details

Summary

With #49397 we approved and exposed cross platform APIs on Vector64/128/256 to help developers more easily support multiple platforms.

This was done by mirroring the surface area exposed by Vector<T>. However, due to their fixed sized there are some additional APIs that would be beneficial to expose. Likewise, there are a few APIs for loading/storing vectors that are commonly used for hardware intrinsics that would be beneficial to have cross platform helpers for.

The APIs expose would include the following:

ExtractMostSignificantBits
- On x86/x64 this would be emitted as MoveMask and performs exactly as expected
- On ARM64, this would be emulated via and, element-wise shift-right, 64-bit pairwise add, extract. The JIT could optionally detect if the input is the result of a Compare instruction and elide the shift-right.
- On WASM, this is called bitmask and works identically to MoveMask
- This API and its emulation are used throughout the BCL
Load/Store
- This is the basic load/store operations already in use for x86, x64, and ARM64
LoadAligned/StoreAligned
- This works exactly as the same named APIs on x86/x64
- When optimizations are disabled the alignment is verified
- When optimizations are enabled, this alignment checking may be skipped due to the load being folded into an instruction on modern hardware
- This enables efficient usage of the instruction on both older (pre-AVX) hardware as well as newer (post-AVX) or ARM64 hardware (where no load/store aligned instructions exist)
LoadAlignedNonTemporal/StoreAlignedNonTemporal
- This behaves as LoadAligned/StoreAligned but may optionally treat the memory access as non-temporal and avoid polluting the cache
LoadUnsafe/StoreUnsafe
- These are "new APIs", they cover a "gap" in the API surface that has been encountered and worked around in the BCL and which is semi-regularly requested by the community
- The API that just takes a ref T behaves exactly like the version that takes a pointer, just without requiring pinning
- The API that additionally takes an nuint index behaves like ref Unsafe.Add(ref value, index) and avoids needing to further bloat IL and hinder readability

API Proposal

namespace System.Runtime.Intrinsics
{
    public static partial class Vector64
    {
        public static uint ExtractMostSignificantBits<T>(Vector64<T> vector);

        public static Vector64<T> Load<T>(T* address);
        public static Vector64<T> LoadAligned<T>(T* address);
        public static Vector64<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector64<T> LoadUnsafe<T>(ref T address);
        public static Vector64<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector64<T> source);
        public static void StoreAligned<T>(T* address, Vector64<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector64<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector64<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector64<T> source);
    }

    public static partial class Vector128
    {
        public static uint ExtractMostSignificantBits<T>(Vector128<T> vector);

        public static Vector128<T> Load<T>(T* address);
        public static Vector128<T> LoadAligned<T>(T* address);
        public static Vector128<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector128<T> LoadUnsafe<T>(ref T address);
        public static Vector128<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector128<T> source);
        public static void StoreAligned<T>(T* address, Vector128<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector128<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector128<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector128<T> source);
    }

    public static partial class Vector256
    {
        public static uint ExtractMostSignificantBits<T>(Vector256<T> vector);

        public static Vector256<T> Load<T>(T* address);
        public static Vector256<T> LoadAligned<T>(T* address);
        public static Vector256<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector256<T> LoadUnsafe<T>(ref T address);
        public static Vector256<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector256<T> source);
        public static void StoreAligned<T>(T* address, Vector256<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector256<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector256<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector256<T> source);
    }
}

Additional Notes

Ideally we would also expose "shuffle" APIs allowing the elements of a single or multiple vectors to be reordered:

On x86/x64 these are referred to as Shuffle or Permute (generally takes two elements and one element, respectively; but that isn't always the case)
On ARM64, these are referred to as VectorTableLookup (only takes two elements)
On WASM, these are referred to as Shuffle (takes two elements) and Swizzle (takes one element).
On LLVM, these are referred to as VectorShuffle and only take two elements

Due to the complexities of same APIs, they can't trivially be exposed as a "single" generic API. Likewise, while the behavior for Vector128<T> is consistent on all platforms. Vector64<T> is ARM64 specific and Vector256<T> is x86/x64 specific. The former behaves like Vector128<T> while the latter generally behaves like 2x Vector128<T> (outside a few APIs called Permute#x#)

For the single-vector reordering, the APIs are "trivial":

public static Vector128<byte>   Shuffle(Vector128<byte>   vector, Vector128<byte>   indices)
public static Vector128<sbyte>  Shuffle(Vector128<sbyte>  vector, Vector128<sbyte>  indices)

public static Vector128<short>  Shuffle(Vector128<short>  vector, Vector128<short>  indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> vector, Vector128<ushort> indices)

public static Vector128<int>    Shuffle(Vector128<int>    vector, Vector128<int>    indices)
public static Vector128<uint>   Shuffle(Vector128<uint>   vector, Vector128<uint>   indices)
public static Vector128<float>  Shuffle(Vector128<float>  vector, Vector128<int>    indices)

public static Vector128<long>   Shuffle(Vector128<long>   vector, Vector128<long>   indices)
public static Vector128<ulong>  Shuffle(Vector128<ulong>  vector, Vector128<ulong>  indices)
public static Vector128<double> Shuffle(Vector128<double> vector, Vector128<long>   indices)

For the two-vector reordering, the APIs are generally the same:

public static Vector128<short>  Shuffle(Vector128<short>  lower, Vector128<short>  upper, Vector128<sbyte>  indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> lower, Vector128<ushort> upper, Vector128<byte>   indices)

public static Vector128<int>    Shuffle(Vector128<int>    lower, Vector128<int>    upper, Vector128<short>  indices)
public static Vector128<uint>   Shuffle(Vector128<uint>   lower, Vector128<uint>   upper, Vector128<ushort> indices)
public static Vector128<float>  Shuffle(Vector128<float>  lower, Vector128<float>  upper, Vector128<short>  indices)

public static Vector128<long>   Shuffle(Vector128<long>   lower, Vector128<long>   upper, Vector128<int>    indices)
public static Vector128<ulong>  Shuffle(Vector128<ulong>  lower, Vector128<ulong>  upper, Vector128<uint>   indices)
public static Vector128<double> Shuffle(Vector128<double> lower, Vector128<double> upper, Vector128<int>    indices)

The exception here is for byte/sbyte where all addresses can no longer be represented in a single indices vector:

public static Vector128<byte>  Shuffle(Vector128<byte>  lower, Vector128<byte>  upper, Vector128<sbyte>  lowerIndices, Vector128<sbyte>  upperIndices)
public static Vector128<sbyte> Shuffle(Vector128<sbyte> lower, Vector128<sbyte> upper, Vector128<byte>   lowerIndices, Vector128<byte>   upperIndices)

The names for the indice inputs of the byte/sbyte variant are a bit confusing, but otherwise fairly straightforward. The Vector64<T> and Vector256<T> variants would be the same as these, just taking their respective types.

An upside of these APIs is that for common input scenarios involving constant indices, these can be massively simplified.
A downside for these APIs is that non-constant inputs on older hardware or certain Vector256<T> shuffles involving byte, sbyte, short, or ushort that cross the 128-bit lane boundary can take a couple instructions rather than being a single instruction.

This is ultimately no worse than a few other scenarios on each platform where one platform may have slightly better instruction generation due to the instructions it provides.

Author:	tannergooding
Assignees:	-
Labels:	`area-System.Runtime.Intrinsics`, `api-ready-for-review`
Milestone:	-

terrajobst · 2022-01-04T18:37:57Z

Video

Looks good as proposed but let's make ExtractMostSignificantBits and StoreXxx extension methods
We decided to keep the Shuffle APIs as non-extensions because the two-argument shuffle would be potentially confusing and we didn't want a mix of extensions/non-extensions.

namespace System.Runtime.Intrinsics
{
    public static partial class Vector64
    {
        public static uint ExtractMostSignificantBits<T>(this Vector64<T> vector);

        public static Vector64<T> Load<T>(T* source);
        public static Vector64<T> LoadAligned<T>(T* source);
        public static Vector64<T> LoadAlignedNonTemporal<T>(T* source);
        public static Vector64<T> LoadUnsafe<T>(ref T source);
        public static Vector64<T> LoadUnsafe<T>(ref T source, nuint index);

        public static void Store<T>(this Vector64<T> source, T* destination);
        public static void StoreAligned<T>(this Vector64<T> source, T* destination);
        public static void StoreAlignedNonTemporal<T>(this Vector64<T> source, T* destination);
        public static void StoreUnsafe<T>(this Vector64<T> source, ref T destination);
        public static void StoreUnsafe<T>(this Vector64<T> source, ref T destination, nuint index);

        public static Vector64<byte>   Shuffle(Vector64<byte>   vector, Vector64<byte>   indices);
        public static Vector64<sbyte>  Shuffle(Vector64<sbyte>  vector, Vector64<sbyte>  indices);

        public static Vector64<short>  Shuffle(Vector64<short>  vector, Vector64<short>  indices);
        public static Vector64<ushort> Shuffle(Vector64<ushort> vector, Vector64<ushort> indices);

        public static Vector64<int>    Shuffle(Vector64<int>    vector, Vector64<int>    indices);
        public static Vector64<uint>   Shuffle(Vector64<uint>   vector, Vector64<uint>   indices);
        public static Vector64<float>  Shuffle(Vector64<float>  vector, Vector64<int>    indices);

        public static Vector64<byte>   Shuffle(Vector64<byte>   lower, Vector64<byte>   upper, Vector64<byte>   indices);
        public static Vector64<sbyte>  Shuffle(Vector64<sbyte>  lower, Vector64<sbyte>  upper, Vector64<sbyte>  indices);

        public static Vector64<short>  Shuffle(Vector64<short>  lower, Vector64<short>  upper, Vector64<short>  indices);
        public static Vector64<ushort> Shuffle(Vector64<ushort> lower, Vector64<ushort> upper, Vector64<ushort> indices);

        public static Vector64<int>    Shuffle(Vector64<int>    lower, Vector64<int>    upper, Vector64<int>    indices);
        public static Vector64<uint>   Shuffle(Vector64<uint>   lower, Vector64<uint>   upper, Vector64<uint>   indices);
        public static Vector64<float>  Shuffle(Vector64<float>  lower, Vector64<float>  upper, Vector64<int>    indices);
    }

    public static partial class Vector128
    {
        public static uint ExtractMostSignificantBits<T>(this Vector128<T> vector);

        public static Vector128<T> Load<T>(T* source);
        public static Vector128<T> LoadAligned<T>(T* source);
        public static Vector128<T> LoadAlignedNonTemporal<T>(T* source);
        public static Vector128<T> LoadUnsafe<T>(ref T source);
        public static Vector128<T> LoadUnsafe<T>(ref T source, nuint index);

        public static void Store<T>(this Vector128<T> source, T* destination);
        public static void StoreAligned<T>(this Vector128<T> source, T* destination);
        public static void StoreAlignedNonTemporal<T>(this Vector128<T> source, T* destination);
        public static void StoreUnsafe<T>(this Vector128<T> source, ref T destination);
        public static void StoreUnsafe<T>(this Vector128<T> source, ref T destination, nuint index);

        public static Vector128<byte>   Shuffle(Vector128<byte>   vector, Vector128<byte>   indices);
        public static Vector128<sbyte>  Shuffle(Vector128<sbyte>  vector, Vector128<sbyte>  indices);

        public static Vector128<short>  Shuffle(Vector128<short>  vector, Vector128<short>  indices);
        public static Vector128<ushort> Shuffle(Vector128<ushort> vector, Vector128<ushort> indices);

        public static Vector128<int>    Shuffle(Vector128<int>    vector, Vector128<int>    indices);
        public static Vector128<uint>   Shuffle(Vector128<uint>   vector, Vector128<uint>   indices);
        public static Vector128<float>  Shuffle(Vector128<float>  vector, Vector128<int>    indices);

        public static Vector128<long>   Shuffle(Vector128<long>   vector, Vector128<long>   indices);
        public static Vector128<ulong>  Shuffle(Vector128<ulong>  vector, Vector128<ulong>  indices);
        public static Vector128<double> Shuffle(Vector128<double> vector, Vector128<long>   indices);

        public static Vector128<byte>   Shuffle(Vector128<byte>   lower, Vector128<byte>   upper, Vector128<byte>   indices);
        public static Vector128<sbyte>  Shuffle(Vector128<sbyte>  lower, Vector128<sbyte>  upper, Vector128<sbyte>  indices);

        public static Vector128<short>  Shuffle(Vector128<short>  lower, Vector128<short>  upper, Vector128<short>  indices);
        public static Vector128<ushort> Shuffle(Vector128<ushort> lower, Vector128<ushort> upper, Vector128<ushort> indices);

        public static Vector128<int>    Shuffle(Vector128<int>    lower, Vector128<int>    upper, Vector128<int>    indices);
        public static Vector128<uint>   Shuffle(Vector128<uint>   lower, Vector128<uint>   upper, Vector128<uint>   indices);
        public static Vector128<float>  Shuffle(Vector128<float>  lower, Vector128<float>  upper, Vector128<int>    indices);

        public static Vector128<long>   Shuffle(Vector128<long>   lower, Vector128<long>   upper, Vector128<long>   indices);
        public static Vector128<ulong>  Shuffle(Vector128<ulong>  lower, Vector128<ulong>  upper, Vector128<ulong>  indices);
        public static Vector128<double> Shuffle(Vector128<double> lower, Vector128<double> upper, Vector128<long>   indices);
    }

    public static partial class Vector256
    {
        public static uint ExtractMostSignificantBits<T>(this Vector256<T> vector);

        public static Vector256<T> Load<T>(T* source);
        public static Vector256<T> LoadAligned<T>(T* source);
        public static Vector256<T> LoadAlignedNonTemporal<T>(T* source);
        public static Vector256<T> LoadUnsafe<T>(ref T source);
        public static Vector256<T> LoadUnsafe<T>(ref T source, nuint index);

        public static void Store<T>(this Vector256<T> source, T* destination);
        public static void StoreAligned<T>(this Vector256<T> source, T* destination);
        public static void StoreAlignedNonTemporal<T>(this Vector256<T> source, T* destination);
        public static void StoreUnsafe<T>(this Vector256<T> source, ref T destination);
        public static void StoreUnsafe<T>(this Vector256<T> source, ref T destination, nuint index);

        public static Vector256<byte>   Shuffle(Vector256<byte>   vector, Vector256<byte>   indices);
        public static Vector256<sbyte>  Shuffle(Vector256<sbyte>  vector, Vector256<sbyte>  indices);

        public static Vector256<short>  Shuffle(Vector256<short>  vector, Vector256<short>  indices);
        public static Vector256<ushort> Shuffle(Vector256<ushort> vector, Vector256<ushort> indices);

        public static Vector256<int>    Shuffle(Vector256<int>    vector, Vector256<int>    indices);
        public static Vector256<uint>   Shuffle(Vector256<uint>   vector, Vector256<uint>   indices);
        public static Vector256<float>  Shuffle(Vector256<float>  vector, Vector256<int>    indices);

        public static Vector256<long>   Shuffle(Vector256<long>   vector, Vector256<long>   indices);
        public static Vector256<ulong>  Shuffle(Vector256<ulong>  vector, Vector256<ulong>  indices);
        public static Vector256<double> Shuffle(Vector256<double> vector, Vector256<long>   indices);

        public static Vector256<byte>   Shuffle(Vector256<byte>   lower, Vector256<byte>   upper, Vector256<byte>   indices);
        public static Vector256<sbyte>  Shuffle(Vector256<sbyte>  lower, Vector256<sbyte>  upper, Vector256<sbyte>  indices);

        public static Vector256<short>  Shuffle(Vector256<short>  lower, Vector256<short>  upper, Vector256<short>  indices);
        public static Vector256<ushort> Shuffle(Vector256<ushort> lower, Vector256<ushort> upper, Vector256<ushort> indices);

        public static Vector256<int>    Shuffle(Vector256<int>    lower, Vector256<int>    upper, Vector256<int>    indices);
        public static Vector256<uint>   Shuffle(Vector256<uint>   lower, Vector256<uint>   upper, Vector256<uint>   indices);
        public static Vector256<float>  Shuffle(Vector256<float>  lower, Vector256<float>  upper, Vector256<int>    indices);

        public static Vector256<long>   Shuffle(Vector256<long>   lower, Vector256<long>   upper, Vector256<long>   indices);
        public static Vector256<ulong>  Shuffle(Vector256<ulong>  lower, Vector256<ulong>  upper, Vector256<ulong>  indices);
        public static Vector256<double> Shuffle(Vector256<double> lower, Vector256<double> upper, Vector256<long>   indices);
    }
}

JulieLeeMSFT · 2022-06-13T18:57:44Z

@tannergooding is there any more APIs left to implement from this issue?

tannergooding · 2022-06-13T19:04:46Z

The three operand shuffle APIs:

        public static Vector256<byte>   Shuffle(Vector256<byte>   lower, Vector256<byte>   upper, Vector256<byte>   indices);
        public static Vector256<sbyte>  Shuffle(Vector256<sbyte>  lower, Vector256<sbyte>  upper, Vector256<sbyte>  indices);

        public static Vector256<short>  Shuffle(Vector256<short>  lower, Vector256<short>  upper, Vector256<short>  indices);
        public static Vector256<ushort> Shuffle(Vector256<ushort> lower, Vector256<ushort> upper, Vector256<ushort> indices);

        public static Vector256<int>    Shuffle(Vector256<int>    lower, Vector256<int>    upper, Vector256<int>    indices);
        public static Vector256<uint>   Shuffle(Vector256<uint>   lower, Vector256<uint>   upper, Vector256<uint>   indices);
        public static Vector256<float>  Shuffle(Vector256<float>  lower, Vector256<float>  upper, Vector256<int>    indices);

        public static Vector256<long>   Shuffle(Vector256<long>   lower, Vector256<long>   upper, Vector256<long>   indices);
        public static Vector256<ulong>  Shuffle(Vector256<ulong>  lower, Vector256<ulong>  upper, Vector256<ulong>  indices);
        public static Vector256<double> Shuffle(Vector256<double> lower, Vector256<double> upper, Vector256<long>   indices);

I'm working on them still and expect them to be in by code complete. That being said in the worst case these ones won't be available in .NET 7 due to time constraints and other work also on my plate taking precedence. The two operand shuffle APIs are in already and cover a large number of the scenarios so this scenario being missing won't significantly hurt the feature.

SamMonoRT · 2022-06-16T14:25:32Z

cc @fanyang-mono for Mono implementations

dakersnar · 2022-08-08T22:51:55Z

Everything is done here except for the shuffle APIs that take two inputs and the index mask. Perf improvements are still needed for the other shuffle APIs.

tannergooding · 2023-07-24T22:04:48Z

The three operand APIs missed .NET 8 as well. We did manage to land the AVX-512 full permute instructions and the AdvSimd multi-input table lookup instructions, however. So we should be able to more easily land this support in the future.

dotnet-issue-labeler bot added area-System.Runtime.Intrinsics untriaged New issue has not been triaged by the area owner labels Jan 4, 2022

tannergooding added api-ready-for-review API is ready for review, it is NOT ready for implementation area-System.Runtime.Intrinsics and removed area-System.Runtime.Intrinsics untriaged New issue has not been triaged by the area owner labels Jan 4, 2022

terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Jan 4, 2022

tannergooding mentioned this issue Jan 5, 2022

Implement the last of the approved cross platform hardware intrinsics, except shuffle #63414

Merged

ociaw mentioned this issue Jan 26, 2022

Investigate new .NET 6 APIs and determine if they're relevant ociaw/RandN#16

Closed

kunalspathak mentioned this issue Mar 28, 2022

Optimize System.Buffers for arm64 using cross-platform intrinsics #35033

Closed

2 tasks

tannergooding mentioned this issue Apr 26, 2022

Adding the 2-parameter xplat shuffle helpers and accelerating them #68559

Merged

tannergooding mentioned this issue May 10, 2022

Adding support for vector constants via GenTreeVecCon #68874

Merged

JulieLeeMSFT mentioned this issue Jun 9, 2022

Enable platform agnostic SIMD support for vector #70532

Closed

4 tasks

EgorBo mentioned this issue Jun 13, 2022

Implement Vector128 version of System.Buffers.Text.Base64 DecodeFromUtf8 and EncodeToUtf8 #70654

Merged

EgorBo mentioned this issue Jun 29, 2022

Remove workaround for missing crossplat Shuffle #71467

Closed

jeffhandley added this to the 7.0.0 milestone Jul 10, 2022

This was referenced Aug 2, 2022

[Mono] Intrinsify loading/storing, reordering, and extracting a per-element "mask" API's of Vector{64, 128, 256} #73261

Open

[mono] Tracking: Intrinsics implementation #43051

Open

dakersnar modified the milestones: 7.0.0, 8.0.0 Aug 8, 2022

dakersnar mentioned this issue Aug 8, 2022

Intrinsics parity with x64 for Arm64 #70533

Closed

2 tasks

dakersnar assigned tannergooding Sep 29, 2022

dakersnar mentioned this issue Nov 29, 2022

System.Runtime.Intrinsics work planned for .NET 8 #79005

Closed

13 tasks

Wsm2110 mentioned this issue Mar 18, 2023

Easy multiplatform support Wsm2110/Faster.Map#23

Closed

tannergooding modified the milestones: 8.0.0, Future Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional cross-platform hardware intrinsic APIs for loading/storing, reordering, and extracting a per-element "mask" #63331

Additional cross-platform hardware intrinsic APIs for loading/storing, reordering, and extracting a per-element "mask" #63331

tannergooding commented Jan 4, 2022 •

edited

Loading

ghost commented Jan 4, 2022

Summary

API Proposal

Additional Notes

ghost commented Jan 4, 2022

Summary

API Proposal

Additional Notes

terrajobst commented Jan 4, 2022 •

edited by dotnet-api-review bot

Loading

JulieLeeMSFT commented Jun 13, 2022

tannergooding commented Jun 13, 2022 •

edited

Loading

SamMonoRT commented Jun 16, 2022

dakersnar commented Aug 8, 2022

tannergooding commented Jul 24, 2023

Additional cross-platform hardware intrinsic APIs for loading/storing, reordering, and extracting a per-element "mask" #63331

Additional cross-platform hardware intrinsic APIs for loading/storing, reordering, and extracting a per-element "mask" #63331

Comments

tannergooding commented Jan 4, 2022 • edited Loading

Summary

API Proposal

Additional Notes

ghost commented Jan 4, 2022

Summary

API Proposal

Additional Notes

ghost commented Jan 4, 2022

Summary

API Proposal

Additional Notes

terrajobst commented Jan 4, 2022 • edited by dotnet-api-review bot Loading

JulieLeeMSFT commented Jun 13, 2022

tannergooding commented Jun 13, 2022 • edited Loading

SamMonoRT commented Jun 16, 2022

dakersnar commented Aug 8, 2022

tannergooding commented Jul 24, 2023

tannergooding commented Jan 4, 2022 •

edited

Loading

terrajobst commented Jan 4, 2022 •

edited by dotnet-api-review bot

Loading

tannergooding commented Jun 13, 2022 •

edited

Loading