Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional cross-platform hardware intrinsic APIs for loading/storing, reordering, and extracting a per-element "mask" #63331

Open
Tracked by #79005
tannergooding opened this issue Jan 4, 2022 · 8 comments
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime.Intrinsics
Milestone

Comments

@tannergooding
Copy link
Member

tannergooding commented Jan 4, 2022

Summary

With #49397 we approved and exposed cross platform APIs on Vector64/128/256 to help developers more easily support multiple platforms.

This was done by mirroring the surface area exposed by Vector<T>. However, due to their fixed sized there are some additional APIs that would be beneficial to expose. Likewise, there are a few APIs for loading/storing vectors that are commonly used for hardware intrinsics that would be beneficial to have cross platform helpers for.

The APIs expose would include the following:

  • ExtractMostSignificantBits
    • On x86/x64 this would be emitted as MoveMask and performs exactly as expected
    • On ARM64, this would be emulated via and, element-wise shift-right, 64-bit pairwise add, extract. The JIT could optionally detect if the input is the result of a Compare instruction and elide the shift-right.
    • On WASM, this is called bitmask and works identically to MoveMask
    • This API and its emulation are used throughout the BCL
  • Load/Store
    • This is the basic load/store operations already in use for x86, x64, and ARM64
  • LoadAligned/StoreAligned
    • This works exactly as the same named APIs on x86/x64
    • When optimizations are disabled the alignment is verified
    • When optimizations are enabled, this alignment checking may be skipped due to the load being folded into an instruction on modern hardware
    • This enables efficient usage of the instruction on both older (pre-AVX) hardware as well as newer (post-AVX) or ARM64 hardware (where no load/store aligned instructions exist)
  • LoadAlignedNonTemporal/StoreAlignedNonTemporal
    • This behaves as LoadAligned/StoreAligned but may optionally treat the memory access as non-temporal and avoid polluting the cache
  • LoadUnsafe/StoreUnsafe
    • These are "new APIs", they cover a "gap" in the API surface that has been encountered and worked around in the BCL and which is semi-regularly requested by the community
    • The API that just takes a ref T behaves exactly like the version that takes a pointer, just without requiring pinning
    • The API that additionally takes an nuint index behaves like ref Unsafe.Add(ref value, index) and avoids needing to further bloat IL and hinder readability

API Proposal

namespace System.Runtime.Intrinsics
{
    public static partial class Vector64
    {
        public static uint ExtractMostSignificantBits<T>(Vector64<T> vector);

        public static Vector64<T> Load<T>(T* address);
        public static Vector64<T> LoadAligned<T>(T* address);
        public static Vector64<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector64<T> LoadUnsafe<T>(ref T address);
        public static Vector64<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector64<T> source);
        public static void StoreAligned<T>(T* address, Vector64<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector64<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector64<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector64<T> source);
    }

    public static partial class Vector128
    {
        public static uint ExtractMostSignificantBits<T>(Vector128<T> vector);

        public static Vector128<T> Load<T>(T* address);
        public static Vector128<T> LoadAligned<T>(T* address);
        public static Vector128<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector128<T> LoadUnsafe<T>(ref T address);
        public static Vector128<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector128<T> source);
        public static void StoreAligned<T>(T* address, Vector128<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector128<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector128<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector128<T> source);
    }

    public static partial class Vector256
    {
        public static uint ExtractMostSignificantBits<T>(Vector256<T> vector);

        public static Vector256<T> Load<T>(T* address);
        public static Vector256<T> LoadAligned<T>(T* address);
        public static Vector256<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector256<T> LoadUnsafe<T>(ref T address);
        public static Vector256<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector256<T> source);
        public static void StoreAligned<T>(T* address, Vector256<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector256<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector256<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector256<T> source);
    }
}

Additional Notes

Ideally we would also expose "shuffle" APIs allowing the elements of a single or multiple vectors to be reordered:

  • On x86/x64 these are referred to as Shuffle or Permute (generally takes two elements and one element, respectively; but that isn't always the case)
  • On ARM64, these are referred to as VectorTableLookup (only takes two elements)
  • On WASM, these are referred to as Shuffle (takes two elements) and Swizzle (takes one element).
  • On LLVM, these are referred to as VectorShuffle and only take two elements

Due to the complexities of same APIs, they can't trivially be exposed as a "single" generic API. Likewise, while the behavior for Vector128<T> is consistent on all platforms. Vector64<T> is ARM64 specific and Vector256<T> is x86/x64 specific. The former behaves like Vector128<T> while the latter generally behaves like 2x Vector128<T> (outside a few APIs called Permute#x#). For consistency, the Vector256<T> APIs exposed here would behave identically to Vector128<T> and allow "cross lane permutation".

For the single-vector reordering, the APIs are "trivial":

public static Vector128<byte>   Shuffle(Vector128<byte>   vector, Vector128<byte>   indices)
public static Vector128<sbyte>  Shuffle(Vector128<sbyte>  vector, Vector128<sbyte>  indices)

public static Vector128<short>  Shuffle(Vector128<short>  vector, Vector128<short>  indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> vector, Vector128<ushort> indices)

public static Vector128<int>    Shuffle(Vector128<int>    vector, Vector128<int>    indices)
public static Vector128<uint>   Shuffle(Vector128<uint>   vector, Vector128<uint>   indices)
public static Vector128<float>  Shuffle(Vector128<float>  vector, Vector128<int>    indices)

public static Vector128<long>   Shuffle(Vector128<long>   vector, Vector128<long>   indices)
public static Vector128<ulong>  Shuffle(Vector128<ulong>  vector, Vector128<ulong>  indices)
public static Vector128<double> Shuffle(Vector128<double> vector, Vector128<long>   indices)

For the two-vector reordering, the APIs are generally the same:

public static Vector128<byte>   Shuffle(Vector128<byte>  lower,  Vector128<byte>   upper, Vector128<byte>   indices)
public static Vector128<sbyte>  Shuffle(Vector128<sbyte> lower,  Vector128<sbyte>  upper, Vector128<sbyte>  indices)

public static Vector128<short>  Shuffle(Vector128<short>  lower, Vector128<short>  upper, Vector128<short>  indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> lower, Vector128<ushort> upper, Vector128<ushort> indices)

public static Vector128<int>    Shuffle(Vector128<int>    lower, Vector128<int>    upper, Vector128<int>    indices)
public static Vector128<uint>   Shuffle(Vector128<uint>   lower, Vector128<uint>   upper, Vector128<uint>   indices)
public static Vector128<float>  Shuffle(Vector128<float>  lower, Vector128<float>  upper, Vector128<int>    indices)

public static Vector128<long>   Shuffle(Vector128<long>   lower, Vector128<long>   upper, Vector128<long>   indices)
public static Vector128<ulong>  Shuffle(Vector128<ulong>  lower, Vector128<ulong>  upper, Vector128<ulong>  indices)
public static Vector128<double> Shuffle(Vector128<double> lower, Vector128<double> upper, Vector128<long>   indices)

An upside of these APIs is that for common input scenarios involving constant indices, these can be massively simplified.
A downside for these APIs is that non-constant inputs on older hardware or certain Vector256<T> shuffles involving byte, sbyte, short, or ushort that cross the 128-bit lane boundary can take a couple instructions rather than being a single instruction.

This is ultimately no worse than a few other scenarios on each platform where one platform may have slightly better instruction generation due to the instructions it provides.

@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Runtime.Intrinsics untriaged New issue has not been triaged by the area owner labels Jan 4, 2022
@ghost
Copy link

ghost commented Jan 4, 2022

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

Issue Details

Summary

With #49397 we approved and exposed cross platform APIs on Vector64/128/256 to help developers more easily support multiple platforms.

This was done by mirroring the surface area exposed by Vector<T>. However, due to their fixed sized there are some additional APIs that would be beneficial to expose. Likewise, there are a few APIs for loading/storing vectors that are commonly used for hardware intrinsics that would be beneficial to have cross platform helpers for.

The APIs expose would include the following:

  • ExtractMostSignificantBits
    • On x86/x64 this would be emitted as MoveMask and performs exactly as expected
    • On ARM64, this would be emulated via and, element-wise shift-right, 64-bit pairwise add, extract. The JIT could optionally detect if the input is the result of a Compare instruction and elide the shift-right.
    • On WASM, this is called bitmask and works identically to MoveMask
    • This API and its emulation are used throughout the BCL
  • Load/Store
    • This is the basic load/store operations already in use for x86, x64, and ARM64
  • LoadAligned/StoreAligned
    • This works exactly as the same named APIs on x86/x64
    • When optimizations are disabled the alignment is verified
    • When optimizations are enabled, this alignment checking may be skipped due to the load being folded into an instruction on modern hardware
    • This enables efficient usage of the instruction on both older (pre-AVX) hardware as well as newer (post-AVX) or ARM64 hardware (where no load/store aligned instructions exist)
  • LoadAlignedNonTemporal/StoreAlignedNonTemporal
    • This behaves as LoadAligned/StoreAligned but may optionally treat the memory access as non-temporal and avoid polluting the cache
  • LoadUnsafe/StoreUnsafe
    • These are "new APIs", they cover a "gap" in the API surface that has been encountered and worked around in the BCL and which is semi-regularly requested by the community
    • The API that just takes a ref T behaves exactly like the version that takes a pointer, just without requiring pinning
    • The API that additionally takes an nuint index behaves like ref Unsafe.Add(ref value, index) and avoids needing to further bloat IL and hinder readability

API Proposal

namespace System.Runtime.Intrinsics
{
    public static partial class Vector64
    {
        public static uint ExtractMostSignificantBits<T>(Vector64<T> vector);

        public static Vector64<T> Load<T>(T* address);
        public static Vector64<T> LoadAligned<T>(T* address);
        public static Vector64<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector64<T> LoadUnsafe<T>(ref T address);
        public static Vector64<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector64<T> source);
        public static void StoreAligned<T>(T* address, Vector64<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector64<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector64<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector64<T> source);
    }

    public static partial class Vector128
    {
        public static uint ExtractMostSignificantBits<T>(Vector128<T> vector);

        public static Vector128<T> Load<T>(T* address);
        public static Vector128<T> LoadAligned<T>(T* address);
        public static Vector128<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector128<T> LoadUnsafe<T>(ref T address);
        public static Vector128<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector128<T> source);
        public static void StoreAligned<T>(T* address, Vector128<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector128<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector128<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector128<T> source);
    }

    public static partial class Vector256
    {
        public static uint ExtractMostSignificantBits<T>(Vector256<T> vector);

        public static Vector256<T> Load<T>(T* address);
        public static Vector256<T> LoadAligned<T>(T* address);
        public static Vector256<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector256<T> LoadUnsafe<T>(ref T address);
        public static Vector256<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector256<T> source);
        public static void StoreAligned<T>(T* address, Vector256<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector256<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector256<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector256<T> source);
    }
}

Additional Notes

Ideally we would also expose "shuffle" APIs allowing the elements of a single or multiple vectors to be reordered:

  • On x86/x64 these are referred to as Shuffle or Permute (generally takes two elements and one element, respectively; but that isn't always the case)
  • On ARM64, these are referred to as VectorTableLookup (only takes two elements)
  • On WASM, these are referred to as Shuffle (takes two elements) and Swizzle (takes one element).
  • On LLVM, these are referred to as VectorShuffle and only take two elements

Due to the complexities of same APIs, they can't trivially be exposed as a "single" generic API. Likewise, while the behavior for Vector128<T> is consistent on all platforms. Vector64<T> is ARM64 specific and Vector256<T> is x86/x64 specific. The former behaves like Vector128<T> while the latter generally behaves like 2x Vector128<T> (outside a few APIs called Permute#x#)

For the single-vector reordering, the APIs are "trivial":

public static Vector128<byte>   Shuffle(Vector128<byte>   vector, Vector128<byte>   indices)
public static Vector128<sbyte>  Shuffle(Vector128<sbyte>  vector, Vector128<sbyte>  indices)

public static Vector128<short>  Shuffle(Vector128<short>  vector, Vector128<short>  indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> vector, Vector128<ushort> indices)

public static Vector128<int>    Shuffle(Vector128<int>    vector, Vector128<int>    indices)
public static Vector128<uint>   Shuffle(Vector128<uint>   vector, Vector128<uint>   indices)
public static Vector128<float>  Shuffle(Vector128<float>  vector, Vector128<int>    indices)

public static Vector128<long>   Shuffle(Vector128<long>   vector, Vector128<long>   indices)
public static Vector128<ulong>  Shuffle(Vector128<ulong>  vector, Vector128<ulong>  indices)
public static Vector128<double> Shuffle(Vector128<double> vector, Vector128<long>   indices)

For the two-vector reordering, the APIs are generally the same:

public static Vector128<short>  Shuffle(Vector128<short>  lower, Vector128<short>  upper, Vector128<sbyte>  indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> lower, Vector128<ushort> upper, Vector128<byte>   indices)

public static Vector128<int>    Shuffle(Vector128<int>    lower, Vector128<int>    upper, Vector128<short>  indices)
public static Vector128<uint>   Shuffle(Vector128<uint>   lower, Vector128<uint>   upper, Vector128<ushort> indices)
public static Vector128<float>  Shuffle(Vector128<float>  lower, Vector128<float>  upper, Vector128<short>  indices)

public static Vector128<long>   Shuffle(Vector128<long>   lower, Vector128<long>   upper, Vector128<int>    indices)
public static Vector128<ulong>  Shuffle(Vector128<ulong>  lower, Vector128<ulong>  upper, Vector128<uint>   indices)
public static Vector128<double> Shuffle(Vector128<double> lower, Vector128<double> upper, Vector128<int>    indices)

The exception here is for byte/sbyte where all addresses can no longer be represented in a single indices vector:

public static Vector128<byte>  Shuffle(Vector128<byte>  lower, Vector128<byte>  upper, Vector128<sbyte>  lowerIndices, Vector128<sbyte>  upperIndices)
public static Vector128<sbyte> Shuffle(Vector128<sbyte> lower, Vector128<sbyte> upper, Vector128<byte>   lowerIndices, Vector128<byte>   upperIndices)

The names for the indice inputs of the byte/sbyte variant are a bit confusing, but otherwise fairly straightforward. The Vector64<T> and Vector256<T> variants would be the same as these, just taking their respective types.

An upside of these APIs is that for common input scenarios involving constant indices, these can be massively simplified.
A downside for these APIs is that non-constant inputs on older hardware or certain Vector256<T> shuffles involving byte, sbyte, short, or ushort that cross the 128-bit lane boundary can take a couple instructions rather than being a single instruction.

This is ultimately no worse than a few other scenarios on each platform where one platform may have slightly better instruction generation due to the instructions it provides.

Author: tannergooding
Assignees: -
Labels:

area-System.Runtime.Intrinsics, untriaged

Milestone: -

@tannergooding tannergooding added api-ready-for-review API is ready for review, it is NOT ready for implementation area-System.Runtime.Intrinsics and removed area-System.Runtime.Intrinsics untriaged New issue has not been triaged by the area owner labels Jan 4, 2022
@ghost
Copy link

ghost commented Jan 4, 2022

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

Issue Details

Summary

With #49397 we approved and exposed cross platform APIs on Vector64/128/256 to help developers more easily support multiple platforms.

This was done by mirroring the surface area exposed by Vector<T>. However, due to their fixed sized there are some additional APIs that would be beneficial to expose. Likewise, there are a few APIs for loading/storing vectors that are commonly used for hardware intrinsics that would be beneficial to have cross platform helpers for.

The APIs expose would include the following:

  • ExtractMostSignificantBits
    • On x86/x64 this would be emitted as MoveMask and performs exactly as expected
    • On ARM64, this would be emulated via and, element-wise shift-right, 64-bit pairwise add, extract. The JIT could optionally detect if the input is the result of a Compare instruction and elide the shift-right.
    • On WASM, this is called bitmask and works identically to MoveMask
    • This API and its emulation are used throughout the BCL
  • Load/Store
    • This is the basic load/store operations already in use for x86, x64, and ARM64
  • LoadAligned/StoreAligned
    • This works exactly as the same named APIs on x86/x64
    • When optimizations are disabled the alignment is verified
    • When optimizations are enabled, this alignment checking may be skipped due to the load being folded into an instruction on modern hardware
    • This enables efficient usage of the instruction on both older (pre-AVX) hardware as well as newer (post-AVX) or ARM64 hardware (where no load/store aligned instructions exist)
  • LoadAlignedNonTemporal/StoreAlignedNonTemporal
    • This behaves as LoadAligned/StoreAligned but may optionally treat the memory access as non-temporal and avoid polluting the cache
  • LoadUnsafe/StoreUnsafe
    • These are "new APIs", they cover a "gap" in the API surface that has been encountered and worked around in the BCL and which is semi-regularly requested by the community
    • The API that just takes a ref T behaves exactly like the version that takes a pointer, just without requiring pinning
    • The API that additionally takes an nuint index behaves like ref Unsafe.Add(ref value, index) and avoids needing to further bloat IL and hinder readability

API Proposal

namespace System.Runtime.Intrinsics
{
    public static partial class Vector64
    {
        public static uint ExtractMostSignificantBits<T>(Vector64<T> vector);

        public static Vector64<T> Load<T>(T* address);
        public static Vector64<T> LoadAligned<T>(T* address);
        public static Vector64<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector64<T> LoadUnsafe<T>(ref T address);
        public static Vector64<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector64<T> source);
        public static void StoreAligned<T>(T* address, Vector64<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector64<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector64<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector64<T> source);
    }

    public static partial class Vector128
    {
        public static uint ExtractMostSignificantBits<T>(Vector128<T> vector);

        public static Vector128<T> Load<T>(T* address);
        public static Vector128<T> LoadAligned<T>(T* address);
        public static Vector128<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector128<T> LoadUnsafe<T>(ref T address);
        public static Vector128<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector128<T> source);
        public static void StoreAligned<T>(T* address, Vector128<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector128<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector128<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector128<T> source);
    }

    public static partial class Vector256
    {
        public static uint ExtractMostSignificantBits<T>(Vector256<T> vector);

        public static Vector256<T> Load<T>(T* address);
        public static Vector256<T> LoadAligned<T>(T* address);
        public static Vector256<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector256<T> LoadUnsafe<T>(ref T address);
        public static Vector256<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector256<T> source);
        public static void StoreAligned<T>(T* address, Vector256<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector256<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector256<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector256<T> source);
    }
}

Additional Notes

Ideally we would also expose "shuffle" APIs allowing the elements of a single or multiple vectors to be reordered:

  • On x86/x64 these are referred to as Shuffle or Permute (generally takes two elements and one element, respectively; but that isn't always the case)
  • On ARM64, these are referred to as VectorTableLookup (only takes two elements)
  • On WASM, these are referred to as Shuffle (takes two elements) and Swizzle (takes one element).
  • On LLVM, these are referred to as VectorShuffle and only take two elements

Due to the complexities of same APIs, they can't trivially be exposed as a "single" generic API. Likewise, while the behavior for Vector128<T> is consistent on all platforms. Vector64<T> is ARM64 specific and Vector256<T> is x86/x64 specific. The former behaves like Vector128<T> while the latter generally behaves like 2x Vector128<T> (outside a few APIs called Permute#x#)

For the single-vector reordering, the APIs are "trivial":

public static Vector128<byte>   Shuffle(Vector128<byte>   vector, Vector128<byte>   indices)
public static Vector128<sbyte>  Shuffle(Vector128<sbyte>  vector, Vector128<sbyte>  indices)

public static Vector128<short>  Shuffle(Vector128<short>  vector, Vector128<short>  indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> vector, Vector128<ushort> indices)

public static Vector128<int>    Shuffle(Vector128<int>    vector, Vector128<int>    indices)
public static Vector128<uint>   Shuffle(Vector128<uint>   vector, Vector128<uint>   indices)
public static Vector128<float>  Shuffle(Vector128<float>  vector, Vector128<int>    indices)

public static Vector128<long>   Shuffle(Vector128<long>   vector, Vector128<long>   indices)
public static Vector128<ulong>  Shuffle(Vector128<ulong>  vector, Vector128<ulong>  indices)
public static Vector128<double> Shuffle(Vector128<double> vector, Vector128<long>   indices)

For the two-vector reordering, the APIs are generally the same:

public static Vector128<short>  Shuffle(Vector128<short>  lower, Vector128<short>  upper, Vector128<sbyte>  indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> lower, Vector128<ushort> upper, Vector128<byte>   indices)

public static Vector128<int>    Shuffle(Vector128<int>    lower, Vector128<int>    upper, Vector128<short>  indices)
public static Vector128<uint>   Shuffle(Vector128<uint>   lower, Vector128<uint>   upper, Vector128<ushort> indices)
public static Vector128<float>  Shuffle(Vector128<float>  lower, Vector128<float>  upper, Vector128<short>  indices)

public static Vector128<long>   Shuffle(Vector128<long>   lower, Vector128<long>   upper, Vector128<int>    indices)
public static Vector128<ulong>  Shuffle(Vector128<ulong>  lower, Vector128<ulong>  upper, Vector128<uint>   indices)
public static Vector128<double> Shuffle(Vector128<double> lower, Vector128<double> upper, Vector128<int>    indices)

The exception here is for byte/sbyte where all addresses can no longer be represented in a single indices vector:

public static Vector128<byte>  Shuffle(Vector128<byte>  lower, Vector128<byte>  upper, Vector128<sbyte>  lowerIndices, Vector128<sbyte>  upperIndices)
public static Vector128<sbyte> Shuffle(Vector128<sbyte> lower, Vector128<sbyte> upper, Vector128<byte>   lowerIndices, Vector128<byte>   upperIndices)

The names for the indice inputs of the byte/sbyte variant are a bit confusing, but otherwise fairly straightforward. The Vector64<T> and Vector256<T> variants would be the same as these, just taking their respective types.

An upside of these APIs is that for common input scenarios involving constant indices, these can be massively simplified.
A downside for these APIs is that non-constant inputs on older hardware or certain Vector256<T> shuffles involving byte, sbyte, short, or ushort that cross the 128-bit lane boundary can take a couple instructions rather than being a single instruction.

This is ultimately no worse than a few other scenarios on each platform where one platform may have slightly better instruction generation due to the instructions it provides.

Author: tannergooding
Assignees: -
Labels:

area-System.Runtime.Intrinsics, api-ready-for-review

Milestone: -

@terrajobst
Copy link
Member

terrajobst commented Jan 4, 2022

Video

  • Looks good as proposed but let's make ExtractMostSignificantBits and StoreXxx extension methods
  • We decided to keep the Shuffle APIs as non-extensions because the two-argument shuffle would be potentially confusing and we didn't want a mix of extensions/non-extensions.
namespace System.Runtime.Intrinsics
{
    public static partial class Vector64
    {
        public static uint ExtractMostSignificantBits<T>(this Vector64<T> vector);

        public static Vector64<T> Load<T>(T* source);
        public static Vector64<T> LoadAligned<T>(T* source);
        public static Vector64<T> LoadAlignedNonTemporal<T>(T* source);
        public static Vector64<T> LoadUnsafe<T>(ref T source);
        public static Vector64<T> LoadUnsafe<T>(ref T source, nuint index);

        public static void Store<T>(this Vector64<T> source, T* destination);
        public static void StoreAligned<T>(this Vector64<T> source, T* destination);
        public static void StoreAlignedNonTemporal<T>(this Vector64<T> source, T* destination);
        public static void StoreUnsafe<T>(this Vector64<T> source, ref T destination);
        public static void StoreUnsafe<T>(this Vector64<T> source, ref T destination, nuint index);

        public static Vector64<byte>   Shuffle(Vector64<byte>   vector, Vector64<byte>   indices);
        public static Vector64<sbyte>  Shuffle(Vector64<sbyte>  vector, Vector64<sbyte>  indices);

        public static Vector64<short>  Shuffle(Vector64<short>  vector, Vector64<short>  indices);
        public static Vector64<ushort> Shuffle(Vector64<ushort> vector, Vector64<ushort> indices);

        public static Vector64<int>    Shuffle(Vector64<int>    vector, Vector64<int>    indices);
        public static Vector64<uint>   Shuffle(Vector64<uint>   vector, Vector64<uint>   indices);
        public static Vector64<float>  Shuffle(Vector64<float>  vector, Vector64<int>    indices);

        public static Vector64<byte>   Shuffle(Vector64<byte>   lower, Vector64<byte>   upper, Vector64<byte>   indices);
        public static Vector64<sbyte>  Shuffle(Vector64<sbyte>  lower, Vector64<sbyte>  upper, Vector64<sbyte>  indices);

        public static Vector64<short>  Shuffle(Vector64<short>  lower, Vector64<short>  upper, Vector64<short>  indices);
        public static Vector64<ushort> Shuffle(Vector64<ushort> lower, Vector64<ushort> upper, Vector64<ushort> indices);

        public static Vector64<int>    Shuffle(Vector64<int>    lower, Vector64<int>    upper, Vector64<int>    indices);
        public static Vector64<uint>   Shuffle(Vector64<uint>   lower, Vector64<uint>   upper, Vector64<uint>   indices);
        public static Vector64<float>  Shuffle(Vector64<float>  lower, Vector64<float>  upper, Vector64<int>    indices);
    }

    public static partial class Vector128
    {
        public static uint ExtractMostSignificantBits<T>(this Vector128<T> vector);

        public static Vector128<T> Load<T>(T* source);
        public static Vector128<T> LoadAligned<T>(T* source);
        public static Vector128<T> LoadAlignedNonTemporal<T>(T* source);
        public static Vector128<T> LoadUnsafe<T>(ref T source);
        public static Vector128<T> LoadUnsafe<T>(ref T source, nuint index);

        public static void Store<T>(this Vector128<T> source, T* destination);
        public static void StoreAligned<T>(this Vector128<T> source, T* destination);
        public static void StoreAlignedNonTemporal<T>(this Vector128<T> source, T* destination);
        public static void StoreUnsafe<T>(this Vector128<T> source, ref T destination);
        public static void StoreUnsafe<T>(this Vector128<T> source, ref T destination, nuint index);

        public static Vector128<byte>   Shuffle(Vector128<byte>   vector, Vector128<byte>   indices);
        public static Vector128<sbyte>  Shuffle(Vector128<sbyte>  vector, Vector128<sbyte>  indices);

        public static Vector128<short>  Shuffle(Vector128<short>  vector, Vector128<short>  indices);
        public static Vector128<ushort> Shuffle(Vector128<ushort> vector, Vector128<ushort> indices);

        public static Vector128<int>    Shuffle(Vector128<int>    vector, Vector128<int>    indices);
        public static Vector128<uint>   Shuffle(Vector128<uint>   vector, Vector128<uint>   indices);
        public static Vector128<float>  Shuffle(Vector128<float>  vector, Vector128<int>    indices);

        public static Vector128<long>   Shuffle(Vector128<long>   vector, Vector128<long>   indices);
        public static Vector128<ulong>  Shuffle(Vector128<ulong>  vector, Vector128<ulong>  indices);
        public static Vector128<double> Shuffle(Vector128<double> vector, Vector128<long>   indices);

        public static Vector128<byte>   Shuffle(Vector128<byte>   lower, Vector128<byte>   upper, Vector128<byte>   indices);
        public static Vector128<sbyte>  Shuffle(Vector128<sbyte>  lower, Vector128<sbyte>  upper, Vector128<sbyte>  indices);

        public static Vector128<short>  Shuffle(Vector128<short>  lower, Vector128<short>  upper, Vector128<short>  indices);
        public static Vector128<ushort> Shuffle(Vector128<ushort> lower, Vector128<ushort> upper, Vector128<ushort> indices);

        public static Vector128<int>    Shuffle(Vector128<int>    lower, Vector128<int>    upper, Vector128<int>    indices);
        public static Vector128<uint>   Shuffle(Vector128<uint>   lower, Vector128<uint>   upper, Vector128<uint>   indices);
        public static Vector128<float>  Shuffle(Vector128<float>  lower, Vector128<float>  upper, Vector128<int>    indices);

        public static Vector128<long>   Shuffle(Vector128<long>   lower, Vector128<long>   upper, Vector128<long>   indices);
        public static Vector128<ulong>  Shuffle(Vector128<ulong>  lower, Vector128<ulong>  upper, Vector128<ulong>  indices);
        public static Vector128<double> Shuffle(Vector128<double> lower, Vector128<double> upper, Vector128<long>   indices);
    }

    public static partial class Vector256
    {
        public static uint ExtractMostSignificantBits<T>(this Vector256<T> vector);

        public static Vector256<T> Load<T>(T* source);
        public static Vector256<T> LoadAligned<T>(T* source);
        public static Vector256<T> LoadAlignedNonTemporal<T>(T* source);
        public static Vector256<T> LoadUnsafe<T>(ref T source);
        public static Vector256<T> LoadUnsafe<T>(ref T source, nuint index);

        public static void Store<T>(this Vector256<T> source, T* destination);
        public static void StoreAligned<T>(this Vector256<T> source, T* destination);
        public static void StoreAlignedNonTemporal<T>(this Vector256<T> source, T* destination);
        public static void StoreUnsafe<T>(this Vector256<T> source, ref T destination);
        public static void StoreUnsafe<T>(this Vector256<T> source, ref T destination, nuint index);

        public static Vector256<byte>   Shuffle(Vector256<byte>   vector, Vector256<byte>   indices);
        public static Vector256<sbyte>  Shuffle(Vector256<sbyte>  vector, Vector256<sbyte>  indices);

        public static Vector256<short>  Shuffle(Vector256<short>  vector, Vector256<short>  indices);
        public static Vector256<ushort> Shuffle(Vector256<ushort> vector, Vector256<ushort> indices);

        public static Vector256<int>    Shuffle(Vector256<int>    vector, Vector256<int>    indices);
        public static Vector256<uint>   Shuffle(Vector256<uint>   vector, Vector256<uint>   indices);
        public static Vector256<float>  Shuffle(Vector256<float>  vector, Vector256<int>    indices);

        public static Vector256<long>   Shuffle(Vector256<long>   vector, Vector256<long>   indices);
        public static Vector256<ulong>  Shuffle(Vector256<ulong>  vector, Vector256<ulong>  indices);
        public static Vector256<double> Shuffle(Vector256<double> vector, Vector256<long>   indices);

        public static Vector256<byte>   Shuffle(Vector256<byte>   lower, Vector256<byte>   upper, Vector256<byte>   indices);
        public static Vector256<sbyte>  Shuffle(Vector256<sbyte>  lower, Vector256<sbyte>  upper, Vector256<sbyte>  indices);

        public static Vector256<short>  Shuffle(Vector256<short>  lower, Vector256<short>  upper, Vector256<short>  indices);
        public static Vector256<ushort> Shuffle(Vector256<ushort> lower, Vector256<ushort> upper, Vector256<ushort> indices);

        public static Vector256<int>    Shuffle(Vector256<int>    lower, Vector256<int>    upper, Vector256<int>    indices);
        public static Vector256<uint>   Shuffle(Vector256<uint>   lower, Vector256<uint>   upper, Vector256<uint>   indices);
        public static Vector256<float>  Shuffle(Vector256<float>  lower, Vector256<float>  upper, Vector256<int>    indices);

        public static Vector256<long>   Shuffle(Vector256<long>   lower, Vector256<long>   upper, Vector256<long>   indices);
        public static Vector256<ulong>  Shuffle(Vector256<ulong>  lower, Vector256<ulong>  upper, Vector256<ulong>  indices);
        public static Vector256<double> Shuffle(Vector256<double> lower, Vector256<double> upper, Vector256<long>   indices);
    }
}

@JulieLeeMSFT
Copy link
Member

@tannergooding is there any more APIs left to implement from this issue?

@tannergooding
Copy link
Member Author

tannergooding commented Jun 13, 2022

The three operand shuffle APIs:

        public static Vector256<byte>   Shuffle(Vector256<byte>   lower, Vector256<byte>   upper, Vector256<byte>   indices);
        public static Vector256<sbyte>  Shuffle(Vector256<sbyte>  lower, Vector256<sbyte>  upper, Vector256<sbyte>  indices);

        public static Vector256<short>  Shuffle(Vector256<short>  lower, Vector256<short>  upper, Vector256<short>  indices);
        public static Vector256<ushort> Shuffle(Vector256<ushort> lower, Vector256<ushort> upper, Vector256<ushort> indices);

        public static Vector256<int>    Shuffle(Vector256<int>    lower, Vector256<int>    upper, Vector256<int>    indices);
        public static Vector256<uint>   Shuffle(Vector256<uint>   lower, Vector256<uint>   upper, Vector256<uint>   indices);
        public static Vector256<float>  Shuffle(Vector256<float>  lower, Vector256<float>  upper, Vector256<int>    indices);

        public static Vector256<long>   Shuffle(Vector256<long>   lower, Vector256<long>   upper, Vector256<long>   indices);
        public static Vector256<ulong>  Shuffle(Vector256<ulong>  lower, Vector256<ulong>  upper, Vector256<ulong>  indices);
        public static Vector256<double> Shuffle(Vector256<double> lower, Vector256<double> upper, Vector256<long>   indices);

I'm working on them still and expect them to be in by code complete. That being said in the worst case these ones won't be available in .NET 7 due to time constraints and other work also on my plate taking precedence. The two operand shuffle APIs are in already and cover a large number of the scenarios so this scenario being missing won't significantly hurt the feature.

@SamMonoRT
Copy link
Member

cc @fanyang-mono for Mono implementations

@dakersnar
Copy link
Contributor

Everything is done here except for the shuffle APIs that take two inputs and the index mask. Perf improvements are still needed for the other shuffle APIs.

@tannergooding
Copy link
Member Author

The three operand APIs missed .NET 8 as well. We did manage to land the AVX-512 full permute instructions and the AdvSimd multi-input table lookup instructions, however. So we should be able to more easily land this support in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime.Intrinsics
Projects
None yet
Development

No branches or pull requests

6 participants