ARM64 GC: Use SVE when sorting the mark list #108473

a74nh · 2024-10-02T10:03:30Z

Background

AIUI, in the coreclr GC, the mark list is an optimisation for post-marking stages where rather than traversing the heap, object by object, it allows the plan phase to skip over objects that are not live (gaps) and just traverse live objects (plugs). To do this, it sorts the object addresses of the live objects.
For most targets, introsort::sort is used
For AMD64, vxsort is used. This is a modified copy of damageboy/vxsort-cpp
vxsort is Quicksort with a vectorised partition routine.
Traditionally, the partition is not vectorizable. However, AVX can use masks and the permute instruction to split a vector into two parts.
ARM64 SVE has similar concepts and can split a vector using predicates and the compact instruction.

Suggestion

Extend the vxsort in coreclr to also work using SVE. I'm not sure how easy this would be to implement.

Alternative Suggestion

Highway provides a highly optimised version of Quicksort, which does not rely on the special AVX/SVE instructions and instead can vectorise across all targets. I don't know if use of this would be suitable for coreclr

Reference

Here is an example partition routine implemented in C# SVE. All elements less than the first element get written to left, all other elements get written to right

    public static unsafe void splitArray(ref uint* input, ref uint* left, ref uint* right, int length)
    {
      long i = 0;

      // Position within the output arrays.
      long index_left = 0;
      long index_right = 0;

      // Create a vector filled with the first element from input.
      Vector<uint> firstelemvec = Sve.DuplicateSelectedScalarToVector(Sve.LoadVector(Sve.CreateTrueMaskUInt32(), input), 0);

      // Create a predicate for the loop.
      Vector<uint> ploop = Sve.CreateWhileLessThanMask32Bit(i, length);

      while(Sve.TestAnyTrue(Sve.CreateTrueMaskUInt32(), ploop))
      {
        // Load from the input array based on the loop predicate.
        Vector<uint> data = Sve.LoadVector(ploop, input + i);

        // Find all elements in input array less than the first element.
        Vector<uint> pinner = Sve.ConditionalSelect(ploop, Sve.CompareLessThan(data, firstelemvec), Vector<uint>.Zero);

        // Squash all found elements to the lower lanes of the vector
        Vector<uint> compacted = Sve.Compact(pinner, data);
        
        // Store the squashed elements to the first output array. (This uses the loop predicate, so some additional
        // zeros may be stored)
        Sve.StoreAndZip(ploop, left + index_left, compacted);

        // Increment the position in the first output array by the number of elements found.
        index_left = Sve.SaturatingIncrementByActiveElementCount(index_left, pinner);

        // Do the same for the elements greater than or equal, storing into the second output array.
        pinner = Sve.ConditionalSelect(ploop, Sve.CompareGreaterThanOrEqual(data, firstelemvec), Vector<uint>.Zero);
        compacted = Sve.Compact(pinner, data);
        Sve.StoreAndZip(ploop, right + index_right, compacted);
        index_right = Sve.SaturatingIncrementByActiveElementCount(index_right, pinner);

        // Handle the loop.
        i = Sve.SaturatingIncrementBy32BitElementCount(i, 1);
        ploop = Sve.CreateWhileLessThanMask32Bit(i, length);
      }
    }

The text was updated successfully, but these errors were encountered:

dotnet-policy-service · 2024-10-02T12:34:04Z

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

cshung · 2024-10-02T15:13:27Z

@a74nh I spent some time earlier to look into the possibility of speeding up sorting on ARM64. I've got mixed feedback on this one.

On one hand, looking at here, it appears to me that all we needed is to implement these methods for a machine, then we can get vxsort working for it. Optimistically, this is probably the easiest path forward.

On the other hand, speaking with @Maoni0, she told me that she talk with some other experts on ARM64 and told me that it was blocking on some factors (that I can't remember) so that we didn't do it earlier.

@kunalspathak told me that we can't have the 512 bits parallelization as in here, still some vectorization should be good.

I didn't go any further with this - but generally I think this is a good thing to do.

In any case, we can't use C# inside the GC.

Let us know if you are interested in contributing.

a74nh · 2024-10-02T17:12:27Z

On one hand, looking at here, it appears to me that all we needed is to implement these methods for a machine, then we can get vxsort working for it. Optimistically, this is probably the easiest path forward.

Agreed, that seems a sensible place to start. This assumes that the AVX and SVE algorithms are directly compatible.

On the other hand, speaking with @Maoni0, she told me that she talk with some other experts on ARM64 and told me that it was blocking on some factors (that I can't remember) so that we didn't do it earlier.

This could be the availability of SVE ? It's a fairly new technology, so would have to fall back to the other version on older hardware. Which means we need a check at runtime.

@kunalspathak told me that we can't have the 512 bits parallelization as in here, still some vectorization should be good.

Yes, most machines with SVE are 128bits. But as hardware improves and vector lengths get longer we'll get a speed up for free.

In any case, we can't use C# inside the GC.

Agreed. I only provided it in C# because I already had it available (I'm using it in a blog I'm writing about SVE in C#).

Let us know if you are interested in contributing.

It is possible someone in my team would be able to do this. We'll have to prioritise it against other work.

kunalspathak · 2024-10-09T16:44:49Z

Related: #64164

@cshung - Do we know the speed up we get on x64 when trying vxsort vs. introsort. Is there a benchmark that we can use to see if porting vxsort to NEON will benefit?

https://github.com/dotnet/runtime/blob/87273d64859e8bf5856d176915624e6f92ed3baf/src/coreclr/gc/gc.cpp#L10485-L10503

On the other hand, speaking with @Maoni0, she told me that she talk with some other experts on ARM64 and told me that it was blocking on some factors (that I can't remember) so that we didn't do it earlier.

Will be good to know that.

cshung · 2024-11-08T19:11:23Z

Related: #64164

@cshung - Do we know the speed up we get on x64 when trying vxsort vs. introsort. Is there a benchmark that we can use to see if porting vxsort to NEON will benefit?

My earlier PR #98712 to enable VxSort on Linux contained some data that could be useful.

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 2, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Oct 2, 2024

a74nh added arch-arm64 arm-sve Work related to arm64 SVE/SVE2 support labels Oct 2, 2024

vcsjones added area-GC-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Oct 2, 2024

mangod9 removed the untriaged New issue has not been triaged by the area owner label Oct 2, 2024

mangod9 added this to the 10.0.0 milestone Oct 2, 2024

kunalspathak mentioned this issue Nov 8, 2024

Improve Arm64 Performance in .NET 10 #109652

Open

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARM64 GC: Use SVE when sorting the mark list #108473

ARM64 GC: Use SVE when sorting the mark list #108473

a74nh commented Oct 2, 2024 •

edited

Loading

dotnet-policy-service bot commented Oct 2, 2024

cshung commented Oct 2, 2024

a74nh commented Oct 2, 2024

kunalspathak commented Oct 9, 2024

cshung commented Nov 8, 2024

ARM64 GC: Use SVE when sorting the mark list #108473

ARM64 GC: Use SVE when sorting the mark list #108473

Comments

a74nh commented Oct 2, 2024 • edited Loading

Background

Suggestion

Alternative Suggestion

Reference

dotnet-policy-service bot commented Oct 2, 2024

cshung commented Oct 2, 2024

a74nh commented Oct 2, 2024

kunalspathak commented Oct 9, 2024

cshung commented Nov 8, 2024

a74nh commented Oct 2, 2024 •

edited

Loading