Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM64 GC: Use SVE when sorting the mark list #108473

Open
a74nh opened this issue Oct 2, 2024 · 5 comments
Open

ARM64 GC: Use SVE when sorting the mark list #108473

a74nh opened this issue Oct 2, 2024 · 5 comments
Labels
arch-arm64 area-GC-coreclr arm-sve Work related to arm64 SVE/SVE2 support
Milestone

Comments

@a74nh
Copy link
Contributor

a74nh commented Oct 2, 2024

Background

  • AIUI, in the coreclr GC, the mark list is an optimisation for post-marking stages where rather than traversing the heap, object by object, it allows the plan phase to skip over objects that are not live (gaps) and just traverse live objects (plugs). To do this, it sorts the object addresses of the live objects.
  • For most targets, introsort::sort is used
  • For AMD64, vxsort is used. This is a modified copy of damageboy/vxsort-cpp
  • vxsort is Quicksort with a vectorised partition routine.
  • Traditionally, the partition is not vectorizable. However, AVX can use masks and the permute instruction to split a vector into two parts.
  • ARM64 SVE has similar concepts and can split a vector using predicates and the compact instruction.

Suggestion

  • Extend the vxsort in coreclr to also work using SVE. I'm not sure how easy this would be to implement.

Alternative Suggestion

  • Highway provides a highly optimised version of Quicksort, which does not rely on the special AVX/SVE instructions and instead can vectorise across all targets. I don't know if use of this would be suitable for coreclr

Reference

Here is an example partition routine implemented in C# SVE. All elements less than the first element get written to left, all other elements get written to right

    public static unsafe void splitArray(ref uint* input, ref uint* left, ref uint* right, int length)
    {
      long i = 0;

      // Position within the output arrays.
      long index_left = 0;
      long index_right = 0;

      // Create a vector filled with the first element from input.
      Vector<uint> firstelemvec = Sve.DuplicateSelectedScalarToVector(Sve.LoadVector(Sve.CreateTrueMaskUInt32(), input), 0);

      // Create a predicate for the loop.
      Vector<uint> ploop = Sve.CreateWhileLessThanMask32Bit(i, length);

      while(Sve.TestAnyTrue(Sve.CreateTrueMaskUInt32(), ploop))
      {
        // Load from the input array based on the loop predicate.
        Vector<uint> data = Sve.LoadVector(ploop, input + i);

        // Find all elements in input array less than the first element.
        Vector<uint> pinner = Sve.ConditionalSelect(ploop, Sve.CompareLessThan(data, firstelemvec), Vector<uint>.Zero);

        // Squash all found elements to the lower lanes of the vector
        Vector<uint> compacted = Sve.Compact(pinner, data);
        
        // Store the squashed elements to the first output array. (This uses the loop predicate, so some additional
        // zeros may be stored)
        Sve.StoreAndZip(ploop, left + index_left, compacted);

        // Increment the position in the first output array by the number of elements found.
        index_left = Sve.SaturatingIncrementByActiveElementCount(index_left, pinner);

        // Do the same for the elements greater than or equal, storing into the second output array.
        pinner = Sve.ConditionalSelect(ploop, Sve.CompareGreaterThanOrEqual(data, firstelemvec), Vector<uint>.Zero);
        compacted = Sve.Compact(pinner, data);
        Sve.StoreAndZip(ploop, right + index_right, compacted);
        index_right = Sve.SaturatingIncrementByActiveElementCount(index_right, pinner);

        // Handle the loop.
        i = Sve.SaturatingIncrementBy32BitElementCount(i, 1);
        ploop = Sve.CreateWhileLessThanMask32Bit(i, length);
      }
    }
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 2, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Oct 2, 2024
@a74nh a74nh added arch-arm64 arm-sve Work related to arm64 SVE/SVE2 support labels Oct 2, 2024
@vcsjones vcsjones added area-GC-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Oct 2, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Oct 2, 2024
@mangod9 mangod9 added this to the 10.0.0 milestone Oct 2, 2024
@cshung
Copy link
Member

cshung commented Oct 2, 2024

@a74nh I spent some time earlier to look into the possibility of speeding up sorting on ARM64. I've got mixed feedback on this one.

On one hand, looking at here, it appears to me that all we needed is to implement these methods for a machine, then we can get vxsort working for it. Optimistically, this is probably the easiest path forward.

On the other hand, speaking with @Maoni0, she told me that she talk with some other experts on ARM64 and told me that it was blocking on some factors (that I can't remember) so that we didn't do it earlier.

@kunalspathak told me that we can't have the 512 bits parallelization as in here, still some vectorization should be good.

I didn't go any further with this - but generally I think this is a good thing to do.

In any case, we can't use C# inside the GC.

Let us know if you are interested in contributing.

@a74nh
Copy link
Contributor Author

a74nh commented Oct 2, 2024

On one hand, looking at here, it appears to me that all we needed is to implement these methods for a machine, then we can get vxsort working for it. Optimistically, this is probably the easiest path forward.

Agreed, that seems a sensible place to start. This assumes that the AVX and SVE algorithms are directly compatible.

On the other hand, speaking with @Maoni0, she told me that she talk with some other experts on ARM64 and told me that it was blocking on some factors (that I can't remember) so that we didn't do it earlier.

This could be the availability of SVE ? It's a fairly new technology, so would have to fall back to the other version on older hardware. Which means we need a check at runtime.

@kunalspathak told me that we can't have the 512 bits parallelization as in here, still some vectorization should be good.

Yes, most machines with SVE are 128bits. But as hardware improves and vector lengths get longer we'll get a speed up for free.

In any case, we can't use C# inside the GC.

Agreed. I only provided it in C# because I already had it available (I'm using it in a blog I'm writing about SVE in C#).

Let us know if you are interested in contributing.

It is possible someone in my team would be able to do this. We'll have to prioritise it against other work.

@kunalspathak
Copy link
Member

Related: #64164

@cshung - Do we know the speed up we get on x64 when trying vxsort vs. introsort. Is there a benchmark that we can use to see if porting vxsort to NEON will benefit?

https://github.com/dotnet/runtime/blob/87273d64859e8bf5856d176915624e6f92ed3baf/src/coreclr/gc/gc.cpp#L10485-L10503

On the other hand, speaking with @Maoni0, she told me that she talk with some other experts on ARM64 and told me that it was blocking on some factors (that I can't remember) so that we didn't do it earlier.

Will be good to know that.

@cshung
Copy link
Member

cshung commented Nov 8, 2024

Related: #64164

@cshung - Do we know the speed up we get on x64 when trying vxsort vs. introsort. Is there a benchmark that we can use to see if porting vxsort to NEON will benefit?

My earlier PR #98712 to enable VxSort on Linux contained some data that could be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm64 area-GC-coreclr arm-sve Work related to arm64 SVE/SVE2 support
Projects
None yet
Development

No branches or pull requests

5 participants