Expose new CUDA APIs for sharing the same memory between processes #1235

Deneas · 2024-05-27T10:26:35Z

Hello
This Pull request is a result of my experimentations with CUDA IPC and will expose the CUDA IPC API to allow sharing the same memory between different processes. Another part would be the Event API, but as far as I can tell, you already habe that one exposed.

The biggest currently open task is memory management of the buffer allocated by IPC. My current plan is creating a class CudaIpcMemoryBuffer mirroring CudaMemoryBuffer. The main difference I currently found was in Disposing, as the IPC buffer should then use the new method CloseIpcMemHandle.

Also the way I see it, the actual exchange of the IPC handle between different processes should be out of scope for this project.

Feel free to give feedback and to ask questions.

Deneas · 2024-05-27T20:00:41Z

Sorry for the switching around of the draft status, I'm still getting used to this.
As for the pull request: The actual functionality for exposing and using memory through the IPC is already done.
What remains is how we expose it for end use.
For now I created a method MapRaw on the CudaAccelerator based on the similar AllocateRaw method.
This method is not strictly necessary, instead users could just create the CudaIpcMemoryBuffer by itself.
Lastly, I'm currently exploringhow to write the tests for this new feature.
Unfortunately the IPC handles explicitly do not work in the same process, so for testing we would actually need two processes communication with each other.
An easy solution would be a separate new console project that then is started by the tests, but I'm looking whether there are some more "elegant" solutions possible.

MoFtZ

Thanks for this PR @Deneas.

I've made some comments inline already.

In addition, this PR handles consuming an IPC buffer from another process, but not the "Get IPC Mem Handle" functionality. That would mean ILGPU could not share it's buffer with another process. Is that correct?

With regards to unit testing, I'm not sure if that is even possible. Perhaps adding two sample projects, for the Producer and Consumer?

Src/ILGPU/Runtime/Cuda/CudaAPI.xml

MoFtZ · 2024-05-29T11:04:17Z

Src/ILGPU/Runtime/Cuda/CudaNativeMethods.cs

+    /// <remarks>This represents the struct
+    /// <a href="https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaIpcMemHandle__t.html#structcudaIpcMemHandle__t">cudaIpcMemHandle__t</a>.
+    /// </remarks>
+    public struct CudaIpcMemHandle


QUESTION: Is there a reason for having a struct with another struct (i.e. Handle within CudaIpcMemHandle)?

Why not just declare a single struct?

Originally it was that way to mirror cudaIpcMemHandle__t having the array as a field.
Having [InlineArray(64)] directly on CudaIpcMemHandle means it straight up is the array.
Also honestly speaking, I didn't think to test it that way, asg etting arrays to work with LibraryImport instead of DllImport was a already a bit of a hassle.
That being said, I tested it that way and it actually worked. Since we don't have any other fields in the struct, we should actually be fine this way.
To ease the usage, since InlineArrays can't be used directly as arrays, I added explicit conversions to and from byte[].
I thought about making them implicit, but decided against it because they should not throw exceptions and generally carry the assumption of being allocation free.

MoFtZ · 2024-05-29T11:05:37Z

Src/ILGPU/Runtime/Cuda/CudaIpcMemoryBuffer.cs

+            CudaIpcMemHandle ipcMemHandle,
+            long length,
+            int elementSize,
+            CudaIpcMemFlags flags = CudaIpcMemFlags.None)


I would suggest making the flags mandatory, so that the caller has to provide it. It will primarily be used internally anyway.

Agreed, I made it mandatory. What I'm not sure about is, whether we should even add them to the "convenience" methods. The currently only flag LazyEnablePeerAccess will enable peer access if the underlying memory is allocated on a different device than the device used by the current context.
Obviously I'm quite new to this library, but my gut says this would be a rather niche case not worth complicating the API for.

Well, I spoke too soon. Without LazyEnablePeerAccess IPC won't work for me even with the same device. I strongly suspect it relates to me having an SLI setup with two GPUs, but I was unable to find a definite answer.
Unsurprisingly, I'm now strongly in favor of exposing the flags. I also added my finding in the remarks.

MoFtZ · 2024-05-29T11:06:43Z

Src/ILGPU/Runtime/Cuda/CudaIpcMemoryBuffer.cs

@@ -0,0 +1,173 @@
+// ---------------------------------------------------------------------------------------
+//                                        ILGPU
+//                        Copyright (c) 2017-2023 ILGPU Project


ISSUE: Need to fix up the copyright header. Should be 2024 only.

Done, I also assumed you probably wanted the same for all other files I touched.

MoFtZ · 2024-05-29T11:07:13Z

Src/ILGPU/Runtime/Cuda/CudaIpcMemoryBuffer.cs

+namespace ILGPU.Runtime.Cuda
+{
+    /// <summary>
+    /// Represents an unmanaged Cuda buffer.


ISSUE: Please tidy up the comments in this file. e.g. "This represents a Cuda IPC buffer"

I made some changes, how does it look now? I left comments of code identically to CudaMemoryBuffer alone, as it was the original template for the class.

MoFtZ · 2024-05-29T11:13:22Z

Src/ILGPU/Runtime/Cuda/CudaAccelerator.cs

+        /// <param name="length">The number of elements to allocate.</param>
+        /// <param name="elementSize">The size of a single element in bytes.</param>
+        /// <returns>A mapped buffer of shared memory on this accelerator.</returns>
+        public CudaIpcMemoryBuffer MapRaw(CudaIpcMemHandle ipcMemHandle, long length, int elementSize)


I'm thinking this method signature should be something like:
public MemoryBuffer MapFromIpcMemHandle(CudaIpcMemHandle ipcMemHandle, long length, int elementSize).

We do not need to expose the underlying CudaIpcMemoryBuffer.

And we should also have another method like:
public MemoryBuffer1D<T, Stride1D.Dense> MapFromIpcMemHandle1D<T>(CudaIpcMemHandle ipcMemHandle, long length).

This is the convenience method that will make it easier to use. Not sure if we need the 2D or 3D variants.

I'm also wondering if we should be using the Map prefix, or if it should be an 'Allocate' prefix, so that it is discoverable with the rest of the memory allocation functions when using auto-complete. e.g. AllocateFromIpcMemHandle.

Agree on returning MemoryBuffer and naming the method MapFromIpcMemHandle.
My original idea with MapRaw was to mirror Accelerator.AllocateRaw, but MapFrom works as well.
I also did think about using Allocate, but decided against it, because it is exactly what we're not doing. The memory was already allocated by a different process, what we're creating is more like a View. Instead I went with Map because that is also what they use in the documentation ("Maps memory exported from another process").

As for having a similar API to Allocate1D() like e.g. Map1D, that would be something I could do. As far I understand the code, I most likely can just replace AllocateRaw with its IPC equivalent and move the CudaIpcMemHandle through the methods. The methods then should be in CudaAccelerator as opposed to Accelerator.

MoFtZ · 2024-05-29T11:15:26Z

Src/ILGPU/Runtime/Cuda/CudaNativeMethods.cs

+        /// Enables peer access lazily.
+        /// </summary>
+        /// <remarks>
+        /// From <a href="https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g60f28a5142ee7ae0336dfa83fd54e006">


ISSUE: I'm guessing the GUID will change with each Cuda release, so I would prefer not to include the link.

Understandable, I also removed all of the remarks as well, since they seemed too vague without the links.

MoFtZ · 2024-05-29T11:18:22Z

Src/ILGPU/Runtime/Cuda/CudaAccelerator.cs

+                throw new ArgumentOutOfRangeException(nameof(elementSize));
+
+            Bind();
+            return new CudaIpcMemoryBuffer(this, ipcMemHandle, length, elementSize);


QUESTION: Should there be a flags parameter in the method signature, which then gets passed here?

Sorry, I went from top to bottom and only saw this later. Feel free to reply in either thread.

What I'm not sure about is, whether we should even add them to the "convenience" methods. The currently only flag LazyEnablePeerAccess will enable peer access if the underlying memory is allocated on a different device than the device used by the current context.
Obviously I'm quite new to this library, but my gut says this would be a rather niche case not worth complicating the API for.

Deneas · 2024-05-29T20:47:03Z

In addition, this PR handles consuming an IPC buffer from another process, but not the "Get IPC Mem Handle" functionality. That would mean ILGPU could not share it's buffer with another process. Is that correct?

No, I just missed creating a convenient method for doing so in CudaAccelerator. The fuctionality is mapped in the CudaAPI and available as GetIpcMemoryHandle.
I'm currently debating whether to call the convenience method simply GetIpcMemoryHandle or ExportMemoryForOtherProcesses, what sounds better for you?

With regards to unit testing, I'm not sure if that is even possible. Perhaps adding two sample projects, for the Producer and Consumer?

Yeah, while I think it would be doable as test with a new project, adding a new Sample seems the more idomatic solution for ILGPU.
I saw that CUDA has an example called SimpleIPC. I'll take that as a starting point and see where it takes me.
If I managed to make it work between .Net and Python, then between two .Net processes shouldn't be difficult 😄

Deneas · 2024-06-01T13:25:42Z

Alright, with the added sample and changes, this PR is feature-complete. All that remains is to settle naming and the ergonomics of the helper methods.

A quick summary of my new work:

I added the sample CudaIPC with two projects showing both exporting memory and mapping said memory.
I added GetIpcEventHandle and OpenIpcEventHandle for completeness (note: I intentionally did not add any convenience methods, as regular events didn't seem to have the either).
I removed the usage of InlineArray for the IPC handles and switched to using raw arrays with byte*, because else users might forced to use C# 12.
I extended CudaDevice with HasIpcSupport since theres is a corresponding device attribute we can query.
I added the convenience methods GetIpcMemoryHandle and GetIpcMemoryHandle<TView> for MemoryBuffer and MemoryBuffer<TView> respectively.
I renamed MapRaw to MapFromIpcMemHandle and exposed the flags as mandatory parameter.

As before, feel free to ask questions and give feedback.

Instead of allocating and freeing memory it uses OpenIpcMemHandle and CloseIpcMemHandle.

This mimics AllocateRaw for IPC shared memory with MapRaw. The newly added constructor for CudaIpcMemHandle allows users to avoid dealing with the quirks of inline array.

…f the allocations are less than 1 MB

- Use the newest version of cuIpcOpenMemHandle. - Apply InlineArray to CudaIpcMemHandle directly, as we don't have any other fields or properties. - Makes the flags parameter mandatory for CudaIpcMemoryBuffer. - Adjust header format of all touched files to use the current year only. - Adjusted doc comments for CudaIpcMemoryBuffer when behavior differs from CudaMemoryBuffer.

…has a fixed size

InlineArrays complicated array handling and forced users on newer C# versions. Now we use regular arrays and pointers for interop.

… support

m4rs-mt · 2024-07-18T20:41:24Z

Amazing work, thank you very much for your efforts. I ping @MoFtZ to take a look to get this over the line.

m4rs-mt

Thank you for this PR, looks mostly good to me! I was wondering about the class CudaIpcMemoryBuffer which does not inherit from CudaMemoryBuffer. This can lead to confusion and potential issues with people testing for Cuda-enabled buffers via is.

This allows introspection e.g. checking if a memory handle was already used.

Deneas · 2024-10-09T22:15:21Z

Yeah, I agree it could be a bit confusing, but I felt it would be wrong for CudaIpcMemoryBuffer to inherit from CudaMemoryBuffer because both the constructor and the dispose completely differ.
I could unseal CudaMemoryBuffer and make Dispose virtual, but it would be rather unintuitive for people to inherit from it and neither call the base constructor nor the base Dispose.
Alternatively we could create a CudaMemoryBufferBase both buffers inherit from, ideally it would just be called CudaMemoryBuffer, but that would break existing code.
Or as final alternative: I merge CudaIpcMemoryBuffer into CudaMemoryBuffer. There would be a new constructor, an extended Dispose and probably a property to check whether it is IPC or not.
Which one would you prefer?

Deneas changed the title ~~Expose new CUDA APIs for sharing the same memory between processes~~ WIP: Expose new CUDA APIs for sharing the same memory between processes May 27, 2024

Deneas marked this pull request as draft May 27, 2024 19:42

Deneas marked this pull request as ready for review May 27, 2024 19:46

MoFtZ requested changes May 29, 2024

View reviewed changes

Deneas changed the title ~~WIP: Expose new CUDA APIs for sharing the same memory between processes~~ Expose new CUDA APIs for sharing the same memory between processes Jun 1, 2024

Deneas added 9 commits June 12, 2024 23:16

Exposed new CUDA APIs for sharing the same memory between processes

43b61ba

Created CudaIpcMemoryBuffer behaving just like CudaMemoryBuffer

f2f8920

Instead of allocating and freeing memory it uses OpenIpcMemHandle and CloseIpcMemHandle.

Added hooks for ergonomic access to IPC shared memory

b1ce5af

This mimics AllocateRaw for IPC shared memory with MapRaw. The newly added constructor for CudaIpcMemHandle allows users to avoid dealing with the quirks of inline array.

Added finding that cuIpcGetMemHandle will even zero adjacent memory i…

596ad23

…f the allocations are less than 1 MB

Added range check for the array length because the IPC memory handle …

c444a7f

…has a fixed size

Exposed IPC event API and reworked IPC handle usage

f8ce07a

InlineArrays complicated array handling and forced users on newer C# versions. Now we use regular arrays and pointers for interop.

Added convenience methods for IPC memory handles and checking for IPC…

0b69218

… support

Added sample projects demonstrating cuda IPC from both sides

b73754c

Deneas force-pushed the CudaIPC branch from be0f09c to b73754c Compare June 12, 2024 21:19

m4rs-mt reviewed Oct 9, 2024

View reviewed changes

Exposed IPC memory handle and flags CudaIpcMemoryBuffer

9d7f951

This allows introspection e.g. checking if a memory handle was already used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose new CUDA APIs for sharing the same memory between processes #1235

Expose new CUDA APIs for sharing the same memory between processes #1235

Deneas commented May 27, 2024

Deneas commented May 27, 2024

MoFtZ left a comment

MoFtZ May 29, 2024

Deneas May 29, 2024

MoFtZ May 29, 2024

Deneas May 29, 2024

Deneas Jun 1, 2024

MoFtZ May 29, 2024

Deneas May 29, 2024

MoFtZ May 29, 2024

Deneas May 29, 2024

MoFtZ May 29, 2024

Deneas May 29, 2024

MoFtZ May 29, 2024

Deneas May 29, 2024

MoFtZ May 29, 2024

Deneas May 29, 2024

Deneas commented May 29, 2024

Deneas commented Jun 1, 2024

m4rs-mt commented Jul 18, 2024

m4rs-mt left a comment

Deneas commented Oct 9, 2024

Expose new CUDA APIs for sharing the same memory between processes #1235

Are you sure you want to change the base?

Expose new CUDA APIs for sharing the same memory between processes #1235

Conversation

Deneas commented May 27, 2024

Deneas commented May 27, 2024

MoFtZ left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Deneas commented May 29, 2024

Deneas commented Jun 1, 2024

m4rs-mt commented Jul 18, 2024

m4rs-mt left a comment

Choose a reason for hiding this comment

Deneas commented Oct 9, 2024