Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose new CUDA APIs for sharing the same memory between processes #1235

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

Deneas
Copy link

@Deneas Deneas commented May 27, 2024

Hello
This Pull request is a result of my experimentations with CUDA IPC and will expose the CUDA IPC API to allow sharing the same memory between different processes. Another part would be the Event API, but as far as I can tell, you already habe that one exposed.

The biggest currently open task is memory management of the buffer allocated by IPC. My current plan is creating a class CudaIpcMemoryBuffer mirroring CudaMemoryBuffer. The main difference I currently found was in Disposing, as the IPC buffer should then use the new method CloseIpcMemHandle.

Also the way I see it, the actual exchange of the IPC handle between different processes should be out of scope for this project.

Feel free to give feedback and to ask questions.

@Deneas Deneas changed the title Expose new CUDA APIs for sharing the same memory between processes WIP: Expose new CUDA APIs for sharing the same memory between processes May 27, 2024
@Deneas Deneas marked this pull request as draft May 27, 2024 19:42
@Deneas Deneas marked this pull request as ready for review May 27, 2024 19:46
@Deneas
Copy link
Author

Deneas commented May 27, 2024

Sorry for the switching around of the draft status, I'm still getting used to this.
As for the pull request: The actual functionality for exposing and using memory through the IPC is already done.
What remains is how we expose it for end use.
For now I created a method MapRaw on the CudaAccelerator based on the similar AllocateRaw method.
This method is not strictly necessary, instead users could just create the CudaIpcMemoryBuffer by itself.
Lastly, I'm currently exploringhow to write the tests for this new feature.
Unfortunately the IPC handles explicitly do not work in the same process, so for testing we would actually need two processes communication with each other.
An easy solution would be a separate new console project that then is started by the tests, but I'm looking whether there are some more "elegant" solutions possible.

Copy link
Collaborator

@MoFtZ MoFtZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @Deneas.

I've made some comments inline already.

In addition, this PR handles consuming an IPC buffer from another process, but not the "Get IPC Mem Handle" functionality. That would mean ILGPU could not share it's buffer with another process. Is that correct?

With regards to unit testing, I'm not sure if that is even possible. Perhaps adding two sample projects, for the Producer and Consumer?

Src/ILGPU/Runtime/Cuda/CudaAPI.xml Outdated Show resolved Hide resolved
/// <remarks>This represents the struct
/// <a href="https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaIpcMemHandle__t.html#structcudaIpcMemHandle__t">cudaIpcMemHandle__t</a>.
/// </remarks>
public struct CudaIpcMemHandle
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QUESTION: Is there a reason for having a struct with another struct (i.e. Handle within CudaIpcMemHandle)?

Why not just declare a single struct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally it was that way to mirror cudaIpcMemHandle__t having the array as a field.
Having [InlineArray(64)] directly on CudaIpcMemHandle means it straight up is the array.
Also honestly speaking, I didn't think to test it that way, asg etting arrays to work with LibraryImport instead of DllImport was a already a bit of a hassle.
That being said, I tested it that way and it actually worked. Since we don't have any other fields in the struct, we should actually be fine this way.
To ease the usage, since InlineArrays can't be used directly as arrays, I added explicit conversions to and from byte[].
I thought about making them implicit, but decided against it because they should not throw exceptions and generally carry the assumption of being allocation free.

CudaIpcMemHandle ipcMemHandle,
long length,
int elementSize,
CudaIpcMemFlags flags = CudaIpcMemFlags.None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest making the flags mandatory, so that the caller has to provide it. It will primarily be used internally anyway.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I made it mandatory. What I'm not sure about is, whether we should even add them to the "convenience" methods. The currently only flag LazyEnablePeerAccess will enable peer access if the underlying memory is allocated on a different device than the device used by the current context.
Obviously I'm quite new to this library, but my gut says this would be a rather niche case not worth complicating the API for.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I spoke too soon. Without LazyEnablePeerAccess IPC won't work for me even with the same device. I strongly suspect it relates to me having an SLI setup with two GPUs, but I was unable to find a definite answer.
Unsurprisingly, I'm now strongly in favor of exposing the flags. I also added my finding in the remarks.

@@ -0,0 +1,173 @@
// ---------------------------------------------------------------------------------------
// ILGPU
// Copyright (c) 2017-2023 ILGPU Project
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ISSUE: Need to fix up the copyright header. Should be 2024 only.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I also assumed you probably wanted the same for all other files I touched.

namespace ILGPU.Runtime.Cuda
{
/// <summary>
/// Represents an unmanaged Cuda buffer.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ISSUE: Please tidy up the comments in this file. e.g. "This represents a Cuda IPC buffer"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some changes, how does it look now? I left comments of code identically to CudaMemoryBuffer alone, as it was the original template for the class.

/// <param name="length">The number of elements to allocate.</param>
/// <param name="elementSize">The size of a single element in bytes.</param>
/// <returns>A mapped buffer of shared memory on this accelerator.</returns>
public CudaIpcMemoryBuffer MapRaw(CudaIpcMemHandle ipcMemHandle, long length, int elementSize)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking this method signature should be something like:
public MemoryBuffer MapFromIpcMemHandle(CudaIpcMemHandle ipcMemHandle, long length, int elementSize).

We do not need to expose the underlying CudaIpcMemoryBuffer.

And we should also have another method like:
public MemoryBuffer1D<T, Stride1D.Dense> MapFromIpcMemHandle1D<T>(CudaIpcMemHandle ipcMemHandle, long length).

This is the convenience method that will make it easier to use. Not sure if we need the 2D or 3D variants.

I'm also wondering if we should be using the Map prefix, or if it should be an 'Allocate' prefix, so that it is discoverable with the rest of the memory allocation functions when using auto-complete. e.g. AllocateFromIpcMemHandle.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on returning MemoryBuffer and naming the method MapFromIpcMemHandle.
My original idea with MapRaw was to mirror Accelerator.AllocateRaw, but MapFrom works as well.
I also did think about using Allocate, but decided against it, because it is exactly what we're not doing. The memory was already allocated by a different process, what we're creating is more like a View. Instead I went with Map because that is also what they use in the documentation ("Maps memory exported from another process").

As for having a similar API to Allocate1D() like e.g. Map1D, that would be something I could do. As far I understand the code, I most likely can just replace AllocateRaw with its IPC equivalent and move the CudaIpcMemHandle through the methods. The methods then should be in CudaAccelerator as opposed to Accelerator.

/// Enables peer access lazily.
/// </summary>
/// <remarks>
/// From <a href="https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g60f28a5142ee7ae0336dfa83fd54e006">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ISSUE: I'm guessing the GUID will change with each Cuda release, so I would prefer not to include the link.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understandable, I also removed all of the remarks as well, since they seemed too vague without the links.

throw new ArgumentOutOfRangeException(nameof(elementSize));

Bind();
return new CudaIpcMemoryBuffer(this, ipcMemHandle, length, elementSize);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QUESTION: Should there be a flags parameter in the method signature, which then gets passed here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I went from top to bottom and only saw this later. Feel free to reply in either thread.

What I'm not sure about is, whether we should even add them to the "convenience" methods. The currently only flag LazyEnablePeerAccess will enable peer access if the underlying memory is allocated on a different device than the device used by the current context.
Obviously I'm quite new to this library, but my gut says this would be a rather niche case not worth complicating the API for.

@Deneas
Copy link
Author

Deneas commented May 29, 2024

In addition, this PR handles consuming an IPC buffer from another process, but not the "Get IPC Mem Handle" functionality. That would mean ILGPU could not share it's buffer with another process. Is that correct?

No, I just missed creating a convenient method for doing so in CudaAccelerator. The fuctionality is mapped in the CudaAPI and available as GetIpcMemoryHandle.
I'm currently debating whether to call the convenience method simply GetIpcMemoryHandle or ExportMemoryForOtherProcesses, what sounds better for you?

With regards to unit testing, I'm not sure if that is even possible. Perhaps adding two sample projects, for the Producer and Consumer?

Yeah, while I think it would be doable as test with a new project, adding a new Sample seems the more idomatic solution for ILGPU.
I saw that CUDA has an example called SimpleIPC. I'll take that as a starting point and see where it takes me.
If I managed to make it work between .Net and Python, then between two .Net processes shouldn't be difficult 😄

@Deneas
Copy link
Author

Deneas commented Jun 1, 2024

Alright, with the added sample and changes, this PR is feature-complete. All that remains is to settle naming and the ergonomics of the helper methods.

A quick summary of my new work:

  • I added the sample CudaIPC with two projects showing both exporting memory and mapping said memory.
  • I added GetIpcEventHandle and OpenIpcEventHandle for completeness (note: I intentionally did not add any convenience methods, as regular events didn't seem to have the either).
  • I removed the usage of InlineArray for the IPC handles and switched to using raw arrays with byte*, because else users might forced to use C# 12.
  • I extended CudaDevice with HasIpcSupport since theres is a corresponding device attribute we can query.
  • I added the convenience methods GetIpcMemoryHandle and GetIpcMemoryHandle<TView> for MemoryBuffer and MemoryBuffer<TView> respectively.
  • I renamed MapRaw to MapFromIpcMemHandle and exposed the flags as mandatory parameter.

As before, feel free to ask questions and give feedback.

@Deneas Deneas changed the title WIP: Expose new CUDA APIs for sharing the same memory between processes Expose new CUDA APIs for sharing the same memory between processes Jun 1, 2024
Instead of allocating and freeing memory it uses OpenIpcMemHandle and CloseIpcMemHandle.
This mimics AllocateRaw for IPC shared memory with MapRaw.
The newly added constructor for CudaIpcMemHandle allows users to avoid dealing with the quirks of inline array.
- Use the newest version of cuIpcOpenMemHandle.
- Apply InlineArray to CudaIpcMemHandle directly, as we don't have any other fields or properties.
- Makes the flags parameter mandatory for CudaIpcMemoryBuffer.
- Adjust header format of all touched files to use the current year only.
- Adjusted doc comments for CudaIpcMemoryBuffer when behavior differs from CudaMemoryBuffer.
InlineArrays complicated array handling and forced users on newer C# versions. Now we use regular arrays and pointers for interop.
@m4rs-mt
Copy link
Owner

m4rs-mt commented Jul 18, 2024

Amazing work, thank you very much for your efforts. I ping @MoFtZ to take a look to get this over the line.

Copy link
Owner

@m4rs-mt m4rs-mt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR, looks mostly good to me! I was wondering about the class CudaIpcMemoryBuffer which does not inherit from CudaMemoryBuffer. This can lead to confusion and potential issues with people testing for Cuda-enabled buffers via is.

This allows introspection e.g. checking if a memory handle was already used.
@Deneas
Copy link
Author

Deneas commented Oct 9, 2024

Yeah, I agree it could be a bit confusing, but I felt it would be wrong for CudaIpcMemoryBuffer to inherit from CudaMemoryBuffer because both the constructor and the dispose completely differ.
I could unseal CudaMemoryBuffer and make Dispose virtual, but it would be rather unintuitive for people to inherit from it and neither call the base constructor nor the base Dispose.
Alternatively we could create a CudaMemoryBufferBase both buffers inherit from, ideally it would just be called CudaMemoryBuffer, but that would break existing code.
Or as final alternative: I merge CudaIpcMemoryBuffer into CudaMemoryBuffer. There would be a new constructor, an extended Dispose and probably a property to check whether it is IPC or not.
Which one would you prefer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants