Import/export for AIEs via vmem API #25

ypapadop-amd · 2024-09-16T20:24:52Z

This PR allows sharing GPU memory with AIE agents via the vmem API. Sharing via other alternatives (hsa_amd_memory_lock, hsa_amd_interop_map_buffer, hsa_amd_agents_allow_access) is not allowed and will throw an error.

runtime/hsa-runtime/core/driver/xdna/amd_xdna_driver.cpp

ypapadop-amd · 2024-09-16T20:28:02Z

runtime/hsa-runtime/core/inc/runtime.h

@@ -861,7 +847,6 @@ class Runtime {
    MemoryHandle* mem_handle;
    AddressHandle* address_handle;
    uint64_t offset;
-    uint64_t mmap_offset;


This was not used anywhere and I accidentally typed it in. Removing it so people don't do the same mistake.

runtime/hsa-runtime/core/runtime/amd_aie_agent.cpp

runtime/hsa-runtime/core/runtime/amd_gpu_agent.cpp

ypapadop-amd · 2024-09-16T20:30:02Z

runtime/hsa-runtime/core/runtime/amd_memory_region.cpp

@@ -508,19 +507,22 @@ hsa_status_t MemoryRegion::AllowAccess(uint32_t num_agents,

  bool cpu_in_list = false;

-  std::set<GpuAgentInt*> whitelist_gpus;


Never used.

ypapadop-amd · 2024-09-16T20:30:21Z

runtime/hsa-runtime/core/runtime/amd_memory_region.cpp

@@ -584,25 +585,25 @@ hsa_status_t MemoryRegion::Lock(uint32_t num_agents, const hsa_agent_t* agents,
    return HSA_STATUS_SUCCESS;
  }

-  std::set<core::Agent*> whitelist_gpus;


Never used.

I think you may want to make a separate change for this and get David's feedback. He's mentioned this code to me in the past, although I have to admit I never bothered to understand how it works. Weird that it'd be unused though.

OK, I'll take it to the public ROCR.

David has mentioned to me in the past that we'll need to think about how to add AIEs to these lists. So they should definitely be used somehow. So, yeah get his feedback there.

Yes, it does look like whitelist_gpus are not used anymore and we forgot to remove them at some point. Feel free to remove them.

atgutier: I think you are talking about the rvd filers in amd_topology.c. A user can export an environment variable: ROCR_VISIBLE_DEVICES=0,2 on a system with lets say 3 GPUs, and this will effectively hide the second GPU to the users of ROCr (i.e the device will not be listed when someone uses the iterate_agents APIs).
It does look like the whitelist_gpus was used to keep track of visible gpus at some point and affects when initDma() was called, but this is not necessary anymore.

ypapadop-amd · 2024-09-16T20:31:20Z

runtime/hsa-runtime/core/runtime/runtime.cpp

+  // For now, this is only supported for KFD due to the call to
+  // GetAmdgpuDeviceArgs
+  if (agent->device_type() != core::Agent::DeviceType::kAmdGpuDevice)
+    return HSA_STATUS_ERROR_INVALID_AGENT;
+
+  // Create handle by exporting and importing the memory from the owning agent
+  hsa_status_t status =
+      agent->ExportDMABuf(memoryHandleIt->first, size, &dmabuf_fd, &offset);
+  if (status != HSA_STATUS_SUCCESS)
+    return status;


I think that this is a create shareable handle procedure.

ypapadop-amd · 2024-09-16T20:31:55Z

runtime/hsa-runtime/core/runtime/runtime.cpp

+    // MAYBE? auto status = agentPermsIt->second.RemoveAccess();
+    hsa_status_t status = agentPermsIt->second.targetAgent->Unmap(
+        &(agentPermsIt->second.ldrm_bo), va, mappedHandleIt->second.offset,
+        size);


Checking if this is indeed RemoveAccess (it seems so).

ypapadop-amd · 2024-09-16T20:36:47Z

runtime/hsa-runtime/core/runtime/runtime.cpp

-                          reinterpret_cast<uint64_t>(va), drm_perm(perms), AMDGPU_VA_OP_MAP);
-    if (ret) return HSA_STATUS_ERROR;
+    // CPU agents use a different offset
+    offset = reinterpret_cast<uint64_t>(mappedHandle->drm_cpu_addr);


I'm not entirely sure about why this is different.

atgutier

Need to do a deeper review of the core changes, but wanted to request some changes, particular regarding the agent API first.

runtime/hsa-runtime/core/driver/xdna/amd_xdna_driver.cpp

runtime/hsa-runtime/core/inc/agent.h

runtime/hsa-runtime/core/driver/xdna/amd_xdna_driver.cpp

atgutier · 2024-09-16T21:28:18Z

runtime/hsa-runtime/core/inc/agent.h

@@ -278,6 +278,35 @@ class Agent : public Checked<0xF6BC25EB17E6F917> {
  // @brief Returns an array of regions owned by the agent.
  virtual const std::vector<const core::MemoryRegion*>& regions() const = 0;

+  // @brief Maps the memory associated with the handle.
+  virtual hsa_status_t Map(void *handle, void *va, size_t offset, size_t size,


These should be members of the Driver as the other OS driver functions are, not the agent. It'd warrant much broader/deeper discussion if we want to expose these in the Agent API.

There's a tricky part. GpuAgent::ImportDMABuf uses GpuAgent::libDrmDev() which is set by hsaKmtGetAMDGPUDeviceHandle(node_id(), &device_handle);

If we were to move it into the driver, for every import we would be adding a lookup for the driver type (we should probably cache that one) and an ioctl.

The rest of the functions can be moved to the driver, but then we'll have one function in the agent and the rest in the driver. Maybe have the import / export in the agent and the rest in the driver?

We also don't have a CPU driver to put some of these functions.

Yeah, there are some things we need to think about wrt the driver interface. I think for now I'm more concerned about the interface than lookup overhead. I was actually thinking we should have a reference to the driver object inside the agent. Would that help?

For some of the areas where there already seem to be sidebands thru the Agent, we can rethink those.

Yeah, there are some things we need to think about wrt the driver interface. I think for now I'm more concerned about the interface that lookup overhead.

I'd like to avoid unnecessary obvious overhead though, esp. if it's something that will require extensive refactoring.

I was actually thinking we should have a reference to the driver object inside the agent. Would that help?

I was planning to do the driver lookup at AIEAgent construction and store it as a pointer since the driver is guaranteed to exist and outlasts the lifetime of the agent.

For some of the areas where there already seem to be sidebands thru the Agent, we can rethink those.

We can pass the agent in, but then we create a cyclic dependency between agent and driver. Maybe I can dig a little bit more and see if that libDrm can be expressed as another handle? But we are going past the scope of this work.

I was actually thinking we should have a reference to the driver object inside the agent. Would that help?

AIEAgents have now a pointer to their driver in ff00230

atgutier · 2024-09-16T21:42:50Z

runtime/hsa-runtime/core/runtime/amd_memory_region.cpp

@@ -584,25 +585,25 @@ hsa_status_t MemoryRegion::Lock(uint32_t num_agents, const hsa_agent_t* agents,
    return HSA_STATUS_SUCCESS;
  }

-  std::set<core::Agent*> whitelist_gpus;


I think you may want to make a separate change for this and get David's feedback. He's mentioned this code to me in the past, although I have to admit I never bothered to understand how it works. Weird that it'd be unused though.

ypapadop-amd · 2024-09-19T15:20:20Z

Test will fail until 0d77583 is merged.

Is there any reason not to sync the repo with ROCR-Runtime:amd-staging and merge in amd-staging to the iree-aie branch?

ypapadop-amd added 12 commits September 12, 2024 11:20

Avoiding default constructor and multiple lookups

dcba99a

Disallow AIEAgents from allow_access / lock

f7dfc47

Disallow AIEAgents in interop_map

7591760

Merge branch 'iree-aie' into ypapadop-amd/vmem-aie

f1d27ae

XdnaDriver support for dma-buf import

2930d58

Remove unused constructor

095641b

Remove unused member

6b00937

Extracting agent specific functionality

c374a29

Release XDNA BOs

94ce3cc

Formatting

6f66bac

map / unmap for AIEAgent

5c0ecd7

Proper enable access for AIE agents

e8c499b

ypapadop-amd self-assigned this Sep 16, 2024

ypapadop-amd commented Sep 16, 2024

View reviewed changes

runtime/hsa-runtime/core/driver/xdna/amd_xdna_driver.cpp Show resolved Hide resolved

ypapadop-amd commented Sep 16, 2024

View reviewed changes

runtime/hsa-runtime/core/runtime/amd_aie_agent.cpp Outdated Show resolved Hide resolved

ypapadop-amd commented Sep 16, 2024

View reviewed changes

runtime/hsa-runtime/core/runtime/amd_gpu_agent.cpp Show resolved Hide resolved

ypapadop-amd commented Sep 16, 2024

View reviewed changes

Comments.

cd4eb60

atgutier requested changes Sep 16, 2024

View reviewed changes

ypapadop-amd added 6 commits September 17, 2024 13:56

Taking out parameter by reference

817c4ef

Updating copyright year

828acd0

Moving permission to mmap protection flags conversion to Driver

4615896

Abstracting handle for export / import

d0f6ebd

Caching XdnaDriver pointer in AIEAgent

ff00230

Tests for GPU export / AIE import

7d00b13

Formatting

4066872

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import/export for AIEs via vmem API #25

Import/export for AIEs via vmem API #25

ypapadop-amd commented Sep 16, 2024

ypapadop-amd Sep 16, 2024

ypapadop-amd Sep 16, 2024

ypapadop-amd Sep 16, 2024

atgutier Sep 16, 2024

ypapadop-amd Sep 17, 2024

atgutier Sep 17, 2024

dayatsin-amd Sep 18, 2024

ypapadop-amd Sep 16, 2024 •

edited

Loading

ypapadop-amd Sep 16, 2024

ypapadop-amd Sep 16, 2024

atgutier left a comment

atgutier Sep 16, 2024

ypapadop-amd Sep 17, 2024 •

edited

Loading

atgutier Sep 17, 2024 •

edited

Loading

ypapadop-amd Sep 17, 2024 •

edited

Loading

ypapadop-amd Sep 17, 2024

atgutier Sep 16, 2024

ypapadop-amd commented Sep 19, 2024

		@@ -508,19 +507,22 @@ hsa_status_t MemoryRegion::AllowAccess(uint32_t num_agents,

		bool cpu_in_list = false;

		std::set<GpuAgentInt*> whitelist_gpus;

Import/export for AIEs via vmem API #25

Are you sure you want to change the base?

Import/export for AIEs via vmem API #25

Conversation

ypapadop-amd commented Sep 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ypapadop-amd Sep 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atgutier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ypapadop-amd Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

atgutier Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

ypapadop-amd Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ypapadop-amd commented Sep 19, 2024

ypapadop-amd Sep 16, 2024 •

edited

Loading

ypapadop-amd Sep 17, 2024 •

edited

Loading

atgutier Sep 17, 2024 •

edited

Loading

ypapadop-amd Sep 17, 2024 •

edited

Loading