-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import/export for AIEs via vmem API #25
base: iree-aie
Are you sure you want to change the base?
Conversation
@@ -861,7 +847,6 @@ class Runtime { | |||
MemoryHandle* mem_handle; | |||
AddressHandle* address_handle; | |||
uint64_t offset; | |||
uint64_t mmap_offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was not used anywhere and I accidentally typed it in. Removing it so people don't do the same mistake.
@@ -508,19 +507,22 @@ hsa_status_t MemoryRegion::AllowAccess(uint32_t num_agents, | |||
|
|||
bool cpu_in_list = false; | |||
|
|||
std::set<GpuAgentInt*> whitelist_gpus; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never used.
@@ -584,25 +585,25 @@ hsa_status_t MemoryRegion::Lock(uint32_t num_agents, const hsa_agent_t* agents, | |||
return HSA_STATUS_SUCCESS; | |||
} | |||
|
|||
std::set<core::Agent*> whitelist_gpus; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you may want to make a separate change for this and get David's feedback. He's mentioned this code to me in the past, although I have to admit I never bothered to understand how it works. Weird that it'd be unused though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll take it to the public ROCR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
David has mentioned to me in the past that we'll need to think about how to add AIEs to these lists. So they should definitely be used somehow. So, yeah get his feedback there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does look like whitelist_gpus are not used anymore and we forgot to remove them at some point. Feel free to remove them.
atgutier: I think you are talking about the rvd filers in amd_topology.c. A user can export an environment variable: ROCR_VISIBLE_DEVICES=0,2 on a system with lets say 3 GPUs, and this will effectively hide the second GPU to the users of ROCr (i.e the device will not be listed when someone uses the iterate_agents APIs).
It does look like the whitelist_gpus was used to keep track of visible gpus at some point and affects when initDma() was called, but this is not necessary anymore.
// For now, this is only supported for KFD due to the call to | ||
// GetAmdgpuDeviceArgs | ||
if (agent->device_type() != core::Agent::DeviceType::kAmdGpuDevice) | ||
return HSA_STATUS_ERROR_INVALID_AGENT; | ||
|
||
// Create handle by exporting and importing the memory from the owning agent | ||
hsa_status_t status = | ||
agent->ExportDMABuf(memoryHandleIt->first, size, &dmabuf_fd, &offset); | ||
if (status != HSA_STATUS_SUCCESS) | ||
return status; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is a create shareable handle procedure.
// MAYBE? auto status = agentPermsIt->second.RemoveAccess(); | ||
hsa_status_t status = agentPermsIt->second.targetAgent->Unmap( | ||
&(agentPermsIt->second.ldrm_bo), va, mappedHandleIt->second.offset, | ||
size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking if this is indeed RemoveAccess (it seems so).
reinterpret_cast<uint64_t>(va), drm_perm(perms), AMDGPU_VA_OP_MAP); | ||
if (ret) return HSA_STATUS_ERROR; | ||
// CPU agents use a different offset | ||
offset = reinterpret_cast<uint64_t>(mappedHandle->drm_cpu_addr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure about why this is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to do a deeper review of the core changes, but wanted to request some changes, particular regarding the agent API first.
runtime/hsa-runtime/core/inc/agent.h
Outdated
@@ -278,6 +278,35 @@ class Agent : public Checked<0xF6BC25EB17E6F917> { | |||
// @brief Returns an array of regions owned by the agent. | |||
virtual const std::vector<const core::MemoryRegion*>& regions() const = 0; | |||
|
|||
// @brief Maps the memory associated with the handle. | |||
virtual hsa_status_t Map(void *handle, void *va, size_t offset, size_t size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be members of the Driver as the other OS driver functions are, not the agent. It'd warrant much broader/deeper discussion if we want to expose these in the Agent API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a tricky part. GpuAgent::ImportDMABuf uses GpuAgent::libDrmDev() which is set by hsaKmtGetAMDGPUDeviceHandle(node_id(), &device_handle);
If we were to move it into the driver, for every import we would be adding a lookup for the driver type (we should probably cache that one) and an ioctl.
The rest of the functions can be moved to the driver, but then we'll have one function in the agent and the rest in the driver. Maybe have the import / export in the agent and the rest in the driver?
We also don't have a CPU driver to put some of these functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there are some things we need to think about wrt the driver interface. I think for now I'm more concerned about the interface than lookup overhead. I was actually thinking we should have a reference to the driver object inside the agent. Would that help?
For some of the areas where there already seem to be sidebands thru the Agent, we can rethink those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there are some things we need to think about wrt the driver interface. I think for now I'm more concerned about the interface that lookup overhead.
I'd like to avoid unnecessary obvious overhead though, esp. if it's something that will require extensive refactoring.
I was actually thinking we should have a reference to the driver object inside the agent. Would that help?
I was planning to do the driver lookup at AIEAgent construction and store it as a pointer since the driver is guaranteed to exist and outlasts the lifetime of the agent.
For some of the areas where there already seem to be sidebands thru the Agent, we can rethink those.
We can pass the agent in, but then we create a cyclic dependency between agent and driver. Maybe I can dig a little bit more and see if that libDrm can be expressed as another handle? But we are going past the scope of this work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually thinking we should have a reference to the driver object inside the agent. Would that help?
AIEAgents have now a pointer to their driver in ff00230
@@ -584,25 +585,25 @@ hsa_status_t MemoryRegion::Lock(uint32_t num_agents, const hsa_agent_t* agents, | |||
return HSA_STATUS_SUCCESS; | |||
} | |||
|
|||
std::set<core::Agent*> whitelist_gpus; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you may want to make a separate change for this and get David's feedback. He's mentioned this code to me in the past, although I have to admit I never bothered to understand how it works. Weird that it'd be unused though.
Test will fail until 0d77583 is merged. Is there any reason not to sync the repo with ROCR-Runtime:amd-staging and merge in amd-staging to the iree-aie branch? |
This PR allows sharing GPU memory with AIE agents via the vmem API. Sharing via other alternatives (hsa_amd_memory_lock, hsa_amd_interop_map_buffer, hsa_amd_agents_allow_access) is not allowed and will throw an error.