-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cuda] Use command segments in graph command buffer impl #14524
Conversation
// Therefore, to implement IREE HAL command buffers using CUDA graphs, we | ||
// perform two steps using a linked list of command segments. First we create | ||
// segments (iree_hal_cuda2_command_buffer_prepare_*) to keep track of all IREE | ||
// HAL commands and the associated data, and then, when finalizing the command | ||
// buffer, we iterate through all the segments and record their contents | ||
// (iree_hal_cuda2_command_segment_record_*) into a proper CUDA graph command | ||
// buffer. A linked list gives us the flexibility to organize command sequence | ||
// in low overhead; and a deferred recording gives us the complete picture of | ||
// the command buffer when really started recording. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much overhead is this logic adding? Would it be possible to do some of this ahead of time in the compiler and pass extra data through to the executable flatbuffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much overhead is this logic adding?
It should be minor overhead at command buffer composition time due to the linked list--we need to create the list and scan it once/twice later when emitting the concrete commands. It takes extra allocations and time, but both at O(N) where N is the number of commands. We'd need the flexibility here to handle barriers and such for example in #14526. And we can go even fancier later.
Would it be possible to do some of this ahead of time in the compiler and pass extra data through to the executable flatbuffer?
Not really. Fundamentally this is due to the semantics mismatch between CUDA graph and IREE HAL. Unless we want to change how HAL are defined.
This commit switches the graph command buffer to use command segments to handle recording. This is a prelimiary step for better dependency analysis and handling. The reasons are: In a CUDA graph, buffer management and kernel launches are represented as graph nodes. Dependencies are represented by graph edges. IREE's HAL follows the Vulkan command buffer recording model, which "linearizes" the original graph. So we have a mismatch here. Implementing IREE's HAL using CUDA graph would require rediscover the graph node dependencies from the linear chain of command buffer commands; it means looking at both previous and next commands sometimes. Due to these reasons, it's beneficial to have a complete view of the full command buffer and extra flexibility during recording, in order to fixup past commands, or inspect past/future commands. Note that this does not improve the descriptor and push constant handling yet and also does not really change the node serialization. Those are addressed in future commits.
28974cd
to
171c8d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too familiar with the APIs being bridged between here, but the comments and code seem reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, I'm skeptical of the need for this implementation - we shouldn't need to stash the entire command buffer and walk over it but only the "open" nodes that can be chained (the last barrier or the last event signal). The design of the Vulkan/Metal/D3D-style single-pass command buffers is such that this is the case and it's what we use in the HAL for that reason. Needing to do everything in two passes adds non-trivial overhead to the most performance-sensitive host work we do (command buffer recording). In the common case (100% of what the compiler produces today) there's only ever 0 or 1 nodes that can be chained, and in the future it'll be some small handful (O(4)) and not all possible operations in the command buffer (O(1000's)).
In other words, CUDA graphs are a strict superset of the HAL command buffers and should need no additional processing - that we're doing so much additional processing is not great as this is an extremely latency-sensitive part of execution. I strongly suspect (we can measure) that the execution benefits we get from this are outweighed by the overheads involved.
iree_hal_resource_set_freeze(command_buffer->resource_set); | ||
|
||
IREE_TRACE_ZONE_END(z0); | ||
return iree_ok_status(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return status;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the biggest performance problem will actually be from a call to cuGraphInstantiate
on every iree_hal_cuda2_graph_command_buffer_end
, NVIDIA recommends keeping executable around and updating them with new parameters instead of destroy/instantiate cycle.
This commit switches the graph command buffer to use command segments to handle recording. This is a prelimiary step for better dependency analysis and handling. The reasons are:
In a CUDA graph, buffer management and kernel launches are represented as graph nodes. Dependencies are represented by graph edges. IREE's HAL follows the Vulkan command buffer recording model, which "linearizes" the original graph. So we have a mismatch here. Implementing IREE's HAL using CUDA graph would require rediscover the graph node dependencies from the linear chain of command buffer commands; it means looking at both previous and next commands sometimes.
Due to these reasons, it's beneficial to have a complete view of the full command buffer and extra flexibility during recording, in order to fixup past commands, or inspect past/future commands.
Note that this does not improve the descriptor and push constant handling yet and also does not really change the node serialization. Those are addressed in future commits.
Progress towards #13245