Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cuda] Use command segments in graph command buffer impl #14524

Closed
wants to merge 2 commits into from

Conversation

antiagainst
Copy link
Contributor

@antiagainst antiagainst commented Jul 31, 2023

This commit switches the graph command buffer to use command segments to handle recording. This is a prelimiary step for better dependency analysis and handling. The reasons are:

In a CUDA graph, buffer management and kernel launches are represented as graph nodes. Dependencies are represented by graph edges. IREE's HAL follows the Vulkan command buffer recording model, which "linearizes" the original graph. So we have a mismatch here. Implementing IREE's HAL using CUDA graph would require rediscover the graph node dependencies from the linear chain of command buffer commands; it means looking at both previous and next commands sometimes.

Due to these reasons, it's beneficial to have a complete view of the full command buffer and extra flexibility during recording, in order to fixup past commands, or inspect past/future commands.

Note that this does not improve the descriptor and push constant handling yet and also does not really change the node serialization. Those are addressed in future commits.

Progress towards #13245

@antiagainst antiagainst added the hal/cuda Runtime CUDA HAL backend label Jul 31, 2023
@antiagainst antiagainst self-assigned this Jul 31, 2023
Base automatically changed from cuda2-fix-symbol to main July 31, 2023 16:20
Comment on lines +42 to +50
// Therefore, to implement IREE HAL command buffers using CUDA graphs, we
// perform two steps using a linked list of command segments. First we create
// segments (iree_hal_cuda2_command_buffer_prepare_*) to keep track of all IREE
// HAL commands and the associated data, and then, when finalizing the command
// buffer, we iterate through all the segments and record their contents
// (iree_hal_cuda2_command_segment_record_*) into a proper CUDA graph command
// buffer. A linked list gives us the flexibility to organize command sequence
// in low overhead; and a deferred recording gives us the complete picture of
// the command buffer when really started recording.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much overhead is this logic adding? Would it be possible to do some of this ahead of time in the compiler and pass extra data through to the executable flatbuffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much overhead is this logic adding?

It should be minor overhead at command buffer composition time due to the linked list--we need to create the list and scan it once/twice later when emitting the concrete commands. It takes extra allocations and time, but both at O(N) where N is the number of commands. We'd need the flexibility here to handle barriers and such for example in #14526. And we can go even fancier later.

Would it be possible to do some of this ahead of time in the compiler and pass extra data through to the executable flatbuffer?

Not really. Fundamentally this is due to the semantics mismatch between CUDA graph and IREE HAL. Unless we want to change how HAL are defined.

experimental/cuda2/graph_command_buffer.c Outdated Show resolved Hide resolved
This commit switches the graph command buffer to use command
segments to handle recording. This is a prelimiary step for
better dependency analysis and handling. The reasons are:

In a CUDA graph, buffer management and kernel launches are
represented as graph nodes. Dependencies are represented by
graph edges. IREE's HAL follows the Vulkan command buffer
recording model, which "linearizes" the original graph.
So we have a mismatch here. Implementing IREE's HAL using
CUDA graph would require rediscover the graph node dependencies
from the linear chain of command buffer commands; it means
looking at both previous and next commands sometimes.

Due to these reasons, it's beneficial to have a complete view
of the full command buffer and extra flexibility during
recording, in order to fixup past commands, or inspect
past/future commands.

Note that this does not improve the descriptor and push constant
handling yet and also does not really change the node
serialization. Those are addressed in future commits.
Copy link
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too familiar with the APIs being bridged between here, but the comments and code seem reasonable to me.

Copy link
Collaborator

@benvanik benvanik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, I'm skeptical of the need for this implementation - we shouldn't need to stash the entire command buffer and walk over it but only the "open" nodes that can be chained (the last barrier or the last event signal). The design of the Vulkan/Metal/D3D-style single-pass command buffers is such that this is the case and it's what we use in the HAL for that reason. Needing to do everything in two passes adds non-trivial overhead to the most performance-sensitive host work we do (command buffer recording). In the common case (100% of what the compiler produces today) there's only ever 0 or 1 nodes that can be chained, and in the future it'll be some small handful (O(4)) and not all possible operations in the command buffer (O(1000's)).

In other words, CUDA graphs are a strict superset of the HAL command buffers and should need no additional processing - that we're doing so much additional processing is not great as this is an extremely latency-sensitive part of execution. I strongly suspect (we can measure) that the execution benefits we get from this are outweighed by the overheads involved.

iree_hal_resource_set_freeze(command_buffer->resource_set);

IREE_TRACE_ZONE_END(z0);
return iree_ok_status();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return status;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the biggest performance problem will actually be from a call to cuGraphInstantiate on every iree_hal_cuda2_graph_command_buffer_end, NVIDIA recommends keeping executable around and updating them with new parameters instead of destroy/instantiate cycle.

@antiagainst
Copy link
Contributor Author

Coming back to this--given the concerns, I'll drop this for now until later I can have rationale/data to support it. I'll rebase patches (#14525, #14526) depending on this one and land them first.

@antiagainst antiagainst deleted the cuda2-command-segments branch April 16, 2024 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hal/cuda Runtime CUDA HAL backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants