[cuda] Use command segments in graph command buffer impl #14524

antiagainst · 2023-07-31T01:18:33Z

This commit switches the graph command buffer to use command segments to handle recording. This is a prelimiary step for better dependency analysis and handling. The reasons are:

In a CUDA graph, buffer management and kernel launches are represented as graph nodes. Dependencies are represented by graph edges. IREE's HAL follows the Vulkan command buffer recording model, which "linearizes" the original graph. So we have a mismatch here. Implementing IREE's HAL using CUDA graph would require rediscover the graph node dependencies from the linear chain of command buffer commands; it means looking at both previous and next commands sometimes.

Due to these reasons, it's beneficial to have a complete view of the full command buffer and extra flexibility during recording, in order to fixup past commands, or inspect past/future commands.

Note that this does not improve the descriptor and push constant handling yet and also does not really change the node serialization. Those are addressed in future commits.

Progress towards #13245

ScottTodd · 2023-07-31T16:32:41Z

experimental/cuda2/graph_command_buffer.c

+// Therefore, to implement IREE HAL command buffers using CUDA graphs, we
+// perform two steps using a linked list of command segments. First we create
+// segments (iree_hal_cuda2_command_buffer_prepare_*) to keep track of all IREE
+// HAL commands and the associated data, and then, when finalizing the command
+// buffer, we iterate through all the segments and record their contents
+// (iree_hal_cuda2_command_segment_record_*) into a proper CUDA graph command
+// buffer. A linked list gives us the flexibility to organize command sequence
+// in low overhead; and a deferred recording gives us the complete picture of
+// the command buffer when really started recording.


How much overhead is this logic adding? Would it be possible to do some of this ahead of time in the compiler and pass extra data through to the executable flatbuffer?

How much overhead is this logic adding?

It should be minor overhead at command buffer composition time due to the linked list--we need to create the list and scan it once/twice later when emitting the concrete commands. It takes extra allocations and time, but both at O(N) where N is the number of commands. We'd need the flexibility here to handle barriers and such for example in #14526. And we can go even fancier later.

Would it be possible to do some of this ahead of time in the compiler and pass extra data through to the executable flatbuffer?

Not really. Fundamentally this is due to the semantics mismatch between CUDA graph and IREE HAL. Unless we want to change how HAL are defined.

experimental/cuda2/graph_command_buffer.c

This commit switches the graph command buffer to use command segments to handle recording. This is a prelimiary step for better dependency analysis and handling. The reasons are: In a CUDA graph, buffer management and kernel launches are represented as graph nodes. Dependencies are represented by graph edges. IREE's HAL follows the Vulkan command buffer recording model, which "linearizes" the original graph. So we have a mismatch here. Implementing IREE's HAL using CUDA graph would require rediscover the graph node dependencies from the linear chain of command buffer commands; it means looking at both previous and next commands sometimes. Due to these reasons, it's beneficial to have a complete view of the full command buffer and extra flexibility during recording, in order to fixup past commands, or inspect past/future commands. Note that this does not improve the descriptor and push constant handling yet and also does not really change the node serialization. Those are addressed in future commits.

ScottTodd

I'm not too familiar with the APIs being bridged between here, but the comments and code seem reasonable to me.

benvanik

Hrm, I'm skeptical of the need for this implementation - we shouldn't need to stash the entire command buffer and walk over it but only the "open" nodes that can be chained (the last barrier or the last event signal). The design of the Vulkan/Metal/D3D-style single-pass command buffers is such that this is the case and it's what we use in the HAL for that reason. Needing to do everything in two passes adds non-trivial overhead to the most performance-sensitive host work we do (command buffer recording). In the common case (100% of what the compiler produces today) there's only ever 0 or 1 nodes that can be chained, and in the future it'll be some small handful (O(4)) and not all possible operations in the command buffer (O(1000's)).

In other words, CUDA graphs are a strict superset of the HAL command buffers and should need no additional processing - that we're doing so much additional processing is not great as this is an extremely latency-sensitive part of execution. I strongly suspect (we can measure) that the execution benefits we get from this are outweighed by the overheads involved.

benvanik · 2023-08-09T15:41:59Z

experimental/cuda2/graph_command_buffer.c

+  iree_hal_resource_set_freeze(command_buffer->resource_set);
+
+  IREE_TRACE_ZONE_END(z0);
+  return iree_ok_status();


return status;

I think that the biggest performance problem will actually be from a call to cuGraphInstantiate on every iree_hal_cuda2_graph_command_buffer_end, NVIDIA recommends keeping executable around and updating them with new parameters instead of destroy/instantiate cycle.

antiagainst · 2023-10-27T05:33:50Z

Coming back to this--given the concerns, I'll drop this for now until later I can have rationale/data to support it. I'll rebase patches (#14525, #14526) depending on this one and land them first.

antiagainst added the hal/cuda Runtime CUDA HAL backend label Jul 31, 2023

antiagainst requested review from benvanik, ezhulenev and ScottTodd July 31, 2023 01:18

antiagainst self-assigned this Jul 31, 2023

antiagainst mentioned this pull request Jul 31, 2023

[Epic] CUDA HAL driver rewrite for production #13245

Closed

Base automatically changed from cuda2-fix-symbol to main July 31, 2023 16:20

ScottTodd reviewed Jul 31, 2023

View reviewed changes

antiagainst added 2 commits July 31, 2023 12:57

Fix symbol typos

171c8d0

antiagainst force-pushed the cuda2-command-segments branch from 28974cd to 171c8d0 Compare July 31, 2023 20:04

ScottTodd self-requested a review July 31, 2023 20:36

ScottTodd approved these changes Aug 2, 2023

View reviewed changes

ezhulenev approved these changes Aug 8, 2023

View reviewed changes

benvanik requested changes Aug 9, 2023

View reviewed changes

raikonenfnu mentioned this pull request Oct 24, 2023

[tracking] Eager execution support. nod-ai/SHARK-ModelDev#105

Open

19 tasks

antiagainst closed this Oct 27, 2023

antiagainst deleted the cuda2-command-segments branch April 16, 2024 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda] Use command segments in graph command buffer impl #14524

[cuda] Use command segments in graph command buffer impl #14524

antiagainst commented Jul 31, 2023 •

edited

Loading

ScottTodd Jul 31, 2023

antiagainst Jul 31, 2023

ScottTodd left a comment

benvanik left a comment

benvanik Aug 9, 2023

ezhulenev Aug 9, 2023

antiagainst commented Oct 27, 2023

[cuda] Use command segments in graph command buffer impl #14524

[cuda] Use command segments in graph command buffer impl #14524

Conversation

antiagainst commented Jul 31, 2023 • edited Loading

ScottTodd Jul 31, 2023

Choose a reason for hiding this comment

antiagainst Jul 31, 2023

Choose a reason for hiding this comment

ScottTodd left a comment

Choose a reason for hiding this comment

benvanik left a comment

Choose a reason for hiding this comment

benvanik Aug 9, 2023

Choose a reason for hiding this comment

ezhulenev Aug 9, 2023

Choose a reason for hiding this comment

antiagainst commented Oct 27, 2023

antiagainst commented Jul 31, 2023 •

edited

Loading