-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Subgroup Operations Extension #1459
Introduce Subgroup Operations Extension #1459
Conversation
For context, there was a lot of discussion about this in #954 and IIRC there are some outstanding comments / concerns. |
This comment by @litherum is a must-see in that thread: |
DirectX 12 Agility SDK made it possible to use subgroup related improvements in HLSL Shader Model 6.6 in older Windows 10 versions (might update this comment later): https://devblogs.microsoft.com/directx/announcing-dx12agility/ |
Is it feasible to include subgroup shuffle operations in the proposal? Our use case is compute-shader-based point rendering. Subgroup operations allow us to reduce the amount of atomicMin operations needed to draw the points, and with warp-wide operations, we can reduce 32 atomics to 1 if all points in a subgroup are projected into a single pixel. However, the chance of all 32 points falling in a single pixel is relatively low so if a 32 to 1 reduce fails, we can instead attempt sixteen 2 to 1 reduces by letting pairs of threads communicate with each other through shuffles. The 2:1 reduce boosts the performance quite a bit since it has a very good chance of reducing the number of atomic operations. EDIT: Just now noticing that shuffle is already being discussed here: #954 (comment) |
@m-schuetz passing index to shuffle operations was concerning; is it infeasible to replace shuffle with reduction operations in your use case? |
Assuming 32 threads in a subgroup, anything that would allow me to let 16 pairs of threads communicate would help, so that I could combine 32 outputs to just 16. Which threads are paired doesn't really matter so I don't need to pass indices. |
@m-schuetz something like this would split the subgroup into active groups of 2 threads where reduction operations only communicate between those (but then reconvergence isn't guaranteed to coarser active masks):
But there is no well-tested behavior for such code; I think this is a great, practical case to note down if we develop a large suite of tests to check the situation across vendors. |
That sounds interesting. Once it's available, I'll give it a try and report the results! I've already got a WebGPU atomicMin-based renderer ready, and looking forward to try out subgroup extensions! |
@m-schuetz thanks, we will register updates to this thread! |
Upcoming |
Sunsetting this PR since it would be better for a new take to build upon changes that took place since the PR's inception. The list of operations here might provide a helpful heuristic. |
Preview WebGPU Changes: https://mehmetoguzderin.github.io/webgpu/webgpu.html
Preview WGSL Changes: https://mehmetoguzderin.github.io/webgpu/wgsl.html
Preview Argdown: https://kvark.github.io/webgpu-debate/SubgroupOps.component.html
See: #954 (comment)
This pull request works towards #667 for standard library. For that, the first form of subgroup operations extension to host and device specifications is introduced. Host exposure is directly deducible for all host APIs since it is compute-only, and the set of device instructions is the greatest common factor minus operations that take in a mask or invocation index.
Motivation
Subgroup operations provide speed-up proportional to the subgroup size. They provide a great opportunity to optimize both global and local reduction operations, especially for algorithms that need to specialize general graphs. And their presence is getting more common than ever.
Trade-offs
Lack of Exposed Hardware Banding
Although it is possible to increase market penetration of subgroup operations extension significantly by banding it to permutation and reduction similar to Metal, such direction increases the API surface, possibly crusting for a very narrow use case. Moreover, indicators of next-generation mobile hardware show that they will almost ubiquitously support reduction operations.
Exclusion of Quad Operations
This proposal excludes quad operations from the definition of subgroup operations. New hardware reports on Adreno and PowerVR show lack of quad support. Also, excluding quad operations makes it easier to avoid more ambiguous operations, delegating their presence to a proper quad operations extension.
Exclusion of Indexed or Masked Operations
This proposal excludes indexed or masked operations to avoid undefined behavior on divergence, reconvergence, and possibly out of bounds indexing. The current set of exposed operations are implicitly active on all APIs.
Presence of Extension for APIs
D3D12_FEATURE_DATA_D3D12_OPTIONS1.WaveOps
MTLDevice.supportsFamily(MTLGPUFamilyMac2) | MTLDevice.supportsFamily(MTLGPUFamilyApple7)
(VkPhysicalDeviceSubgroupProperties.supportedOperations & (VK_SUBGROUP_FEATURE_BASIC_BIT & VK_SUBGROUP_FEATURE_VOTE_BIT & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT & VK_SUBGROUP_FEATURE_BALLOT_BIT)) & (VkPhysicalDeviceSubgroupProperties.supportedStages & VK_SHADER_STAGE_COMPUTE_BIT)
Related Issues
Preview | Diff