Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Subgroup Operations Extension #1459

Conversation

mehmetoguzderin
Copy link
Member

@mehmetoguzderin mehmetoguzderin commented Feb 23, 2021

Preview WebGPU Changes: https://mehmetoguzderin.github.io/webgpu/webgpu.html
Preview WGSL Changes: https://mehmetoguzderin.github.io/webgpu/wgsl.html
Preview Argdown: https://kvark.github.io/webgpu-debate/SubgroupOps.component.html
See: #954 (comment)

This pull request works towards #667 for standard library. For that, the first form of subgroup operations extension to host and device specifications is introduced. Host exposure is directly deducible for all host APIs since it is compute-only, and the set of device instructions is the greatest common factor minus operations that take in a mask or invocation index.

Motivation

Subgroup operations provide speed-up proportional to the subgroup size. They provide a great opportunity to optimize both global and local reduction operations, especially for algorithms that need to specialize general graphs. And their presence is getting more common than ever.

Trade-offs

Lack of Exposed Hardware Banding

Although it is possible to increase market penetration of subgroup operations extension significantly by banding it to permutation and reduction similar to Metal, such direction increases the API surface, possibly crusting for a very narrow use case. Moreover, indicators of next-generation mobile hardware show that they will almost ubiquitously support reduction operations.

Exclusion of Quad Operations

This proposal excludes quad operations from the definition of subgroup operations. New hardware reports on Adreno and PowerVR show lack of quad support. Also, excluding quad operations makes it easier to avoid more ambiguous operations, delegating their presence to a proper quad operations extension.

Exclusion of Indexed or Masked Operations

This proposal excludes indexed or masked operations to avoid undefined behavior on divergence, reconvergence, and possibly out of bounds indexing. The current set of exposed operations are implicitly active on all APIs.

Presence of Extension for APIs

DirectX 12 Metal Vulkan
D3D12_FEATURE_DATA_D3D12_OPTIONS1.WaveOps MTLDevice.supportsFamily(MTLGPUFamilyMac2) | MTLDevice.supportsFamily(MTLGPUFamilyApple7) (VkPhysicalDeviceSubgroupProperties.supportedOperations & (VK_SUBGROUP_FEATURE_BASIC_BIT & VK_SUBGROUP_FEATURE_VOTE_BIT & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT & VK_SUBGROUP_FEATURE_BALLOT_BIT)) & (VkPhysicalDeviceSubgroupProperties.supportedStages & VK_SHADER_STAGE_COMPUTE_BIT)

Related Issues


Preview | Diff

@mehmetoguzderin mehmetoguzderin added the wgsl WebGPU Shading Language Issues label Feb 23, 2021
@mehmetoguzderin mehmetoguzderin added this to the post-MVP milestone Feb 23, 2021
@github-actions
Copy link
Contributor

Previews, as seen at the time of posting this comment:
WebGPU | IDL
WGSL
e0a1e1c

@Kangz
Copy link
Contributor

Kangz commented Feb 23, 2021

For context, there was a lot of discussion about this in #954 and IIRC there are some outstanding comments / concerns.

@mehmetoguzderin
Copy link
Member Author

This comment by @litherum is a must-see in that thread:
#954 (comment)

@mehmetoguzderin
Copy link
Member Author

DirectX 12 Agility SDK made it possible to use subgroup related improvements in HLSL Shader Model 6.6 in older Windows 10 versions (might update this comment later): https://devblogs.microsoft.com/directx/announcing-dx12agility/

@m-schuetz
Copy link

m-schuetz commented May 6, 2021

Is it feasible to include subgroup shuffle operations in the proposal?

Our use case is compute-shader-based point rendering. Subgroup operations allow us to reduce the amount of atomicMin operations needed to draw the points, and with warp-wide operations, we can reduce 32 atomics to 1 if all points in a subgroup are projected into a single pixel. However, the chance of all 32 points falling in a single pixel is relatively low so if a 32 to 1 reduce fails, we can instead attempt sixteen 2 to 1 reduces by letting pairs of threads communicate with each other through shuffles. The 2:1 reduce boosts the performance quite a bit since it has a very good chance of reducing the number of atomic operations.

EDIT: Just now noticing that shuffle is already being discussed here: #954 (comment)

@mehmetoguzderin
Copy link
Member Author

@m-schuetz passing index to shuffle operations was concerning; is it infeasible to replace shuffle with reduction operations in your use case?

@m-schuetz
Copy link

Assuming 32 threads in a subgroup, anything that would allow me to let 16 pairs of threads communicate would help, so that I could combine 32 outputs to just 16. Which threads are paired doesn't really matter so I don't need to pass indices.

@mehmetoguzderin
Copy link
Member Author

@m-schuetz something like this would split the subgroup into active groups of 2 threads where reduction operations only communicate between those (but then reconvergence isn't guaranteed to coarser active masks):

if (sthIdx / 2 == sth) {
  ...subgroupReduce()...
}

But there is no well-tested behavior for such code; I think this is a great, practical case to note down if we develop a large suite of tests to check the situation across vendors.

@m-schuetz
Copy link

That sounds interesting. Once it's available, I'll give it a try and report the results! I've already got a WebGPU atomicMin-based renderer ready, and looking forward to try out subgroup extensions!

@mehmetoguzderin
Copy link
Member Author

@m-schuetz thanks, we will register updates to this thread!

@mehmetoguzderin
Copy link
Member Author

Upcoming VK_KHR_shader_subgroup_uniform_control_flow pointed out by @alan-baker on 2021 June 8 meeting: KhronosGroup/Vulkan-Guide#118

@mehmetoguzderin
Copy link
Member Author

Sunsetting this PR since it would be better for a new take to build upon changes that took place since the PR's inception. The list of operations here might provide a helpful heuristic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wgsl WebGPU Shading Language Issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants