Introduce Subgroup Operations Extension #1459

mehmetoguzderin · 2021-02-23T14:38:42Z

Preview WebGPU Changes: https://mehmetoguzderin.github.io/webgpu/webgpu.html
Preview WGSL Changes: https://mehmetoguzderin.github.io/webgpu/wgsl.html
Preview Argdown: https://kvark.github.io/webgpu-debate/SubgroupOps.component.html
See: #954 (comment)

This pull request works towards #667 for standard library. For that, the first form of subgroup operations extension to host and device specifications is introduced. Host exposure is directly deducible for all host APIs since it is compute-only, and the set of device instructions is the greatest common factor minus operations that take in a mask or invocation index.

Motivation

Subgroup operations provide speed-up proportional to the subgroup size. They provide a great opportunity to optimize both global and local reduction operations, especially for algorithms that need to specialize general graphs. And their presence is getting more common than ever.

Trade-offs

Lack of Exposed Hardware Banding

Although it is possible to increase market penetration of subgroup operations extension significantly by banding it to permutation and reduction similar to Metal, such direction increases the API surface, possibly crusting for a very narrow use case. Moreover, indicators of next-generation mobile hardware show that they will almost ubiquitously support reduction operations.

Exclusion of Quad Operations

This proposal excludes quad operations from the definition of subgroup operations. New hardware reports on Adreno and PowerVR show lack of quad support. Also, excluding quad operations makes it easier to avoid more ambiguous operations, delegating their presence to a proper quad operations extension.

Exclusion of Indexed or Masked Operations

This proposal excludes indexed or masked operations to avoid undefined behavior on divergence, reconvergence, and possibly out of bounds indexing. The current set of exposed operations are implicitly active on all APIs.

Presence of Extension for APIs

DirectX 12	Metal	Vulkan
`D3D12_FEATURE_DATA_D3D12_OPTIONS1.WaveOps`	`MTLDevice.supportsFamily(MTLGPUFamilyMac2) \| MTLDevice.supportsFamily(MTLGPUFamilyApple7)`	`(VkPhysicalDeviceSubgroupProperties.supportedOperations & (VK_SUBGROUP_FEATURE_BASIC_BIT & VK_SUBGROUP_FEATURE_VOTE_BIT & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT & VK_SUBGROUP_FEATURE_BALLOT_BIT)) & (VkPhysicalDeviceSubgroupProperties.supportedStages & VK_SHADER_STAGE_COMPUTE_BIT)`

Related Issues

Preview | Diff

github-actions · 2021-02-23T14:41:07Z

Previews, as seen at the time of posting this comment:
WebGPU | IDL
WGSL
_e0a1e1c

Kangz · 2021-02-23T18:12:23Z

For context, there was a lot of discussion about this in #954 and IIRC there are some outstanding comments / concerns.

mehmetoguzderin · 2021-02-24T00:26:01Z

This comment by @litherum is a must-see in that thread:
#954 (comment)

mehmetoguzderin · 2021-04-20T16:47:42Z

DirectX 12 Agility SDK made it possible to use subgroup related improvements in HLSL Shader Model 6.6 in older Windows 10 versions (might update this comment later): https://devblogs.microsoft.com/directx/announcing-dx12agility/

m-schuetz · 2021-05-06T13:10:42Z

Is it feasible to include subgroup shuffle operations in the proposal?

Our use case is compute-shader-based point rendering. Subgroup operations allow us to reduce the amount of atomicMin operations needed to draw the points, and with warp-wide operations, we can reduce 32 atomics to 1 if all points in a subgroup are projected into a single pixel. However, the chance of all 32 points falling in a single pixel is relatively low so if a 32 to 1 reduce fails, we can instead attempt sixteen 2 to 1 reduces by letting pairs of threads communicate with each other through shuffles. The 2:1 reduce boosts the performance quite a bit since it has a very good chance of reducing the number of atomic operations.

EDIT: Just now noticing that shuffle is already being discussed here: #954 (comment)

mehmetoguzderin · 2021-05-06T13:26:05Z

@m-schuetz passing index to shuffle operations was concerning; is it infeasible to replace shuffle with reduction operations in your use case?

m-schuetz · 2021-05-06T13:36:41Z

Assuming 32 threads in a subgroup, anything that would allow me to let 16 pairs of threads communicate would help, so that I could combine 32 outputs to just 16. Which threads are paired doesn't really matter so I don't need to pass indices.

mehmetoguzderin · 2021-05-06T13:48:50Z

@m-schuetz something like this would split the subgroup into active groups of 2 threads where reduction operations only communicate between those (but then reconvergence isn't guaranteed to coarser active masks):

if (sthIdx / 2 == sth) {
  ...subgroupReduce()...
}

But there is no well-tested behavior for such code; I think this is a great, practical case to note down if we develop a large suite of tests to check the situation across vendors.

m-schuetz · 2021-05-06T13:58:44Z

That sounds interesting. Once it's available, I'll give it a try and report the results! I've already got a WebGPU atomicMin-based renderer ready, and looking forward to try out subgroup extensions!

mehmetoguzderin · 2021-05-06T14:03:29Z

@m-schuetz thanks, we will register updates to this thread!

mehmetoguzderin · 2021-06-08T18:50:01Z

Upcoming VK_KHR_shader_subgroup_uniform_control_flow pointed out by @alan-baker on 2021 June 8 meeting: KhronosGroup/Vulkan-Guide#118

mehmetoguzderin · 2022-11-23T19:06:49Z

Sunsetting this PR since it would be better for a new take to build upon changes that took place since the PR's inception. The list of operations here might provide a helpful heuristic.

Introduce Subgroup Operations Extension

e0a1e1c

mehmetoguzderin added the wgsl WebGPU Shading Language Issues label Feb 23, 2021

mehmetoguzderin added this to the post-MVP milestone Feb 23, 2021

mehmetoguzderin mentioned this pull request Feb 23, 2021

Implement Subgroup Ops gfx-rs/wgpu#1212

Closed

krogovin mentioned this pull request Sep 21, 2021

Request for compute: anyInvocation() and allInvocation() #2137

Open

davidar mentioned this pull request Mar 21, 2022

Investigation: Generate mipmaps #386

Open

ben-clayton pushed a commit to ben-clayton/gpuweb that referenced this pull request Sep 6, 2022

Fix memory leak in vertex_buffer_OOB,vertex_step_mode (gpuweb#1459)

b5cc682

mehmetoguzderin closed this Nov 23, 2022

raphlinus mentioned this pull request Mar 10, 2023

Considerations for subgroups #3950

Open

dneto0 mentioned this pull request Sep 26, 2023

add subgroups, and make them portable if possible #4306

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Subgroup Operations Extension #1459

Introduce Subgroup Operations Extension #1459

mehmetoguzderin commented Feb 23, 2021 •

edited by pr-preview bot

Loading

github-actions bot commented Feb 23, 2021

Kangz commented Feb 23, 2021

mehmetoguzderin commented Feb 24, 2021

mehmetoguzderin commented Apr 20, 2021

m-schuetz commented May 6, 2021 •

edited

Loading

mehmetoguzderin commented May 6, 2021

m-schuetz commented May 6, 2021

mehmetoguzderin commented May 6, 2021

m-schuetz commented May 6, 2021

mehmetoguzderin commented May 6, 2021

mehmetoguzderin commented Jun 8, 2021

mehmetoguzderin commented Nov 23, 2022

Introduce Subgroup Operations Extension #1459

Introduce Subgroup Operations Extension #1459

Conversation

mehmetoguzderin commented Feb 23, 2021 • edited by pr-preview bot Loading

Motivation

Trade-offs

Lack of Exposed Hardware Banding

Exclusion of Quad Operations

Exclusion of Indexed or Masked Operations

Presence of Extension for APIs

Related Issues

github-actions bot commented Feb 23, 2021

Kangz commented Feb 23, 2021

mehmetoguzderin commented Feb 24, 2021

mehmetoguzderin commented Apr 20, 2021

m-schuetz commented May 6, 2021 • edited Loading

mehmetoguzderin commented May 6, 2021

m-schuetz commented May 6, 2021

mehmetoguzderin commented May 6, 2021

m-schuetz commented May 6, 2021

mehmetoguzderin commented May 6, 2021

mehmetoguzderin commented Jun 8, 2021

mehmetoguzderin commented Nov 23, 2022

mehmetoguzderin commented Feb 23, 2021 •

edited by pr-preview bot

Loading

m-schuetz commented May 6, 2021 •

edited

Loading