Fix fragment shader subgroup builtin io test #4024

jzm-intel · 2024-10-31T08:31:26Z

This PR fix the expectation of fragment shader subgroup_invocation_id, which can be assigned to inactivate invocations between active ones and therefore go larger than active subgroup invocations number but still smaller than subgroup size. This PR also fix the draw call for fragment subgroup tests. With this PR, tests can passed on Intel devices.

This PR fix the expectation of fragment shader subgroup_invocation_id, which can be assigned to inactivate invocations between active ones and therefore go larger than active subgroup invocations number but still smaller than subgroup size. This PR also fix the draw call for fragment subgroup tests.

jzm-intel · 2024-10-31T08:47:06Z

A more detailed explanation:
At least on Intel devices we found that fragments are likely scheduled as 2x2 blocks, invocations in each block take adjacent subgroup invocation id. If such a 2x2 blocks contains inactivate invocation (i.e. for inactive fragment), they will also take the subgroup id, making the active subgroups ids go larger than active subgroup invocation numbers.

I have write a webpage to help illustrating the subgroup dispatching in fragment shader, and a screenshot of drawing lower-right triangle of framebuffer of size 17*5 on Intel UHD770 is as below. In this case subgroup size is 16.

The invocations of a subgroup are marked with same background color, and I mark the inactive invocations together with active ones for subgroup "43" (representation Id). The 2x2 pattern is clear, and we can see how inactivate invocations take some subgroups ids.
The screenshot also show a interesting case that subgroup "50" contains fragments of two disjoint fragment area.

alan-baker

The changes to the checker code look sound, but they don't seem to match the code that is run. If we can resolve that this definitely looks more robust.

src/webgpu/shader/execution/shader_io/fragment_builtins.spec.ts

jzm-intel · 2024-10-31T15:57:22Z

Tried to assert all outputs are zero for fragment that repId === 0, I found it was not true. Investigated the output of framebuffer size 15x5 and 16x16, the repId of some activate fragments could be duplicated, 0 or even 4294967295. We can also see this repId === 0 cases in the screenshot provided above, like for fragment (12, 1) and its neighbors.

I thought this is due to currently we get the repId from subgroupBroadcast from invocation 0, which can be actually inactive and the broadcasted result could be undefined.

I think I need to change the way we get repId for subgroups, and I will do it tomorrow.

jzm-intel · 2024-10-31T16:06:27Z

Things could be tricky since we don't know which invocation will be active in compile time, and subgroupBroadcast requires a constant subgroup invocation id to read from. A subgroup function that broadcast from the lowest active invocation could solve the issue, but I thought we don't have that for now?

alan-baker · 2024-10-31T16:18:33Z

Things could be tricky since we don't know which invocation will be active in compile time, and subgroupBroadcast requires a constant subgroup invocation id to read from. A subgroup function that broadcast from the lowest active invocation could solve the issue, but I thought we don't have that for now?

There is subgroupBroadcastFirst, but unfortunately APIs aren't consistent about whether helpers participate in subgroup operations. Some say yes, some say no. For the built-in function tests I tried making the fragment tests draw a single full screen triangle and skipping the last row and column as potential helpers. That might help here, but it may still assume the upper left pixels are assigned lower ids. I'll think on this a bit more.

jzm-intel · 2024-10-31T16:36:19Z

A terrible but workable (should it?) solution is:

Fragment shader directly store the output to storage buffer, using the built-in input position to get the fragment position p: u32 = rowOfFragment * framebufferColsPerRow + colOfFragment and compute the offset o = p * strideOfOutput for output buffer. Such position p is unique for each fragment.
To know which fragments are of the same subgroup, we make a large storage buffer of size [fragmentNumber] * [maximium subgroup size == 128] * sizeof(u32), i.e. we have a storage buffer of array broadcastedP: array<array<maxSubgroupSize, u32>, [fragmentNumber]> , and each fragment store the subgroup-broadcasted p from all subgroup invocations (oops still subgroupBroadcast from inactive invocation... but we can use ballot result to filter the results from inactive later):

if (sgSize >= 4) {
    broadcastedP[p][0] = subgroupBroadcast(p, 0);
    broadcastedP[p][1] = subgroupBroadcast(p, 1);
    broadcastedP[p][2] = subgroupBroadcast(p, 2);
    broadcastedP[p][3] = subgroupBroadcast(p, 3);
}
if (sgSize >= 8) {
    broadcastedP[p][4] = subgroupBroadcast(p, 4);
    ...
}
...

And also store the subgroup ballot result of each fragment to indicate which results comes from inactive invocation and should be ignored.

Then on the CPU side we can use the filtered broadcastedP to tell which fragments are in the same subgroup of fragment p, i.e. those of broadcastedP[p][0..sgSize-1].

jzm-intel · 2024-10-31T16:53:59Z

oops still subgroupBroadcast from inactive invocation... but we can use ballot result to filter the results from inactive later

This still rely on subgroup ballot to filter the result, and it would be wrong if helper lane also take part in the ballot (will it?) but don't compute the correct p, so it seems no better than subgroupBroadcastFirst(repId)?

alan-baker · 2024-10-31T17:21:07Z

Here's an alternative maybe:

@fragment
fn fsMain(
  @builtin(position) pos : vec4f,
  @builtin(subgroup_invocation_id) id : u32,
  @builtin(subgroup_size) sgSize : u32
) -> vec4u {
  var error = 0;
  for (var i = 0; i < sgSize; i++) {
    let idBallot = subgroupBallot(id == i);
    let count = countOneBits(idBallot);
    let sum = count.x + count.y + count.z + count.w;
   error += select(1, 0, sum == 1);
  }
  return vec4u(id, sgSize, error, 0);
}

Then we can check that for all pixels, id < sgSize and error === 0. I think we could even hardcode the loop bound to 128 in case we're worried sgSize is implemented incorrectly.

What do you think?

jzm-intel · 2024-10-31T17:41:18Z

This would check that each id < sgSize is used once and only once for all active invocations (if only the active ones take part in the ballot), but some id might be used by inactive invocations. Use error += select(1, 0, sum <= 1); might check each id used at most once (no duplication) instead?

And we can also check (sum == 0) || (id < sgSize) to ensure no id is larger than subgroup size, if we use 128 as loop boundary.

alan-baker · 2024-10-31T18:09:01Z

This would check that each id < sgSize is used once and only once for all active invocations (if only the active ones take part in the ballot), but some id might be used by inactive invocations. Use error += select(1, 0, sum <= 1); might check each id used at most once (no duplication) instead?

And we can also check (sum == 0) || (id < sgSize) to ensure no id is larger than subgroup size, if we use 128 as loop boundary.

I suppose if the invocation is inactive it doesn't really matter which id it is assigned, but I agree the select condition is better formulated as an inequality. The secondary check makes sense too.

jzm-intel · 2024-11-01T05:31:35Z

Please take a look, thanks! repId is removed, and do more validation within shader.

alan-baker

Thanks for this fix! I also tested it on my Macbook M1.

jzm-intel · 2024-11-04T07:37:22Z

Thanks for reviewing!

jzm-intel requested a review from alan-baker October 31, 2024 08:31

Make diff smaller

94bb7e0

alan-baker requested changes Oct 31, 2024

View reviewed changes

src/webgpu/shader/execution/shader_io/fragment_builtins.spec.ts Outdated Show resolved Hide resolved

Fix fragment testing shader

a2c7367

jzm-intel added 2 commits November 1, 2024 10:55

Failed: Assert active invocation has non-zero repId

f5f3456

Remove repId and do validation in shader

f527cff

alan-baker approved these changes Nov 1, 2024

View reviewed changes

Merge branch 'main' into FixFragmentSubgroupBuiltinsTest0

6c8570d

jzm-intel merged commit f2e2ada into gpuweb:main Nov 4, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fragment shader subgroup builtin io test #4024

Fix fragment shader subgroup builtin io test #4024

jzm-intel commented Oct 31, 2024

jzm-intel commented Oct 31, 2024

alan-baker left a comment

jzm-intel commented Oct 31, 2024

jzm-intel commented Oct 31, 2024

alan-baker commented Oct 31, 2024

jzm-intel commented Oct 31, 2024

jzm-intel commented Oct 31, 2024

alan-baker commented Oct 31, 2024

jzm-intel commented Oct 31, 2024

alan-baker commented Oct 31, 2024

jzm-intel commented Nov 1, 2024

alan-baker left a comment

jzm-intel commented Nov 4, 2024

Fix fragment shader subgroup builtin io test #4024

Fix fragment shader subgroup builtin io test #4024

Conversation

jzm-intel commented Oct 31, 2024

jzm-intel commented Oct 31, 2024

alan-baker left a comment

Choose a reason for hiding this comment

jzm-intel commented Oct 31, 2024

jzm-intel commented Oct 31, 2024

alan-baker commented Oct 31, 2024

jzm-intel commented Oct 31, 2024

jzm-intel commented Oct 31, 2024

alan-baker commented Oct 31, 2024

jzm-intel commented Oct 31, 2024

alan-baker commented Oct 31, 2024

jzm-intel commented Nov 1, 2024

alan-baker left a comment

Choose a reason for hiding this comment

jzm-intel commented Nov 4, 2024