Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fragment shader subgroup builtin io test #4024

Merged
merged 6 commits into from
Nov 4, 2024

Conversation

jzm-intel
Copy link
Contributor

This PR fix the expectation of fragment shader subgroup_invocation_id, which can be assigned to inactivate invocations between active ones and therefore go larger than active subgroup invocations number but still smaller than subgroup size. This PR also fix the draw call for fragment subgroup tests. With this PR, tests can passed on Intel devices.

This PR fix the expectation of fragment shader subgroup_invocation_id,
which can be assigned to inactivate invocations between active ones
and therefore go larger than active subgroup invocations number but
still smaller than subgroup size. This PR also fix the draw call for
fragment subgroup tests.
@jzm-intel
Copy link
Contributor Author

A more detailed explanation:
At least on Intel devices we found that fragments are likely scheduled as 2x2 blocks, invocations in each block take adjacent subgroup invocation id. If such a 2x2 blocks contains inactivate invocation (i.e. for inactive fragment), they will also take the subgroup id, making the active subgroups ids go larger than active subgroup invocation numbers.

I have write a webpage to help illustrating the subgroup dispatching in fragment shader, and a screenshot of drawing lower-right triangle of framebuffer of size 17*5 on Intel UHD770 is as below. In this case subgroup size is 16.

image

The invocations of a subgroup are marked with same background color, and I mark the inactive invocations together with active ones for subgroup "43" (representation Id). The 2x2 pattern is clear, and we can see how inactivate invocations take some subgroups ids.
The screenshot also show a interesting case that subgroup "50" contains fragments of two disjoint fragment area.

Copy link
Contributor

@alan-baker alan-baker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to the checker code look sound, but they don't seem to match the code that is run. If we can resolve that this definitely looks more robust.

@jzm-intel
Copy link
Contributor Author

Tried to assert all outputs are zero for fragment that repId === 0, I found it was not true. Investigated the output of framebuffer size 15x5 and 16x16, the repId of some activate fragments could be duplicated, 0 or even 4294967295. We can also see this repId === 0 cases in the screenshot provided above, like for fragment (12, 1) and its neighbors.

I thought this is due to currently we get the repId from subgroupBroadcast from invocation 0, which can be actually inactive and the broadcasted result could be undefined.

I think I need to change the way we get repId for subgroups, and I will do it tomorrow.

@jzm-intel
Copy link
Contributor Author

Things could be tricky since we don't know which invocation will be active in compile time, and subgroupBroadcast requires a constant subgroup invocation id to read from. A subgroup function that broadcast from the lowest active invocation could solve the issue, but I thought we don't have that for now?

@alan-baker
Copy link
Contributor

Things could be tricky since we don't know which invocation will be active in compile time, and subgroupBroadcast requires a constant subgroup invocation id to read from. A subgroup function that broadcast from the lowest active invocation could solve the issue, but I thought we don't have that for now?

There is subgroupBroadcastFirst, but unfortunately APIs aren't consistent about whether helpers participate in subgroup operations. Some say yes, some say no. For the built-in function tests I tried making the fragment tests draw a single full screen triangle and skipping the last row and column as potential helpers. That might help here, but it may still assume the upper left pixels are assigned lower ids. I'll think on this a bit more.

@jzm-intel
Copy link
Contributor Author

A terrible but workable (should it?) solution is:

  • Fragment shader directly store the output to storage buffer, using the built-in input position to get the fragment position p: u32 = rowOfFragment * framebufferColsPerRow + colOfFragment and compute the offset o = p * strideOfOutput for output buffer. Such position p is unique for each fragment.
  • To know which fragments are of the same subgroup, we make a large storage buffer of size [fragmentNumber] * [maximium subgroup size == 128] * sizeof(u32), i.e. we have a storage buffer of array broadcastedP: array<array<maxSubgroupSize, u32>, [fragmentNumber]> , and each fragment store the subgroup-broadcasted p from all subgroup invocations (oops still subgroupBroadcast from inactive invocation... but we can use ballot result to filter the results from inactive later):
if (sgSize >= 4) {
    broadcastedP[p][0] = subgroupBroadcast(p, 0);
    broadcastedP[p][1] = subgroupBroadcast(p, 1);
    broadcastedP[p][2] = subgroupBroadcast(p, 2);
    broadcastedP[p][3] = subgroupBroadcast(p, 3);
}
if (sgSize >= 8) {
    broadcastedP[p][4] = subgroupBroadcast(p, 4);
    ...
}
...

And also store the subgroup ballot result of each fragment to indicate which results comes from inactive invocation and should be ignored.

  • Then on the CPU side we can use the filtered broadcastedP to tell which fragments are in the same subgroup of fragment p, i.e. those of broadcastedP[p][0..sgSize-1].

@jzm-intel
Copy link
Contributor Author

oops still subgroupBroadcast from inactive invocation... but we can use ballot result to filter the results from inactive later

This still rely on subgroup ballot to filter the result, and it would be wrong if helper lane also take part in the ballot (will it?) but don't compute the correct p, so it seems no better than subgroupBroadcastFirst(repId)?

@alan-baker
Copy link
Contributor

Here's an alternative maybe:

@fragment
fn fsMain(
  @builtin(position) pos : vec4f,
  @builtin(subgroup_invocation_id) id : u32,
  @builtin(subgroup_size) sgSize : u32
) -> vec4u {
  var error = 0;
  for (var i = 0; i < sgSize; i++) {
    let idBallot = subgroupBallot(id == i);
    let count = countOneBits(idBallot);
    let sum = count.x + count.y + count.z + count.w;
   error += select(1, 0, sum == 1);
  }
  return vec4u(id, sgSize, error, 0);
}

Then we can check that for all pixels, id < sgSize and error === 0. I think we could even hardcode the loop bound to 128 in case we're worried sgSize is implemented incorrectly.

What do you think?

@jzm-intel
Copy link
Contributor Author

This would check that each id < sgSize is used once and only once for all active invocations (if only the active ones take part in the ballot), but some id might be used by inactive invocations. Use error += select(1, 0, sum <= 1); might check each id used at most once (no duplication) instead?

And we can also check (sum == 0) || (id < sgSize) to ensure no id is larger than subgroup size, if we use 128 as loop boundary.

@alan-baker
Copy link
Contributor

This would check that each id < sgSize is used once and only once for all active invocations (if only the active ones take part in the ballot), but some id might be used by inactive invocations. Use error += select(1, 0, sum <= 1); might check each id used at most once (no duplication) instead?

And we can also check (sum == 0) || (id < sgSize) to ensure no id is larger than subgroup size, if we use 128 as loop boundary.

I suppose if the invocation is inactive it doesn't really matter which id it is assigned, but I agree the select condition is better formulated as an inequality. The secondary check makes sense too.

@jzm-intel
Copy link
Contributor Author

Please take a look, thanks! repId is removed, and do more validation within shader.

Copy link
Contributor

@alan-baker alan-baker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix! I also tested it on my Macbook M1.

@jzm-intel
Copy link
Contributor Author

Thanks for reviewing!

@jzm-intel jzm-intel merged commit f2e2ada into gpuweb:main Nov 4, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants