Frustum-cull small draws #17808

hrydgard · 2023-07-30T10:15:19Z

This is an experiment, some games do a poor job of culling stuff, and some transparent sprites can be very expensive if they cause a framebuffer copy. Skipping them if outside the viewport makes sense in that case. We simply re-use the bbox code for now, though this could be optimized.

~~We are simply re-using the BBOX culling code, which is not very efficient.~~

One example is the flame sprites in #17797 .

!!!! Not for merging! It seems our culling isn't completely accurate and needs work. In GTA:LCS for example, some triangles intersecting the screen edges get culled by mistake. However it also manages to cull 600 draw calls per frame in that game (that inefficient water). Actually, my logic was just busted. It works!

Additionally, we should be able to cull through-mode draws easily, this one doesn't even try.

Anyway, there are a few ways we can go with this:

Check draws for culling only if weird blend/logic modes are enabled
Check small draws only for culling, but always do it

In many games, the number of draws culled is very small since games do a good job themselves. In other games, the draws culled can be pretty high. Avoiding draws entirely can save a lot of work doing things like texture binding etc, but this will not be beneficial for all games due to the extra work (plus, the culling code needs a lot of optimization if we're gonna apply it widely).

Also just realized that if applied widely, this is going to seriously mess with the vertex cache when we try to cache merged sequences of draw calls - if parts of one of those gets culled, there'll be a lot of combinations... The vertex cache is now gone, so not a concern.

Ended up doing culling in view space so we don't need to update the planes for every new world matrix, and additionally, SIMD-optimized the thing carefully. Now it's very fast and generally a win.

This isn't a huge performance boost for the games that use BBOX (like Tekken), but it'll be more valuable if we start using soft culling more widely automatically, see #17808

unknownbrackets · 2023-07-30T14:10:38Z

Hm, this seems like a good idea for us to skip loading textures - could have interesting impacts on performance in some scenes. I think it'd be good (like software skinning, but hopefully not for as long...) to start as an option. That way people could experiment with the pros and cons and give feedback on heuristics/etc. to remove the option.

Potentially we could use less-accurate culling checks that are faster and split the func. I'd like to keep the bbox jumps accurate but we could deviate for our own, since it doesn't have to be exact - it just has to skip enough to be profitable.

Through mode is probably easy to cull but I wonder if it even happens often.

-[Unknown]

hrydgard · 2023-07-30T15:05:50Z

After some testing, on PC it's hard to beat just performing the draws vs doing this extra work to cull, for complex scenes like in God of War. In GTA though, it's pretty much even or a slight boost. Indeed, an option to experiment with is probably best.

(And also, there's a lot of room for optimization - running the whole NormalizeVertex is total overkill).

As for less accurate options, one would be to compute a bounding box during vertex loading.

Also, there might be an interesting tradeoff to do this at Flush-time instead of at Submit-time. Not sure.

hrydgard · 2023-07-30T17:37:05Z

Okay, I made it a bit more conservative - many games submit a lot of tiny drawcalls which we end up joining together, and I now assume that if one drawcall passes culling in the "inner fastloop", all of them will, which cuts down a lot on checking in some games.

Additionally, I made a fast-path for non-skinned non-morph geometry, avoiding NormalizeVertices.

Now I can't really find any slowdowns even on PC, except my extreme GoW testcase which goes from 560 to 550 fps.

Though, we still need to solve or avoid the interaction with vertex cache before merging - it'll reduce its efficiency a bit. Actually maybe this is the time to delete the vertex cache ...

I'll do some Android testing.

~~Hm, also, MotoGP is glitching. Unclear why...~~ (typo)

hrydgard · 2023-10-04T10:46:10Z

Rebased it, with the new draw call merging my pathological GoW case doesn't slow down much anymore (since we skip the culling machinery entirely in that case, once one draw has been proven visible).

So apart from the icky interactions with the vertex cache which needs solving, might consider actually merging this.

hrydgard · 2023-10-10T15:20:33Z

Current status: This does work quite well, but iis blocked on #18339 , and I also want to make sure that games don't end up in the "NormalizeVertices" path here, since it'll likely be expensive enough to eat up any wins.

hrydgard · 2023-11-13T23:49:00Z

Together with previous optimizations to drawing, this is already fast enough now. But will of course have varying benefit in different games, some like Wipeout end up net zero since they already cull very efficiently.

The previous vertex cache concern is now also gone, since it's, well, gone.

So starting to think about just merging it without even adding an option, just to solve #17797... One concern might be that without SIMD'd matrix muliplies in the update function, maybe it'll incur some slowdown? Not sure.. There are also some more possible optimizations to implement..

hrydgard · 2023-12-05T09:46:15Z

Optimized it a bit. I'm still a little bit afraid of performance regressions from the large number of plane updates that are caused in some games like Burnout Dominator. Can a lot less of those happen by not including the world matrix in the planes (since world matrix is by far the most commonly updated one), but then the plane checks will be a bit more expensive. Tricky tradeoffs.

Though, in practice, I don't see much performance regression anywhere, but also where there are improvements they are not big. So still a bit in doubt here about the overall value, except for that one game in #17797 which will improve a lot :/

hrydgard · 2023-12-09T00:41:53Z

I moved it into view space (to avoid updating the planes on every world matrix change, at the cost of transforming each vertex instead) and SSE-optimized it. Not seeing any perf regressions anymore, only wins. So I'll just do NEON as well tomorrow and get it in.

Some games do a poor job of culling stuff, and some transparent sprites can be very expensive if they cause a copy. Skipping them if outside the viewport makes sense in that case. One example are the flame sprites in #17797 . Additionally, we should be able to cull through-mode draws easily, this one doesn't even try.

…tiple times.

… compat.

hrydgard · 2023-12-09T15:53:35Z

There, I think this is finally done. It's actually a noticeable boost now, instead of a loss, even in God of War.

The amount of culling we get from this varies hugely between games. in LCS we cull 500 (tiny) draw calls per frame, in Wipeout around 10-20, in Tekken a bit more, in Virtua Tennis a lot.

hrydgard · 2023-12-09T17:26:28Z

~~Hm, this broke Outrun. Can't find anything else that's broken. Weird!~~

EDIT: Fixed in 904ce4f

fp64 · 2023-12-09T22:29:56Z

Instead of

// Sign extension. Ugly without SSE4.
bits = _mm_srai_epi32(_mm_unpacklo_epi16(bits, bits), 16);
__m128 pos = _mm_mul_ps(_mm_cvtepi32_ps(bits), scaleFactor);

maybe

bits = _mm_unpacklo_epi16(_mm_set1_epi32(0), bits);
__m128 pos = _mm_mul_ps(_mm_cvtepi32_ps(bits), scaleFactor2); // scaleFactor2=2^(-(15+16))

Zero probably would be computed outside of the loop.

hrydgard · 2023-12-10T01:29:42Z

@fp64 Doesn't work. _mm_srai_epi32 is there to compute the sign extension. The unpack is effectively a left-shift by 16 bits, then we right shift duplicating the sign bits to generate sign-extended 32-bit versions of the original 16-bit values.

Actually never mind, I misread. Your thing will probably work yes, since in that we incorporate the right shift in the scale factor. Clever!

unknownbrackets · 2023-12-28T00:46:09Z

GPU/GPUCommonHW.cpp

+			bool passCulling = onePassed || PASSES_CULLING;
+			if (!passCulling) {
+				// Do software culling.
+				if (drawEngineCommon_->TestBoundingBox(verts, inds, count, vertexType)) {


I think you meant for this one to be TestBoundingBoxFast() too?

-[Unknown]

Oops, fixing

hrydgard added the GE emulation Backend-independent GPU issues label Jul 30, 2023

hrydgard added this to the v1.17.0 milestone Jul 30, 2023

hrydgard force-pushed the frustum-cull-small-draws branch from 576f65d to 3aa8422 Compare July 30, 2023 10:57

hrydgard added a commit that referenced this pull request Jul 30, 2023

Cache planes used for BBOX culling

061131e

This isn't a huge performance boost for the games that use BBOX (like Tekken), but it'll be more valuable if we start using soft culling more widely automatically, see #17808

hrydgard mentioned this pull request Jul 30, 2023

Cache computed planes used for BBOX culling #17810

Merged

hrydgard force-pushed the frustum-cull-small-draws branch from 07bf171 to eabdee7 Compare July 30, 2023 17:34

hrydgard force-pushed the frustum-cull-small-draws branch from eabdee7 to 6507890 Compare July 31, 2023 12:21

hrydgard force-pushed the frustum-cull-small-draws branch 2 times, most recently from 0e9aea0 to b09f120 Compare October 4, 2023 10:45

hrydgard force-pushed the frustum-cull-small-draws branch from b09f120 to 96a59cb Compare October 10, 2023 08:39

hrydgard force-pushed the frustum-cull-small-draws branch from 21c1d1b to 04f0885 Compare October 16, 2023 18:20

hrydgard force-pushed the frustum-cull-small-draws branch 3 times, most recently from 389aba0 to 1746c35 Compare November 13, 2023 22:09

hrydgard marked this pull request as ready for review November 13, 2023 23:44

hrydgard mentioned this pull request Nov 26, 2023

Minor bbox optimizations, assorted bugfixes #18446

Merged

hrydgard force-pushed the frustum-cull-small-draws branch from 1746c35 to 8894e03 Compare November 26, 2023 16:23

hrydgard force-pushed the frustum-cull-small-draws branch from 8894e03 to 746d320 Compare December 5, 2023 00:13

hrydgard force-pushed the frustum-cull-small-draws branch from 746d320 to 5db2bbe Compare December 9, 2023 00:39

hrydgard force-pushed the frustum-cull-small-draws branch from 08ce69d to 6d5a27f Compare December 9, 2023 13:30

hrydgard added 2 commits December 9, 2023 15:55

Use a less accurate but faster frustum cull for the general draws.

89d8ef8

hrydgard added 5 commits December 9, 2023 15:55

Fastcull: SSE/NEON-optimize 16-bit position conversion

dbf796b

World space planes

a043962

Flip the cull plane data around to avoid transforming each vertex mul…

62c936b

…tiple times.

SSE-optimize the frustum culling

5b44e25

NEON-optimize the culling

6a7ef83

hrydgard force-pushed the frustum-cull-small-draws branch from 19d4772 to c5a94c3 Compare December 9, 2023 14:58

NEON culling: Use mla operations to shave off some more cycles. ARM32…

99548be

… compat.

hrydgard force-pushed the frustum-cull-small-draws branch from c5a94c3 to 440b832 Compare December 9, 2023 15:41

hrydgard added 2 commits December 9, 2023 16:48

NEON: vcvtq can scale directly, no need for a mul by const.

4e2a1bf

Disable the new culling on RISC-V for now.

7e85d3d

hrydgard force-pushed the frustum-cull-small-draws branch from 33dd7cd to 7e85d3d Compare December 9, 2023 15:49

hrydgard changed the title ~~Frustum-cull small draws (experiment)~~ Frustum-cull small draws Dec 9, 2023

hrydgard merged commit 27e47d9 into master Dec 9, 2023
18 checks passed

hrydgard deleted the frustum-cull-small-draws branch December 9, 2023 16:23

unknownbrackets reviewed Dec 28, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frustum-cull small draws #17808

Frustum-cull small draws #17808

hrydgard commented Jul 30, 2023 •

edited

Loading

unknownbrackets commented Jul 30, 2023

hrydgard commented Jul 30, 2023 •

edited

Loading

hrydgard commented Jul 30, 2023 •

edited

Loading

hrydgard commented Oct 4, 2023 •

edited

Loading

hrydgard commented Oct 10, 2023

hrydgard commented Nov 13, 2023

hrydgard commented Dec 5, 2023

hrydgard commented Dec 9, 2023

hrydgard commented Dec 9, 2023 •

edited

Loading

hrydgard commented Dec 9, 2023 •

edited

Loading

fp64 commented Dec 9, 2023 •

edited

Loading

hrydgard commented Dec 10, 2023 •

edited

Loading

unknownbrackets Dec 28, 2023

hrydgard Dec 28, 2023

Frustum-cull small draws #17808

Frustum-cull small draws #17808

Conversation

hrydgard commented Jul 30, 2023 • edited Loading

unknownbrackets commented Jul 30, 2023

hrydgard commented Jul 30, 2023 • edited Loading

hrydgard commented Jul 30, 2023 • edited Loading

hrydgard commented Oct 4, 2023 • edited Loading

hrydgard commented Oct 10, 2023

hrydgard commented Nov 13, 2023

hrydgard commented Dec 5, 2023

hrydgard commented Dec 9, 2023

hrydgard commented Dec 9, 2023 • edited Loading

hrydgard commented Dec 9, 2023 • edited Loading

fp64 commented Dec 9, 2023 • edited Loading

hrydgard commented Dec 10, 2023 • edited Loading

unknownbrackets Dec 28, 2023

Choose a reason for hiding this comment

hrydgard Dec 28, 2023

Choose a reason for hiding this comment

hrydgard commented Jul 30, 2023 •

edited

Loading

hrydgard commented Jul 30, 2023 •

edited

Loading

hrydgard commented Jul 30, 2023 •

edited

Loading

hrydgard commented Oct 4, 2023 •

edited

Loading

hrydgard commented Dec 9, 2023 •

edited

Loading

hrydgard commented Dec 9, 2023 •

edited

Loading

fp64 commented Dec 9, 2023 •

edited

Loading

hrydgard commented Dec 10, 2023 •

edited

Loading