Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frustum-cull small draws #17808

Merged
merged 10 commits into from
Dec 9, 2023
Merged

Frustum-cull small draws #17808

merged 10 commits into from
Dec 9, 2023

Conversation

hrydgard
Copy link
Owner

@hrydgard hrydgard commented Jul 30, 2023

This is an experiment, some games do a poor job of culling stuff, and some transparent sprites can be very expensive if they cause a framebuffer copy. Skipping them if outside the viewport makes sense in that case. We simply re-use the bbox code for now, though this could be optimized.

We are simply re-using the BBOX culling code, which is not very efficient.

One example is the flame sprites in #17797 .

!!!! Not for merging! It seems our culling isn't completely accurate and needs work. In GTA:LCS for example, some triangles intersecting the screen edges get culled by mistake. However it also manages to cull 600 draw calls per frame in that game (that inefficient water). Actually, my logic was just busted. It works!

Additionally, we should be able to cull through-mode draws easily, this one doesn't even try.

Anyway, there are a few ways we can go with this:

  • Check draws for culling only if weird blend/logic modes are enabled
  • Check small draws only for culling, but always do it

In many games, the number of draws culled is very small since games do a good job themselves. In other games, the draws culled can be pretty high. Avoiding draws entirely can save a lot of work doing things like texture binding etc, but this will not be beneficial for all games due to the extra work (plus, the culling code needs a lot of optimization if we're gonna apply it widely).

Also just realized that if applied widely, this is going to seriously mess with the vertex cache when we try to cache merged sequences of draw calls - if parts of one of those gets culled, there'll be a lot of combinations... The vertex cache is now gone, so not a concern.

Ended up doing culling in view space so we don't need to update the planes for every new world matrix, and additionally, SIMD-optimized the thing carefully. Now it's very fast and generally a win.

@hrydgard hrydgard added the GE emulation Backend-independent GPU issues label Jul 30, 2023
@hrydgard hrydgard added this to the v1.17.0 milestone Jul 30, 2023
hrydgard added a commit that referenced this pull request Jul 30, 2023
This isn't a huge performance boost for the games that use BBOX (like
Tekken), but it'll be more valuable if we start using soft culling more
widely automatically, see #17808
@unknownbrackets
Copy link
Collaborator

Hm, this seems like a good idea for us to skip loading textures - could have interesting impacts on performance in some scenes. I think it'd be good (like software skinning, but hopefully not for as long...) to start as an option. That way people could experiment with the pros and cons and give feedback on heuristics/etc. to remove the option.

Potentially we could use less-accurate culling checks that are faster and split the func. I'd like to keep the bbox jumps accurate but we could deviate for our own, since it doesn't have to be exact - it just has to skip enough to be profitable.

Through mode is probably easy to cull but I wonder if it even happens often.

-[Unknown]

@hrydgard
Copy link
Owner Author

hrydgard commented Jul 30, 2023

After some testing, on PC it's hard to beat just performing the draws vs doing this extra work to cull, for complex scenes like in God of War. In GTA though, it's pretty much even or a slight boost. Indeed, an option to experiment with is probably best.

(And also, there's a lot of room for optimization - running the whole NormalizeVertex is total overkill).

As for less accurate options, one would be to compute a bounding box during vertex loading.

Also, there might be an interesting tradeoff to do this at Flush-time instead of at Submit-time. Not sure.

@hrydgard
Copy link
Owner Author

hrydgard commented Jul 30, 2023

Okay, I made it a bit more conservative - many games submit a lot of tiny drawcalls which we end up joining together, and I now assume that if one drawcall passes culling in the "inner fastloop", all of them will, which cuts down a lot on checking in some games.

Additionally, I made a fast-path for non-skinned non-morph geometry, avoiding NormalizeVertices.

Now I can't really find any slowdowns even on PC, except my extreme GoW testcase which goes from 560 to 550 fps.

Though, we still need to solve or avoid the interaction with vertex cache before merging - it'll reduce its efficiency a bit. Actually maybe this is the time to delete the vertex cache ...

I'll do some Android testing.

Hm, also, MotoGP is glitching. Unclear why... (typo)

@hrydgard hrydgard force-pushed the frustum-cull-small-draws branch 2 times, most recently from 0e9aea0 to b09f120 Compare October 4, 2023 10:45
@hrydgard
Copy link
Owner Author

hrydgard commented Oct 4, 2023

Rebased it, with the new draw call merging my pathological GoW case doesn't slow down much anymore (since we skip the culling machinery entirely in that case, once one draw has been proven visible).

So apart from the icky interactions with the vertex cache which needs solving, might consider actually merging this.

@hrydgard
Copy link
Owner Author

Current status: This does work quite well, but iis blocked on #18339 , and I also want to make sure that games don't end up in the "NormalizeVertices" path here, since it'll likely be expensive enough to eat up any wins.

@hrydgard hrydgard force-pushed the frustum-cull-small-draws branch 3 times, most recently from 389aba0 to 1746c35 Compare November 13, 2023 22:09
@hrydgard hrydgard marked this pull request as ready for review November 13, 2023 23:44
@hrydgard
Copy link
Owner Author

Together with previous optimizations to drawing, this is already fast enough now. But will of course have varying benefit in different games, some like Wipeout end up net zero since they already cull very efficiently.

The previous vertex cache concern is now also gone, since it's, well, gone.

So starting to think about just merging it without even adding an option, just to solve #17797... One concern might be that without SIMD'd matrix muliplies in the update function, maybe it'll incur some slowdown? Not sure.. There are also some more possible optimizations to implement..

@hrydgard
Copy link
Owner Author

hrydgard commented Dec 5, 2023

Optimized it a bit. I'm still a little bit afraid of performance regressions from the large number of plane updates that are caused in some games like Burnout Dominator. Can a lot less of those happen by not including the world matrix in the planes (since world matrix is by far the most commonly updated one), but then the plane checks will be a bit more expensive. Tricky tradeoffs.

Though, in practice, I don't see much performance regression anywhere, but also where there are improvements they are not big. So still a bit in doubt here about the overall value, except for that one game in #17797 which will improve a lot :/

@hrydgard
Copy link
Owner Author

hrydgard commented Dec 9, 2023

I moved it into view space (to avoid updating the planes on every world matrix change, at the cost of transforming each vertex instead) and SSE-optimized it. Not seeing any perf regressions anymore, only wins. So I'll just do NEON as well tomorrow and get it in.

Some games do a poor job of culling stuff, and some transparent
sprites can be very expensive if they cause a copy.
Skipping them if outside the viewport makes sense in that case.

One example are the flame sprites in #17797 .

Additionally, we should be able to cull through-mode draws easily, this
one doesn't even try.
@hrydgard
Copy link
Owner Author

hrydgard commented Dec 9, 2023

There, I think this is finally done. It's actually a noticeable boost now, instead of a loss, even in God of War.

The amount of culling we get from this varies hugely between games. in LCS we cull 500 (tiny) draw calls per frame, in Wipeout around 10-20, in Tekken a bit more, in Virtua Tennis a lot.

@hrydgard hrydgard changed the title Frustum-cull small draws (experiment) Frustum-cull small draws Dec 9, 2023
@hrydgard hrydgard merged commit 27e47d9 into master Dec 9, 2023
18 checks passed
@hrydgard hrydgard deleted the frustum-cull-small-draws branch December 9, 2023 16:23
@hrydgard
Copy link
Owner Author

hrydgard commented Dec 9, 2023

Hm, this broke Outrun. Can't find anything else that's broken. Weird!

EDIT: Fixed in 904ce4f

@fp64
Copy link
Contributor

fp64 commented Dec 9, 2023

Instead of

// Sign extension. Ugly without SSE4.
bits = _mm_srai_epi32(_mm_unpacklo_epi16(bits, bits), 16);
__m128 pos = _mm_mul_ps(_mm_cvtepi32_ps(bits), scaleFactor);

maybe

bits = _mm_unpacklo_epi16(_mm_set1_epi32(0), bits);
__m128 pos = _mm_mul_ps(_mm_cvtepi32_ps(bits), scaleFactor2); // scaleFactor2=2^(-(15+16))

Zero probably would be computed outside of the loop.

@hrydgard
Copy link
Owner Author

hrydgard commented Dec 10, 2023

@fp64 Doesn't work. _mm_srai_epi32 is there to compute the sign extension. The unpack is effectively a left-shift by 16 bits, then we right shift duplicating the sign bits to generate sign-extended 32-bit versions of the original 16-bit values.

Actually never mind, I misread. Your thing will probably work yes, since in that we incorporate the right shift in the scale factor. Clever!

bool passCulling = onePassed || PASSES_CULLING;
if (!passCulling) {
// Do software culling.
if (drawEngineCommon_->TestBoundingBox(verts, inds, count, vertexType)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant for this one to be TestBoundingBoxFast() too?

-[Unknown]

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, fixing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GE emulation Backend-independent GPU issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants