-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse sampler when creating cached bind groups #10610
Conversation
0989e03
to
f960576
Compare
I rebased this onto main and did some benchmarking on the tonemapping example, the results are interesting. tracy shows 0.05ms reduction in mean time (1.56 before, 1.51 after), 0.02ms reduction in median time (1.42 before, 1.4 after) nsight however, often shows a 10x increase in frametime. it reports 11-15ms per frame with this PR, 1.4ms without, but every run is different. some runs stay at a solid 3.4ms, most stay at 11, some stay at 15ms. I also got the message "This application appears to be using a significant amount of coherently mapped memory. This may increase the time necessary to capture." when trying to get a frame capture |
f960576
to
1c76a09
Compare
So I tried looking at the frame times by attaching RenderDoc and Nsight, but in both cases I couldn't reproduce the performance impact. Whether on main (1c76a09) or on this PR I got around 2.2ms across runs with RenderDoc and 2.7ms with Nsight, with some slight variations but consistently close to those values for both. Worst case I got was some runs running around 0.3-4ms slower than those averages, and I reproduced that both for main and this PR. Edit: Tested on Arch with a 3060 mobile, latest drivers. |
I tested on windows 11 on a 2070 RTX, i'll try a few other computers and with latest drivers and see if i can repro but it was consistent for me. |
This is consistent across machines, with latest drivers. Happens on windows 10, windows 11, and with 2070 RTX and 2080 Ti RTX. Not sure how to further debug this. It always just looks like its making gargantuan amounts of buffer memory barriers interleaved with coherent buffer copies, but the exact number of them changes per run. Sometimes its way more than others, but always regular per-frame for any given run. When it happens to get fewer of them it outperforms main at least, but its rare. Usually more than half of the frame is spent doing these pointless barriers and copies. |
This should be good to merge once #11515 merges. I have confirmed pretty substantial CPU perf wins with this PR plus that fix, around 0.2-0.3ms savings |
As @DGriffin91 pointed out, we should verify that this doesn't cause large performance regressions on more realistic scenes involving animated materials. We don't currently have a good example to test that though: lemme make an issue. |
#11524 shows that this is good for merge as-is |
# Objective - Fixes #11516 ## Solution - Add Animated Material example (colors are hue-cycling smoothly per-mesh) ![image](https://github.com/bevyengine/bevy/assets/11307157/c75b9e66-0019-41b8-85ec-647559c6ba01) Note: this example reproduces the perf issue found in #10610 pretty consistently, with and without the changes from that PR included. Frame time is sometimes around 4.3ms, other times around 12-14ms. Its pretty random per run. I think this clears #10610 for merge.
# Objective - Some passes recreate a sampler when creating a bind group to be cached, even if the sampler is always the same. ## Solution - Store the sampler in the corresponding pipeline resource.
# Objective - Fixes bevyengine#11516 ## Solution - Add Animated Material example (colors are hue-cycling smoothly per-mesh) ![image](https://github.com/bevyengine/bevy/assets/11307157/c75b9e66-0019-41b8-85ec-647559c6ba01) Note: this example reproduces the perf issue found in bevyengine#10610 pretty consistently, with and without the changes from that PR included. Frame time is sometimes around 4.3ms, other times around 12-14ms. Its pretty random per run. I think this clears bevyengine#10610 for merge.
Objective
Solution