-
-
Notifications
You must be signed in to change notification settings - Fork 21.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[3.4] Split Vertex Buffer Stream in Positions and Attributes #46574
Conversation
97974b2
to
dee692c
Compare
Is there a godot master version I can review? |
At the moment I only have this implemented in 3.2 since 4.0 is still undergoing work on the rendering side from what I understand I think discussion is still ongoing as to what/if 4.0 will need something like this implemented (see the proposal linked for discussion) |
dee692c
to
dccfcbf
Compare
What do you think about making this a ProjectSetting instead of an import setting. I don't see a reason to import some meshes using the split vertex stream and some without. To me it seems like Desktop should use non-split and mobile should use split. In which case there can be a check in the rasterizer. I think making this a project setting would result in a simpler PR and would be more user-friendly. Please let me know if there is a good reason to have this in the importer. |
Also have you got any timing figures for the performance with and without the change on different hardware? It would be kind of nice to have some figures to see the difference. |
Ah that's a good thought and not something that I had considered before, looking at it I think that a project setting would make more sense also, I'll look into what that would take! Thank you for pointing me towards that! For performance, I know that for mobile it helps primarily with memory bandwidth metrics (and was the primary motivation for this change, 50% improvement) rather than gpu frame times, and I did do some cursory testing on desktop (nvidia 2070 super Max-Q laptop) and did not find much in the way of performance difference, but I will try to get more concrete numbers for that as well to share |
Implementation question - is there any way to trigger a reimport of all mesh assets when a project setting is changed? At the moment someone could change the project setting, but all previously imported assets would still have the old mesh data format and they would have to reimport the assets manually |
As far as I know, there is no way to do that yet (aside of removing the |
dccfcbf
to
d44a1c4
Compare
Gotcha, thanks for letting me know! Maybe we can make that an option in the future haha :) For now I was able to update the implementation to use project settings rather than import settings for splitting the vertex buffer streams, and will have similar functionality to the way that the VRAM texture compression settings will behave then I also removed the commit that added the mesh compression options since it is much less relevant now to this PR, and created a new PR with that feature addition here #48625 |
attribs[i].type = _GL_HALF_FLOAT_OES; | ||
stride += attribs[i].size * 2; | ||
uses_half_float = true; | ||
attribs[i].type = GL_HALF_FLOAT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this is a copy-paste error from GLES3. This line is causing the CI to fail because it should be _GL_HALF_FLOAT_OES
Implementation looks good on a preliminary review! Have you had any luck getting performance numbers on mobile? I would be really interested to see if there is a measurable difference. The TPS demo may be a good test project if you don't already have one. |
d44a1c4
to
77b5792
Compare
I have figured out the clang-format issue so hopefully static checker will be happy from now on :) Oh yes I definitely have the numbers on mobile to show this haha, I'm working on getting the screenshots right now :) Also working on getting the numbers for desktop using nvidia nsights, will also report that as well when I get them! |
I also collected data for desktop and at the moment it looks like performance is basically identical - This was in a static scene with 90,000,000 verts that were offscreen, so I would say having it on desktop will at least keep the same performance In Nvidia Nsight, the bottleneck here is in VRAM (both have it pinned at 97% SOL) and the L2 cache behavior is also basically the same (46% hit rate in this scene). There was a reduction in frame time (average 43ms to 41ms) but I would chalk that up to occasional power management/throttling I also did a test with the same scene but an additional dynamic light/shadow (but reduced down to 60,000,000 to make it a more reasonable framerate) and I got the same exact results between split and unsplit - although when I was checking things in renderdoc - I did notice that the shadow map generating draw calls had the vertex normal enabled - and I was wondering what they were used for in this scenario? As that could be a good area for improvement on both desktop and mobile This is also more anecdotal evidence, but I would also note from performance profiling other games on desktop (such as Apex Legends, or the default vertex buffer format in Unreal Engine) they all usually have some sort of vertex buffer stream splitting in place where positions are their own stream and attributes are distributed amongst other separate buffers, grouped by what data is most commonly used together in a shader/renderpass |
Ah sorry I think I misunderstood before about the static vs dynamic mesh thing - yes this is applied to static meshes only I see definitely see where you're coming from in terms of seeing real-world improvements, but I think the TPS scene in particular isn't the best way to show the improvements one would get from something like this because it's unreasonably pixel bound haha, at least on mobile :P I think in all scenarios though it will come out with same or improved performance We do have the data for static mesh scenes (single or not shouldn't make a difference) on mobile since that was what the first scenario I presented was testing (to be fair it wasn't a "single" static mesh but 30 copies of the same high poly single mesh, which I would argue end up with similar effect) And for tile based renderers (which are all mobile GPUs) - rendering to a single tile vs offscreen would show the same performance improvements that the first test scenario I presented showed, just with the added pixel shader work for fragments that end up on screen, regardless of the tile - the binning process occurs for ALL vertices before shading fragments (or the rest of the vertex calculations) ever begins For the power argument, I don't think I have a good way to measure the power improvement without running an example load/app on a device and running it until it dies haha - but we know that's it's a recommendation that both ARM[1][2] and Qualcomm make in their best practices guides |
If performance is within a gnat's todger with the split vertex format for static buffers on desktop, then to me that does sound a fair argument for just using split as the path for all static buffers, and not bothering with the split / non-split option. Unless there is a need for backward compatibility. What do you think @clayjohn ? 😄 |
@lawnjelly I agree that we need to do more performance analysis before we commit to anything. In theory this PR should result in performance benefits in some cases on mobile. And in theory it should degrade performance on desktop. The reduction in performance comes from the cache pressure resulting from reading the vertex position from one end of the vertex array while reading the other vertex attributes from another position in the array. Again, the performance consideration is only theoretical and may pan out to be negligible. |
@The-O-King Can you upload the vertex-heavy test project you used for testing (not the TPS demo)? This way, we can check the performance impact on desktop platforms. Thanks in advance 🙂 |
Oh yes sure thing - here's a google drive link to a zip of the project I've been using to test things - there's a sphere_scene_desktop.tscn which is what I used to measure the performance - currently there's a moving point light with shadows as well for when I was profiling that scenario https://drive.google.com/file/d/1X_Dc9NJfN6p8xYf4-g64SVuOEJCxJzrW/view?usp=sharing Let me know if this drive link works, or if there's a better/preferred way to share the project :) |
The drive link works, thanks 🙂 |
Nice little discussion of this on twitter today too, and on when the situation changed on this (looks like around 2015/2016?): |
Just been trying the PR on my desktop (Linux Mint 20, Intel HD Graphics 630) and seeing if there's any difference. With this quick contrived test I can get 129fps without splitting and 126fps with splitting, but I doubt there is much in it in practice. My test was procedurally generating a mesh with 1000000 triangles (3000000 unique verts) with uv, uv2, color, normal and tangent, and very small screen size and identical small triangle size. I don't even know if it uses that many verts or limits to 65535 in fact .. it could do in GLES3 though. Be interesting to try on more hardware these kinds of tests, especially mobile. |
I got a chance today to run your test project (really elegant btw I love it haha) and on my laptop which has a RTX 2070 MAX-Q I got identical performance when using split vs unsplit vertex buffers (avg 708.66fps) I was able to also capture some traces on my Pixel 4 and while average framerate was identical (avg 13.66fps) I did see a drop in vertex memory bandwidth in a frame (this benefit coming directly from bandwidth improvements during the binning phase I mentioned earlier) where for unsplit I saw peak and average bandwidth of 13.28GB/s and 3.7GB/s respectively, while on the split buffer test I saw peak and average bandwidth of 4.9GB/s and 2.8GB/s respectively. I've included links for you to download the Android GPU Inspector trace files I've taken in case you were interested at looking at the data yourself! You just need to have Android GPU Inspector downloaded to take a peek :) |
That's quite interesting in itself. I deliberately used identical position triangles and scaled small so that hopefully they would end up in the same tile. The idea being that this would stress the tile based renderer the most as it didn't get the opportunity to spread the geometry over several tiles. But perhaps there is something we are missing and we could run a better test - I'm no expert on tiled renderers. I'm sure a hardware guy could tell us more. |
Hello! Just wanted to bump this and see what the status is right now? Sorry it's been a while I got pulled into other work that's taken up my time So what are the current concerns about this change? Are we still unsure about it's benefits? My current understanding is that on mobile we see the memory bandwidth reductions we were expecting (and implicitly that is battery life improvement as well) and on desktop performance was largely the same (at least on the hardware we've tested so far) |
@clayjohn hello! Just wanted to bump this again, and see if there was anything I could do to help :) Thanks! |
There's a couple new project settings added by this PR which should be included in the docs, see the diff here: https://github.com/godotengine/godot/pull/46574/checks?check_run_id=2559453240 You can generate the template with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, subject to addition of docs for the project settings.
Update the documentation with doctool so hopefully things are ready after this! Thanks for taking a look everyone! |
Could you rebase to squash the commits? See PR workflow for instructions. |
af8c0b6
to
7bc9174
Compare
Oh thank you for letting me know about that I hadn't seen that before, was able to squash everything, let me know if everything is now in order |
doc/classes/ProjectSettings.xml
Outdated
@@ -1187,6 +1187,9 @@ | |||
<member name="rendering/lossless_compression/webp_compression_level" type="int" setter="" getter="" default="2"> | |||
The default compression level for lossless WebP. Higher levels result in smaller files at the cost of compression speed. Decompression speed is mostly unaffected by the compression level. Supported values are 0 to 9. Note that compression levels above 6 are very slow and offer very little savings. | |||
</member> | |||
<member name="rendering/mesh_storage/split_stream" type="bool" setter="" getter="" default="false"> | |||
On import, mesh vertex data will be split into two streams within a single vertex buffer, one for position data and the other for interleaved attributes data. Recommended to be enable if targeting mobile devices. Requires manual reimport of meshes after toggling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On import, mesh vertex data will be split into two streams within a single vertex buffer, one for position data and the other for interleaved attributes data. Recommended to be enable if targeting mobile devices. Requires manual reimport of meshes after toggling | |
On import, mesh vertex data will be split into two streams within a single vertex buffer, one for position data and the other for interleaved attributes data. Recommended to be enabled if targeting mobile devices. Requires manual reimport of meshes after toggling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the word here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You missed it in my diff but there's also a dot that should be added at the end of the sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh shoot you're so right sorry about that I hope that takes care of it now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might have missed pushing that last amend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you're right I noticed in the logs that I was getting unresolved host errors in my repo for some reason
7bc9174
to
1d02f4d
Compare
Implemented splitting of vertex positions and attributes in the vertex buffer Positions are sequential at the start of the buffer, followed by the additional attributes which are interleaved Made a project setting which enables/disabled the buffer formatting throughout the project Implemented in both GLES2 and GLES3 This improves performance particularly on tile-based GPUs as well as cache performance for something like shadow mapping which only needs position data Updated Docs and Project Setting
1d02f4d
to
7f8487a
Compare
Thanks! And congrats for your first merged Godot contribution 🎉 |
Provide an import option for enabling vertex stream splitting - where the vertex
buffer is organized into a contiguous block of position data and a contiguous
block of interleaved attribute data
This is an optimization particularly for mobile, tile-based renderers which can
use less memory bandwidth during their binning phase
Implementation of godotengine/godot-proposals#2350