Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GLES2 2d Batch rendering #35957

Closed
wants to merge 1 commit into from
Closed

Conversation

lawnjelly
Copy link
Member

@lawnjelly lawnjelly commented Feb 6, 2020

2d rendering is currently bottlenecked by drawing rects one at a time, limiting OpenGL efficiency. This PR batches rects and renders in fewer drawcalls, resulting in significant performance improvements. This also speeds up text rendering.

The code dynamically chooses between a vertex format with and without color, depending on the input data for a frame, in order to optimize throughput and maximize batch size.

Notes

  • Although it only batches rects to start with, the same framework can be used to batch the other primitives.

  • Can only batch within _canvas_item_render_commands. Notably this covers tilemaps, text, the _draw API but NOT separate sprites (we evaluated this and it would require far more complex changes).

  • There is a high speed simple growable pod dynamic array template included. For now it is included in the drivers/gles2 directory.

I hesitate to put it in core as it is really only intended for these rasterizers and is not general purpose. It will be very easy to replace when we have a high speed, non-COW vector in the future.

  • I'm using indexed primitives so that only 4 verts are required per quad. Indices in GLES2 however can only be 16 bit, so they are limited to addressing 65535 verts (from the origin). For this reason there is a maximum size for a vertex buffer, 1 quad less than would fit into 65535. When this occurs, the routine ends the batch, draws the current batches, resets and starts filling batches again where it left off.

  • Swapping between colored / non-colored vertex format is currently done based on a simple heuristic at runtime. It is also possible to use hysteresis, or to allow the user to switch manually, however this has not been implemented yet, it may not be required.

  • Now tested on Linux, Windows 10, Android, WebGL. The more testing the better, and should anyone be able to try it on their devices / more platforms that would be welcome.

  • Because there is still the possibility for bugs, I have added (temporarily) a Project setting : rendering/quality/2d/use_batching which defaults to true, but can be turned off if there are problems. The legacy non-batched method is still included.

  • Just in case it affects people being able to even start up the editor in 2d, I have disabled the batching in the IDE until we have some feedback. If all goes well it could be enabled in the IDE by default with another PR.

What kind of speed increases can I expect?

It depends on a number of factors, including what you are rendering and what the hardware is. Benefit roughly scales with the number of quads. There will be less benefit in games that are fill rate limited already. In my tests with 2d games / tests, frame rate increases of 2-10x were typical. I was getting larger speed increases on desktop than on my tablet.

Note there are some cases (notably bunnymark v2) which can't be batched currently, see below. Any more figures doing the comparison on different hardware would be welcome.

Increases will be higher in release build. With some more tweaks I have now clocked 58x increase in frame rate in vertex throughput limited scene ( #19917 ).

@lawnjelly lawnjelly requested a review from reduz as a code owner February 6, 2020 16:46
@clayjohn clayjohn added enhancement topic:rendering cherrypick:3.x Considered for cherry-picking into a future 3.x release labels Feb 6, 2020
@clayjohn clayjohn added this to the 4.0 milestone Feb 6, 2020
@Geequlim
Copy link
Contributor

Geequlim commented Feb 6, 2020

Does it increase the bunnymark score?

@lawnjelly
Copy link
Member Author

lawnjelly commented Feb 6, 2020

Does it increase the bunnymark score?

Bunnymark V2 GDscript (uses Sprite)

Old 1870, Batched 1870

No it isn't faster, you may have found a rare case it can't accelerate. 😢

At a guess I'd think each bunny is causing a separate call to _canvas_item_render_commands. I can only batch within that function, if it is called with e.g. 1 bunny each time, there is no batching. Also in the single case, it is currently using the same codepath as nvidia workaround, which is slower. I can easily put in a special case for this kind of thing, so it is never slower than the old version (edit DONE).

I did have a look initially at batching over multiple _canvas_item_render_commands calls, however because of the historical design of the engine, it looked quite a bit more complex. I think the best I can do in that case is to just fall back to the legacy method, it will probably be quicker for a single quad (because the batching uses index buffers etc).

Bunnymark V1 DrawTex GDScript (uses draw_texture)

Old 2060, Batched 3238

I suspect the low increase is because something else is bottlenecking it, I haven't looked in more detail. Each bunny is a separate node running gdscript so that's going to be pretty slow compared to things like tilemaps.

Bunnymark V1Sprites
Wouldn't converge to a solution at all in old mode? I think there's a problem with the benchmark, it was hovering around 56/57 fps and not giving any figures.

Overall, bunnymark might be useful for comparing languages but it's not a very useful real world test for rendering. Usually it can be a good idea to test each aspect in isolation. And within rendering there are different aspects, you can be limited by vertex throughput, or fill rate etc.

@clayjohn
Copy link
Member

clayjohn commented Feb 6, 2020

On a heavy tilemap (cell size 8) on my low-end hardware (Celeron N2840). I get about a 10X speed increase. Frame times from 5-33 ms are reduced to 0.5-2.5.

The speed increase is obvious. I will be reviewing the code this evening in more detail. But I think we should also try to get lots of testers on different hardware. The other batching PR ended up having problems with specific hardware. So testing on a wider base is likely needed here as well.

@lawnjelly
Copy link
Member Author

Great! 😄 It seems I have a bunch of warnings as errors to sort out for some platforms, that may have to wait until after the weekend. Still it can be tested on desktop in the meantime. 👍

Also note to any reviewers, don't get carried away yet on the details as I'll still be doing some more work on it after the weekend (and there are still various variable names to change to godot standard).

@clayjohn
Copy link
Member

clayjohn commented Feb 6, 2020

Regarding defining _use_2d_batching. I think it should be moved to VisualServer.cpp There is no need to have it defined in Engine here is an example of a typical rendering quality setting

Defined in VisualServer.cpp

GLOBAL_DEF("rendering/quality/reflections/high_quality_ggx", true);

Used in rasterizer:

bool ggx_hq = GLOBAL_GET("rendering/quality/reflections/high_quality_ggx");

@fire
Copy link
Member

fire commented Feb 6, 2020

@lawnjelly
Copy link
Member Author

lawnjelly commented Feb 6, 2020

There are some cicd errors. https://travis-ci.org/godotengine/godot/builds/646965929?utm_source=github_status&utm_medium=notification

Yup I will have to deal with them after the weekend unless I get some time tomorrow (am too tired tonight, is bedtime here! 😃 . I think they are warnings as errors about using godot things like Color in pod structures, nothing serious. It's actually a little more complex than you might think, because the vertex format has to be 32 bit and Vector2 etc might be using real, so could change to double in future, so I'd prefer to make anything future proof.

@novhack
Copy link

novhack commented Feb 8, 2020

This one is a game changer. My project jumped from less than 200 to 850 FPS. It's a powerful desktop so I guess a mobile experience would not be nice without this.

@lawnjelly lawnjelly force-pushed the batch-2d branch 6 times, most recently from 1d1ce2d to 2d66e3f Compare February 8, 2020 18:42
Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a start. Mostly just naming convention nit picks. I will do a more in depth review when I have time tonight.

Looking great so far. :)

drivers/gles2/rasterizer_array.h Outdated Show resolved Hide resolved
drivers/gles2/rasterizer_canvas_gles2.h Outdated Show resolved Hide resolved
drivers/gles2/rasterizer_canvas_gles2.h Outdated Show resolved Hide resolved
drivers/gles2/rasterizer_canvas_gles2.h Outdated Show resolved Hide resolved
drivers/gles2/rasterizer_canvas_gles2.h Outdated Show resolved Hide resolved
drivers/gles2/rasterizer_canvas_gles2.cpp Outdated Show resolved Hide resolved
drivers/gles2/rasterizer_canvas_gles2.cpp Outdated Show resolved Hide resolved
drivers/gles2/rasterizer_canvas_gles2.cpp Outdated Show resolved Hide resolved
drivers/gles2/rasterizer_canvas_gles2.cpp Outdated Show resolved Hide resolved
drivers/gles2/rasterizer_canvas_gles2.cpp Outdated Show resolved Hide resolved
Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. This finishes the first pass. Mostly naming convention stuff. Some change to the flow in trivial areas and some questions about using `CRASH_COND``. Overall looks very good.

We should discuss later how this will fit in with other primitives. Right now it is set up really well for just RECTS. But I think the biggest benefit would come if we could batch all types of primitives together.

drivers/gles2/rasterizer_array.h Outdated Show resolved Hide resolved
drivers/gles2/rasterizer_canvas_gles2.cpp Outdated Show resolved Hide resolved
drivers/gles2/rasterizer_canvas_gles2.cpp Outdated Show resolved Hide resolved
drivers/gles2/rasterizer_canvas_gles2.cpp Outdated Show resolved Hide resolved

// this should always succeed after growing
batch = bdata.batches.request();
CRASH_COND(!batch);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference here would be to try to fail gracefully. Although, given that this should never occur. I think it is probably best to at least use CRASH_COND_MSG

#define CRASH_COND_MSG(m_cond, m_msg) \

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually now you mention it, I will see if I can remove some of these, I really put some of them in as asserts to check the code path during development, and they will slow things down at runtime. Or at least wrap them to only occur in debug build. And as you say some might be able to be replaced by a graceful fail macro.

drivers/gles2/rasterizer_canvas_gles2.cpp Outdated Show resolved Hide resolved
bdata.verts_colored.reset();
bdata.batches_temp.reset();

CRASH_COND(bdata.verts_colored.max_size() != bdata.verts.max_size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need to be CRASH_CONDs Is there any way we could gracefully recover from this?

We try to avoid any and all crashes. Our preference is to ERR_PRINT when possible.

Copy link
Member Author

@lawnjelly lawnjelly Feb 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They actually should be debug only, I've wrapped them in #ifdef DEBUG_ENABLED. It is a check on program flow, if this assertion is violated the rest of the function won't work. There is no sensible way to recover I don't think. Good news is users can't cause this, only a core programmer changing the allocation sizes so they are unequal, and it would be best to catch this immediately.

The CRASH_COND checks within this function could actually be removed entirely, since they can't occur at the moment, but I am just hoping that having them available will make it easier to maintain. I've also added a comment to that purpose.

drivers/gles2/rasterizer_canvas_gles2.cpp Outdated Show resolved Hide resolved
drivers/gles2/rasterizer_canvas_gles2.cpp Outdated Show resolved Hide resolved
indices.set(i_pos + 5, q_pos + 3);

// we can only use 16 bit indices in GLES2!
CRASH_COND((q_pos + 3) > 65535);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to ERR_FAIL_COND_MSG

Copy link
Member Author

@lawnjelly lawnjelly Feb 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make it debug only, but crashing rather than just an error message might be a good idea to quickly catch logic errors if this allocation code is changed in future (i.e. if someone later changes max_quads without realising they can overrun the max index size, this is a really common gotcha with indices, i.e. I've done it myself too many times to count lol 😄 ).

If you overrun the max index size, it will 'appear' to work, however the whole vertex buffer won't get used in drawing and it will wrap around to the front causing hard to diagnose visual bugs. Best case. 😬 . Let me know either way, I can change.

@lawnjelly lawnjelly force-pushed the batch-2d branch 2 times, most recently from ddb8aac to a780346 Compare February 9, 2020 19:53
@clayjohn
Copy link
Member

clayjohn commented Feb 9, 2020

Tested on: Windows 10, android, and WebGL All work fine. All see substantial performance benefits with heavy text/tilemap rendering.

For posterity I will summarize our discussion from IRC earlier today:
We discussed how this approach to batching only batches ItemCommands as apposed to batching Items themselves. This means drawing 1000 of the same sprite will not be batched, but using the _draw() API to draw a textured rect 1000 would. Because internally TileMap and Label use a single Item and issue many ItemCommands, they benefit from this PR.

This approach to batching is very simple to implement and easy to understand. To implement batching on an Item-level would be significantly more complex and may require redesigning the GLES canvas rasterizer. Further, it would create a significant overhead cost as sorting the Items and identifying when batching is appropriate can and will be expensive.

We decided the best approach going forward was to focus on batching ItemCommands and then communicate this implementation to users. (e.g. if you need extra speed, take advantage of batching by using the custom drawing API). However, we can also consider adding a second batching implementation that batches Items by combining them into ItemCommands. This would be toggleable separate from regular batching as it would incur the above-mentioned overhead cost.

@lawnjelly lawnjelly changed the title [WIP] GLES2 2d Batch rendering GLES2 2d Batch rendering Feb 10, 2020
@reduz
Copy link
Member

reduz commented Feb 10, 2020

Sorting the items is kind of out of the question, that would just break 2D entirely.
But I think what can be done is just using the same logic for the canvas items, as in keep the data persistent.. so you have basically two things:

  • A buffer you fill
  • A flush function

Between canvas items, if there is light, then you flush on the light pass. If the mateial changes, then you flush. Finally when you are done rendering, you flush again.

This will make it possible to batch as much as realistically possible, but it is to be expected that some things like lights or material changes will avoid you from going further, but it's still fine.

@lawnjelly
Copy link
Member Author

For @clayjohn and reference here we discussed extending it to handle batching between the calls to get it working with sprites etc too:

<lawnjelly> reduz: I agree with that idea, just didn't know if it would be practically possible
<lawnjelly> however, do sprites have a different transform?
<lawnjelly> i.e. is the transform currently done on the CPU or the GPU
<lawnjelly> because for that to work we'd have to do the transform on the CPU (which is very doable)
<reduz> lawnjelly: well, if you are using a shader that access transforms, you would need to flush and let the shader use the transform uniform, otherwise, just transform manually in CPU
<reduz> lawnjelly: but thanks to how Godot works, its pretty easy to just check if a shader uses a transform uniform
<reduz> lawnjelly: it can be checked and cached when the shader is compiled with one or two lines of code
<lawnjelly> i'll have a look with clayjohn
<reduz> lawnjelly: but if you have a logic that pretty much just does fill/flush, it will be fine
<lawnjelly> agree with that being a more practical approach than sorting the items

@Duroxxigar
Copy link
Contributor

@clayjohn I just want to make sure I understand this correctly - the only way to get the benefits of this PR is to use _draw()? Looking at an end user standpoint.

@dragmz
Copy link
Contributor

dragmz commented Feb 11, 2020

I've tried a similar approach (i.e. ItemCommands batching) in #20077. This has been rejected at that time and resulted with another batching try #20965 however that caused performance issues on mobile patforms.

I keep my fingers crossed for this pull request so we've eventually got the proper GLES2 batching in Godot. :)

@lawnjelly
Copy link
Member Author

I've tried a similar approach (i.e. ItemCommands batching) in #20077. This has been rejected at that time and resulted with another batching try #20965 however that caused performance issues on mobile patforms.

I keep my fingers crossed for this pull request so we've eventually got the proper GLES2 batching in Godot. :)

Yup, I saw your PR (or one of them)! It is absolutely not easy to retrofit fully so I totally understand the frustration you had (on top of actually getting anything working at all, it's really easy to get stalls from the dynamic VBs). I took what I would consider the easy approach at first, but batching across canvas_render_items becomes considerably more involved.

Hopefully I am just starting to get somewhere with the extended version today (it is rendering mostly ok now, just still not accelerating the bunnymark v2, but I probably need to batch over multiple calls to canvas_render_items, as each bunny is a node). Getting only 4 hours sleep didn't help either!
😁

Part of the problem is that it has not been designed with batching in mind (it is really written as an immediate mode renderer), so there is a lot of CPU sapping house keeping to do, which would be nice to push upstream in the call stack. Still, getting anything working at this higher level will be better than the current situation.

@dragmz
Copy link
Contributor

dragmz commented Feb 11, 2020

@lawnjelly as far as I remember I've been able to batch buffer update calls (but not draw calls) across canvas_render_items in #20077 by introducing two commands, i.e. begin list and end list that were used to split canvas_render_items into the prepare and flush phases.

@dragmz
Copy link
Contributor

dragmz commented Feb 11, 2020

@lawnjelly I really (really really) hope your PR is going to hit the master because I can't sleep due to the missing batching. 😊

@clayjohn
Copy link
Member

I'm thinking of open a fresh PR for this extended version with a new branch once I'm further along (I've copied the existing changes across) just to make it easier to follow on github, unless there are any objections, and I can link to it on from here.

I think it would be best to just add a second commit on here. Then you still have the snapshot of how things are now and we can track everything in one PR.

@AniMerrill
Copy link

AniMerrill commented Feb 29, 2020

Hi, I've been anxious to try this feature out since I heard about it a couple weeks ago and today I decided to build it. I can confirm that the performance increase is impressive, with a tilemap based game on mobile jumping from 16 FPS to 40 FPS on Firefox and a full 60 FPS in Chrome. I would like to say that I'm completely satisfied with it, but it seems like either the master branch of Godot (most likely) or this PR causes an issue with touch screen button functionality. Assuming I did everything right, which I'm not even 100% about.

I built the lawnjelly:batch-2d branch from source on a xenial (Ubuntu 16.04) 64-bit chroot on my system (which is the latest version of Ubuntu) as well as the JavaScript export template. Everything works for the most part (for example, when testing and playing on desktop the buttons work fine) but on mobile I can no longer press them. Keep in mind that this was a project I was developing using 3.2-stable just yesterday and they worked fine there. So I can only assume there's some weird issue. The touch screen buttons neither make the pressed animation or send out any kind of signal, which I'm guessing means click/touch events might be messed up in general for the HTML version on mobile.

I'm not sure if this is the proper place to bring this up or if it is directly related to the PR, but since it happened while building the branch associated with it I figured I should mention it somewhere. I am really excited for this feature and I'm really hoping it makes it in to 3.2.1.

Edit: I went ahead and took the liberty of uploading both versions of my current game project in its current state as it was built with both 3.2-stable and this PR repo. The virtual controller will only display on mobile, but I did test for debugging purposes and the virtual controller does in fact work on both desktop native and desktop HTML versions of the game built with the batch2d version. Only on the mobile version, where they are actually needed, does this problem occur. You can also see on the top left hand corner an FPS counter though which shows the stark difference between the rendering techniques.

https://animerrill.com/test/3.2-stable-test/
https://animerrill.com/test/batch2d-test/

@lawnjelly
Copy link
Member Author

lawnjelly commented Feb 29, 2020

@AniMerrill : The touchscreen issue sounds unrelated. I can't guarantee it 100%, but I struggle to think of a mechanism where the rendering could interfere with input.

It is more likely it is due to whatever version of master I happened to have at the time I was developing the PR (it was not 3.2 stable, most likely a bit afterwards or even slightly before, where some bugs may have been introduced by other commits). I'm actually not 100% sure what happens when you grab my PR branch (not being a git expert), whether it mixes it with 3.2 stable or whatever version I happened to have during development, I'm assuming the latter.

If the above is the case, and you really want a build with 3.2 stable, you could do it by grabbing 3.2 stable source. Then grab only the files or bits I changed (there's probably only 2 or 3, it is mainly in drivers/gles2), copy those over the 3.2 stable versions, then compile. This may do as a temporary solution, or if you were intending to release a game with it and you couldn't afford to wait.

If you search in issues, you may find something has been worked on recently regarding touchscreen input that might be responsible for the bug you saw.

Good to hear it is working on web though. 👍

@AniMerrill
Copy link

@lawnjelly : Alright, so I did a bunch of tests and attempts today. Final verdict is that I wasn't able to get a 3.2 stable version working with these changes. I'm sure this is probably down to a mistake on my part during the build process, or at least I have to assume so.

It does look like whenever you created the branch for batch-2d it was after the conversion of master from the 3.2 stuff to the 4.0 development which is already well known it was buggy. So to follow your suggestion, I downloaded the 3.2-stable release and then transplanted the drivers/gles2 folder and ran an export template build (and later a full tools build). Since it's the only version I'm really concerned with at the moment, I just did the javascript platform export. This is the result of that build:

https://animerrill.com/test/batch2d-w-stable2/

Unfortunately it does still look like the touch screen buttons are still unfunctional, and even worse it says:

**WARNING**: Property not found: rendering/quality/2d/use_batching  Squid%20Quest.js:7:37895
   At: core/project_settings.cpp:209:_get() - Property not found: rendering/quality/2d/use_batching  Squid%20Quest.js:7:37895

Which means that this build also wasn't able to even use the batching, so the framerate problems are exactly the same + there's no way to use the virtual controls.

The other message that popped up on both this build and the one I posted about last time was the following:

**ERROR**:   Squid%20Quest.js:7:35600
   At: platform/javascript/os_javascript.cpp:1004:initialize() - 

Which I have no way of proving this is related, and I don't have any way of checking the browser console on my phone, but it is the only other difference between this build and the vanilla 3.2-stable build where the controls work. And this also happens on the version where the only change should have been to the drivers/gles2 directory. I have a feeling this is still probably unrelated to the work you were doing, and I feel like it's more likely that your version requires something in the current master branch to work which breaks when trying to transplant it into the other code base or something. I'm not sure, I haven't really ever attempted engine builds from source before this.

Again I am excited for this feature and I hope to see it in an upcoming RC if at all possible, or at least in 3.2.1. The performance difference for a simple tilemap that this batching routine provides over the current stable branch is phenomenal and extremely necessary for the limited hardware of mobile play. I just wish I could get it working without these bugs myself, but I'm hoping I don't have to wait too long regardless.

@clayjohn
Copy link
Member

clayjohn commented Mar 1, 2020

@AniMerrill you can't just copy over the GLES2 folder. You have to copy over every file that has been changed in this PR. So that includes servers/visual_server.cpp

@lawnjelly
Copy link
Member Author

If I have time I'll try and make a version against 3.2 stable, however, I'm pretty much a git newbie myself, and this kind of version mismatching can be quite difficult to deal with. Normally all this is handled for us contributors during merging.

Bear in mind that if you grab 3.2 branch now and build from source, it is not the same as the released 3.2 stable. Commits have been cherry picked and added since then, which may have fixed or introduced new bugs:

https://github.com/godotengine/godot/commits/3.2

So if you are going to compare with and without the PR, you have to compare both versions built from source. The chances are that something to do with input is unrelated, you may have to do some detective work yourself to work out what is causing that bug.

I can't really spend too long on this at the moment because I'm busy working on the next version. The sooner we finish that the sooner it will be available officially.

@AniMerrill
Copy link

@clayjohn : Alright, that did seem to be my issue for that last build then. Once I updated that file and recompiled, everything seems to be in working order now.

https://animerrill.com/test/batch2d-w-stable5/

Unfortunate finding perhaps though is that the performance gain on this version seems pretty negligible, seems to be <5 FPS gain versus the 10-20 FPS gain that the first attempt I made. For full clarification here are the other two versions I am comparing against.

https://animerrill.com/test/3.2-stable-test/ (Made with vanilla 3.2-stable as can be downloaded from the main page)
https://animerrill.com/test/batch2d-test/ (Made with compiling this PR's full code, which seems to include some state of the 4.0.dev master)

These are some general FPS measurements I got by just sort of eyeballing it, so take it with a grain of salt I guess:

  • Firefox on Galaxy Note 5
    • 3.2-stable = 16-19 FPS
    • 3.2-stable w batch2d (5) = 15-24 FPS
    • PR = 30-40 FPS
  • Firefox on Desktop
    • 3.2-stable = 19-29 FPS
    • 3.2-stable w batch2d (5) = 21-31 FPS
    • PR = 25-34 FPS
  • Chrome on Galaxy Note 5
    • 3.2-stable = 49-59 FPS
    • 3.2-stable w batch2d (5) = 52-60 FPS
    • PR = 59-60 FPS
  • Chrome on Desktop
    • 3.2-stable = 48-56 FPS
    • 3.2-stable w batch2d (5) = 42-56 FPS
    • PR = 53-58 FPS

I can only assume that there must have already been a lot of other massive improvements on the backend by the time this branch was created, because besides the loss of touch screen button functionality the performance boost created through the PR build is pretty substantial.

@lawnjelly : Also yes, I did mind to grab the actual release version of 3.2-stable so these comparisons are between the proper download provided at godotengine.org and the preserved code base that should have been used for that specific build. So the only difference between my stable build and the stable with batch2d build should be the files altered in this PR. So that likely has some amount to do with the performance differences etc.

In any case, at least I got it working and modest performance boost is better than no performance boost at all. Looking forward to the final version of this, hope it gets merged soon and made official.

@clayjohn
Copy link
Member

clayjohn commented Mar 1, 2020

@AniMerrill the stable release from the website has optimizations available. Lawnjelly was saying you need to build from the 3.2 branch yourself to test.

Alternatively, for a fair comparison, you need to build the 3.2 merged branch using release instead of debug target.

@AniMerrill
Copy link

@clayjohn : Okay for clarification, I am using the version of 3.2-stable that I downloaded like the day of it's release. Again, unless I am mistaken, this should be precisely the build made from the tagged release code above which wouldn't have the updates that differentiate it from the vanilla version I am using. I have also used release version builds for all my tests with this PR.

Unless you're suggesting that the post release commits would be necessary to get the full advantage of this PR, in which case I could probably try that.

@clayjohn
Copy link
Member

clayjohn commented Mar 1, 2020

@AniMerrill this has nothing to do with the post-release commits. The official editor binaries are built using a series of compiler optimizations that you likely haven't enabled when you built it yourself.

The best way to compare is to build both yourself. That means you need to checkout the 3.2 release branch and build it the same way you built the 3.2-batch2d branch and then compare those two versions.

@AniMerrill
Copy link

@clayjohn : Are you talking about enabling things like link-time optimization? Because I did enable that. I'm not sure what other compiler optimizations are available though besides the things suggested here to reduce the executable size.

Also I would assume and hope that if there are optimizations made to the officially provided releases that those specifications are public so that I could try to replicate them myself, either way if I can't get an improvement over the official release then it really doesn't matter.

@clayjohn
Copy link
Member

clayjohn commented Mar 1, 2020

@AniMerrill. Ah, good. So you are aware. Between LTO and target=release you should get pretty similar performance to the stable binary. I believe they also use a specific compiler that has further benefits.

Also, the speed improvements depend on your test scene. This POC PR only benefits tilemaps and text. So if you are testing on a scene that doesn't make heavy use of either you are unlikely to see much benefit.

Lawnjelly is working on a much larger PR that will provide speed benefits across the board. But it differs substantially in implementation from this one.

@lawnjelly
Copy link
Member Author

Just to point out you can actually directly compare both for a single build with this PR, because there is a project setting which can enable / disable it: rendering/quality/2d/use_batching. So if you just make 2 exports, 1 with batching switched on and 1 with it off, that should be easier to compare.

As clayjohn says, the benefits you get will depend on the bottlenecks in your particular game (and the bottlenecks in the webversion, which may be different to regular builds).

@AniMerrill
Copy link

AniMerrill commented Mar 1, 2020

@clayjohn and @lawnjelly : Just to make absolute sure, I rebuilt again making sure that both the editor and export template were built using LTO and had their most optimal target selected (release_debug and release respectively). I could have sworn I did this before, but these builds all took me about an hour each so it's hard to say if I remembered to set the flags correctly every time. Here are the commands used:

scons -j4 p=x11 target=release_debug use_lto=yes
scons -j4 p=javascript target=release tools=no use_lto=yes

I also went ahead and tried out what @lawnjelly suggested and made two of the same build just with one that has the batching disabled.

https://animerrill.com/test/batch2d-w-stable6-disabled/
https://animerrill.com/test/batch2d-w-stable6/

Disabled

  • Firefox Mobile: 14-20 FPS
  • Firefox Desktop: 17-26 FPS
    • Hard stutter to sub 10
  • Chrome Mobile: 49-60 FPS
  • Chrome Desktop: 43-57 FPS

Enabled

  • Firefox Mobile: 17-20 FPS
  • Firefox Desktop: 23-26 FPS
    • Hard stutter to sub 10
  • Chrome Mobile: 55-60 FPS
  • Chrome Desktop: 46-60 FPS
    • Occasional dip to 30s

I think the biggest difference I noticed was that the enabled version tended to be more stable towards the higher end of it's range and would have a higher low end to the range... but that's basically it. For full disclosure, I can launch both versions from the editor natively on my desktop machine and it pulls 59/60 FPS basically flawlessly so I can only assume this basically comes down to the limitations of WebASM and mobile hardware more than anything. Doing all these tests also elucidated to me how poor Firefox is for mobile game performance, at least for Godot.

Also as far as the specific bottlenecks this PR attempts to solve, I am aware that it basically only works for tilemaps, text, or any object where _draw() functions are explicitly used (if I understand correctly). I'd rather not share the code for this publicly any time soon and I doubt it's urgent enough to have me send the code privately somehow, but I can basically assure you that the performance issues are 100% being caused by the tilemaps in the scene. I have two tilemap layers, one for foreground and one for background detail, and just removing the foreground map increases performance significantly. That's sort of what's frustrating about the current stable implementation, because what is going on in my game scene is about on the level of what the NES could do over 30 years ago. The common wisdom is that tilemaps and tilesets reduce both memory usage and render processing because it reduces a complicated scene to reusable chunks, but the current implementation in Godot would work better if I had 320x240 pngs uniquely drawn for each bit of the level. It's also my understanding that Godot doesn't seem to cull things outside the viewport which is also somewhat silly. I might be wrong about that.

I am definitely excited for both the future of this PR and of this upcoming PR which will work for all scene objects generally. Hopefully between these and a bunch of bug fixes from the main 3.2 branch we will see a lot of general performance boosts.

In any case, thank you for your time. I feel if nothing else this was a learning experience for me and hopefully my one unique use case has maybe given you some further ideas as well. I appreciate the work you're doing and I wish you good luck going forward.

@Schroedi
Copy link
Contributor

Schroedi commented Apr 1, 2020

Awesome that you work on this!

I noticed two bugs in 1fa753852edfe6f5393096d5d75c6ad8f6613d35 from your kessel_32_1 branch:

  • self_modulate on sprites does not work anymore with this PR
  • modulate effects all sprites in the scene.

Sample project (ignore the name):
RenderDocTest.zip

@lawnjelly
Copy link
Member Author

lawnjelly commented Apr 1, 2020

Awesome that you work on this!

I noticed two bugs:

  • self_modulate on sprites does not work anymore with this PR
  • modulate effects all sprites in the scene.

Sample project (ignore the name):
RenderDocTest.zip

This PR has been superseded by this one: #37349
Not sure if you just posted this to the wrong PR (as the titles are similar). I'll try your example project with the new PR. It may well cause a regression in the new one too, these are the types of things we are looking for so we can fix them, so thank you for testing! 👍

Edit - Yes it is a regression, I'll see if I can fix it. Many thanks! 👍

@Schroedi
Copy link
Contributor

Schroedi commented Apr 1, 2020

This PR has been superseded [...]

Thanks for your kind reply :) Yes that's the one I tested. Maybe you can close this PR then? It's hard to keep track on what is the newest PR.

@lawnjelly
Copy link
Member Author

lawnjelly commented Apr 1, 2020

This PR has been superseded [...]
Thanks for your kind reply :) Yes that's the one I tested. Maybe you can close this PR then? It's hard to keep track on what is the newest PR.

Will do good idea. I think these won't be too tricky to fix, I've possibly fixed them already in another version I was working on (hard to keep track of lol). 👍

Yes, it's in the item joining, I'll add an extra check for this. For now if you set max join commands to 0 it will fix it.

@lawnjelly lawnjelly closed this Apr 1, 2020
@akien-mga akien-mga added archived and removed cherrypick:3.x Considered for cherry-picking into a future 3.x release labels Apr 1, 2020
@akien-mga
Copy link
Member

Putting a comment for those who might wonder why this was closed:

Superseded by #37349.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.