C#: Improve godot_variant conversion for godot_variant-marshable C# types #89807

roookeee · 2024-03-23T13:33:23Z

I was working on another C# binding optimization when I noticed that VariantUtils.ConvertTo<T>(in godot_variant variant) did a suspicious amount of GetTypeFromHandle calls (aka typeof(T)) for a simple Node3D that only overrides _Process(double delta):

(Also note the accompanying op_Equality calls for the type comparisons)

The commit message for 3f645f9 (which introduced this method) noted the following:

C# generics don't support specialization the way C++ templates do.
I knew NativeAOT could optimize away many type checks when the types
are known at compile time, but I didn't trust the JIT would do as good
a job, so I initially went with cached function pointers.

Well, it turns out the JIT is also very good at optimizing in this
scenario, so I'm changing the methods to do the conversion directly,
rather than returning a function pointer for the conversion.

While the JIT is generally good at optimizing such typeof(T) guarded function calls, various limitations apply - e.g. the function must be inlined to enable the JIT to remove the unreachable branches for the specific callsite (and specific T).
This function has grown over time and it seems like the JIT does not inline ConvertTo (and the accompanying CreateFrom) as often as we like, which makes its performance degrade, which hurts for cheap conversion like e.g. double.

I have run into this issue in the past and found a workaround which makes C# behave similar to C++ templates when it comes to method specialization: generic static classes with a statically evaluated delegate.
The delegate is resolved at class load time and is then "resolved and cached " for us at all call sites by the runtime. This leads to small (code size) delegates which are easily inlineable with guaranteed "zero-overhead" runtime characteristics. The new implementation is also stable in the sense that we are not praying for JIT optimization which might vary over time - especially because the performance of ConvertTo / CreateFrom does not depend on inlining anymore.

I created a test project which spawn 100.000 Nodes with an attached C# script. The attached C# script overrides _Ready and _Process and increments a global variable which is printed on the screen to avoid dead code elimination, which pretty much nullified my previous benchmark results. Additionally the minimum and average process time is printed, which is used in the table below. If you do your own testing with this project, please do multiple runs as the run to run variance (especially in release builds) is quite high, 5 - 10 runs should suffice. As the project is a CPU intensive synthetic scenario you should aim to reduce the noise on your machine (e.g. IDEs indexing stuff, streaming 8k videos of Godot showcases).
HighNodeCountNoDeadCode.zip

The timings on my local machine (Win11, NET.Core 7.0.2, CPU: 5800x3D, 120hz display - might matter because of vsync) are as follows:

Editor Run Project timings:

Implementation	Timings
`master` @ Godot 4.3 dev 5	min: 234,20ms - avg: 239,01ms
This PR	min: 213,31ms - avg: 217,78ms

Windows release template timings:

Implementation	Timings
Godot 4.3 Dev 5 official release template	min: 16,58ms - avg: 17,67ms
This PR with LTO=full	min: 15,23 ms - avg: 17,17ms

We gain around 9% in the editor debug builds and around 3-9% in release builds. It's not much, but this PR is a quite small change as it stands.
The run to run variance for the release template is pretty high, the best runs were taken for all numbers (sample size: ~10) after letting the project run for ~10-15 seconds.

Let me know if the code style or the comments feel off (or you have any other question)!

roookeee · 2024-03-28T22:09:16Z

I had to redo the benchmarks after discovering that my test project was affected by dead code elimination. I changed the main post and also added numbers for the export templates, which don't gain as much of an uplift as the editor runs (which is expected).

aromaa · 2024-04-02T19:54:12Z

modules/mono/glue/GodotSharp/GodotSharp/Core/NativeInterop/VariantUtils.generic.cs

+    {
+        private static readonly ConvertToDelegate<T> _converter = DetermineConvertToDelegate<T>();
+
+        [MethodImpl(MethodImplOptions.AggressiveInlining | MethodImplOptions.AggressiveOptimization)]


The usage of AggressiveOptimization is actually a deoptimization most of the time as the method skips tiering and cant patriciate in more optimizations that rely on it. The usage of AggressiveInlining hides the fact as the caller can still patriciate in tiering and thus allowing it to take advantage of it. So in rare cases when this method is not inlined it ends up generating suboptimal code. Here is simple example demonstrating it:

[DisassemblyDiagnoser] public class InlineBenchmark { [Benchmark] public int WithAggressiveOptimization() => Test.WithAggressiveOptimization(); [Benchmark] public int WithoutAggressiveOptimization() => Test.WithoutAggressiveOptimization(); } public static class Test { private static readonly int value = 0x123456; [MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)] public static int WithAggressiveOptimization() => Test.value; [MethodImpl(MethodImplOptions.NoInlining)] public static int WithoutAggressiveOptimization() => Test.value; }

.NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX2

; InlineBenchmark.WithAggressiveOptimization() jmp qword ptr [7FF831905ED8]; Test.WithAggressiveOptimization() ; Total bytes of code 6

; Test.WithAggressiveOptimization() sub rsp,28 test byte ptr [7FF8316D1AD6],1 je short M01_L01 M01_L00: mov eax,[7FF8316D1B08] add rsp,28 ret M01_L01: mov rcx,7FF8316D1AA0 mov edx,6 call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE jmp short M01_L00 ; Total bytes of code 46

.NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX2

; InlineBenchmark.WithoutAggressiveOptimization() jmp qword ptr [7FF831915EF0]; Test.WithoutAggressiveOptimization() ; Total bytes of code 6

; Test.WithoutAggressiveOptimization() mov eax,123456 ret ; Total bytes of code 6

I just copy&pasted the original methods attributes to my newly added functions, it's interesting to see how these can affect the generated code in such a bad way. I wonder what impact removing these all over the C# integration would yield.

I have little experience with the C# JIT, especially on this level. I can remove the attributes but would have to rebenchmark then.

Thank you for the explanation, its always appreciated!

Generally you want to avoid using the AO flag as much as possible as the end result is almost always the opposite of what you want. Where it can yield improvement is on a cold start where the JIT compiles methods without optimizations but it also means that the method is never recompiled with better optimizations after the application is on a steady state.

The attribute's documentation mentions that applying the flag without measuring can yield to negative results.

I locally removed AggressiveOptimization and gave it a spin: the performance seems to be on par for me, a bit faster (14,99ms min time) in one run but that could just be noise. I would have to dig into the generated assembly to see the difference, but I don't know how to do this with Godot and C#.

EDIT: Some other runs have been yielded better results (14,3ms min time), the whole setup seems to have a lot of variance though.

Its generally very hard to benchmark these correctly and the possible difference is just few assembly instructions so its very likely to turn to noise. The expensive part is even inlined so that the only time it jumps to CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE is on the first execution.

Also as I mentioned above the AggressiveInlining is most likely hiding the negative side-effects of AggressiveOptimization in most cases where the method does get inlined as the AO flag is then just ignored. However AI never guarantees that the method is inlined in all cases, so when the method does not get inlined we then get yet another slap in the face by the negative effects of the AO flag.

roookeee requested a review from a team as a code owner March 23, 2024 13:33

AThousandShips added enhancement topic:dotnet performance labels Mar 23, 2024

AThousandShips added this to the 4.x milestone Mar 23, 2024

Improve godot_variant conversion for godot_variant-marshable C# types

c4fa8f1

roookeee force-pushed the csharp-optimize-variant-utils branch from 3f5ba95 to c4fa8f1 Compare March 23, 2024 13:43

roookeee mentioned this pull request Mar 23, 2024

DRAFT: C#-Performance: Use a custom static lookup to handle InvokeGodotClassMethod and HasGodotClassMethod #89826

Draft

10 tasks

aromaa reviewed Apr 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C#: Improve godot_variant conversion for godot_variant-marshable C# types #89807

C#: Improve godot_variant conversion for godot_variant-marshable C# types #89807

roookeee commented Mar 23, 2024 •

edited

Loading

roookeee commented Mar 28, 2024 •

edited

Loading

aromaa Apr 2, 2024

roookeee Apr 2, 2024

aromaa Apr 2, 2024

roookeee Apr 9, 2024 •

edited

Loading

aromaa Apr 10, 2024

C#: Improve godot_variant conversion for godot_variant-marshable C# types #89807

Are you sure you want to change the base?

C#: Improve godot_variant conversion for godot_variant-marshable C# types #89807

Conversation

roookeee commented Mar 23, 2024 • edited Loading

roookeee commented Mar 28, 2024 • edited Loading

aromaa Apr 2, 2024

Choose a reason for hiding this comment

.NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX2

.NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX2

roookeee Apr 2, 2024

Choose a reason for hiding this comment

aromaa Apr 2, 2024

Choose a reason for hiding this comment

roookeee Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

aromaa Apr 10, 2024

Choose a reason for hiding this comment

roookeee commented Mar 23, 2024 •

edited

Loading

roookeee commented Mar 28, 2024 •

edited

Loading

roookeee Apr 9, 2024 •

edited

Loading