Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C#: Improve godot_variant conversion for godot_variant-marshable C# types #89807

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

roookeee
Copy link

@roookeee roookeee commented Mar 23, 2024

I was working on another C# binding optimization when I noticed that VariantUtils.ConvertTo<T>(in godot_variant variant) did a suspicious amount of GetTypeFromHandle calls (aka typeof(T)) for a simple Node3D that only overrides _Process(double delta):
flamechart
(Also note the accompanying op_Equality calls for the type comparisons)

The commit message for 3f645f9 (which introduced this method) noted the following:

C# generics don't support specialization the way C++ templates do.
I knew NativeAOT could optimize away many type checks when the types
are known at compile time, but I didn't trust the JIT would do as good
a job, so I initially went with cached function pointers.

Well, it turns out the JIT is also very good at optimizing in this
scenario, so I'm changing the methods to do the conversion directly,
rather than returning a function pointer for the conversion.

While the JIT is generally good at optimizing such typeof(T) guarded function calls, various limitations apply - e.g. the function must be inlined to enable the JIT to remove the unreachable branches for the specific callsite (and specific T).
This function has grown over time and it seems like the JIT does not inline ConvertTo (and the accompanying CreateFrom) as often as we like, which makes its performance degrade, which hurts for cheap conversion like e.g. double.

I have run into this issue in the past and found a workaround which makes C# behave similar to C++ templates when it comes to method specialization: generic static classes with a statically evaluated delegate.
The delegate is resolved at class load time and is then "resolved and cached " for us at all call sites by the runtime. This leads to small (code size) delegates which are easily inlineable with guaranteed "zero-overhead" runtime characteristics. The new implementation is also stable in the sense that we are not praying for JIT optimization which might vary over time - especially because the performance of ConvertTo / CreateFrom does not depend on inlining anymore.

I created a test project which spawn 100.000 Nodes with an attached C# script. The attached C# script overrides _Ready and _Process and increments a global variable which is printed on the screen to avoid dead code elimination, which pretty much nullified my previous benchmark results. Additionally the minimum and average process time is printed, which is used in the table below. If you do your own testing with this project, please do multiple runs as the run to run variance (especially in release builds) is quite high, 5 - 10 runs should suffice. As the project is a CPU intensive synthetic scenario you should aim to reduce the noise on your machine (e.g. IDEs indexing stuff, streaming 8k videos of Godot showcases).
HighNodeCountNoDeadCode.zip

The timings on my local machine (Win11, NET.Core 7.0.2, CPU: 5800x3D, 120hz display - might matter because of vsync) are as follows:

Editor Run Project timings:

Implementation Timings
master @ Godot 4.3 dev 5 min: 234,20ms - avg: 239,01ms
This PR min: 213,31ms - avg: 217,78ms

Windows release template timings:

Implementation Timings
Godot 4.3 Dev 5 official release template min: 16,58ms - avg: 17,67ms
This PR with LTO=full min: 15,23 ms - avg: 17,17ms

We gain around 9% in the editor debug builds and around 3-9% in release builds. It's not much, but this PR is a quite small change as it stands.
The run to run variance for the release template is pretty high, the best runs were taken for all numbers (sample size: ~10) after letting the project run for ~10-15 seconds.

Let me know if the code style or the comments feel off (or you have any other question)!

@roookeee
Copy link
Author

roookeee commented Mar 28, 2024

I had to redo the benchmarks after discovering that my test project was affected by dead code elimination. I changed the main post and also added numbers for the export templates, which don't gain as much of an uplift as the editor runs (which is expected).

{
private static readonly ConvertToDelegate<T> _converter = DetermineConvertToDelegate<T>();

[MethodImpl(MethodImplOptions.AggressiveInlining | MethodImplOptions.AggressiveOptimization)]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of AggressiveOptimization is actually a deoptimization most of the time as the method skips tiering and cant patriciate in more optimizations that rely on it. The usage of AggressiveInlining hides the fact as the caller can still patriciate in tiering and thus allowing it to take advantage of it. So in rare cases when this method is not inlined it ends up generating suboptimal code. Here is simple example demonstrating it:

[DisassemblyDiagnoser]
public class InlineBenchmark
{
    [Benchmark]
    public int WithAggressiveOptimization() => Test.WithAggressiveOptimization();

    [Benchmark]
    public int WithoutAggressiveOptimization() => Test.WithoutAggressiveOptimization();
}

public static class Test
{
    private static readonly int value = 0x123456;

    [MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
    public static int WithAggressiveOptimization() => Test.value;

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int WithoutAggressiveOptimization() => Test.value;
}

.NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX2

; InlineBenchmark.WithAggressiveOptimization()
       jmp       qword ptr [7FF831905ED8]; Test.WithAggressiveOptimization()
; Total bytes of code 6
; Test.WithAggressiveOptimization()
       sub       rsp,28
       test      byte ptr [7FF8316D1AD6],1
       je        short M01_L01
M01_L00:
       mov       eax,[7FF8316D1B08]
       add       rsp,28
       ret
M01_L01:
       mov       rcx,7FF8316D1AA0
       mov       edx,6
       call      CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
       jmp       short M01_L00
; Total bytes of code 46

.NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX2

; InlineBenchmark.WithoutAggressiveOptimization()
       jmp       qword ptr [7FF831915EF0]; Test.WithoutAggressiveOptimization()
; Total bytes of code 6
; Test.WithoutAggressiveOptimization()
       mov       eax,123456
       ret
; Total bytes of code 6

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copy&pasted the original methods attributes to my newly added functions, it's interesting to see how these can affect the generated code in such a bad way. I wonder what impact removing these all over the C# integration would yield.

I have little experience with the C# JIT, especially on this level. I can remove the attributes but would have to rebenchmark then.

Thank you for the explanation, its always appreciated!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally you want to avoid using the AO flag as much as possible as the end result is almost always the opposite of what you want. Where it can yield improvement is on a cold start where the JIT compiles methods without optimizations but it also means that the method is never recompiled with better optimizations after the application is on a steady state.

The attribute's documentation mentions that applying the flag without measuring can yield to negative results.

Copy link
Author

@roookeee roookeee Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I locally removed AggressiveOptimization and gave it a spin: the performance seems to be on par for me, a bit faster (14,99ms min time) in one run but that could just be noise. I would have to dig into the generated assembly to see the difference, but I don't know how to do this with Godot and C#.

EDIT: Some other runs have been yielded better results (14,3ms min time), the whole setup seems to have a lot of variance though.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its generally very hard to benchmark these correctly and the possible difference is just few assembly instructions so its very likely to turn to noise. The expensive part is even inlined so that the only time it jumps to CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE is on the first execution.

Also as I mentioned above the AggressiveInlining is most likely hiding the negative side-effects of AggressiveOptimization in most cases where the method does get inlined as the AO flag is then just ignored. However AI never guarantees that the method is inlined in all cases, so when the method does not get inlined we then get yet another slap in the face by the negative effects of the AO flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants