-
-
Notifications
You must be signed in to change notification settings - Fork 21.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C#: Improve godot_variant conversion for godot_variant-marshable C# types #89807
base: master
Are you sure you want to change the base?
C#: Improve godot_variant conversion for godot_variant-marshable C# types #89807
Conversation
3f5ba95
to
c4fa8f1
Compare
I had to redo the benchmarks after discovering that my test project was affected by dead code elimination. I changed the main post and also added numbers for the export templates, which don't gain as much of an uplift as the editor runs (which is expected). |
{ | ||
private static readonly ConvertToDelegate<T> _converter = DetermineConvertToDelegate<T>(); | ||
|
||
[MethodImpl(MethodImplOptions.AggressiveInlining | MethodImplOptions.AggressiveOptimization)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usage of AggressiveOptimization
is actually a deoptimization most of the time as the method skips tiering and cant patriciate in more optimizations that rely on it. The usage of AggressiveInlining
hides the fact as the caller can still patriciate in tiering and thus allowing it to take advantage of it. So in rare cases when this method is not inlined it ends up generating suboptimal code. Here is simple example demonstrating it:
[DisassemblyDiagnoser]
public class InlineBenchmark
{
[Benchmark]
public int WithAggressiveOptimization() => Test.WithAggressiveOptimization();
[Benchmark]
public int WithoutAggressiveOptimization() => Test.WithoutAggressiveOptimization();
}
public static class Test
{
private static readonly int value = 0x123456;
[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
public static int WithAggressiveOptimization() => Test.value;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int WithoutAggressiveOptimization() => Test.value;
}
.NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX2
; InlineBenchmark.WithAggressiveOptimization()
jmp qword ptr [7FF831905ED8]; Test.WithAggressiveOptimization()
; Total bytes of code 6
; Test.WithAggressiveOptimization()
sub rsp,28
test byte ptr [7FF8316D1AD6],1
je short M01_L01
M01_L00:
mov eax,[7FF8316D1B08]
add rsp,28
ret
M01_L01:
mov rcx,7FF8316D1AA0
mov edx,6
call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
jmp short M01_L00
; Total bytes of code 46
.NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX2
; InlineBenchmark.WithoutAggressiveOptimization()
jmp qword ptr [7FF831915EF0]; Test.WithoutAggressiveOptimization()
; Total bytes of code 6
; Test.WithoutAggressiveOptimization()
mov eax,123456
ret
; Total bytes of code 6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copy&pasted the original methods attributes to my newly added functions, it's interesting to see how these can affect the generated code in such a bad way. I wonder what impact removing these all over the C# integration would yield.
I have little experience with the C# JIT, especially on this level. I can remove the attributes but would have to rebenchmark then.
Thank you for the explanation, its always appreciated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally you want to avoid using the AO flag as much as possible as the end result is almost always the opposite of what you want. Where it can yield improvement is on a cold start where the JIT compiles methods without optimizations but it also means that the method is never recompiled with better optimizations after the application is on a steady state.
The attribute's documentation mentions that applying the flag without measuring can yield to negative results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I locally removed AggressiveOptimization
and gave it a spin: the performance seems to be on par for me, a bit faster (14,99ms min time) in one run but that could just be noise. I would have to dig into the generated assembly to see the difference, but I don't know how to do this with Godot and C#.
EDIT: Some other runs have been yielded better results (14,3ms min time), the whole setup seems to have a lot of variance though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its generally very hard to benchmark these correctly and the possible difference is just few assembly instructions so its very likely to turn to noise. The expensive part is even inlined so that the only time it jumps to CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
is on the first execution.
Also as I mentioned above the AggressiveInlining
is most likely hiding the negative side-effects of AggressiveOptimization
in most cases where the method does get inlined as the AO flag is then just ignored. However AI never guarantees that the method is inlined in all cases, so when the method does not get inlined we then get yet another slap in the face by the negative effects of the AO flag.
I was working on another C# binding optimization when I noticed that
VariantUtils.ConvertTo<T>(in godot_variant variant)
did a suspicious amount ofGetTypeFromHandle
calls (akatypeof(T)
) for a simpleNode3D
that only overrides_Process(double delta)
:(Also note the accompanying
op_Equality
calls for the type comparisons)The commit message for 3f645f9 (which introduced this method) noted the following:
While the JIT is generally good at optimizing such
typeof(T)
guarded function calls, various limitations apply - e.g. the function must be inlined to enable the JIT to remove the unreachable branches for the specific callsite (and specificT
).This function has grown over time and it seems like the JIT does not inline
ConvertTo
(and the accompanyingCreateFrom
) as often as we like, which makes its performance degrade, which hurts for cheap conversion like e.g.double
.I have run into this issue in the past and found a workaround which makes C# behave similar to C++ templates when it comes to method specialization: generic static classes with a statically evaluated delegate.
The delegate is resolved at class load time and is then "resolved and cached " for us at all call sites by the runtime. This leads to small (code size) delegates which are easily inlineable with guaranteed "zero-overhead" runtime characteristics. The new implementation is also stable in the sense that we are not praying for JIT optimization which might vary over time - especially because the performance of
ConvertTo / CreateFrom
does not depend on inlining anymore.I created a test project which spawn 100.000 Nodes with an attached C# script. The attached C# script overrides
_Ready
and_Process
and increments a global variable which is printed on the screen to avoid dead code elimination, which pretty much nullified my previous benchmark results. Additionally the minimum and average process time is printed, which is used in the table below. If you do your own testing with this project, please do multiple runs as the run to run variance (especially in release builds) is quite high, 5 - 10 runs should suffice. As the project is a CPU intensive synthetic scenario you should aim to reduce the noise on your machine (e.g. IDEs indexing stuff, streaming 8k videos of Godot showcases).HighNodeCountNoDeadCode.zip
The timings on my local machine (Win11, NET.Core 7.0.2, CPU: 5800x3D, 120hz display - might matter because of vsync) are as follows:
Editor
Run Project
timings:master
@ Godot 4.3 dev 5Windows release template timings:
We gain around 9% in the editor debug builds and around 3-9% in release builds. It's not much, but this PR is a quite small change as it stands.
The run to run variance for the release template is pretty high, the best runs were taken for all numbers (sample size: ~10) after letting the project run for ~10-15 seconds.
Let me know if the code style or the comments feel off (or you have any other question)!