Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite math jit helpers to managed code #98858

Closed
wants to merge 36 commits into from

Conversation

MichalPetryka
Copy link
Contributor

@MichalPetryka MichalPetryka commented Feb 23, 2024

Rewrites math helpers that has multiple implementations (C++ for CoreCLR, Windows X86 asm for CoreCLR and C#/C++ for NativeAOT) into a single, shared C# implementation.
I've put the new helpers in Math and MathF but some other place like RuntimeHelpers could perhaps be better.

On 64bit platforms this should be mostly the same, some direct helper calls might be replaced with indirect ones when the helpers didn't get to T1 yet.

Might result in slight perf regressions on X86 and ARM32 due to removal of asm paths and the managed helpers needing to do another call in some cases.

Regressions could be made up by converting casts to calls earlier and making them inlineable, that's not handled here to keep the changes simple.

This removes most of the implementation of 4 unused helpers, I think I forwarded them to temporary managed variants for legacy R2R images correctly but verification would be appreciated.

@huoyaoyuan
Copy link
Member

I've put the new helpers in Math and MathF but some other place like RuntimeHelpers could perhaps be better.

They may have a separate helper type, like CastHelpers.

@MichalPetryka
Copy link
Contributor Author

@MihuBot

@jkotas
Copy link
Member

jkotas commented Feb 23, 2024

This removes most of the implementation of 4 unused helpers, I think I forwarded them to temporary managed variants for legacy R2R images correctly but verification would be appreciated.

It would be better to leave these no longer used helpers alone, and delete them next time we rev min R2R version.

@teo-tsirpanis teo-tsirpanis added the community-contribution Indicates that the PR has been added by a community member label Feb 26, 2024
@khushal1996
Copy link
Contributor

@MichalPetryka
Just wanted to ask if there is a plan to merge this PR soon. Since #97529 will depend on this PR. Any idea when the current PR will be merged?

@MichalPetryka
Copy link
Contributor Author

@MichalPetryka Just wanted to ask if there is a plan to merge this PR soon. Since #97529 will depend on this PR. Any idea when the current PR will be merged?

I've cleaned up the changes now and added some comments, I've also verified that .NET 8 X86 R2R binaries seem to still be working so I think this is only waiting for review now?
One question would be whether the helpers should be moved to some other file.
cc @tannergooding @jkotas

@MichalPetryka
Copy link
Contributor Author

The tests are failing:

Assert failure(PID 9608 [0x00002588], Thread: 9612 [0x258c]): !"Precode::GetPrecodeFromEntryPoint: Unexpected code in precode"

CORECLR! Precode::GetPrecodeFromEntryPoint + 0x4A (0x718b1743)
CORECLR! CEEJitInfo::getHelperFtn + 0x1EB (0x719160db)

Ah I might know what this is, it seems to not like the Dbl2UInt helper being unmanaged in CoreCLR after I ported it to managed in NativeAOT again.

@filipnavara
Copy link
Member

filipnavara commented Mar 15, 2024

We are missing this in win-x86 NativeAOT:

--- a/src/coreclr/nativeaot/Runtime/MathHelpers.cpp
+++ b/src/coreclr/nativeaot/Runtime/MathHelpers.cpp
@@ -271,6 +271,14 @@ FCIMPL1_F(float, tanhf, float x)
     return std::tanhf(x);
 FCIMPLEND

+FCIMPL2_DD(double, fmod, double x, double y)
+    return std::fmod(x, y);
+FCIMPLEND
+
+FCIMPL2_FF(float, fmodf, float x, float y)
+    return std::fmodf(x, y);
+FCIMPLEND
+
 FCIMPL3_DDD(double, fma, double x, double y, double z)
     return std::fma(x, y, z);
 FCIMPLEND

Also, the throw helpers are not properly rooted in the scanning phase:

    *** Methods compiled but not scanned:
    int64.MultiplyChecked(int64,int64)
    [S.P.CoreLib]System.ThrowHelper.ThrowDivideByZeroException()
    uint64.MultiplyChecked(uint64,uint64)
    [S.P.CoreLib]System.Text.StringBuilder+AppendInterpolatedStringHandler.AppendFormatted<uint64>(uint64)
    [S.P.CoreLib]System.Text.StringBuilder+AppendInterpolatedStringHandler.AppendFormattedWithTempSpace<uint64>(uint64,in
  t32,string)
    [S.P.CoreLib]System.Text.StringBuilder+AppendInterpolatedStringHandler.AppendCustomFormatter<uint64>(uint64,string)
    Unhandled exception: System.Exception: Scanning failure
       at ILCompiler.Program.Run() in D:\runtime\src\coreclr\tools\aot\ILCompiler\Program.cs:line 651
       at ILCompiler.ILCompilerRootCommand.<>c__DisplayClass236_0.<.ctor>b__0(ParseResult result) in D:\runtime\src\corec
  lr\tools\aot\ILCompiler\ILCompilerRootCommand.cs:line 292
       at System.CommandLine.Invocation.InvocationPipeline.Invoke(ParseResult parseResult)

JITHELPER(CORINFO_HELP_DBL2INT_OVF, JIT_Dbl2IntOvf, CORINFO_HELP_SIG_8_STACK)
DYNAMICJITHELPER(CORINFO_HELP_LNG2DBL, NULL, CORINFO_HELP_SIG_8_STACK)
DYNAMICJITHELPER(CORINFO_HELP_ULNG2DBL, NULL, CORINFO_HELP_SIG_8_STACK)
JITHELPER(CORINFO_HELP_DBL2INT, JIT_Dbl2Int, CORINFO_HELP_SIG_8_STACK)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What platform is this needed for?

Arm, Arm64, x86, and x64 all provide a direct instruction for double to int32 as part of the baseline ISAs we require.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unused but can't be removed without an R2R bump. Also, Intel PRs might need it again.

//

[StackTraceHidden]
private static ulong MultiplyChecked(ulong left, ulong right)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a very expensive way to implement MultiplyChecked.

For UInt128 we simply do BigMul and check if upper is 0: https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/UInt128.cs,1379

But more practically speaking, we can determine overflow just by doing if ((128 - LeadingZeroCount(left) - LeadingZeroCount(right)) > 64). Since x-bit * y-bit will produce an (x + y)-bit product; anything else must fit into the given 64-bits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to change logic here in a separate PR. Keep in mind that this is for 32bit platforms so things like BigMul are slow there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to change logic here in a separate PR

That's fine.

Keep in mind that this is for 32bit platforms so things like BigMul are slow there.

That's why I suggested the extremely cheap check of if ((128 - LeadingZeroCount(left) - LeadingZeroCount(right)) > 64). That, on a 32-bit platform, ends up being an almost free check to determine overflow without doing any expensive multiplication (it simply uses the lzcnt equivalent that every platform we target, including WASM, natively supports).

Then, you only have to do a regular 64x64-bit multiplication since you know it can't overflow. You don't have to do the extra work (and many branches) to check overflow at each step in the algorithm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That, on a 32-bit platform, ends up being an almost free check to determine overflow without doing any expensive multiplication

It looks like an approximate check to determine overflow. If it fails, you still need to take the more expensive multiplication to determine whether it is actually going to overflow - correct?

The current implementation has a variant of this idea where it checks whether the high 32-bits of both numbers are zero. If they are, it knows that there is not going to be an overflow and so it multiplies the low 32-bit portions and returns. If they are not, it does the more expensive multiplication.

I think the difference between what's there currently and your suggestion is in the speed of the fast path and the curoff for the fast path. The current implementation has lower cutoff and faster fast path. Your suggestion has higher cutoff and slower fast path.

Copy link
Member

@tannergooding tannergooding Mar 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like an approximate check to determine overflow.

Yes.

you still need to take the more expensive multiplication to determine whether it is actually going to overflow

Not quite.


In general, for an x-bit * y-bit value (excepting where one input is 0), you need a minimum of max(x, y) bits to represent the product and a maximum of x + y bits.

The minimum bit count comes about because of the special case 1, which returns the original value. Multiplication then adds one additional bit to the output for each power of 2 (multiplying by 2 is equivalent to shifting left by 1, multiplying by 4 is equivalent to shifting left by 2, thus multiplying by 3 must be between these two values and takes less than 2 additional bits).

Thus the number of bits needed is actually ceil(log2(x) + log2(y)) bits.

Finding the input with the smaller length and computing its effective log2 are both incredibly cheap operations that all our targets have direct acceleration to support. So it should always be significantly cheaper than what we're doing here already.

The approach I detailed of giving if ((128 - LeadingZeroCount(left) - LeadingZeroCount(right)) > 64) is an even cheaper approximation and there is some subset of valid inputs that would trigger it and for which we'd then need to compute the actual bit count.

So it basically just comes down to handling the 99% of inputs that can't possibly overflow as cheaply as possible. This is always fine because lzcnt is a native instruction for Arm, Arm64, and WASM. It is part of our default R2R target for x86/x64 and if you happen to be on x86/x64 that is pre-lzcnt (more than 11 years old at this point), you end up getting (sizeof(T) * 8 - 1) ^ BSR instead. For Arm32 and x86, you at worst get an extra branch and addition in here to account for the upper vs lower halves of the 64-bit input.

For the subset of cases that hit the "it could actually be under 64-bits", you then need to do log2 which just uses the already computed lzcnt as input. So we remain with an incredibly efficient fast path that minimizes the total amount of work and doesn't need to do the extra multiplications required to check if bits 64-127 are non-zero.

Copy link
Contributor Author

@MichalPetryka MichalPetryka Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing code (both the C/C++ version in regular CoreCLR and the managed native AOT version) does not use the BigMul helper, so it does not have this perf bug.

Putting AggressiveInlining on BigMul like I did in #100027 makes the regression be about 2.9% on CoreCLR on my machine, I'd argue that's good enough in exchange for sharing the implementation between runtimes. It will also likely improve with future optimizations to the JIT so I'd assume it's gonna get better than native in the future (today it does stuff like or edx, 0 and add eax, 0).

It may be a good idea to split this PR into multiple smaller one so that each of the changes gets the right level of scrutiny.

While I'd agree with this before, I feel like splitting the PR at this point after most review was done would only complicate things here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like splitting the PR at this point after most review was done would only complicate things here.

It would allow me or other maintainers to get these changes merged with higher confidence and less work. In general, multiple smaller PRs are less total work for maintainers to get through than one big PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have peeled off first chunk into #100222

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#100222 is merged with number of adjustments. I think the next good-sized chunk to peel off into a separate PR is everything to do with the long multiplication. @MichalPetryka Are you interested in doing that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into splitting this up tomorrow.

@jkotas
Copy link
Member

jkotas commented Mar 16, 2024

/azp run runtime-extra-platforms, runtime-nativeaot outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jkotas
Copy link
Member

jkotas commented Mar 16, 2024

/azp run runtime-nativeaot-outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@MichalPetryka
Copy link
Contributor Author

The Build linux-arm64 Release NativeAOT_Checked_Libs failure seems like it could be related, not sure why the Checked cast helper would not throw with float -> int anymore though and why it'd fail only there.

@jkotas
Copy link
Member

jkotas commented Mar 16, 2024

I expected that it would fail in some other combinations too. We have too many possible combinations in the test matrix and we are not able to test all of them.

@jkotas
Copy link
Member

jkotas commented May 12, 2024

I have extracted the remaining good changes from this PR to #102129

@jkotas jkotas closed this May 12, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Jun 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-VM-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants