JIT Hardware Intrinsics low compiler throughput due to inefficient intrinsic identification algorithm during import phase #13617

4creators · 2019-10-21T10:59:49Z

During the design phase of JIT Hardware Intrinsics compiler bits one inefficient, temporary algorithm was allowed to be used for searching for intrinsics function names/ids. The naive search algorithm currently used works in O(n) time for each intrinsic imported, while it is possible to do this search in O(1) with very low constant overhead. Usually, this problem is hitting particularly hard in functions that use multiple intrinsics instructions since the search time is equal to t = number_of_intrinsics_used * O(n). This is particularly exacerbated in applications that require fast startup time: cloud lambda functions, command-line tools i.e. PowerShell Core.

The function implementing an algorithm which slipped to the release is the following:

https://github.com/dotnet/coreclr/blob/6de88d4f5d291269f82e3dd1aa39cee026725dfe/src/jit/hwintrinsic.cpp#L186

and the algorithm used:

https://github.com/dotnet/coreclr/blob/6de88d4f5d291269f82e3dd1aa39cee026725dfe/src/jit/hwintrinsic.cpp#L210

Previously discussed solutions included using a hashtable or/and creating fast binary preselection due to the fixed nature of search terms.

@AndyAyersMS @CarolEidt @fiigii @tannergooding

category:implementation
theme:vector-codegen
skill-level:beginner
cost:small
impact:medium

The text was updated successfully, but these errors were encountered:

EgorBo · 2019-10-21T11:06:10Z

Just a note: in Mono we use binary search for S.R.I. intrinsics (but we have a small subset), O(1) lookup would be better I guess.

4creators · 2019-10-22T11:23:42Z

@jkotas @CarolEidt

Do you think that there is a chance to have it implemented and shipped with v3.1 and eventually backported to v3.0? The current performance of JIT with HW intrinsics is ... a bit disappointing - one can feel it using cloud functions for image processing. The alternative would be to have crossgen supporting intrinsics for R2R compilation with preset CPU architecture targets shipped with v3.1.

jkotas · 2019-10-22T17:15:44Z

3.1 is "done". Everything going into it at this has to go through approval process.
3.0 is a very short-lived release (it is not LTS). We are doing minimal changes there.

So the next steps would be:

Implement the fix and get it to master.
Once it is in master and there are data that demonstrates a large impact, we may be able to get approval to back-port it to 3.1 servicing (I am not promising anything).

CarolEidt · 2019-10-22T20:34:46Z

It is definitely worth improving the lookup for the intrinsics. That said, I'm not sure how that prioritizes against the other work that's being done in the JIT.

Supporting intrinsics in crossgen is also a reasonable thing to do (and has already been done to a limited extent for SPC.dll in dotnet/coreclr#24689). #11689 captures the remaining issue(s).

jkotas · 2019-10-22T21:55:23Z

I'm not sure how that prioritizes against the other work that's being done in the JIT.

That's where the data about the impact would come handy.

AndyAyersMS · 2020-04-28T20:36:22Z

Looks like we still don't have data measuring the actual impact of this lookup on jit throughput. I'll see if I can come up with something.

tannergooding · 2020-04-28T20:51:08Z

I would suspect the worst case is AVX2.Xor which is going to end up doing 371 checks for if isa == hwIntrinsicInfoArray[i].isa: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsic.cpp#L285-L335
It will then do 66 more checks where the ISA matches but the name doesn't, each of which involves a strcmp call.

We could simplify a good bit of this by removing the isa checks, which could simply be a small table that retrns the first/last NamedIntrinsic for a given ISA.
That would take it down to just the strcmp calls at least.

We could then probably optimize the string checks a bit, but that would be more complex.

AndyAyersMS · 2020-04-28T21:56:57Z

Probably not a great test, but on Avx2_ro, we spend 2232 ms jitting, and somewhere around 2 ms in lookupId.

tannergooding · 2020-04-28T22:11:22Z

We also have the src/coreclr/tests/src/JIT/Performance/CodeQuality/HWIntrinsic/X86/PacketTracer benchmark if we want to test a slightly more "real world" example of intrinsic usage.

However, it sounds like there isn't a huge penalty for it overall and while we could speed it up, it isn't likely to make a noticeable impact.

Was there anything else from the HWIntrinsic specific code paths that was taking a large amount of time (I think import, codegen, and emit are the biggest; everything else is largely shared).

AndyAyersMS · 2020-04-29T01:22:30Z

Nothing really jumps out. By exclusive profile (under jitNativeCode)

Inclusive: codegen is 17%, importer 14%, morph 10%, linear scan 9%, ...

AndyAyersMS · 2020-04-29T01:22:50Z

I'm going to move this to future as it doesn't seem urgent to address now.

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the 5.0 milestone Jan 31, 2020

AndyAyersMS modified the milestones: 5.0, Future Apr 29, 2020

jkotas mentioned this issue Jul 21, 2020

Common/fast lookupHWIntrinsic(...) API #9628

Closed

echesakov mentioned this issue Oct 20, 2020

[Arm64] Planned JIT work in .NET 6 #43629

Closed

29 tasks

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

TIHan removed the JitUntriaged CLR JIT issues needing additional triage label Oct 31, 2022

jakobbotsch mentioned this issue Jun 20, 2024

JIT: Use a hash table for HW intrinsic name -> ID lookups #103763

Closed

dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT Hardware Intrinsics low compiler throughput due to inefficient intrinsic identification algorithm during import phase #13617

JIT Hardware Intrinsics low compiler throughput due to inefficient intrinsic identification algorithm during import phase #13617

4creators commented Oct 21, 2019 •

edited by BruceForstall

Loading

EgorBo commented Oct 21, 2019 •

edited

Loading

4creators commented Oct 22, 2019

jkotas commented Oct 22, 2019

CarolEidt commented Oct 22, 2019

jkotas commented Oct 22, 2019

AndyAyersMS commented Apr 28, 2020

tannergooding commented Apr 28, 2020

AndyAyersMS commented Apr 28, 2020

tannergooding commented Apr 28, 2020

AndyAyersMS commented Apr 29, 2020

AndyAyersMS commented Apr 29, 2020

JIT Hardware Intrinsics low compiler throughput due to inefficient intrinsic identification algorithm during import phase #13617

JIT Hardware Intrinsics low compiler throughput due to inefficient intrinsic identification algorithm during import phase #13617

Comments

4creators commented Oct 21, 2019 • edited by BruceForstall Loading

EgorBo commented Oct 21, 2019 • edited Loading

4creators commented Oct 22, 2019

jkotas commented Oct 22, 2019

CarolEidt commented Oct 22, 2019

jkotas commented Oct 22, 2019

AndyAyersMS commented Apr 28, 2020

tannergooding commented Apr 28, 2020

AndyAyersMS commented Apr 28, 2020

tannergooding commented Apr 28, 2020

AndyAyersMS commented Apr 29, 2020

AndyAyersMS commented Apr 29, 2020

4creators commented Oct 21, 2019 •

edited by BruceForstall

Loading

EgorBo commented Oct 21, 2019 •

edited

Loading