Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT Hardware Intrinsics low compiler throughput due to inefficient intrinsic identification algorithm during import phase #13617

Open
4creators opened this issue Oct 21, 2019 · 11 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI in-pr There is an active PR which will close this issue when it is merged tenet-performance Performance related issue
Milestone

Comments

@4creators
Copy link
Contributor

4creators commented Oct 21, 2019

During the design phase of JIT Hardware Intrinsics compiler bits one inefficient, temporary algorithm was allowed to be used for searching for intrinsics function names/ids. The naive search algorithm currently used works in O(n) time for each intrinsic imported, while it is possible to do this search in O(1) with very low constant overhead. Usually, this problem is hitting particularly hard in functions that use multiple intrinsics instructions since the search time is equal to t = number_of_intrinsics_used * O(n). This is particularly exacerbated in applications that require fast startup time: cloud lambda functions, command-line tools i.e. PowerShell Core.

The function implementing an algorithm which slipped to the release is the following:

https://github.com/dotnet/coreclr/blob/6de88d4f5d291269f82e3dd1aa39cee026725dfe/src/jit/hwintrinsic.cpp#L186

and the algorithm used:

https://github.com/dotnet/coreclr/blob/6de88d4f5d291269f82e3dd1aa39cee026725dfe/src/jit/hwintrinsic.cpp#L210

Previously discussed solutions included using a hashtable or/and creating fast binary preselection due to the fixed nature of search terms.

@AndyAyersMS @CarolEidt @fiigii @tannergooding

category:implementation
theme:vector-codegen
skill-level:beginner
cost:small
impact:medium

@EgorBo
Copy link
Member

EgorBo commented Oct 21, 2019

Just a note: in Mono we use binary search for S.R.I. intrinsics (but we have a small subset), O(1) lookup would be better I guess.

@4creators
Copy link
Contributor Author

@jkotas @CarolEidt

Do you think that there is a chance to have it implemented and shipped with v3.1 and eventually backported to v3.0? The current performance of JIT with HW intrinsics is ... a bit disappointing - one can feel it using cloud functions for image processing. The alternative would be to have crossgen supporting intrinsics for R2R compilation with preset CPU architecture targets shipped with v3.1.

@jkotas
Copy link
Member

jkotas commented Oct 22, 2019

  • 3.1 is "done". Everything going into it at this has to go through approval process.
  • 3.0 is a very short-lived release (it is not LTS). We are doing minimal changes there.

So the next steps would be:

  • Implement the fix and get it to master.
  • Once it is in master and there are data that demonstrates a large impact, we may be able to get approval to back-port it to 3.1 servicing (I am not promising anything).

@CarolEidt
Copy link
Contributor

It is definitely worth improving the lookup for the intrinsics. That said, I'm not sure how that prioritizes against the other work that's being done in the JIT.

Supporting intrinsics in crossgen is also a reasonable thing to do (and has already been done to a limited extent for SPC.dll in dotnet/coreclr#24689). #11689 captures the remaining issue(s).

@jkotas
Copy link
Member

jkotas commented Oct 22, 2019

I'm not sure how that prioritizes against the other work that's being done in the JIT.

That's where the data about the impact would come handy.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@AndyAyersMS
Copy link
Member

Looks like we still don't have data measuring the actual impact of this lookup on jit throughput. I'll see if I can come up with something.

@tannergooding
Copy link
Member

I would suspect the worst case is AVX2.Xor which is going to end up doing 371 checks for if isa == hwIntrinsicInfoArray[i].isa: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsic.cpp#L285-L335
It will then do 66 more checks where the ISA matches but the name doesn't, each of which involves a strcmp call.

We could simplify a good bit of this by removing the isa checks, which could simply be a small table that retrns the first/last NamedIntrinsic for a given ISA.
That would take it down to just the strcmp calls at least.

We could then probably optimize the string checks a bit, but that would be more complex.

@AndyAyersMS
Copy link
Member

Probably not a great test, but on Avx2_ro, we spend 2232 ms jitting, and somewhere around 2 ms in lookupId.

@tannergooding
Copy link
Member

We also have the src/coreclr/tests/src/JIT/Performance/CodeQuality/HWIntrinsic/X86/PacketTracer benchmark if we want to test a slightly more "real world" example of intrinsic usage.

However, it sounds like there isn't a huge penalty for it overall and while we could speed it up, it isn't likely to make a noticeable impact.

Was there anything else from the HWIntrinsic specific code paths that was taking a large amount of time (I think import, codegen, and emit are the biggest; everything else is largely shared).

@AndyAyersMS
Copy link
Member

Nothing really jumps out. By exclusive profile (under jitNativeCode)
image
Inclusive: codegen is 17%, importer 14%, morph 10%, linear scan 9%, ...

@AndyAyersMS
Copy link
Member

I'm going to move this to future as it doesn't seem urgent to address now.

@AndyAyersMS AndyAyersMS modified the milestones: 5.0, Future Apr 29, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@TIHan TIHan removed the JitUntriaged CLR JIT issues needing additional triage label Oct 31, 2022
@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI in-pr There is an active PR which will close this issue when it is merged tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants