-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT Hardware Intrinsics low compiler throughput due to inefficient intrinsic identification algorithm during import phase #13617
Comments
Just a note: in Mono we use binary search for S.R.I. intrinsics (but we have a small subset), O(1) lookup would be better I guess. |
Do you think that there is a chance to have it implemented and shipped with v3.1 and eventually backported to v3.0? The current performance of JIT with HW intrinsics is ... a bit disappointing - one can feel it using cloud functions for image processing. The alternative would be to have crossgen supporting intrinsics for R2R compilation with preset CPU architecture targets shipped with v3.1. |
So the next steps would be:
|
It is definitely worth improving the lookup for the intrinsics. That said, I'm not sure how that prioritizes against the other work that's being done in the JIT. Supporting intrinsics in crossgen is also a reasonable thing to do (and has already been done to a limited extent for SPC.dll in dotnet/coreclr#24689). #11689 captures the remaining issue(s). |
That's where the data about the impact would come handy. |
Looks like we still don't have data measuring the actual impact of this lookup on jit throughput. I'll see if I can come up with something. |
I would suspect the worst case is We could simplify a good bit of this by removing the We could then probably optimize the string checks a bit, but that would be more complex. |
Probably not a great test, but on Avx2_ro, we spend 2232 ms jitting, and somewhere around 2 ms in |
We also have the src/coreclr/tests/src/JIT/Performance/CodeQuality/HWIntrinsic/X86/PacketTracer benchmark if we want to test a slightly more "real world" example of intrinsic usage. However, it sounds like there isn't a huge penalty for it overall and while we could speed it up, it isn't likely to make a noticeable impact. Was there anything else from the HWIntrinsic specific code paths that was taking a large amount of time (I think |
I'm going to move this to future as it doesn't seem urgent to address now. |
During the design phase of JIT Hardware Intrinsics compiler bits one inefficient, temporary algorithm was allowed to be used for searching for intrinsics function names/ids. The naive search algorithm currently used works in O(n) time for each intrinsic imported, while it is possible to do this search in O(1) with very low constant overhead. Usually, this problem is hitting particularly hard in functions that use multiple intrinsics instructions since the search time is equal to
t = number_of_intrinsics_used * O(n)
. This is particularly exacerbated in applications that require fast startup time: cloud lambda functions, command-line tools i.e. PowerShell Core.The function implementing an algorithm which slipped to the release is the following:
https://github.com/dotnet/coreclr/blob/6de88d4f5d291269f82e3dd1aa39cee026725dfe/src/jit/hwintrinsic.cpp#L186
and the algorithm used:
https://github.com/dotnet/coreclr/blob/6de88d4f5d291269f82e3dd1aa39cee026725dfe/src/jit/hwintrinsic.cpp#L210
Previously discussed solutions included using a hashtable or/and creating fast binary preselection due to the fixed nature of search terms.
@AndyAyersMS @CarolEidt @fiigii @tannergooding
category:implementation
theme:vector-codegen
skill-level:beginner
cost:small
impact:medium
The text was updated successfully, but these errors were encountered: