-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Allow generating HW intrinsics in crossgen #24689
Conversation
We currently don't precompile methods that use hardware intrinsics because we don't know the CPU that the generated code will run on. Jitting these methods slows down startup and accounts for 3% of startup time in PowerShell. With this change, we're going to lift this restriction for CoreLib (the thing that matters for startup) and support generating HW intrinsics for our minimum supported target ISA (SSE/SSE2). This means that code that uses intrinsics higher than this will be compiled with IsSupported = false. We will rely on tiered JITting to eventually replace the pregenerated method body that targets the minimum supported ISA with the most efficient code supported by the processor.
src/zap/zapper.cpp
Outdated
@@ -1201,6 +1201,8 @@ void Zapper::InitializeCompilerFlags(CORCOMPILE_VERSION_INFO * pVersionInfo) | |||
|
|||
#endif // _TARGET_X86_ | |||
|
|||
m_pOpt->m_compilerFlags.Set(CORJIT_FLAGS::CORJIT_FLAG_FEATURE_SIMD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compiler::compSupportsHWIntrinsic
in RyuJIT would refuse generating SSE/SSE2 without this flag.
We still block loading Vector<T>
on the runtime side, so vectors won't light up with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC. @CarolEidt.
Pretty sure this is safe, but she may be able to better answer if there are any other fixups needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to set the hwintrinsic feature and EnableSSE
/EnableSSE2
?
Do we have a test showing that EnableSSE3+
and others like EnableLZCNT
or EnableBMI1
(which aren't part of the SSE hierarchy) don't get generated for crossgen/r2r?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were some wrinkles with enabling the recognition of SIMD types in crossgen (which was needed to correctly support the Vector ABI on Arm64), but I think that we should be OK now that those have been ironed out.
Do we also need to set the hwintrinsic feature and EnableSSE/EnableSSE2?
The latter two should always be enabled by default. I'm not sure about the hw intrinsic feature.
Do we have a test showing that EnableSSE3+ and others like EnableLZCNT or EnableBMI1 (which aren't part of the SSE hierarchy) don't get generated for crossgen/r2r?
I don't think that we currently have a good way to detect this. It would certainly worth making an investment in figuring out how we can do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latter two should always be enabled by default. I'm not sure about the hw intrinsic feature.
I guess the better question is there anything we need to do here to ensure that the CPUID checks and the enabled ISAs are correct?
For the normal JIT codegen we ultimately have:
The CPUID checks/caching that happens in the VM: https://github.com/dotnet/coreclr/blob/master/src/vm/codeman.cpp#L1374
-and -
Various JitConfig checks that tell the HWIntrinsic code which ISAs are "supported": https://github.com/dotnet/coreclr/blob/master/src/jit/compiler.cpp#L2254 (since users can disable HWIntrinsic codegen for SSE/SSE2 even if they are "baseline").
I don't think that we currently have a good way to detect this. It would certainly worth making an investment in figuring out how we can do this.
Could we just have a set of tests that are always run through crossgen/r2r? We could then just hardcode them to validate Sse.IsSupported
/Sse2.IsSupported
are true and all other ISAs are false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we know CoreLib won't do something dumb, like caching Avx.IsSupported into a field and using that thorough
Even if some user did this, I don't think there is a concern. It just means they will get worse codegen at runtime (e.g. they won't hit any Avx codepaths using their cached check).
We should also be actively discouraging people from doing that because it will produce worse codegen (since it may not be treated as a JIT time constant).
I think the perf benefits of enabling this everywhere probably outweighs the chance that someone does something dumb, especially once we start using the HWIntrinsics more broadly (and we already are for some CoreFX code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I imagine we will also ultimately have an AOT story where SSE/SSE2 are enabled by default and where you can explicitly opt-into additional ISAs if you want to be more restrictive, in which case we probably just want to do the right thing by default here as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see us address #17568 and #21603 more broadly (post 3.0, presumably) - i.e.
- recognize SIMD types during crossgen so that, even if we're not recognizing intrinsics, we don't generate really awful code to reference the structs.
- support intrinsics more broadly during AOT, as @tannergooding suggests above
- I think that if we allow the developer to opt into additional ISAs, we'll need to come up with a plan for both how to best expose it, as well as how to ensure adequate test coverage.
For now, this seems like a very good initial step to address the startup issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this change would there be more work in the runtime to allow crossgen to specify a minimum ISA? I'm assuming testing, but beyond that nothing fundamental, correct?
Also for those who opt out of tiered jitting, does this make things worse in any way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If one opts out of tiered jitting, then any pessimization caused by SSE/AVX transition penalties would remain. It is also the case that other pessimizations due to R2R constraints would remain.
src/vm/methodtablebuilder.cpp
Outdated
if (IsCompilationProcess()) | ||
#ifdef CROSSGEN_COMPILE | ||
if (!IsNgenPDBCompilationProcess() && | ||
GetAppDomain()->ToCompilationDomain()->GetTargetModule() != g_pObjectClass->GetModule()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this check mean that we only do this for corelib?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(That is, we only support HWIntrinsics for corelib and we say "disallowed" for everything else).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment below needs editing - I think it's actually incorrect now. Also, it would be good to clarify exactly what the above condition is checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable, though I think the comments need updating (and expanding).
src/vm/methodtablebuilder.cpp
Outdated
if (IsCompilationProcess()) | ||
#ifdef CROSSGEN_COMPILE | ||
if (!IsNgenPDBCompilationProcess() && | ||
GetAppDomain()->ToCompilationDomain()->GetTargetModule() != g_pObjectClass->GetModule()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment below needs editing - I think it's actually incorrect now. Also, it would be good to clarify exactly what the above condition is checking.
There are two test failures with this: One is a bit annoying - it's a test that attempts to reflection-execute the HW intrinsics and since we pregenerated the method body for the intrinsics in Crossgen, we get different behaviors than what's expected. I can just make it so that we don't compile method bodies on the HW intrinsic classes themselves and the problem will go away. The other failure is a RyuJIT assert:
This is because I forgot to restrict the addition |
I think it's probably best not to pregenerate method bodies for the intrinsics, since we don't generally expect them to be called. That said, I'm unclear how the behavior would change in an observable way - can you explain and/or point me to the failing test? For the second one, I don't understand why you would hit an assert when compiling something other than CoreLib. Is this because, while FeatureSIMD is enabled we are somehow not actually recognizing the types? A jitdump for this would be useful. |
It was this test: coreclr/tests/src/JIT/HardwareIntrinsics/X86/General/IsSupported.cs Lines 33 to 52 in b795318
We end up calling a pregenerated method body by reflection and the In light of all this, I think it's the best to scope this down to SSE and SSE2. These are guaranteed to match what we would see at runtime (modulo the opt out COMPlus envionment variable that lets one completely disable HW intrinsics at runtime).
Yes, I was setting |
#if defined(_TARGET_X86_) || defined(_TARGET_AMD64_) | ||
if ((!IsNgenPDBCompilationProcess() | ||
&& GetAppDomain()->ToCompilationDomain()->GetTargetModule() != g_pObjectClass->GetModule()) | ||
|| (strcmp(className, "Sse") != 0 && strcmp(className, "Sse2") != 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We is this needed? Can we just not report them as unsupported in the compiler or vm layer and have things work "naturally"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to avoid AOT compiling anything that touches the other intrinsics that could have different support level at runtime. It solves two problems:
- The reflection test that is trying to reflection-invoke e.g.
Avx.IsSupported
and finds out that the method body forIsSupported
was crossgenned as returningfalse
, but the reality is that Avx.IsSupported returns true. This could also be solved by not crossgenning any of the methods on the Avx type, but there's also problem number 2: - If there's an intraprocedural dependency, we'll hit bugs once tiering kicks in. Imagine that the linked method was calling
Avx.IsSupported
instead ofSse2.IsSupported
- we crossgen it withAvx.IsSupported == false
. At some pointGetIndexOfFirstNonAsciiByte
gets tiered up andAvx.IsSupported == true
. At that point we start calling a method that doesn't have any guards forAvx.IsSupported
and directly calls into intrinsics, but we pregenerated them as throwingPlatformNotSupportedException
.
// of the hardware intrinsics. | ||
if (m_pEECompileInfo->GetAssemblyModule(m_hAssembly) == m_pEECompileInfo->GetLoaderModuleForMscorlib()) | ||
{ | ||
m_pOpt->m_compilerFlags.Set(CORJIT_FLAGS::CORJIT_FLAG_FEATURE_SIMD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flag is only defined for targets that support SIMD (_TARGET_X86_
, _TARGET_AMD64_
and _TARGET_ARM64_
). While we don't have any uses of ARM64 intrinsics currently in SPC.dll, I'd probably include that in this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like that's causing the Arm failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it was. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@MichalStrehovsky, could you update with some perf numbers as well, showing how this improves startup times for things like Powershell? Also, CC. @adamsitnik who may be interested in trying to get some updated numbers. |
We get 5+ ms startup improvement. This lets us AOT compile following methods that we saw JITting on the PowerShell startup path:
What is still left on the table in the HW intrinsics space are these methods that use intrinsics higher than SSE2:
|
This is unfortunate and I presume to be fallout from this: #24689 (comment)... The methods in question above really just have optional paths that support higher level intrinsics and would still be fully aot compilable when limited to just SSE/SSE2. So in an ideal world, crossgen/r2r would AOT these for the SSE/SSE2 or software fallback paths and tiered jitting would later rejit them using the higher ISAs (such as BMI1/BMI2)... |
Would this change to the runtime also allow generation of an AVX2 SPC? Assuming you use a build of crossgen that will set the ISA. |
I added some comments to the broader issue here: https://github.com/dotnet/coreclr/issues/21603#issuecomment-497394620 |
This is a follow up to dotnet#24689 that lets us pregenerate all hardware intrinsics in CoreLib. We ensures the potentially unsupported code will never be reachable at runtime on CPUs that don't support it by not reporting the `IsSupported` property as intrinsic in crossgen. This ensures the support checks are always JITted. JITting the support checks is very cheap. There is cost in the form of an extra call and failure to do constant propagation of the return value, but the cost is negligible in practice and gets eliminated once the tiered JIT tiers the method up. We only do this in CoreLib because user code could technically not guard intrinsic use in `IsSupported` checks and pregenerating the code could lead to illegal instruction traps at runtime (instead of `PlatformNotSupportedException` throws) - it's a bad user experience.
This is a follow up to dotnet#24689 that lets us pregenerate all hardware intrinsics in CoreLib. We ensures the potentially unsupported code will never be reachable at runtime on CPUs that don't support it by not reporting the `IsSupported` property as intrinsic in crossgen. This ensures the support checks are always JITted. JITting the support checks is very cheap. There is cost in the form of an extra call and failure to do constant propagation of the return value, but the cost is negligible in practice and gets eliminated once the tiered JIT tiers the method up. We only do this in CoreLib because user code could technically not guard intrinsic use in `IsSupported` checks and pregenerating the code could lead to illegal instruction traps at runtime (instead of `PlatformNotSupportedException` throws) - it's a bad user experience.
* Allow pregenerating all HW intrinsics in CoreLib This is a follow up to #24689 that lets us pregenerate all hardware intrinsics in CoreLib. We ensures the potentially unsupported code will never be reachable at runtime on CPUs that don't support it by not reporting the `IsSupported` property as intrinsic in crossgen. This ensures the support checks are always JITted. JITting the support checks is very cheap. There is cost in the form of an extra call and failure to do constant propagation of the return value, but the cost is negligible in practice and gets eliminated once the tiered JIT tiers the method up. We only do this in CoreLib because user code could technically not guard intrinsic use in `IsSupported` checks and pregenerating the code could lead to illegal instruction traps at runtime (instead of `PlatformNotSupportedException` throws) - it's a bad user experience.
We currently don't precompile methods that use hardware intrinsics because we don't know the CPU that the generated code will run on. Jitting these methods slows down startup and accounts for 3% of startup time in PowerShell. With this change, we're going to lift this restriction for CoreLib (the thing that matters for startup) and support generating HW intrinsics for our minimum supported target ISA (SSE/SSE2). Commit migrated from dotnet/coreclr@d4fadf0
* Allow pregenerating all HW intrinsics in CoreLib This is a follow up to dotnet/coreclr#24689 that lets us pregenerate all hardware intrinsics in CoreLib. We ensures the potentially unsupported code will never be reachable at runtime on CPUs that don't support it by not reporting the `IsSupported` property as intrinsic in crossgen. This ensures the support checks are always JITted. JITting the support checks is very cheap. There is cost in the form of an extra call and failure to do constant propagation of the return value, but the cost is negligible in practice and gets eliminated once the tiered JIT tiers the method up. We only do this in CoreLib because user code could technically not guard intrinsic use in `IsSupported` checks and pregenerating the code could lead to illegal instruction traps at runtime (instead of `PlatformNotSupportedException` throws) - it's a bad user experience. Commit migrated from dotnet/coreclr@e73c8e6
We currently don't precompile methods that use hardware intrinsics because we don't know the CPU that the generated code will run on. Jitting these methods slows down startup and accounts for 3% of startup time in PowerShell.
With this change, we're going to lift this restriction for CoreLib (the thing that matters for startup) and support generating HW intrinsics for our minimum supported target ISA (SSE/SSE2).