Allow generating HW intrinsics in crossgen #24689

MichalStrehovsky · 2019-05-21T16:26:52Z

We currently don't precompile methods that use hardware intrinsics because we don't know the CPU that the generated code will run on. Jitting these methods slows down startup and accounts for 3% of startup time in PowerShell.

With this change, we're going to lift this restriction for CoreLib (the thing that matters for startup) and support generating HW intrinsics for our minimum supported target ISA (SSE/SSE2).

We currently don't precompile methods that use hardware intrinsics because we don't know the CPU that the generated code will run on. Jitting these methods slows down startup and accounts for 3% of startup time in PowerShell. With this change, we're going to lift this restriction for CoreLib (the thing that matters for startup) and support generating HW intrinsics for our minimum supported target ISA (SSE/SSE2). This means that code that uses intrinsics higher than this will be compiled with IsSupported = false. We will rely on tiered JITting to eventually replace the pregenerated method body that targets the minimum supported ISA with the most efficient code supported by the processor.

MichalStrehovsky · 2019-05-21T16:28:32Z

src/zap/zapper.cpp

@@ -1201,6 +1201,8 @@ void Zapper::InitializeCompilerFlags(CORCOMPILE_VERSION_INFO * pVersionInfo)

 #endif // _TARGET_X86_

+    m_pOpt->m_compilerFlags.Set(CORJIT_FLAGS::CORJIT_FLAG_FEATURE_SIMD);


Compiler::compSupportsHWIntrinsic in RyuJIT would refuse generating SSE/SSE2 without this flag.

We still block loading Vector<T> on the runtime side, so vectors won't light up with this.

CC. @CarolEidt.

Pretty sure this is safe, but she may be able to better answer if there are any other fixups needed.

Do we also need to set the hwintrinsic feature and EnableSSE/EnableSSE2?

Do we have a test showing that EnableSSE3+ and others like EnableLZCNT or EnableBMI1 (which aren't part of the SSE hierarchy) don't get generated for crossgen/r2r?

There were some wrinkles with enabling the recognition of SIMD types in crossgen (which was needed to correctly support the Vector ABI on Arm64), but I think that we should be OK now that those have been ironed out.

Do we also need to set the hwintrinsic feature and EnableSSE/EnableSSE2?

The latter two should always be enabled by default. I'm not sure about the hw intrinsic feature.

Do we have a test showing that EnableSSE3+ and others like EnableLZCNT or EnableBMI1 (which aren't part of the SSE hierarchy) don't get generated for crossgen/r2r?

I don't think that we currently have a good way to detect this. It would certainly worth making an investment in figuring out how we can do this.

The latter two should always be enabled by default. I'm not sure about the hw intrinsic feature.

I guess the better question is there anything we need to do here to ensure that the CPUID checks and the enabled ISAs are correct?

For the normal JIT codegen we ultimately have:

The CPUID checks/caching that happens in the VM: https://github.com/dotnet/coreclr/blob/master/src/vm/codeman.cpp#L1374
-and -
Various JitConfig checks that tell the HWIntrinsic code which ISAs are "supported": https://github.com/dotnet/coreclr/blob/master/src/jit/compiler.cpp#L2254 (since users can disable HWIntrinsic codegen for SSE/SSE2 even if they are "baseline").

I don't think that we currently have a good way to detect this. It would certainly worth making an investment in figuring out how we can do this.

Could we just have a set of tests that are always run through crossgen/r2r? We could then just hardcode them to validate Sse.IsSupported/Sse2.IsSupported are true and all other ISAs are false.

because we know CoreLib won't do something dumb, like caching Avx.IsSupported into a field and using that thorough

Even if some user did this, I don't think there is a concern. It just means they will get worse codegen at runtime (e.g. they won't hit any Avx codepaths using their cached check).

We should also be actively discouraging people from doing that because it will produce worse codegen (since it may not be treated as a JIT time constant).

I think the perf benefits of enabling this everywhere probably outweighs the chance that someone does something dumb, especially once we start using the HWIntrinsics more broadly (and we already are for some CoreFX code).

(I imagine we will also ultimately have an AOT story where SSE/SSE2 are enabled by default and where you can explicitly opt-into additional ISAs if you want to be more restrictive, in which case we probably just want to do the right thing by default here as well)

I'd like to see us address #17568 and #21603 more broadly (post 3.0, presumably) - i.e.

recognize SIMD types during crossgen so that, even if we're not recognizing intrinsics, we don't generate really awful code to reference the structs.

support intrinsics more broadly during AOT, as @tannergooding suggests above

I think that if we allow the developer to opt into additional ISAs, we'll need to come up with a plan for both how to best expose it, as well as how to ensure adequate test coverage.

For now, this seems like a very good initial step to address the startup issues.

After this change would there be more work in the runtime to allow crossgen to specify a minimum ISA? I'm assuming testing, but beyond that nothing fundamental, correct?

Also for those who opt out of tiered jitting, does this make things worse in any way?

If one opts out of tiered jitting, then any pessimization caused by SSE/AVX transition penalties would remain. It is also the case that other pessimizations due to R2R constraints would remain.

tannergooding · 2019-05-21T18:25:48Z

src/vm/methodtablebuilder.cpp

-            if (IsCompilationProcess())
+#ifdef CROSSGEN_COMPILE
+            if (!IsNgenPDBCompilationProcess() &&
+                GetAppDomain()->ToCompilationDomain()->GetTargetModule() != g_pObjectClass->GetModule())


Does this check mean that we only do this for corelib?

(That is, we only support HWIntrinsics for corelib and we say "disallowed" for everything else).

The comment below needs editing - I think it's actually incorrect now. Also, it would be good to clarify exactly what the above condition is checking.

CarolEidt

This seems reasonable, though I think the comments need updating (and expanding).

CarolEidt · 2019-05-21T22:38:13Z

src/vm/methodtablebuilder.cpp

-            if (IsCompilationProcess())
+#ifdef CROSSGEN_COMPILE
+            if (!IsNgenPDBCompilationProcess() &&
+                GetAppDomain()->ToCompilationDomain()->GetTargetModule() != g_pObjectClass->GetModule())


The comment below needs editing - I think it's actually incorrect now. Also, it would be good to clarify exactly what the above condition is checking.

MichalStrehovsky · 2019-05-23T19:25:26Z

There are two test failures with this:

One is a bit annoying - it's a test that attempts to reflection-execute the HW intrinsics and since we pregenerated the method body for the intrinsics in Crossgen, we get different behaviors than what's expected. I can just make it so that we don't compile method bodies on the HW intrinsic classes themselves and the problem will go away.

The other failure is a RyuJIT assert:

Assert failure(PID 68868 [0x00010d04], Thread: 775384 [0xbd4d8]): Assertion failed 'varDsc->lvDoNotEnregister || tree->OperIs(GT_LCL_VAR, GT_STORE_LCL_VAR)' in 'RayTracer:GetNaturalColor(ref,struct,struct,struct,ref):struct:this' (IL size 369)

    File: /Users/vsts/agent/2.150.3/work/1/s/src/jit/liveness.cpp Line: 57
    Image: /Users/dotnet-bot/dotnetbuild/work/7e99e9ac-194c-4033-8962-94894aae9a20/Payload/crossgen

This is because I forgot to restrict the addition CORJIT_FLAGS::CORJIT_FLAG_FEATURE_SIMD flag to CoreLib only. It should go away if I add the restriction. @CarolEidt is this a concern?

CarolEidt · 2019-05-23T20:17:01Z

a test that attempts to reflection-execute the HW intrinsics and since we pregenerated the method body for the intrinsics in Crossgen, we get different behaviors than what's expected. I can just make it so that we don't compile method bodies on the HW intrinsic classes themselves and the problem will go away.

I think it's probably best not to pregenerate method bodies for the intrinsics, since we don't generally expect them to be called. That said, I'm unclear how the behavior would change in an observable way - can you explain and/or point me to the failing test?

For the second one, I don't understand why you would hit an assert when compiling something other than CoreLib. Is this because, while FeatureSIMD is enabled we are somehow not actually recognizing the types? A jitdump for this would be useful.

MichalStrehovsky · 2019-05-29T13:32:38Z

That said, I'm unclear how the behavior would change in an observable way - can you explain and/or point me to the failing test?

It was this test:

coreclr/tests/src/JIT/HardwareIntrinsics/X86/General/IsSupported.cs

Lines 33 to 52 in b795318

    
           if (Convert.ToBoolean(typeof(Sse).GetMethod(issupported).Invoke(null, null)) != Sse.IsSupported || 
        
               Convert.ToBoolean(typeof(Sse2).GetMethod(issupported).Invoke(null, null)) != Sse2.IsSupported || 
        
               Convert.ToBoolean(typeof(Sse3).GetMethod(issupported).Invoke(null, null)) != Sse3.IsSupported || 
        
               Convert.ToBoolean(typeof(Ssse3).GetMethod(issupported).Invoke(null, null)) != Ssse3.IsSupported || 
        
               Convert.ToBoolean(typeof(Sse41).GetMethod(issupported).Invoke(null, null)) != Sse41.IsSupported || 
        
               Convert.ToBoolean(typeof(Sse42).GetMethod(issupported).Invoke(null, null)) != Sse42.IsSupported || 
        
               Convert.ToBoolean(typeof(Avx).GetMethod(issupported).Invoke(null, null)) != Avx.IsSupported || 
        
               Convert.ToBoolean(typeof(Avx2).GetMethod(issupported).Invoke(null, null)) != Avx2.IsSupported || 
        
               Convert.ToBoolean(typeof(Lzcnt).GetMethod(issupported).Invoke(null, null)) != Lzcnt.IsSupported || 
        
               Convert.ToBoolean(typeof(Popcnt).GetMethod(issupported).Invoke(null, null)) != Popcnt.IsSupported || 
        
               Convert.ToBoolean(typeof(Bmi1).GetMethod(issupported).Invoke(null, null)) != Bmi1.IsSupported || 
        
               Convert.ToBoolean(typeof(Bmi2).GetMethod(issupported).Invoke(null, null)) != Bmi2.IsSupported || 
        
               Convert.ToBoolean(typeof(Sse.X64).GetMethod(issupported).Invoke(null, null)) != Sse.X64.IsSupported || 
        
               Convert.ToBoolean(typeof(Sse2.X64).GetMethod(issupported).Invoke(null, null)) != Sse2.X64.IsSupported || 
        
               Convert.ToBoolean(typeof(Sse41.X64).GetMethod(issupported).Invoke(null, null)) != Sse41.X64.IsSupported || 
        
               Convert.ToBoolean(typeof(Sse42.X64).GetMethod(issupported).Invoke(null, null)) != Sse42.X64.IsSupported || 
        
               Convert.ToBoolean(typeof(Lzcnt.X64).GetMethod(issupported).Invoke(null, null)) != Lzcnt.X64.IsSupported || 
        
               Convert.ToBoolean(typeof(Popcnt.X64).GetMethod(issupported).Invoke(null, null)) != Popcnt.X64.IsSupported || 
        
               Convert.ToBoolean(typeof(Bmi1.X64).GetMethod(issupported).Invoke(null, null)) != Bmi1.X64.IsSupported || 
        
               Convert.ToBoolean(typeof(Bmi2.X64).GetMethod(issupported).Invoke(null, null)) != Bmi2.X64.IsSupported

We end up calling a pregenerated method body by reflection and the IsSupported in the pregenerated body doesn't match what the actual intrisic expansion does.

In light of all this, I think it's the best to scope this down to SSE and SSE2. These are guaranteed to match what we would see at runtime (modulo the opt out COMPlus envionment variable that lets one completely disable HW intrinsics at runtime).

For the second one, I don't understand why you would hit an assert when compiling something other than CoreLib. Is this because, while FeatureSIMD is enabled we are somehow not actually recognizing the types?

Yes, I was setting CORJIT_FLAGS::CORJIT_FLAG_FEATURE_SIMD in Crossgen unconditionally. I now scoped it back to only be set in CoreLib. The problem should be gone now.

tannergooding · 2019-05-29T15:33:10Z

src/vm/methodtablebuilder.cpp

+#if defined(_TARGET_X86_) || defined(_TARGET_AMD64_)
+            if ((!IsNgenPDBCompilationProcess()
+                && GetAppDomain()->ToCompilationDomain()->GetTargetModule() != g_pObjectClass->GetModule())
+                || (strcmp(className, "Sse") != 0 && strcmp(className, "Sse2") != 0))


We is this needed? Can we just not report them as unsupported in the compiler or vm layer and have things work "naturally"?

We need to avoid AOT compiling anything that touches the other intrinsics that could have different support level at runtime. It solves two problems:

The reflection test that is trying to reflection-invoke e.g. Avx.IsSupported and finds out that the method body for IsSupported was crossgenned as returning false, but the reality is that Avx.IsSupported returns true. This could also be solved by not crossgenning any of the methods on the Avx type, but there's also problem number 2:

If there's an intraprocedural dependency, we'll hit bugs once tiering kicks in. Imagine that the linked method was calling Avx.IsSupported instead of Sse2.IsSupported - we crossgen it with Avx.IsSupported == false. At some point GetIndexOfFirstNonAsciiByte gets tiered up and Avx.IsSupported == true. At that point we start calling a method that doesn't have any guards for Avx.IsSupported and directly calls into intrinsics, but we pregenerated them as throwing PlatformNotSupportedException.

CarolEidt · 2019-05-29T16:52:51Z

src/zap/zapper.cpp

+    // of the hardware intrinsics.
+    if (m_pEECompileInfo->GetAssemblyModule(m_hAssembly) == m_pEECompileInfo->GetLoaderModuleForMscorlib())
+    {
+        m_pOpt->m_compilerFlags.Set(CORJIT_FLAGS::CORJIT_FLAG_FEATURE_SIMD);


This flag is only defined for targets that support SIMD (_TARGET_X86_, _TARGET_AMD64_ and _TARGET_ARM64_). While we don't have any uses of ARM64 intrinsics currently in SPC.dll, I'd probably include that in this.

It looks like that's causing the Arm failure.

Yes it was. Thanks!

CarolEidt

LGTM

tannergooding · 2019-05-30T13:47:10Z

@MichalStrehovsky, could you update with some perf numbers as well, showing how this improves startup times for things like Powershell?

Also, CC. @adamsitnik who may be interested in trying to get some updated numbers.

MichalStrehovsky · 2019-05-30T14:07:23Z

@MichalStrehovsky, could you update with some perf numbers as well, showing how this improves startup times for things like Powershell?

We get 5+ ms startup improvement. This lets us AOT compile following methods that we saw JITting on the PowerShell startup path:

JIT time	Method
3.303	System.Text.Unicode.Utf8Utility.TranscodeToUtf16(unsigned int8* int32 wchar* int32 unsigned int8& wchar&)
0.606	System.Text.ASCIIUtility.GetIndexOfFirstNonAsciiChar_Sse2(wchar* unsigned int64)
0.605	System.Text.ASCIIUtility.NarrowUtf16ToAscii_Sse2(wchar* unsigned int8* unsigned int64)
0.586	System.Text.ASCIIUtility.GetIndexOfFirstNonAsciiByte_Sse2(unsigned int8* unsigned int64)
0.257	System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(unsigned int8* wchar* unsigned int64)

What is still left on the table in the HW intrinsics space are these methods that use intrinsics higher than SSE2:

JIT time	Method
2.45	System.Text.Unicode.Utf8Utility.TranscodeToUtf8(wchar* int32 unsigned int8* int32 wchar& unsigned int8&)
1.9	System.Text.Unicode.Utf8Utility.GetPointerToFirstInvalidByte(unsigned int8* int32 int32& int32&)
1.708	System.Text.Unicode.Utf16Utility.GetPointerToFirstInvalidChar(wchar* int32 int64& int32&)
0.635	System.Text.ASCIIUtility.NarrowUtf16ToAscii(wchar* unsigned int8* unsigned int64)
0.492	System.Text.ASCIIUtility.WidenAsciiToUtf16(unsigned int8* wchar* unsigned int64)

tannergooding · 2019-05-30T14:49:11Z

What is still left on the table in the HW intrinsics space are these methods that use intrinsics higher than SSE2.

This is unfortunate and I presume to be fallout from this: #24689 (comment)...

The methods in question above really just have optional paths that support higher level intrinsics and would still be fully aot compilable when limited to just SSE/SSE2. So in an ideal world, crossgen/r2r would AOT these for the SSE/SSE2 or software fallback paths and tiered jitting would later rejit them using the higher ISAs (such as BMI1/BMI2)...

mjsabby · 2019-05-30T15:45:51Z

Would this change to the runtime also allow generation of an AVX2 SPC? Assuming you use a build of crossgen that will set the ISA.

CarolEidt · 2019-05-30T16:35:03Z

I added some comments to the broader issue here: https://github.com/dotnet/coreclr/issues/21603#issuecomment-497394620

This is a follow up to dotnet#24689 that lets us pregenerate all hardware intrinsics in CoreLib. We ensures the potentially unsupported code will never be reachable at runtime on CPUs that don't support it by not reporting the `IsSupported` property as intrinsic in crossgen. This ensures the support checks are always JITted. JITting the support checks is very cheap. There is cost in the form of an extra call and failure to do constant propagation of the return value, but the cost is negligible in practice and gets eliminated once the tiered JIT tiers the method up. We only do this in CoreLib because user code could technically not guard intrinsic use in `IsSupported` checks and pregenerating the code could lead to illegal instruction traps at runtime (instead of `PlatformNotSupportedException` throws) - it's a bad user experience.

* Allow pregenerating all HW intrinsics in CoreLib This is a follow up to #24689 that lets us pregenerate all hardware intrinsics in CoreLib. We ensures the potentially unsupported code will never be reachable at runtime on CPUs that don't support it by not reporting the `IsSupported` property as intrinsic in crossgen. This ensures the support checks are always JITted. JITting the support checks is very cheap. There is cost in the form of an extra call and failure to do constant propagation of the return value, but the cost is negligible in practice and gets eliminated once the tiered JIT tiers the method up. We only do this in CoreLib because user code could technically not guard intrinsic use in `IsSupported` checks and pregenerating the code could lead to illegal instruction traps at runtime (instead of `PlatformNotSupportedException` throws) - it's a bad user experience.

We currently don't precompile methods that use hardware intrinsics because we don't know the CPU that the generated code will run on. Jitting these methods slows down startup and accounts for 3% of startup time in PowerShell. With this change, we're going to lift this restriction for CoreLib (the thing that matters for startup) and support generating HW intrinsics for our minimum supported target ISA (SSE/SSE2). Commit migrated from dotnet/coreclr@d4fadf0

* Allow pregenerating all HW intrinsics in CoreLib This is a follow up to dotnet/coreclr#24689 that lets us pregenerate all hardware intrinsics in CoreLib. We ensures the potentially unsupported code will never be reachable at runtime on CPUs that don't support it by not reporting the `IsSupported` property as intrinsic in crossgen. This ensures the support checks are always JITted. JITting the support checks is very cheap. There is cost in the form of an extra call and failure to do constant propagation of the return value, but the cost is negligible in practice and gets eliminated once the tiered JIT tiers the method up. We only do this in CoreLib because user code could technically not guard intrinsic use in `IsSupported` checks and pregenerating the code could lead to illegal instruction traps at runtime (instead of `PlatformNotSupportedException` throws) - it's a bad user experience. Commit migrated from dotnet/coreclr@e73c8e6

MichalStrehovsky commented May 21, 2019

View reviewed changes

MichalStrehovsky requested review from tannergooding and CarolEidt May 21, 2019 16:29

tannergooding reviewed May 21, 2019

View reviewed changes

CarolEidt reviewed May 21, 2019

View reviewed changes

MichalStrehovsky added 2 commits May 29, 2019 15:23

a

9f102d9

Merge branch 'master' into isa

1e46f36

b

9a4119d

tannergooding reviewed May 29, 2019

View reviewed changes

CarolEidt reviewed May 29, 2019

View reviewed changes

Put SIMD under ifdef

6e8fb5f

CarolEidt approved these changes May 29, 2019

View reviewed changes

MichalStrehovsky merged commit d4fadf0 into dotnet:master May 30, 2019

MichalStrehovsky deleted the isa branch May 30, 2019 07:48

MichalStrehovsky mentioned this pull request Jun 3, 2019

Allow pregenerating most HW intrinsics in CoreLib #24917

Merged

MichalStrehovsky mentioned this pull request Aug 30, 2019

ImproveComments - making sure the comment is accurately describing the code #26442

Merged

MichalStrehovsky mentioned this pull request Sep 26, 2019

[CPAOT] Generating code for hardware intrinsic #26772

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow generating HW intrinsics in crossgen #24689

Allow generating HW intrinsics in crossgen #24689

MichalStrehovsky commented May 21, 2019 •

edited

Loading

MichalStrehovsky May 21, 2019

tannergooding May 21, 2019

tannergooding May 21, 2019

CarolEidt May 21, 2019

tannergooding May 21, 2019

tannergooding May 22, 2019

tannergooding May 22, 2019

CarolEidt May 23, 2019

mjsabby May 23, 2019

CarolEidt May 23, 2019

tannergooding May 21, 2019

tannergooding May 21, 2019

CarolEidt May 21, 2019

CarolEidt left a comment

CarolEidt May 21, 2019

MichalStrehovsky commented May 23, 2019

CarolEidt commented May 23, 2019

MichalStrehovsky commented May 29, 2019

tannergooding May 29, 2019 •

edited

Loading

MichalStrehovsky May 29, 2019

CarolEidt May 29, 2019

CarolEidt May 29, 2019

MichalStrehovsky May 29, 2019

CarolEidt left a comment

tannergooding commented May 30, 2019

MichalStrehovsky commented May 30, 2019

tannergooding commented May 30, 2019

mjsabby commented May 30, 2019

CarolEidt commented May 30, 2019

		@@ -1201,6 +1201,8 @@ void Zapper::InitializeCompilerFlags(CORCOMPILE_VERSION_INFO * pVersionInfo)

		#endif // _TARGET_X86_

		m_pOpt->m_compilerFlags.Set(CORJIT_FLAGS::CORJIT_FLAG_FEATURE_SIMD);

Allow generating HW intrinsics in crossgen #24689

Allow generating HW intrinsics in crossgen #24689

Conversation

MichalStrehovsky commented May 21, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarolEidt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichalStrehovsky commented May 23, 2019

CarolEidt commented May 23, 2019

MichalStrehovsky commented May 29, 2019

tannergooding May 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarolEidt left a comment

Choose a reason for hiding this comment

tannergooding commented May 30, 2019

MichalStrehovsky commented May 30, 2019

tannergooding commented May 30, 2019

mjsabby commented May 30, 2019

CarolEidt commented May 30, 2019

MichalStrehovsky commented May 21, 2019 •

edited

Loading

tannergooding May 29, 2019 •

edited

Loading