Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Allow generating HW intrinsics in crossgen #24689

Merged
merged 5 commits into from
May 30, 2019

Conversation

MichalStrehovsky
Copy link
Member

@MichalStrehovsky MichalStrehovsky commented May 21, 2019

We currently don't precompile methods that use hardware intrinsics because we don't know the CPU that the generated code will run on. Jitting these methods slows down startup and accounts for 3% of startup time in PowerShell.

With this change, we're going to lift this restriction for CoreLib (the thing that matters for startup) and support generating HW intrinsics for our minimum supported target ISA (SSE/SSE2).

We currently don't precompile methods that use hardware intrinsics because we don't know the CPU that the generated code will run on. Jitting these methods slows down startup and accounts for 3% of startup time in PowerShell.

With this change, we're going to lift this restriction for CoreLib (the thing that matters for startup) and support generating HW intrinsics for our minimum supported target ISA (SSE/SSE2). This means that code that uses intrinsics higher than this will be compiled with IsSupported = false.

We will rely on tiered JITting to eventually replace the pregenerated method body that targets the minimum supported ISA with the most efficient code supported by the processor.
@@ -1201,6 +1201,8 @@ void Zapper::InitializeCompilerFlags(CORCOMPILE_VERSION_INFO * pVersionInfo)

#endif // _TARGET_X86_

m_pOpt->m_compilerFlags.Set(CORJIT_FLAGS::CORJIT_FLAG_FEATURE_SIMD);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compiler::compSupportsHWIntrinsic in RyuJIT would refuse generating SSE/SSE2 without this flag.

We still block loading Vector<T> on the runtime side, so vectors won't light up with this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC. @CarolEidt.

Pretty sure this is safe, but she may be able to better answer if there are any other fixups needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to set the hwintrinsic feature and EnableSSE/EnableSSE2?

Do we have a test showing that EnableSSE3+ and others like EnableLZCNT or EnableBMI1 (which aren't part of the SSE hierarchy) don't get generated for crossgen/r2r?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were some wrinkles with enabling the recognition of SIMD types in crossgen (which was needed to correctly support the Vector ABI on Arm64), but I think that we should be OK now that those have been ironed out.

Do we also need to set the hwintrinsic feature and EnableSSE/EnableSSE2?

The latter two should always be enabled by default. I'm not sure about the hw intrinsic feature.

Do we have a test showing that EnableSSE3+ and others like EnableLZCNT or EnableBMI1 (which aren't part of the SSE hierarchy) don't get generated for crossgen/r2r?

I don't think that we currently have a good way to detect this. It would certainly worth making an investment in figuring out how we can do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter two should always be enabled by default. I'm not sure about the hw intrinsic feature.

I guess the better question is there anything we need to do here to ensure that the CPUID checks and the enabled ISAs are correct?

For the normal JIT codegen we ultimately have:

The CPUID checks/caching that happens in the VM: https://github.com/dotnet/coreclr/blob/master/src/vm/codeman.cpp#L1374
-and -
Various JitConfig checks that tell the HWIntrinsic code which ISAs are "supported": https://github.com/dotnet/coreclr/blob/master/src/jit/compiler.cpp#L2254 (since users can disable HWIntrinsic codegen for SSE/SSE2 even if they are "baseline").

I don't think that we currently have a good way to detect this. It would certainly worth making an investment in figuring out how we can do this.

Could we just have a set of tests that are always run through crossgen/r2r? We could then just hardcode them to validate Sse.IsSupported/Sse2.IsSupported are true and all other ISAs are false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we know CoreLib won't do something dumb, like caching Avx.IsSupported into a field and using that thorough

Even if some user did this, I don't think there is a concern. It just means they will get worse codegen at runtime (e.g. they won't hit any Avx codepaths using their cached check).

We should also be actively discouraging people from doing that because it will produce worse codegen (since it may not be treated as a JIT time constant).

I think the perf benefits of enabling this everywhere probably outweighs the chance that someone does something dumb, especially once we start using the HWIntrinsics more broadly (and we already are for some CoreFX code).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I imagine we will also ultimately have an AOT story where SSE/SSE2 are enabled by default and where you can explicitly opt-into additional ISAs if you want to be more restrictive, in which case we probably just want to do the right thing by default here as well)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see us address #17568 and #21603 more broadly (post 3.0, presumably) - i.e.

  • recognize SIMD types during crossgen so that, even if we're not recognizing intrinsics, we don't generate really awful code to reference the structs.
  • support intrinsics more broadly during AOT, as @tannergooding suggests above
    • I think that if we allow the developer to opt into additional ISAs, we'll need to come up with a plan for both how to best expose it, as well as how to ensure adequate test coverage.

For now, this seems like a very good initial step to address the startup issues.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this change would there be more work in the runtime to allow crossgen to specify a minimum ISA? I'm assuming testing, but beyond that nothing fundamental, correct?

Also for those who opt out of tiered jitting, does this make things worse in any way?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one opts out of tiered jitting, then any pessimization caused by SSE/AVX transition penalties would remain. It is also the case that other pessimizations due to R2R constraints would remain.

if (IsCompilationProcess())
#ifdef CROSSGEN_COMPILE
if (!IsNgenPDBCompilationProcess() &&
GetAppDomain()->ToCompilationDomain()->GetTargetModule() != g_pObjectClass->GetModule())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this check mean that we only do this for corelib?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(That is, we only support HWIntrinsics for corelib and we say "disallowed" for everything else).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment below needs editing - I think it's actually incorrect now. Also, it would be good to clarify exactly what the above condition is checking.

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable, though I think the comments need updating (and expanding).

if (IsCompilationProcess())
#ifdef CROSSGEN_COMPILE
if (!IsNgenPDBCompilationProcess() &&
GetAppDomain()->ToCompilationDomain()->GetTargetModule() != g_pObjectClass->GetModule())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment below needs editing - I think it's actually incorrect now. Also, it would be good to clarify exactly what the above condition is checking.

@MichalStrehovsky
Copy link
Member Author

There are two test failures with this:

One is a bit annoying - it's a test that attempts to reflection-execute the HW intrinsics and since we pregenerated the method body for the intrinsics in Crossgen, we get different behaviors than what's expected. I can just make it so that we don't compile method bodies on the HW intrinsic classes themselves and the problem will go away.

The other failure is a RyuJIT assert:

Assert failure(PID 68868 [0x00010d04], Thread: 775384 [0xbd4d8]): Assertion failed 'varDsc->lvDoNotEnregister || tree->OperIs(GT_LCL_VAR, GT_STORE_LCL_VAR)' in 'RayTracer:GetNaturalColor(ref,struct,struct,struct,ref):struct:this' (IL size 369)

    File: /Users/vsts/agent/2.150.3/work/1/s/src/jit/liveness.cpp Line: 57
    Image: /Users/dotnet-bot/dotnetbuild/work/7e99e9ac-194c-4033-8962-94894aae9a20/Payload/crossgen

This is because I forgot to restrict the addition CORJIT_FLAGS::CORJIT_FLAG_FEATURE_SIMD flag to CoreLib only. It should go away if I add the restriction. @CarolEidt is this a concern?

@CarolEidt
Copy link

a test that attempts to reflection-execute the HW intrinsics and since we pregenerated the method body for the intrinsics in Crossgen, we get different behaviors than what's expected. I can just make it so that we don't compile method bodies on the HW intrinsic classes themselves and the problem will go away.

I think it's probably best not to pregenerate method bodies for the intrinsics, since we don't generally expect them to be called. That said, I'm unclear how the behavior would change in an observable way - can you explain and/or point me to the failing test?

For the second one, I don't understand why you would hit an assert when compiling something other than CoreLib. Is this because, while FeatureSIMD is enabled we are somehow not actually recognizing the types? A jitdump for this would be useful.

@MichalStrehovsky
Copy link
Member Author

That said, I'm unclear how the behavior would change in an observable way - can you explain and/or point me to the failing test?

It was this test:

if (Convert.ToBoolean(typeof(Sse).GetMethod(issupported).Invoke(null, null)) != Sse.IsSupported ||
Convert.ToBoolean(typeof(Sse2).GetMethod(issupported).Invoke(null, null)) != Sse2.IsSupported ||
Convert.ToBoolean(typeof(Sse3).GetMethod(issupported).Invoke(null, null)) != Sse3.IsSupported ||
Convert.ToBoolean(typeof(Ssse3).GetMethod(issupported).Invoke(null, null)) != Ssse3.IsSupported ||
Convert.ToBoolean(typeof(Sse41).GetMethod(issupported).Invoke(null, null)) != Sse41.IsSupported ||
Convert.ToBoolean(typeof(Sse42).GetMethod(issupported).Invoke(null, null)) != Sse42.IsSupported ||
Convert.ToBoolean(typeof(Avx).GetMethod(issupported).Invoke(null, null)) != Avx.IsSupported ||
Convert.ToBoolean(typeof(Avx2).GetMethod(issupported).Invoke(null, null)) != Avx2.IsSupported ||
Convert.ToBoolean(typeof(Lzcnt).GetMethod(issupported).Invoke(null, null)) != Lzcnt.IsSupported ||
Convert.ToBoolean(typeof(Popcnt).GetMethod(issupported).Invoke(null, null)) != Popcnt.IsSupported ||
Convert.ToBoolean(typeof(Bmi1).GetMethod(issupported).Invoke(null, null)) != Bmi1.IsSupported ||
Convert.ToBoolean(typeof(Bmi2).GetMethod(issupported).Invoke(null, null)) != Bmi2.IsSupported ||
Convert.ToBoolean(typeof(Sse.X64).GetMethod(issupported).Invoke(null, null)) != Sse.X64.IsSupported ||
Convert.ToBoolean(typeof(Sse2.X64).GetMethod(issupported).Invoke(null, null)) != Sse2.X64.IsSupported ||
Convert.ToBoolean(typeof(Sse41.X64).GetMethod(issupported).Invoke(null, null)) != Sse41.X64.IsSupported ||
Convert.ToBoolean(typeof(Sse42.X64).GetMethod(issupported).Invoke(null, null)) != Sse42.X64.IsSupported ||
Convert.ToBoolean(typeof(Lzcnt.X64).GetMethod(issupported).Invoke(null, null)) != Lzcnt.X64.IsSupported ||
Convert.ToBoolean(typeof(Popcnt.X64).GetMethod(issupported).Invoke(null, null)) != Popcnt.X64.IsSupported ||
Convert.ToBoolean(typeof(Bmi1.X64).GetMethod(issupported).Invoke(null, null)) != Bmi1.X64.IsSupported ||
Convert.ToBoolean(typeof(Bmi2.X64).GetMethod(issupported).Invoke(null, null)) != Bmi2.X64.IsSupported

We end up calling a pregenerated method body by reflection and the IsSupported in the pregenerated body doesn't match what the actual intrisic expansion does.

In light of all this, I think it's the best to scope this down to SSE and SSE2. These are guaranteed to match what we would see at runtime (modulo the opt out COMPlus envionment variable that lets one completely disable HW intrinsics at runtime).

For the second one, I don't understand why you would hit an assert when compiling something other than CoreLib. Is this because, while FeatureSIMD is enabled we are somehow not actually recognizing the types?

Yes, I was setting CORJIT_FLAGS::CORJIT_FLAG_FEATURE_SIMD in Crossgen unconditionally. I now scoped it back to only be set in CoreLib. The problem should be gone now.

#if defined(_TARGET_X86_) || defined(_TARGET_AMD64_)
if ((!IsNgenPDBCompilationProcess()
&& GetAppDomain()->ToCompilationDomain()->GetTargetModule() != g_pObjectClass->GetModule())
|| (strcmp(className, "Sse") != 0 && strcmp(className, "Sse2") != 0))
Copy link
Member

@tannergooding tannergooding May 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We is this needed? Can we just not report them as unsupported in the compiler or vm layer and have things work "naturally"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to avoid AOT compiling anything that touches the other intrinsics that could have different support level at runtime. It solves two problems:

  1. The reflection test that is trying to reflection-invoke e.g. Avx.IsSupported and finds out that the method body for IsSupported was crossgenned as returning false, but the reality is that Avx.IsSupported returns true. This could also be solved by not crossgenning any of the methods on the Avx type, but there's also problem number 2:
  2. If there's an intraprocedural dependency, we'll hit bugs once tiering kicks in. Imagine that the linked method was calling Avx.IsSupported instead of Sse2.IsSupported - we crossgen it with Avx.IsSupported == false. At some point GetIndexOfFirstNonAsciiByte gets tiered up and Avx.IsSupported == true. At that point we start calling a method that doesn't have any guards for Avx.IsSupported and directly calls into intrinsics, but we pregenerated them as throwing PlatformNotSupportedException.

// of the hardware intrinsics.
if (m_pEECompileInfo->GetAssemblyModule(m_hAssembly) == m_pEECompileInfo->GetLoaderModuleForMscorlib())
{
m_pOpt->m_compilerFlags.Set(CORJIT_FLAGS::CORJIT_FLAG_FEATURE_SIMD);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag is only defined for targets that support SIMD (_TARGET_X86_, _TARGET_AMD64_ and _TARGET_ARM64_). While we don't have any uses of ARM64 intrinsics currently in SPC.dll, I'd probably include that in this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like that's causing the Arm failure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it was. Thanks!

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MichalStrehovsky MichalStrehovsky merged commit d4fadf0 into dotnet:master May 30, 2019
@MichalStrehovsky MichalStrehovsky deleted the isa branch May 30, 2019 07:48
@tannergooding
Copy link
Member

@MichalStrehovsky, could you update with some perf numbers as well, showing how this improves startup times for things like Powershell?

Also, CC. @adamsitnik who may be interested in trying to get some updated numbers.

@MichalStrehovsky
Copy link
Member Author

@MichalStrehovsky, could you update with some perf numbers as well, showing how this improves startup times for things like Powershell?

We get 5+ ms startup improvement. This lets us AOT compile following methods that we saw JITting on the PowerShell startup path:

JIT time Method
3.303 System.Text.Unicode.Utf8Utility.TranscodeToUtf16(unsigned int8* int32 wchar* int32 unsigned int8*& wchar*&)
0.606 System.Text.ASCIIUtility.GetIndexOfFirstNonAsciiChar_Sse2(wchar* unsigned int64)
0.605 System.Text.ASCIIUtility.NarrowUtf16ToAscii_Sse2(wchar* unsigned int8* unsigned int64)
0.586 System.Text.ASCIIUtility.GetIndexOfFirstNonAsciiByte_Sse2(unsigned int8* unsigned int64)
0.257 System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(unsigned int8* wchar* unsigned int64)

What is still left on the table in the HW intrinsics space are these methods that use intrinsics higher than SSE2:

JIT time Method
2.45 System.Text.Unicode.Utf8Utility.TranscodeToUtf8(wchar* int32 unsigned int8* int32 wchar*& unsigned int8*&)
1.9 System.Text.Unicode.Utf8Utility.GetPointerToFirstInvalidByte(unsigned int8* int32 int32& int32&)
1.708 System.Text.Unicode.Utf16Utility.GetPointerToFirstInvalidChar(wchar* int32 int64& int32&)
0.635 System.Text.ASCIIUtility.NarrowUtf16ToAscii(wchar* unsigned int8* unsigned int64)
0.492 System.Text.ASCIIUtility.WidenAsciiToUtf16(unsigned int8* wchar* unsigned int64)

@tannergooding
Copy link
Member

What is still left on the table in the HW intrinsics space are these methods that use intrinsics higher than SSE2.

This is unfortunate and I presume to be fallout from this: #24689 (comment)...

The methods in question above really just have optional paths that support higher level intrinsics and would still be fully aot compilable when limited to just SSE/SSE2. So in an ideal world, crossgen/r2r would AOT these for the SSE/SSE2 or software fallback paths and tiered jitting would later rejit them using the higher ISAs (such as BMI1/BMI2)...

@mjsabby
Copy link

mjsabby commented May 30, 2019

Would this change to the runtime also allow generation of an AVX2 SPC? Assuming you use a build of crossgen that will set the ISA.

@CarolEidt
Copy link

I added some comments to the broader issue here: https://github.com/dotnet/coreclr/issues/21603#issuecomment-497394620

MichalStrehovsky added a commit to MichalStrehovsky/coreclr that referenced this pull request Jun 3, 2019
This is a follow up to dotnet#24689 that lets us pregenerate all hardware intrinsics in CoreLib.

We ensures the potentially unsupported code will never be reachable at runtime on CPUs that don't support it by not reporting the `IsSupported` property as intrinsic in crossgen. This ensures the support checks are always JITted. JITting the support checks is very cheap.

There is cost in the form of an extra call and failure to do constant propagation of the return value, but the cost is negligible in practice and gets eliminated once the tiered JIT tiers the method up.

We only do this in CoreLib because user code could technically not guard intrinsic use in `IsSupported` checks and pregenerating the code could lead to illegal instruction traps at runtime (instead of `PlatformNotSupportedException` throws) - it's a bad user experience.
MichalStrehovsky added a commit to MichalStrehovsky/coreclr that referenced this pull request Jun 3, 2019
This is a follow up to dotnet#24689 that lets us pregenerate all hardware intrinsics in CoreLib.

We ensures the potentially unsupported code will never be reachable at runtime on CPUs that don't support it by not reporting the `IsSupported` property as intrinsic in crossgen. This ensures the support checks are always JITted. JITting the support checks is very cheap.

There is cost in the form of an extra call and failure to do constant propagation of the return value, but the cost is negligible in practice and gets eliminated once the tiered JIT tiers the method up.

We only do this in CoreLib because user code could technically not guard intrinsic use in `IsSupported` checks and pregenerating the code could lead to illegal instruction traps at runtime (instead of `PlatformNotSupportedException` throws) - it's a bad user experience.
sergiy-k pushed a commit that referenced this pull request Jun 13, 2019
* Allow pregenerating all HW intrinsics in CoreLib

This is a follow up to #24689 that lets us pregenerate all hardware intrinsics in CoreLib.

We ensures the potentially unsupported code will never be reachable at runtime on CPUs that don't support it by not reporting the `IsSupported` property as intrinsic in crossgen. This ensures the support checks are always JITted. JITting the support checks is very cheap.

There is cost in the form of an extra call and failure to do constant propagation of the return value, but the cost is negligible in practice and gets eliminated once the tiered JIT tiers the method up.

We only do this in CoreLib because user code could technically not guard intrinsic use in `IsSupported` checks and pregenerating the code could lead to illegal instruction traps at runtime (instead of `PlatformNotSupportedException` throws) - it's a bad user experience.
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
We currently don't precompile methods that use hardware intrinsics because we don't know the CPU that the generated code will run on. Jitting these methods slows down startup and accounts for 3% of startup time in PowerShell.
With this change, we're going to lift this restriction for CoreLib (the thing that matters for startup) and support generating HW intrinsics for our minimum supported target ISA (SSE/SSE2).

Commit migrated from dotnet/coreclr@d4fadf0
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* Allow pregenerating all HW intrinsics in CoreLib

This is a follow up to dotnet/coreclr#24689 that lets us pregenerate all hardware intrinsics in CoreLib.

We ensures the potentially unsupported code will never be reachable at runtime on CPUs that don't support it by not reporting the `IsSupported` property as intrinsic in crossgen. This ensures the support checks are always JITted. JITting the support checks is very cheap.

There is cost in the form of an extra call and failure to do constant propagation of the return value, but the cost is negligible in practice and gets eliminated once the tiered JIT tiers the method up.

We only do this in CoreLib because user code could technically not guard intrinsic use in `IsSupported` checks and pregenerating the code could lead to illegal instruction traps at runtime (instead of `PlatformNotSupportedException` throws) - it's a bad user experience.



Commit migrated from dotnet/coreclr@e73c8e6
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants