Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[clang] Add intrin0.h header to mimic intrin0.h used by MSVC STL for clang-cl #75711

Merged
merged 14 commits into from
Mar 19, 2024

Conversation

MaxEW707
Copy link
Contributor

@MaxEW707 MaxEW707 commented Dec 16, 2023

Fixes #53520.

Description

Provide intrin0.h to be the minimal set of intrinsics that the MSVC STL requires.
The intrin0.h header matches the latest header provided by MSVC 1939 which does include some extra intrinsics that the MSVC STL does not use.

Inside BuiltinHeaders.def I kept the header description as intrin.h. If you want me to change those to intrin0.h for the moved intrinsics let me know.

This should now allow immintrin.h to be used with function targets for runtime cpu detection of simd instruction sets without worrying about the compile-time overhead from MSVC STL including intrin.h on clang.

I still need to figure out how to best update MSVC STL to detect for the presence of intrin0.h from clang and to use this header over intrin.h.

Testing

Built clang locally and ran the test suite. I still need to do a pass over the existing unit tests for the ms intrinsics to make sure there aren't any gaps. Wanted to get this PR up for discussion first.

Modified latest MSVC STL from github to point to intrin0.h for clang.

Wrote some test files that included MSVC STL headers that rely on intrinsics such as atomic, bit and vector. Built the unit tests against x86, arm, aarch64, and x64.

Benchmarks

The following include times are based on the x64 target with the modified headers in this PR.
These timings were done by using clang-cl.exe -ftime-trace and taking the wall time for parsing intrin.h and intrin0.h.

intrin.h takes ~897ms to parse.
intrin0.h takes ~1ms to parse.

If there is anything required or a different approach is preferred let me know. I would very much like to move this over the finish line so we can use function targets with clang-cl.

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be
notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write
permissions for the repository. In which case you can instead tag reviewers by
name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review
by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate
is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot llvmbot added clang Clang issues not falling into any other category backend:X86 clang:headers Headers provided by Clang, e.g. for intrinsics labels Dec 16, 2023
@llvmbot
Copy link
Collaborator

llvmbot commented Dec 16, 2023

@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-clang

Author: None (MaxEW707)

Changes

Fixes #53520.

Description

Provide intrin0.h to be the minimal set of intrinsics that the MSVC STL requires.
The intrin0.h header matches the latest header provided by MSVC 1939 which does include some extra intrinsics that the MSVC STL does not use.

Inside BuiltinHeaders.def I kept the header description as intrin.h. If you want me to change those to intrin0.h for the moved intrinsics let me know.

This should now allow immintrin.h to be used with function targets for runtime cpu detection of simd instruction sets without worrying about the compile-time overhead from MSVC STL including intrin.h on clang.

I still need to figure out how to best update MSVC STL to detect for the presence on intrin0.h from clang and to use this header over intrin.h.

Testing

Built clang locally and ran the test suite. I still need to do a pass over the existing unit tests for the ms intrinsics to make sure there aren't any gaps. Wanted to get this PR up for discussion first.

Modified latest MSVC STL from github to point to intrin0.h for clang.

Wrote some test files that included MSVC STL headers that rely on intrinsics such as atomic, bit and vector. Built the unit tests against x86, arm, aarch64, and x64.

Benchmarks

The following include times are based on the x64 target with the modified headers in this PR.
These timings were done by using clang-cl.exe -ftime-trace and taking the wall time for parsing intrin.h and intrin0.h.

intrin.h takes ~897ms to parse.
intrin0.h takes ~1ms to parse.

If there is anything required or a different approach is preferred let me know. I would very much like to move this over the finish line so we can use function targets with clang-cl.


Patch is 46.59 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/75711.diff

4 Files Affected:

  • (modified) clang/lib/Headers/CMakeLists.txt (+1)
  • (modified) clang/lib/Headers/immintrin.h (+83-83)
  • (modified) clang/lib/Headers/intrin.h (+2-214)
  • (added) clang/lib/Headers/intrin0.h (+233)
diff --git a/clang/lib/Headers/CMakeLists.txt b/clang/lib/Headers/CMakeLists.txt
index f8fdd402777e48..e5ce039d5789ad 100644
--- a/clang/lib/Headers/CMakeLists.txt
+++ b/clang/lib/Headers/CMakeLists.txt
@@ -252,6 +252,7 @@ set(x86_files
   )
 
 set(windows_only_files
+  intrin0.h
   intrin.h
   vadefs.h
 )
diff --git a/clang/lib/Headers/immintrin.h b/clang/lib/Headers/immintrin.h
index 9bfe2fcdabdb3a..f57cd385455c22 100644
--- a/clang/lib/Headers/immintrin.h
+++ b/clang/lib/Headers/immintrin.h
@@ -16,62 +16,62 @@
 
 #include <x86gprintrin.h>
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__MMX__)
 #include <mmintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__SSE__)
 #include <xmmintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__SSE2__)
 #include <emmintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__SSE3__)
 #include <pmmintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__SSSE3__)
 #include <tmmintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     (defined(__SSE4_2__) || defined(__SSE4_1__))
 #include <smmintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     (defined(__AES__) || defined(__PCLMUL__))
 #include <wmmintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__CLFLUSHOPT__)
 #include <clflushoptintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__CLWB__)
 #include <clwbintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVX__)
 #include <avxintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVX2__)
 #include <avx2intrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__F16C__)
 #include <f16cintrin.h>
 #endif
@@ -79,217 +79,217 @@
 /* No feature check desired due to internal checks */
 #include <bmiintrin.h>
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__BMI2__)
 #include <bmi2intrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__LZCNT__)
 #include <lzcntintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__POPCNT__)
 #include <popcntintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__FMA__)
 #include <fmaintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVX512F__)
 #include <avx512fintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVX512VL__)
 #include <avx512vlintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVX512BW__)
 #include <avx512bwintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVX512BITALG__)
 #include <avx512bitalgintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVX512CD__)
 #include <avx512cdintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVX512VPOPCNTDQ__)
 #include <avx512vpopcntdqintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     (defined(__AVX512VL__) && defined(__AVX512VPOPCNTDQ__))
 #include <avx512vpopcntdqvlintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVX512VNNI__)
 #include <avx512vnniintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     (defined(__AVX512VL__) && defined(__AVX512VNNI__))
 #include <avx512vlvnniintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVXVNNI__)
 #include <avxvnniintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVX512DQ__)
 #include <avx512dqintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     (defined(__AVX512VL__) && defined(__AVX512BITALG__))
 #include <avx512vlbitalgintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     (defined(__AVX512VL__) && defined(__AVX512BW__))
 #include <avx512vlbwintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     (defined(__AVX512VL__) && defined(__AVX512CD__))
 #include <avx512vlcdintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     (defined(__AVX512VL__) && defined(__AVX512DQ__))
 #include <avx512vldqintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVX512ER__)
 #include <avx512erintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVX512IFMA__)
 #include <avx512ifmaintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     (defined(__AVX512IFMA__) && defined(__AVX512VL__))
 #include <avx512ifmavlintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVXIFMA__)
 #include <avxifmaintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVX512VBMI__)
 #include <avx512vbmiintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     (defined(__AVX512VBMI__) && defined(__AVX512VL__))
 #include <avx512vbmivlintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVX512VBMI2__)
 #include <avx512vbmi2intrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     (defined(__AVX512VBMI2__) && defined(__AVX512VL__))
 #include <avx512vlvbmi2intrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVX512PF__)
 #include <avx512pfintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVX512FP16__)
 #include <avx512fp16intrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     (defined(__AVX512VL__) && defined(__AVX512FP16__))
 #include <avx512vlfp16intrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVX512BF16__)
 #include <avx512bf16intrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     (defined(__AVX512VL__) && defined(__AVX512BF16__))
 #include <avx512vlbf16intrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__PKU__)
 #include <pkuintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__VPCLMULQDQ__)
 #include <vpclmulqdqintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__VAES__)
 #include <vaesintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__GFNI__)
 #include <gfniintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVXVNNIINT8__)
 #include <avxvnniint8intrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVXNECONVERT__)
 #include <avxneconvertintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__SHA512__)
 #include <sha512intrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__SM3__)
 #include <sm3intrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__SM4__)
 #include <sm4intrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVXVNNIINT16__)
 #include <avxvnniint16intrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__RDPID__)
 /// Reads the value of the IA32_TSC_AUX MSR (0xc0000103).
 ///
@@ -304,7 +304,7 @@ _rdpid_u32(void) {
 }
 #endif // __RDPID__
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__RDRND__)
 /// Returns a 16-bit hardware-generated random value.
 ///
@@ -367,7 +367,7 @@ _rdrand64_step(unsigned long long *__p)
 }
 #endif /* __RDRND__ */
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__FSGSBASE__)
 #ifdef __x86_64__
 /// Reads the FS base register.
@@ -481,7 +481,7 @@ _writegsbase_u64(unsigned long long __V)
 #endif
 #endif /* __FSGSBASE__ */
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__MOVBE__)
 
 /* The structs used below are to force the load/store to be unaligned. This
@@ -541,18 +541,18 @@ _storebe_i64(void * __P, long long __D) {
 #endif
 #endif /* __MOVBE */
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__RTM__)
 #include <rtmintrin.h>
 #include <xtestintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__SHA__)
 #include <shaintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__FXSR__)
 #include <fxsrintrin.h>
 #endif
@@ -560,22 +560,22 @@ _storebe_i64(void * __P, long long __D) {
 /* No feature check desired due to internal MSC_VER checks */
 #include <xsaveintrin.h>
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__XSAVEOPT__)
 #include <xsaveoptintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__XSAVEC__)
 #include <xsavecintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__XSAVES__)
 #include <xsavesintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__SHSTK__)
 #include <cetintrin.h>
 #endif
@@ -584,91 +584,91 @@ _storebe_i64(void * __P, long long __D) {
  * whereas others are also available at all times. */
 #include <adxintrin.h>
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__RDSEED__)
 #include <rdseedintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__WBNOINVD__)
 #include <wbnoinvdintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__CLDEMOTE__)
 #include <cldemoteintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__WAITPKG__)
 #include <waitpkgintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__MOVDIRI__) || defined(__MOVDIR64B__)
 #include <movdirintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__PCONFIG__)
 #include <pconfigintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__SGX__)
 #include <sgxintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__PTWRITE__)
 #include <ptwriteintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__INVPCID__)
 #include <invpcidintrin.h>
 #endif
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AMX_FP16__)
 #include <amxfp16intrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__KL__) || defined(__WIDEKL__)
 #include <keylockerintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AMX_TILE__) || defined(__AMX_INT8__) || defined(__AMX_BF16__)
 #include <amxintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AMX_COMPLEX__)
 #include <amxcomplexintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__AVX512VP2INTERSECT__)
 #include <avx512vp2intersectintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     (defined(__AVX512VL__) && defined(__AVX512VP2INTERSECT__))
 #include <avx512vlvp2intersectintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__ENQCMD__)
 #include <enqcmdintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__SERIALIZE__)
 #include <serializeintrin.h>
 #endif
 
-#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
+#if !defined(__SCE__) || __has_feature(modules) ||      \
     defined(__TSXLDTRK__)
 #include <tsxldtrkintrin.h>
 #endif
diff --git a/clang/lib/Headers/intrin.h b/clang/lib/Headers/intrin.h
index 9ebaea9fee9421..2d9ebe2be9c01a 100644
--- a/clang/lib/Headers/intrin.h
+++ b/clang/lib/Headers/intrin.h
@@ -15,6 +15,8 @@
 #ifndef __INTRIN_H
 #define __INTRIN_H
 
+#include <intrin0.h>
+
 /* First include the standard intrinsics. */
 #if defined(__i386__) || defined(__x86_64__)
 #include <x86intrin.h>
@@ -131,9 +133,6 @@ void __writefsqword(unsigned long, unsigned __int64);
 void __writefsword(unsigned long, unsigned short);
 void __writemsr(unsigned long, unsigned __int64);
 void *_AddressOfReturnAddress(void);
-unsigned char _BitScanForward(unsigned long *_Index, unsigned long _Mask);
-unsigned char _BitScanReverse(unsigned long *_Index, unsigned long _Mask);
-unsigned char _bittest(long const *, long);
 unsigned char _bittestandcomplement(long *, long);
 unsigned char _bittestandreset(long *, long);
 unsigned char _bittestandset(long *, long);
@@ -151,7 +150,6 @@ long _InterlockedExchangeAdd_HLERelease(long volatile *, long);
 __int64 _InterlockedExchangeAdd64_HLEAcquire(__int64 volatile *, __int64);
 __int64 _InterlockedExchangeAdd64_HLERelease(__int64 volatile *, __int64);
 void _ReadBarrier(void);
-void _ReadWriteBarrier(void);
 unsigned int _rorx_u32(unsigned int, const unsigned int);
 int _sarx_i32(int, unsigned int);
 #if __STDC_HOSTED__
@@ -182,12 +180,6 @@ unsigned char __readgsbyte(unsigned long);
 unsigned long __readgsdword(unsigned long);
 unsigned __int64 __readgsqword(unsi...
[truncated]

Copy link

github-actions bot commented Dec 16, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@nico
Copy link
Contributor

nico commented Dec 18, 2023

Is intrin0.h a header that ships with MSVC's compiler, or with MS's STL? If the latter, shouldn't everything just work already?

@nico
Copy link
Contributor

nico commented Dec 18, 2023

Looks like it belongs to MSVC: https://github.com/microsoft/STL/blob/a8888806c6960f1687590ffd4244794c753aa819/stl/inc/yvals_core.h#L2040

So I think the right order here is:

  1. Add intrin0.h header
  2. Wait for a release
  3. Change MSSTL to include that in clang builds as well
  4. Wait for a release of that
  5. Make intrin.h the chonky header

If we land this as-is, it'll tank build time on Windows.

@nico
Copy link
Contributor

nico commented Dec 18, 2023

In other words, please undo the changes to clang/lib/Headers/immintrin.h for now.

@rnk
Copy link
Collaborator

rnk commented Dec 18, 2023

If we land this as-is, it'll tank build time on Windows.

While this is true, I don't think it's the right tradeoff for us to leave Intel intrinsics inaccessible for users who don't want to enable new microarchitectural features globally with command line flags. You may recall there are ODR issues, where enabling AVX globally results in AVX instructions being generated in inline functions, which then prevail at link time, and crash at runtime on chips where AVX is not available. We are currently prioritizing compile time above those needs, and that doesn't seem like the right tradeoff.

What I'd like to see is a pull request sent to https://github.com/microsoft/stl with some agreement about how to structure the ifdefs so we can use intrin0.h when it is available. We probably can't use __has_include(<intrin0.h>) because MSVC ships its own copy of intrin0.h, and they can't be distinguished. Perhaps we should come up with our own name for intrin0.h, so we can feature-detect it. I suggest intrin_msstl.h or intrin_minimal.h or something.

Then, when we land this PR, the net effect is that compile time with MSVC slows down until the next release of the Microsoft STL, but users have access to Intel intrinsics as soon as possible. Does that seem reasonable?

@MaxEW707
Copy link
Contributor Author

MaxEW707 commented Dec 18, 2023

Is intrin0.h a header that ships with MSVC's compiler, or with MS's STL? If the latter, shouldn't everything just work already?

It ships with MSVC.

intrin0.h from MSVC doesn't work currently due to some declarations not matching declarations in clang's own intrin.h.
For example _addcary_u64. Which I now realize I am missing since that is inside adxintrin.h which I mistakenly assumed only included adx instriniscs and not adc intrinsics as well.

What I'd like to see is a pull request sent to https://github.com/microsoft/stl with some agreement about how to structure the ifdefs so we can use intrin0.h when it is available.

Sounds good I'll do that.

We probably can't use __has_include(<intrin0.h>) because MSVC ships its own copy of intrin0.h, and they can't be distinguished.

I was thinking I could use __has_include_next. If __has_include_next(<intrin0.h>) is true then we know clang is providing its own intrin0.h.

Another option I had was to preemptively get a PR up to MSVC STL with a __clang_major__ check for clang 18 assuming this PR gets in before the release window.

A custom name like intrin_msstl.h works for me as well.

@rnk
Copy link
Collaborator

rnk commented Dec 18, 2023

What I'd like to see is a pull request sent to https://github.com/microsoft/stl with some agreement about how to structure the ifdefs so we can use intrin0.h when it is available.

Sounds good I'll do that.

Thanks!

I was thinking I could use __has_include_next. If __has_include_next(<intrin0.h>) is true then we know clang is providing its own intrin0.h.

Another option I had was to preemptively get a PR up to MSVC STL with a __clang_major__ check for clang 18 assuming this PR gets in before the release window.

A custom name like intrin_msstl.h works for me as well.

I think we can be flexible, and the important thing is that our solution should work for them.

@MaxEW707
Copy link
Contributor Author

MaxEW707 commented Dec 19, 2023

MSVC STL requires the x64 adc intel intrinsics. I moved those to a separate file adcintrin.h that can be included from immintrin.h and intrin0.h for x64.

I also made a table here of all the intrinsics available with MSVC's intrin0.h, which intrinsics are used by MSVC STL, which are used by MSVC STL when compiled under clang and which intrinsics are provided as builtins from clang.

The last table at the end shows the intrinsics that aren't implemented in clang but MSVC STL has workarounds using the clang intrinsics instead. We don't have to worry about supporting those at the moment.

_CountLeadingZeros[64] is special in that we don't support it for arm but the MSVC STL uses the clang intrinsics instead for 32-bit and 64-bit arm. Even though clang provides _CountLeadingZeros[64] on 64-bit arm MSVC STL doesn't use it.

_ReadWriteBarrier isn't supported on arm by clang. This godbolt appears to confirm that.
On arm this is only used inside atomic_signal_fence in MSVC STL. I am not too worried about this since I believe Microsoft is dropping 32-bit ARM support in a future Windows 11 update if my memory is correct.

@RKSimon
Copy link
Collaborator

RKSimon commented Dec 19, 2023

@MaxEW707 If you pull out the adcintrin.h change into a separate PR we can fast track it as it should be done anyhow

@MaxEW707
Copy link
Contributor Author

@MaxEW707 If you pull out the adcintrin.h change into a separate PR we can fast track it as it should be done anyhow

Done :). Here is the link since I can't add reviewers #75992.

phoebewang pushed a commit that referenced this pull request Dec 21, 2023
See #75711 for discussion.

As a summary from the PR above, `<adxintrin.h>` includes adc intrinsics
and adx intrinsics.
To support MSVC STL we need to expose the adc intrinsics inside the
currently proposed `<intrin0.h>` header.
Move the processor agnostic adc intrinsics into a separate file that can
be included from `<immintrin.h>` and the currently proposed
`<intrin0.h>`.
@nico
Copy link
Contributor

nico commented Feb 22, 2024

Wow, that clang/lib/Headers/yvals_core.h hacks is gross, I love it :D

lgtm, and good to land provided you do a quick re-check to confirm that this doesn't tank compile times now with that part included. Thanks!

Copy link
Collaborator

@rnk rnk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are risks to shadowing yvals_core.h, but @CaseyCarter is at least aware of it here.

Once Clang ships its own intrin0.h header, the MSVC STL can adjust their yvals_core.h code to use it, and then we can delete this shadow header. This gives us the best outcome: Fast compiles today, target attribute support in MSVC-mode, and minimal tech debt in the long run. Sounds good to me, thanks for finding a solution!

*===-----------------------------------------------------------------------===
*/

/* Only include this if we are aiming for MSVC compatibility. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newer lib/Headers changes prefer //-style comments. Do we need /* */?

Copy link
Contributor Author

@MaxEW707 MaxEW707 Feb 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need /* */.

I was just copying from the other files in this folder for the copyright header and clang-format didn't complain.
I'll give the style guide a read and get this fixed :).

This file is inside msvc stl so we don't need to worry about C89 compat on microsoft platforms.

@MaskRay
Copy link
Member

MaskRay commented Feb 24, 2024

I am not familiar with the MSVC ecosystem, but the change looks reasonable. Thanks!

@MaxEW707
Copy link
Contributor Author

I will need someone to commit on my behalf since I do not have write access.

lgtm, and good to land provided you do a quick re-check to confirm that this doesn't tank compile times now with that part included. Thanks!

The include times below are done with the following source file using a locally release build of clang, -ftime-trace, and msvc stl shipped with MSVC 1939.

#include <map>
#include <vector>
#include <memory>

clang-cl.exe without yvals_core.h shadowing takes ~1,344 ms in the frontend. intrin.h took ~955 ms to parse.

clang-cl.exe with yvals_core.h shadowing takes ~395 ms in the frontend. intrin0.h took ~1.4 ms to parse.

Copy link
Collaborator

@AaronBallman AaronBallman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the timing information! The yvals_core.h hack is actually a really clever way to handle this. I'm a bit worried that we don't have any test coverage for this file to begin with, so it's a bit hard to validate that the changes are correct, but we're early enough in a release cycle that I think we'll get feedback with plenty of time to address unintentional fallout.

The changes should come with a release note in clang/docs/ReleaseNotes.rst so that users know about the compile time performance improvements. Otherwise, this LGTM! I can land the changes on your behalf once there's a release note.

@MaxEW707
Copy link
Contributor Author

MaxEW707 commented Mar 4, 2024

Just want to say thanks for everyone taking the time to review the PR and providing feedback :).
Looking forward to using this in the next release of clang-cl.

I'm a bit worried that we don't have any test coverage for this file to begin with, so it's a bit hard to validate that the changes are correct, but we're early enough in a release cycle that I think we'll get feedback with plenty of time to address unintentional fallout.

I'll look into adding unit tests next week in a separate PR. If not it will be closer towards the end of March due to personal things.
If any issues pop up feel free to @ me. I don't intend to commit and dash :).

The changes should come with a release note in clang/docs/ReleaseNotes.rst so that users know about the compile time performance improvements.

Done. Feel free to reword the release notes as necessary. I believe I landed on something that is informative but succinct.

Copy link
Collaborator

@AaronBallman AaronBallman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

Copy link
Member

@CaseyCarter CaseyCarter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for devising this clever way to improve MSVCSTL throughput!

@MaxEW707
Copy link
Contributor Author

MaxEW707 commented Mar 7, 2024

Friendly reminder that I require someone to commit on my behalf since I do not have write access :).

@bebuch
Copy link

bebuch commented Mar 19, 2024

@AaronBallman Can you merge this please? We are eagerly waiting for this to end up in a release! ;-)

@AaronBallman AaronBallman merged commit afec08e into llvm:main Mar 19, 2024
5 checks passed
@AaronBallman
Copy link
Collaborator

@AaronBallman Can you merge this please? We are eagerly waiting for this to end up in a release! ;-)

Thank you for the ping, this fell off my radar!

chencha3 pushed a commit to chencha3/llvm-project that referenced this pull request Mar 23, 2024
…for clang-cl (llvm#75711)

Fixes llvm#53520.

#### Description ####

Provide `intrin0.h` to be the minimal set of intrinsics that the MSVC
STL requires.
The `intrin0.h` header matches the latest header provided by MSVC 1939
which does include some extra intrinsics that the MSVC STL does not use.

Inside `BuiltinHeaders.def` I kept the header description as `intrin.h`.
If you want me to change those to `intrin0.h` for the moved intrinsics
let me know.

This should now allow `immintrin.h` to be used with function targets for
runtime cpu detection of simd instruction sets without worrying about
the compile-time overhead from MSVC STL including `intrin.h` on clang.

I still need to figure out how to best update MSVC STL to detect for the
presence of `intrin0.h` from clang and to use this header over
`intrin.h`.

#### Testing ####

Built clang locally and ran the test suite. I still need to do a pass
over the existing unit tests for the ms intrinsics to make sure there
aren't any gaps. Wanted to get this PR up for discussion first.

Modified latest MSVC STL from github to point to `intrin0.h` for clang.

Wrote some test files that included MSVC STL headers that rely on
intrinsics such as `atomic`, `bit` and `vector`. Built the unit tests
against x86, arm, aarch64, and x64.

#### Benchmarks ####

The following include times are based on the x64 target with the
modified headers in this PR.
These timings were done by using `clang-cl.exe -ftime-trace` and taking
the wall time for parsing `intrin.h` and `intrin0.h`.

`intrin.h` takes ~897ms to parse.
`intrin0.h` takes ~1ms to parse.

If there is anything required or a different approach is preferred let
me know. I would very much like to move this over the finish line so we
can use function targets with clang-cl.
@MaxEW707 MaxEW707 deleted the mew/intrin0-clangcl branch April 2, 2024 02:35
qihangkong pushed a commit to rvgpu/rvgpu-llvm that referenced this pull request Apr 23, 2024
See llvm/llvm-project#75711 for discussion.

As a summary from the PR above, `<adxintrin.h>` includes adc intrinsics
and adx intrinsics.
To support MSVC STL we need to expose the adc intrinsics inside the
currently proposed `<intrin0.h>` header.
Move the processor agnostic adc intrinsics into a separate file that can
be included from `<immintrin.h>` and the currently proposed
`<intrin0.h>`.
SquallATF pushed a commit to SquallATF/llvm-project that referenced this pull request Jun 30, 2024
…for clang-cl (llvm#75711)

Fixes llvm#53520.

#### Description ####

Provide `intrin0.h` to be the minimal set of intrinsics that the MSVC
STL requires.
The `intrin0.h` header matches the latest header provided by MSVC 1939
which does include some extra intrinsics that the MSVC STL does not use.

Inside `BuiltinHeaders.def` I kept the header description as `intrin.h`.
If you want me to change those to `intrin0.h` for the moved intrinsics
let me know.

This should now allow `immintrin.h` to be used with function targets for
runtime cpu detection of simd instruction sets without worrying about
the compile-time overhead from MSVC STL including `intrin.h` on clang.

I still need to figure out how to best update MSVC STL to detect for the
presence of `intrin0.h` from clang and to use this header over
`intrin.h`.

#### Testing ####

Built clang locally and ran the test suite. I still need to do a pass
over the existing unit tests for the ms intrinsics to make sure there
aren't any gaps. Wanted to get this PR up for discussion first.

Modified latest MSVC STL from github to point to `intrin0.h` for clang.

Wrote some test files that included MSVC STL headers that rely on
intrinsics such as `atomic`, `bit` and `vector`. Built the unit tests
against x86, arm, aarch64, and x64.

#### Benchmarks ####

The following include times are based on the x64 target with the
modified headers in this PR.
These timings were done by using `clang-cl.exe -ftime-trace` and taking
the wall time for parsing `intrin.h` and `intrin0.h`.

`intrin.h` takes ~897ms to parse.
`intrin0.h` takes ~1ms to parse.

If there is anything required or a different approach is preferred let
me know. I would very much like to move this over the finish line so we
can use function targets with clang-cl.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 clang:headers Headers provided by Clang, e.g. for intrinsics clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

clang-cl can not support function targets
10 participants