-
-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide detection for SIMD features in autoconf and at runtime #125022
Comments
My general opinion about managing SIMD logics in CPython side: #124951 (comment) |
I do have concerns and that's why I'd like to hear from people that 1) know about weird architectures 2) deal with real-life scenarios. What I have in mind:
So, yes, I definitely have concerns on the differences. Using SIMD instructions could probably make local builds faster or builds managed by distributions themselves though we should be really careful. This is also the reason why I want to keep runtime detection to avoid issues. The idea was to open a wider discussion on SIMD support itself. If you want we can move to Discourse though I'm not sure whether it's better to keep it internal for now (the PR is just a PoC and it probably won't cover those cases we're worried about). I don't think we should add SIMD for every possible parts of the library, only those that are critical enough IMO. And they should be carefully worked out. However, in order to investigate them (and test them using the CI), I think having an internal detection framework would at least be the first step (or maybe I'm wrong here?). |
I've harden the detection of AVX instructions. I've also learned that macOS may not like AVX-512 at all (or at least some registers states won't be restored correctly upon context-switching). So there are real-life issues that we should address. What I'll maybe do is first try to make a PoC for |
Hello, I'm one of the maintainers of pygame-ce, a Python C extension library that uses SIMD extensively to speed up pixel processing operations. We've had various bits of SIMD for a long time and use runtime checks to manage it. I'd like to share some information about our approach, in the hope it is helpful. We SIMD accelerate at the SSE2 and AVX2 levels. SSE2 is part of the baseline of x86_64, but we've also had no problems with it on our 32 bit builds. AVX2 is where isolation and runtime checking is much more important. Each SIMD level of a module has its own file and is compiled into its own object. See https://github.com/pygame-community/pygame-ce/blob/6e0e0c67c799c7cc1fa9c96a71598a7751ae2fba/src_c/simd_transform_avx2.c for an example. Our build config for this looks like so: https://github.com/pygame-community/pygame-ce/blob/6e0e0c67c799c7cc1fa9c96a71598a7751ae2fba/src_c/meson.build#L215-L254. In this example, our transform module is not compiled with any special flags, but it is linked with objects that expose functions that can be called to get SIMD acceleration. An example of how the dispatch looks: https://github.com/pygame-community/pygame-ce/blob/6e0e0c67c799c7cc1fa9c96a71598a7751ae2fba/src_c/transform.c#L2158-L2181. The SIMD compilation itself is very conservative, it will only compile the backend if the computer doing the build supports that backend, using compile time macros to check that. I'm not sure if this is actually necessary. About our SIMD code itself, we use intrinsics rather than hardcoded assembly or frameworks like https://github.com/google/highway. https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html is a great reference on this. Intrinsics are better for us than hardcoded assembly because they are more portable, both between compilers and even between architectures. For example we compile all of our "SSE2" code to NEON for ARM support using https://github.com/DLTcollab/sse2neon. Emscripten also allows compile time translation of SIMD intrinsics to Webassembly SIMD, https://emscripten.org/docs/porting/simd.html, although we do not take advantage of this currently. For runtime detection, we rely on https://github.com/libsdl-org/SDL, which is very easy for us because our entire library is built on top of the functionality provided by SDL. If you'd like to check your PR against their implementation of runtime checks, the source seems to be here: https://github.com/libsdl-org/SDL/tree/main/src/cpuinfo I think there could be value in exposing runtime SIMD support checking into the public C API for extension authors that aren't lucky enough to have an existing dependency to rely on for this. I've followed issues about Pillow and Pillow-SIMD where the authors are like these are impossible to merge because we don't have the resources to figure out runtime SIMD checks. I don't think it would get a ton of usage, but it would be extremely valuable functionality for any who need it. In terms of CPython and SIMD I'm not sure how much potential there is, but there may be cool things that could be done. Could 4 or 8 or 16 PyObjects get their reference count changed at once? Could unboxed integers do arithmetic in parallel?Could the JIT decide to use more efficient templates because it knows AVX2 is around? Knowing the runtime SIMD level is an advantage for a JIT over an AOT compiler. But these are just my musings. |
@Starbuck5 Thank you very much for all these insights!
It was definitely helpful.
Yup, that's what the blake2 authors did so we'll probably do something similar. One thing is that it could lead to blowing the code up if we have many different levels... or if we decide to split up files according to architecture itself (like it is done in https://github.com/aklomp/base64/tree/22a3e9d421ee25b25bc6af7a02d4076c49dd323f/lib/arch for instance). I personally think it's nicer to split them by architecture and folders but this leads to too many files and similar ones (which is not very nice for maintaining the whole thing).
I think it's always better to be safe than sorry, unless we're absolutely sure that we won't cause
I also think it's better to use intrinsics for the same reasons as you cited but also because you don't need to know about ASM :') (and portability is key). The projects for translating intrinsics will definitely be helpful in the future if we were to eventually use SIMD instructions.
Thanks for this. I'll probably borrow some of their ideas but I don't think we can vendor this specific part of their library in CPython :( However it will definitely help in improving the detection algorithm (for now the algorithm is quite crude).
That was my original intent, though limited to the CPython internals. Python is great but Python is sometimes slow on some aspects, and it'd be great if we could make it faster. We can always make Python faster by changing algorithms but if we have the possibility of making it faster using CPU features then we should probably try to benefit from them, at least in the important areas.
We could in some situations do it but this will probably need to be synchronized with the ongoing work on deferred reference counts (or so I think).
@skirpichev do we have places where the arithmetic could be sped up using SIMD instructions? I think we either rely on mpdecimal or glibc directly for "advanced" arithmetic and I don't know whether we have a lot of places where we have additions in batches for instance.
I'm not JIT expert :') so let's ask someone who knows about it: @brandtbucher |
AFAIK, GMP isn't utilizes this too much so far. |
For now, we pre compile the JIT code from template. So it depends on the meachine we use to release the official binary(But the JIT is not a default feature yet.). As I know, we dont have any instruction detect on the JIT build script, in another world, Use SIMD or not is depends on the compiler decision. |
I think arithmetic is not a common use case. The people may take care of there data layout to fit the parallel requirement. Otherwise, it may slower than normal operation. IMHO, I think some string operation it more suitable for SIMD, like JSON operation or pickle operation. FYI https://github.com/simdjson/simdjson |
On x86 the default compilation (at least on MSVC) goes up to SSE2. So there could be auto-vectorization opportunities. I actually investigated this quite a bit ago and got the JIT templates to compile with an explicit AVX2 flag, and it barely changed the templates at all. I think it would have to be more intentionally set up, like if unboxed integer arithmetic becomes a thing there could then be a uop that does 2 / 4 / 8 arithmetics at once and then the compiler would be able to do some auto vectorization there. But I'm fully aware all my ideas about this and the refcount thing are just ideas, not anywhere close to a concrete proposal.
I think the popularity of Numba showcases demand for higher performance number crunching. In terms of actionable SIMD items, an internal api for runtime detection seems like a great step to take. An external api could also be helpful to certain projects. I haven't looked in detail into the blake SIMD implementations, but if it just supports x86 right now it would be possible to bring those speedups to ARM using sse2neon. Personally I've never done any string operations with SIMD, but I agree with @Zheaoli that there is certainly potential to speed up things with it! |
Hello, thanks for raising this. I think there is definitely some room for vector instructions in CPython. In the coming weeks I'll spend some time investigating and I'll be watching this space as well. |
Feature or enhancement
Proposal:
In #124951, there has been some initial discussion on improving the performances of
base64
and possibly{bytearray,bytes,str}.translate
using SIMD instructions.More generally, if we want to use specific SIMD instructions, it'd be good if we at least know whether the processor supports them or not. Note that we already support SIMD in
blake2
when possible. As such, I suggest an internal framework for detecting SIMD features for other part of the library as well as a compiler flag support detection.Note that a single part of the code could benefit from some SIMD calls without having to link the entire library against the entire SIMD-128 or SIMD-256 instruction sets. Note that having a way to detect SIMD support should probably be independent of whether we would use them or not apart from the blake2 module because it could only benefit the standard library if we were to include them.
The blake2 module SIMD support is fairly... complicated due to the wide variety of platforms that need to be supported and due to the mixture of many SIMD instructions. So I don't think I want to touch that part and make it work under the new interface (at least, not for now). While I can say that I'm confident in detecting features on "widely used" systems, there are definitely systems that I don't know so I'd appreciate any help on this topic.
Has this already been discussed elsewhere?
I don't want to open a Discourse thread for now since it's mainly something that will be used internally and not to be exposed to the world.
Links to previous discussion of this feature:
There has been some discussion on Discourse already about SIMD in general and whether to include them (e.g., https://discuss.python.org/t/standard-library-support-for-simd/35138) but the number of results containing "SIMD" or "AVX" is very small. Either this is because the topic is too advanced (detecting CPU features is NOT funny and there is a lack of documentation, the best one being the Wikipedia page) or the feature request is too broad.
Linked PRs
The text was updated successfully, but these errors were encountered: