-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S/390: Use getauxval for detecting VXE2 to fix #560 #561
Conversation
This looks like a brilliant alternative. However, to be able to accept this we would need to have the confirmation that it is applicable to all other supported target architectures (x86, aarch64/32, ppc64, risc-v, ...) and all supported platforms (here I'm more concerned the availability of @shibatch Is there anything fundamental preventing us from using this approach for feature detection? This looks ideal since it does not require compiling code snippets, am I missing something? |
This would be fine, but the new code depends on the ELF binary loader and
glibc, right?
From a portability point of view, adding a mutex with pthreads to the
original code may be another option.
…On Wed, Jul 10, 2024 at 12:02 AM Pierre Blanchard ***@***.***> wrote:
This looks like a brilliant alternative. However, to be able to accept
this we would need to have the confirmation that it is applicable to all
other supported target architectures (x86, aarch64/32, ppc64, risc-v, ...)
and all supported platforms (here I'm more concerned the availability of
sys/auxv.h on Windows).
@shibatch <https://github.com/shibatch> Is there anything fundamental
preventing us from using this approach for feature detection? This looks
ideal since it does not require compiling code snippets, am I missing
something?
—
Reply to this email directly, view it on GitHub
<#561 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD2A7F2RUA2ZPY4ZGQUIKTTZLP3QHAVCNFSM6AAAAABKLSEZ46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJXHE3DCMJSGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yes. The HWCAP bits are placed on the stack by the linux kernel before giving control to the binary. getauxval is a glibc function available since 2.16. musl libc seems to have it as well, but newlib doesn't.
I'm not sure whether this can be done easily. Right now sigjmp is written to in the normal path of execution, but it is used from a signal handler which might get executed on a different thread even. Perhaps an atomic variable set in the signal handler might be a solution here?! If the variable hasn't been updated by the signal handler cpuSupportsExt could update it to a positive result. Nevertheless, I think it would be good to avoid signal handlers in the lib as much as possible. This could also interact badly with signal handlers installed in applications using the library. x86 uses the cpuid instruction to check for the availability of features. Targets which do not have unprivileged instructions like that, usually use HWCAPs to get this information carried over from the kernel. So I think this is a pretty common mechanism. |
If, as @blapie says, we are going to make similar changes to the code for all architectures, then we will have to think about how to do it for windows and macos. For me, I think it would be easier to add mutex to the existing code. I don't have a particular preference as long as you modify it so that it works properly. I also don't think that using signal and mutex will cause much of a problem. Here is a PoC code with mutex.
|
I think in general it would be good to avoid using signal handlers as much as possible here and only use it as last resort. There are several situations where this could break:
Adding the mutex makes the situation a bit better, but I think it would be good to revisit for every supported architecture whether an alternate (non-signal) solution could be implemented. I'll try to help as much as I can, but I think the s390x specific PR to switch to getauxval could be merged independently. For s390x we will never need to fall back to the signal handler approach. |
As for me, I am not particularly against the idea, so I think it is fine as it is. |
Hello, Let's try to merge this if it's indeed solving a problem for you, but we need to extend to all other architectures quickly, as I'm worried about the cost of maintaining different detection mechanism for different architectures. I'm a little less concerned about the fact that we have to use separate approaches for different OS-es (with or without glibc), there are fewer of them and it makes more sense to use different mechanisms in that case. |
Thanks. I've also tried to integrate the mutex approach and it seems to work as well, although as discussed above, probably is not as robust as checking HWCAPs. But I think also the mutex would need some special handling on Windows. On Windows the static initializer of the mutex does not seem to work. There are alternatives though, but I don't know what is best here.
|
Ok, thanks for trying that! Could you please upload a separate PR with the mutex in place for macos at least and guidance for windows maybe? Other architectures than x86 and aarch64 imply that we are under Linux so the current PR is probably fine as is. For x86 and aarch64 though, we should probably stick to mutex or we will need to find a way to switch detection mechanism. |
I've opened a draft PR for that now. The static mutex initializer probably only works on Linux right now. Didn't test Windows and MacOS, but some quick search indicates that this needs to be done differently there. Don't know about MingGW. |
Unfortunately, cannot really test this on an actual machine, so will have to trust this is actually fixing the issue. |
Change the detection mechanism for the VXE2 feature from signal handling to getauxval. This is a bit simpler and also helps in multi-threaded uses of the library as in PyTorch:
pytorch/pytorch#128503
In particular it prevents the sigjmp file scope variable from being accessed from multiple threads. This used to cause crashes with the way PyTorch uses Sleef in combination with OpenMP. In that case cpuSupportsExt is executed simultaneously by multiple threads what led to parallel writers to sigjmp. This used to be no problem running on IBM z15 and newer, where there is actually VXE2 available. However, running on IBM z14 the signal handler gets executed what actually makes use of the corrupted sigjmp data structure to get back into cpuSupportsExt.