Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAPACK testing failures on Sapphire Rapids (55 other COMPLEX) #4282

Closed
Flamefire opened this issue Nov 2, 2023 · 2 comments · Fixed by #4287
Closed

LAPACK testing failures on Sapphire Rapids (55 other COMPLEX) #4282

Flamefire opened this issue Nov 2, 2023 · 2 comments · Fixed by #4287

Comments

@Flamefire
Copy link
Contributor

In 0.3.24 (the first to actually support the Sapphire Rapids after #4002) I see consistently 55 "other" failures in the COMPLEX part of the LAPACK tests:

                        -->   LAPACK TESTING SUMMARY  <--
SUMMARY                 nb test run     numerical error         other error
================        ===========     =================       ================
REAL                    1328283         0       (0.000%)        0       (0.000%)
DOUBLE PRECISION        1325997         11      (0.001%)        0       (0.000%)
COMPLEX                 760371          160     (0.021%)        55      (0.007%)
COMPLEX16               771518          48      (0.006%)        0       (0.000%)

--> ALL PRECISIONS      4186169         219     (0.005%)        55      (0.001%)

This happens for GCC 11.2, 11.3, 12.2, 12.3 and also (with backports of #4002) since at least 0.3.20.
The same happens for TARGET=SAPPHIRERAPIDS without the patch so likely is also an issue in 0.3.19.

A workaround seems to be TARGET=SKYLAKEX

From the output I only found:

Testing COMPLEX Nonsymmetric-Eigenvalue-Problem-EIG/xeigtstc < nep.in > cnep.out CCHKHS: CHSEIN(R) returned INFO= 10.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CHS: 22 out of 1764 tests failed to pass the threshold
CCHKHS: CHSEIN(R) returned INFO= 10.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CHS: 22 out of 1764 tests failed to pass the threshold
CCHKHS: CHSEIN(R) returned INFO= 10.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CHS: 22 out of 1764 tests failed to pass the threshold
CCHKHS: CHSEIN(R) returned INFO= 10.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CHS: 22 out of 1764 tests failed to pass the threshold
CCHKHS: CHSEIN(R) returned INFO= 10.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CHS: 22 out of 1764 tests failed to pass the threshold
failing to pass the threshold: 110
Info Error: 55

@martin-frbg
Copy link
Collaborator

martin-frbg commented Nov 2, 2023

The INFO returned here appears to be the number of eigenvectors which failed to converge. As far as the BLAS kernels are concerned, SapphireRapids is currently CooperLake which is SkylakeX with added BFLOAT16 functions. So the most likely source of numerical differences will be in target-specific code generation based on the -march= flag supplied to gcc in Makefile.x86_64

@martin-frbg
Copy link
Collaborator

Turns out the actual cause of the difference was the failure to enable the SkylakeX CASUM "microkernel" for CooperLake and SapphireRapids when support for either was added. (This also implies a loss of precision in the plain C fallback function of the CASUM kernel that I have not looked into yet, but no target configuration is supposed to use it)

bartoldeman added a commit to bartoldeman/OpenBLAS that referenced this issue Nov 17, 2023
This kernel is only used on Skylake+ if the kernel with AVX512
intrinsics can't be used, but used the variable x1 incorrectly
in the tail end of the loop, as it is still at the initial
value instead of where x points to.

This caused 55 "other error"s in the LAPACK tests
(OpenMathLib#4282)

This change makes casum.c as similar as possible as zasum.c,
because zasum.c does this correctly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants