LAPACK testing failures on Sapphire Rapids (55 other COMPLEX) #4282

Flamefire · 2023-11-02T14:09:47Z

In 0.3.24 (the first to actually support the Sapphire Rapids after #4002) I see consistently 55 "other" failures in the COMPLEX part of the LAPACK tests:

                        -->   LAPACK TESTING SUMMARY  <--
SUMMARY                 nb test run     numerical error         other error
================        ===========     =================       ================
REAL                    1328283         0       (0.000%)        0       (0.000%)
DOUBLE PRECISION        1325997         11      (0.001%)        0       (0.000%)
COMPLEX                 760371          160     (0.021%)        55      (0.007%)
COMPLEX16               771518          48      (0.006%)        0       (0.000%)

--> ALL PRECISIONS      4186169         219     (0.005%)        55      (0.001%)

This happens for GCC 11.2, 11.3, 12.2, 12.3 and also (with backports of #4002) since at least 0.3.20.
The same happens for TARGET=SAPPHIRERAPIDS without the patch so likely is also an issue in 0.3.19.

A workaround seems to be TARGET=SKYLAKEX

From the output I only found:

Testing COMPLEX Nonsymmetric-Eigenvalue-Problem-EIG/xeigtstc < nep.in > cnep.out CCHKHS: CHSEIN(R) returned INFO= 10.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CHS: 22 out of 1764 tests failed to pass the threshold
CCHKHS: CHSEIN(R) returned INFO= 10.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CHS: 22 out of 1764 tests failed to pass the threshold
CCHKHS: CHSEIN(R) returned INFO= 10.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CHS: 22 out of 1764 tests failed to pass the threshold
CCHKHS: CHSEIN(R) returned INFO= 10.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CHS: 22 out of 1764 tests failed to pass the threshold
CCHKHS: CHSEIN(R) returned INFO= 10.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 2.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CCHKHS: CHSEIN(R) returned INFO= 7.
CHS: 22 out of 1764 tests failed to pass the threshold
failing to pass the threshold: 110
Info Error: 55

The text was updated successfully, but these errors were encountered:

martin-frbg · 2023-11-02T14:48:36Z

The INFO returned here appears to be the number of eigenvectors which failed to converge. As far as the BLAS kernels are concerned, SapphireRapids is currently CooperLake which is SkylakeX with added BFLOAT16 functions. So the most likely source of numerical differences will be in target-specific code generation based on the -march= flag supplied to gcc in Makefile.x86_64

martin-frbg · 2023-11-04T22:25:47Z

Turns out the actual cause of the difference was the failure to enable the SkylakeX CASUM "microkernel" for CooperLake and SapphireRapids when support for either was added. (This also implies a loss of precision in the plain C fallback function of the CASUM kernel that I have not looked into yet, but no target configuration is supposed to use it)

This kernel is only used on Skylake+ if the kernel with AVX512 intrinsics can't be used, but used the variable x1 incorrectly in the tail end of the loop, as it is still at the initial value instead of where x points to. This caused 55 "other error"s in the LAPACK tests (OpenMathLib#4282) This change makes casum.c as similar as possible as zasum.c, because zasum.c does this correctly.

Flamefire mentioned this issue Nov 2, 2023

OpenBLAS build fails on Sapphire Rapids due to "Too many LAPACK tests failed" easybuilders/easybuild-easyconfigs#19021

Closed

martin-frbg mentioned this issue Nov 4, 2023

Use SkylakeX ?ASUM microkernel for Cooperlake/Sapphirerapids as well #4287

Merged

martin-frbg closed this as completed in #4287 Nov 4, 2023

Flamefire mentioned this issue Nov 6, 2023

fix OpenBLAS 0.3.20+ on newer Intel CPUs easybuilders/easybuild-easyconfigs#19159

Merged

bartoldeman mentioned this issue Nov 17, 2023

Fix casum fallback kernel for x86_64 #4326

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LAPACK testing failures on Sapphire Rapids (55 other COMPLEX) #4282

LAPACK testing failures on Sapphire Rapids (55 other COMPLEX) #4282

Flamefire commented Nov 2, 2023

martin-frbg commented Nov 2, 2023 •

edited

Loading

martin-frbg commented Nov 4, 2023

LAPACK testing failures on Sapphire Rapids (55 other COMPLEX) #4282

LAPACK testing failures on Sapphire Rapids (55 other COMPLEX) #4282

Comments

Flamefire commented Nov 2, 2023

martin-frbg commented Nov 2, 2023 • edited Loading

martin-frbg commented Nov 4, 2023

martin-frbg commented Nov 2, 2023 •

edited

Loading