Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CI on Github Actions for x86_64, aarch64, arm, ppc64 and s390x #476

Merged
merged 12 commits into from
Nov 10, 2023

Conversation

luhenry
Copy link
Contributor

@luhenry luhenry commented Nov 3, 2023

As discussed offline, this PR adds some CI on GitHub Actions. That's expected to facilitate contribution by having smoke testing on any changes that would be integrating, facilitating the work of maintainers by making sure a change will not completely break all supported architectures (aarch64, arm, ppc64, s390x, and x86_64).

@luhenry
Copy link
Contributor Author

luhenry commented Nov 3, 2023

Some known limitations:

  1. build instability with bin/testervecabi sometimes failing to build. That's fixed by hitting "Re-run failed jobs" enough times
  2. no tests are run for arm, s390x, and ppc64. qemu_cpu needs to be specified to enable the correct architecture-specific extensions (ex: vsx for s390x). I suggest we do that in a follow-up PR so that I can focus on integrating the RISC-V support as discussed off-line.

@luhenry
Copy link
Contributor Author

luhenry commented Nov 3, 2023

The GitHub Actions run is available at https://github.com/rivosinc/sleef/actions/runs/6745836169

@blapie
Copy link
Collaborator

blapie commented Nov 3, 2023

Thank you very much for this PR, this is very helpful!

build instability with bin/testervecabi sometimes failing to build. That's fixed by hitting "Re-run failed jobs" enough times

Cannot wrap my head around these random failures. Cannot reproduce locally, but my environment differs slightly.
What jobs/steps were failing? Was it always the native build?
There is quite a number of options that were passed to cmake in the original Jenkins jobs, see Jenkinsfile for x86 with gcc or clang.
These probably were hiding the bug you are seeing. Do you get similar failures when matching options in Jenkinsfile:

-DSLEEF_SHOW_CONFIG=1 -DENFORCE_TESTER3=TRUE -DBUILD_SCALAR_LIB=TRUE \
-DBUILD_QUAD=TRUE -DBUILD_DFT=TRUE -DBUILD_INLINE_HEADERS=TRUE -DBUILD_SHARED_LIBS=FALSE \
-DDISABLE_FMA4=TRUE -DENFORCE_SSE2=TRUE -DENFORCE_SSE4=TRUE \
-DENFORCE_AVX=TRUE -DENFORCE_AVX2=TRUE -DENFORCE_AVX512F=TRUE 

for gcc Jenkins also has -DENABLE_CXX=TRUE .

Have you tried to set the number of threads to 1 in cmake --build, although I'd be surprised if ninja failed to build in parallel?

I suggest we do that in a follow-up PR so that I can focus on integrating the RISC-V support as discussed off-line.

Absolutely, this is still very minimalistic in terms of cmake configuration as well. But it is a great start to unblock pending PRs.

@luhenry
Copy link
Contributor Author

luhenry commented Nov 4, 2023

The issue with linking seems to be in name variation for the symbols. Let's look at Sleef_tanhf_u10 for example. In a functioning build, the symbol name will be _ZGVdN8v_Sleef_tanhf_u10 while in the failing build it will be _ZGVeN8v_Sleef_tanhf_u10 (notive the _ZGVd vs _ZGVe). The compilation commands and compiler versions are exactly the same, so I don't know where the variation comes from.

@luhenry
Copy link
Contributor Author

luhenry commented Nov 4, 2023

Issue seems to be with how testervecabi.c.o is compiled and various generation of GitHub runners. Given testervecabi.c.o is compiled with -march=native, different generation of GitHub runners will have different CPU features. That will lead to generating different mangled names for the symbols, while the libsleef.a object files have not been compiled with the same -march=native.

https://github.com/rivosinc/sleef/actions/runs/6753813009 is a good example. All jobs that run on Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz which supports AVX-512 fail, but all jobs that run on AMD EPYC 7763 64-Core Processor which doesn't support AVX-512 succeed.

Is it then a requirement to compile testervecabi with -march=native? I'm not too familiar with the feature detection mechanism. The easiest fix would be to replace -march=native by something more deterministic, or even remove it (which I'm trying in this PR, let me know).

@luhenry
Copy link
Contributor Author

luhenry commented Nov 4, 2023

That's the latest run without -march=native for testervecabi https://github.com/rivosinc/sleef/actions/runs/6754112260, everything passes on first attempt.

@luhenry luhenry mentioned this pull request Nov 4, 2023
@luhenry
Copy link
Contributor Author

luhenry commented Nov 5, 2023

I've enabled testing on QEMU for ppc64 and arm. I couldn't figure out the magic combination for s390x just yet. I'll need some more domain-specific expertise on s390x to enable the right options, I don't understand what they are.

@blapie
Copy link
Collaborator

blapie commented Nov 6, 2023

Hi! Very good point. I would like to make sure we are not losing the point of testvecabi by doing this, -march-native was introduced at the same time as #404, which makes me think it is a feature.
Is there a way make sure the library and tests are built with the same generation of runners (possibly prescribe what generation we intend to test on)? Alternatively could this be done in a single job?

We'll take over for s390x, this already provides great coverage, thank you!

@luhenry
Copy link
Contributor Author

luhenry commented Nov 6, 2023

Is there a way make sure the library and tests are built with the same generation of runners (possibly prescribe what generation we intend to test on)? Alternatively could this be done in a single job?

It is already built all on the same runner, in the same job. The issue is one is compiled without -mavx512f and the other is via -march=native. So it seems indeed that it's an issue with the ABI uncovered but unrelated to this PR.

@blapie
Copy link
Collaborator

blapie commented Nov 7, 2023

Right, got you! Thanks, we can look at this outside this PR.
It is only a fairly recent tester, I wonder if that ever was tested extensively.
Just opened an issue #478 for this.

@luhenry
Copy link
Contributor Author

luhenry commented Nov 7, 2023

@blapie are you happy to merge this PR as-is (with the removal of -march=native) and merge it back once it's been fixed? Would there be anything else you want to get fixed/changed before we merge? Thank you!

Copy link
Collaborator

@blapie blapie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Just a few comments on config options.

.github/workflows/build_and_test.yml Outdated Show resolved Hide resolved
.github/workflows/build_and_test.yml Outdated Show resolved Hide resolved
@blapie
Copy link
Collaborator

blapie commented Nov 7, 2023

are you happy to merge this PR as-is (with the removal of -march=native) and merge it back once it's been fixed?

Yes, I think that can be merged back later. I left a few comments on your patch.

@luhenry
Copy link
Contributor Author

luhenry commented Nov 7, 2023

@blapie I've updated the PR per your review, and I've split off the native build to reuse for the cross builds. You can find a run at https://github.com/rivosinc/sleef/actions/runs/6790080413

However, I'm running into an issue on armhf specifically as it fails to find the vdouble_neon32_sleef type:

/workspace/sleef/_build-armhf/include/sleefinline_neon32.h:1141:3: error: unknown type name ‘vdouble_neon32_sleef’
 1141 |   vdouble_neon32_sleef x, y;

The issue seems to be that there is no definition of vdouble in src/arch/helperneon32.h, which I expect is because float64x2_t is no defined on 32bits platforms.

I don't know how to fix it, but it's making the PR fails as it is.

@blapie
Copy link
Collaborator

blapie commented Nov 9, 2023

@luhenry Thanks for the update and further investigation! It looks like aarch32 might require a different environment or a fix. Do test pass when disabling inline headers for armhf? If so, I'll merge as is.

@luhenry
Copy link
Contributor Author

luhenry commented Nov 9, 2023

It looks like aarch32 might require a different environment or a fix. Do test pass when disabling inline headers for armhf? If so, I'll merge as is.

I disabled inline headers on armhf with https://github.com/shibatch/sleef/pull/476/files#diff-48c0ee97c53013d18d6bbae44648f7fab9af2e0bf5b0dc1ca761e18ec5c478f2R139-R140

.github/workflows/build_and_test.yml Outdated Show resolved Hide resolved
@blapie
Copy link
Collaborator

blapie commented Nov 9, 2023

Create #480 issue to track this failure.
@luhenry If you're happy with your current version (minus 1024 and 2048 vl) I'm happy to merge.

@luhenry
Copy link
Contributor Author

luhenry commented Nov 9, 2023

@blapie I've removed 1024 and 2048 bits. Last run at https://github.com/rivosinc/sleef/actions/runs/6813866461. Happy to merge as-is! Thank you again

@blapie blapie merged commit 69cfa71 into shibatch:master Nov 10, 2023
1 check failed
@luhenry luhenry deleted the upstream-ci branch November 10, 2023 15:01
@blapie blapie mentioned this pull request Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants