Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Arm server support in CI, and create corresponding Python package #12684

Closed
freddan80 opened this issue Mar 20, 2023 · 32 comments
Closed
Assignees
Labels
enhancement ➕ New feature or request infrastructure Relating to build systems, CI, or testing

Comments

@freddan80
Copy link
Contributor

Request description

It would be great to have a CI pipeline for Arm servers running on regular basis, as well as being able to just "pip install" IREE tooling on Arm based platforms. For these two things to happen, we need

  1. Enable CI tests on Arm servers
  2. Publish IREE Python pkg for aarch64 on pypi (https://pypi.org/project/iree-compiler/)

In that order.

For 1) to happen, a) physical resources and b) docker images for Arm, are needed.

For more details to get started on this ticket, checkout [this discord thread]:(https://discord.com/channels/689900678990135345/706175572920762449/1086197158237175808).

We propose doing thing in the following order

  1. Create and debug relevant Docker images.
  2. Add relevant physical resources (Google cloud)
  3. Start running CI an Arm servers. At first post submit, then gradually more frequent as it stabilizes
  4. Add IREE Aarch64 Python package to pypi.org

What component(s) does this issue relate to?

Python, Compiler, Runtime

Additional context

No response

@freddan80 freddan80 added awaiting-triage enhancement ➕ New feature or request labels Mar 20, 2023
@GMNGeoffrey GMNGeoffrey added infrastructure Relating to build systems, CI, or testing and removed awaiting-triage labels Mar 20, 2023
@GMNGeoffrey
Copy link
Contributor

For reference, our current docker images are at https://github.com/openxla/iree/tree/main/build_tools/docker. We have a helper script that manages the dependencies between images and such, but unfortunately it requires special permissions to even run because the only way to get a digest for a docker image is to push it to a registry (who came up with that design?). But the images all still work fine with normal docker build commands. We use the whole repo as context with almost everything excluded, so that scripts can be shared, so you'll want to mirror the command the script uses that runs from the root of the repo and points directly to the Dockerfile.

This comment highlights that I need to update the README on that page...

You'll probably want to fork the base image and parameterize the places that are currently hardcoded to x86_64. Would be nice to turn that into a build arg and not have to have separate Docker files, but that would require restructuring everything, so let's not do that right now.

@allieculp
Copy link

@freddan80 @GMNGeoffrey Checking, are we leaving this open?

@freddan80
Copy link
Contributor Author

Yes, pls. We'll start working on this soonish.

@freddan80
Copy link
Contributor Author

@GMNGeoffrey A couple of questions.

What set of self-hosted machines do you currently run CI on?

Out of the CI jobs, how many would you say we should run on Arm servers in addition to what is currently ran on x86 servers? That is, what subset would add most value without creating too much duplication?

I see the 'base' docker image is based on Ubuntu 18.04. Is there a plan to move to the 22.04 (like the bleeding edge image)?

@ScottTodd
Copy link
Member

I can give an initial answer those questions.

What set of self-hosted machines do you currently run CI on?

https://github.com/openxla/iree/blob/02f85eab220dcf8044a1811f653b5cd1ea9d3653/build_tools/github_actions/runner/gcp/create_image.sh#L29-L30
(and a macOS arm64 machine I think?)

Out of the CI jobs, how many would you say we should run on Arm servers in addition to what is currently ran on x86 servers? That is, what subset would add most value without creating too much duplication?

For any platform / architecture, I think we'd either add a build_test_all_ job, like these: https://github.com/openxla/iree/blob/02f85eab220dcf8044a1811f653b5cd1ea9d3653/.github/workflows/ci.yml#L1027-L1034

or add a new cross-compile target to here: https://github.com/openxla/iree/blob/02f85eab220dcf8044a1811f653b5cd1ea9d3653/.github/workflows/ci.yml#L844-L879

I see the 'base' docker image is based on Ubuntu 18.04. Is there a plan to move to the 22.04 (like the bleeding edge image)?

See discussion on #11782

@freddan80
Copy link
Contributor Author

Thx @ScottTodd !

iree/build_tools/github_actions/runner/gcp/create_image.sh

Ah, there is was! Presumably, we can add an arm64 CPU machine, Tau T2A, there as well?

For any platform / architecture, I think we'd either add a build_test_all_ job, like these:

I don't know the Tau T2A machine that well, but is it possible to run any GPU tests on that for instance? Also, eventually we'd wanna run the 'python_release_packages' job as well, so we can distribute aarch64 IREE python pkgs, right? And the benchmark related jobs...

See discussion on #11782

Ack.

Hope the comments / questions make sense, my github CI experience is limited :)

Also, I'll be travelling for the coming 2w, so my next reply may take a while. After that I plan to get started on this ticket.

@GMNGeoffrey
Copy link
Contributor

Thanks @freddan80 (and thanks @ScottTodd for responding while I was at summits).

I see the 'base' docker image is based on Ubuntu 18.04. Is there a plan to move to the 22.04 (like the bleeding edge image)?

No (see above), but we could consider using base-bleeding-edge in this case if it makes thing easier. As noted, we use our oldest supported version to verify that support, but I'm not sure how much coverage we get from using Ubuntu 18.04, clang-9 with Arm vs Ubuntu 22.04, clang-17 with Arm + Ubuntu 18.04, clang-9 with x86. We don't really want to test the full combinatorial explosion :-) We may get nicer features from the later compiler also.

Ah, there is was! Presumably, we can add an arm64 CPU machine, Tau T2A, there as well?

Yup :-) You'd need to have permissions in our cloud project to run that script as written, but if you have your own cloud project you could iterate on that.

I don't know the Tau T2A machine that well, but is it possible to run any GPU tests on that for instance?

Would we get significantly more coverage with that? I wouldn't expect there to be a lot of bugs that would be Arm + GPU specific. It also doesn't look like there are any Arm CPU + GPU combinations: https://cloud.google.com/compute/docs/gpus

Also, eventually we'd wanna run the 'python_release_packages' job as well, so we can distribute aarch64 IREE python pkgs, right?

I think cross-compiling may be reasonable here. We can only get 48 vCPU on the T2A and there are also more concerns around hosting our own release machines for supply chain security. But maybe compiling on the Arm machine would be better, IDK. We can explore that once we've got CI set up :-)

And the benchmark related jobs...

This might be tricky. For x86 benchmarks we use the C2 compute-optimized machines and you can basically get a machine to yourself with 1 logical CPU for each physical CPU. We might have very noisy CPU numbers if these machines don't get us as close to the bare metal. For Arm benchmarks we may be better off using a physical device. We've got an M1 Mini in a lab somewhere, but it's currently tied up in regular CI builds.

I would love to also have some Android emulator tests running on Arm. The Android arm64 emulator for x86_64 was just disabled the last time we looked because it was unusably slow. Then we could reduce the tests we run on our physical lab devices, which are quite limited.

@freddan80
Copy link
Contributor Author

Ok, I'm back in office again. Thx @GMNGeoffrey ! Some comments from my side.

No (see above), but we could consider using base-bleeding-edge in this case if it makes thing easier. As noted, we use our oldest supported version to verify that support, but I'm not sure how much coverage we get from using Ubuntu 18.04, clang-9 with Arm vs Ubuntu 22.04, clang-17 with Arm + Ubuntu 18.04, clang-9 with x86. We don't really want to test the full combinatorial explosion :-) We may get nicer features from the later compiler also.

Ack. I'll try to use 18.04. If I run into issues, we'll take it from there.

Yup :-) You'd need to have permissions in our cloud project to run that script as written, but if you have your own cloud project you could iterate on that.

Ack. I have an AWS arm64 instance to test things on for now. I think that'll get me far enough on my own.

Would we get significantly more coverage with that? I wouldn't expect there to be a lot of bugs that would be Arm + GPU specific. It also doesn't look like there are any Arm CPU + GPU combinations: https://cloud.google.com/compute/docs/gpus

Agree. Let's not do the GPU tests on Arm for now (don't think it's possible either).

I think cross-compiling may be reasonable here. We can only get 48 vCPU on the T2A and there are also more concerns around hosting our own release machines for supply chain security. But maybe compiling on the Arm machine would be better, IDK. We can explore that once we've got CI set up :-)

Ack. Let's take that in a later step.

This might be tricky. For x86 benchmarks we use the C2 compute-optimized machines and you can basically get a machine to yourself with 1 logical CPU for each physical CPU. We might have very noisy CPU numbers if these machines don't get us as close to the bare metal. For Arm benchmarks we may be better off using a physical device. We've got an M1 Mini in a lab somewhere, but it's currently tied up in regular CI builds.

Ok, I need to dig into that. Do you mean that there's no 'metal' option for the T2A? And therefore no way of getting reliable benchmark numbers... Did I get that right?

I'd like to have benchmarks running on an Arm server somehow, but perhaps we can take that in a later step. See also a similar discussion I'm in here: https://github.com/openxla/community/pull/75/files#r1188545800

I would love to also have some Android emulator tests running on Arm. The Android arm64 emulator for x86_64 was just disabled the last time we looked because it was unusably slow. Then we could reduce the tests we run on our physical lab devices, which are quite limited.

Sounds interesting! Let me look into that. But this would be a later step too I guess.

So what I'll do now is try to build the docker images relevant to the tests we intend to run on T2A. Not sure which to test makes most sense, but here's a long shot from my side:

  • build_all (base)
  • test_all (swiftshader)
  • build_test_all_bazel? (swiftshader_bleeding_edge)
  • build_test_runtime? (base)

Hence, I'll make base, swiftshader and swiftshader-bleeding-edge work for Arm and push that. Next step, I'll debug whatever test fails on these... Or maybe it's better to just start with 'build_all'..? WDYT?

@GMNGeoffrey
Copy link
Contributor

Ok, I need to dig into that. Do you mean that there's no 'metal' option for the T2A? And therefore no way of getting reliable benchmark numbers... Did I get that right?

Well there's no "metal" option at all on GCP, but based on the advice of others and our own testing, we've been using the compute optimized machines for our CPU benchmarks on X86-64. The reasoning is that in order to give consistent high performance you have to be giving consistent performance :-) There's not an analogous machine type for arm (only Intel and AMD x86-64 machines AFAICT). My impression from marketing is that with those machines they're focused for compute per dollar rather than maximally fast single-threaded performance. We can test it out though and see how noisy the numbers are.

So what I'll do now is try to build the docker images relevant to the tests we intend to run on T2A. Not sure which to test makes most sense, but here's a long shot from my side:

  • build_all (base)
  • test_all (swiftshader)
  • build_test_all_bazel? (swiftshader_bleeding_edge)
  • build_test_runtime? (base)

Hence, I'll make base, swiftshader and swiftshader-bleeding-edge work for Arm and push that. Next step, I'll debug whatever test fails on these... Or maybe it's better to just start with 'build_all'..? WDYT?

I think I would go in this order build_test_runtime -> build_all -> test_all -> build_test_all_bazel. The runtime is much smaller and lighter weight, so you should have an easier time with that. Also Vulkan+Swiftshader can be a bit tricky to get configured (maybe it will just work). Personally, I would probably get the first docker image and build working end-to-end before trying the next build. That has the advantage that we can start running it and then you have something to prevent regressions :-)

@freddan80
Copy link
Contributor Author

My impression from marketing is that with those machines they're focused for compute per dollar rather than maximally fast single-threaded performance. We can test it out though and see how noisy the numbers are.

Ok. Let me look into this on my end.

I think I would go in this order build_test_runtime -> build_all -> test_all -> build_test_all_bazel.

Sounds good to me.

@powderluv
Copy link
Collaborator

I think if we map the vCPU->cores we should be in ok state (with some acceptable variance) based on what we have observed for the icelake benchmarks.

@freddan80 did you get to creating the docker image yet ? I am trying to get the build functional today (need a aarch64 whl today :) ) so if you haven't gotten to it yet I can create a docker image and test

@freddan80
Copy link
Contributor Author

Hi @powderluv, I managed to get the the base+base-bleeding-edge to work with some minor tweaks. I didn't manage to run all test yet, so working on that. This week I'm pre-occupied with other work unfortunately, hope to get more focus time from next week onwards. Of course :) Feel free to create the docker image and test if you haven't already.

@powderluv
Copy link
Collaborator

I have a PR that builds aarch64 whls in docker with the latest manylinux 2_28. #13831
Once it lands we just need to spin up a T2A and build it nightly.

@powderluv
Copy link
Collaborator

@GMNGeoffrey can you please add a self-hosted T2A instance or two to the runner-group and we can try to setup the job to run on it once #13831 lands. We will need to have docker setup on the VM image.

FWIW - I am using the Ubuntu 23.04 base image on the T2A instance. And it is plenty fast to compile / run etc and I can do what I normally do on my x86 VM (modulo gpu tests).

@freddan80
Copy link
Contributor Author

@powderluv nice! I'll check this out... I haven't been able to get to this work yet, but it's high on our prio list. Let me know if you run into any issues you need help with, we're happy to help.

bjacob added a commit that referenced this issue Jun 20, 2023
In the process of adding the ukernels bitcode build, we dropped all
cmake configure-check for toolchain support for CPU-feature-enabling
flags, and configured headers. I didn't properly think through that:
that worked essentially because no one had tried building with an older
toolchain. On x86-64, that was OK because we didn't use any recent flag.
but on ARM that was more problematic with the `+i8mm` target feature.
@freddan80 ran into this on
#12684.

So this brings back configure-checks and configured-headers, but only
where they are specifically needed and not interfering with the bitcode
build --- only in the arch/ subdirs and only in the system-build. Some
`#if defined(IREE_DEVICE_STANDALONE)` lets the bitcode build opt out of
including the configured headers.

This has a couple of side benefits. We get to drop the clumsy `#if`'s
trying to do version checks on compiler version tokens, and we get to
conditionally add those feature-specific dependencies instead of having
those `#if`'s around the entire files, which confuse syntax highlighting
when the feature token is not defined in the IDE build.
@freddan80
Copy link
Contributor Author

Hi! I push:ed a rough draft in PR #14372. I highlighted some open questions from my side.

I have verified this to work on AWS x86 and arm64 instances. Currently I've only tested build_test_runtime. I'd like it to be tested on the GCP instances as well, but I need some help there I believe.

Any comments are welcome. I quite inexperienced with docker and github CI flows, so assume total ignorance :)

@GMNGeoffrey, I read your mail this morning. Good luck with your future assignments!

nhasabni pushed a commit to plaidml/iree that referenced this issue Aug 24, 2023
In the process of adding the ukernels bitcode build, we dropped all
cmake configure-check for toolchain support for CPU-feature-enabling
flags, and configured headers. I didn't properly think through that:
that worked essentially because no one had tried building with an older
toolchain. On x86-64, that was OK because we didn't use any recent flag.
but on ARM that was more problematic with the `+i8mm` target feature.
@freddan80 ran into this on
iree-org#12684.

So this brings back configure-checks and configured-headers, but only
where they are specifically needed and not interfering with the bitcode
build --- only in the arch/ subdirs and only in the system-build. Some
`#if defined(IREE_DEVICE_STANDALONE)` lets the bitcode build opt out of
including the configured headers.

This has a couple of side benefits. We get to drop the clumsy `#if`'s
trying to do version checks on compiler version tokens, and we get to
conditionally add those feature-specific dependencies instead of having
those `#if`'s around the entire files, which confuse syntax highlighting
when the feature token is not defined in the IDE build.
@freddan80
Copy link
Contributor Author

Update. #14372 is merged and there's a T2A runner available in the project now. Next step is to get the other jobs, build_all -> test_all -> build_test_all_bazel running on Arm. I'll get to that within a couple of weeks.

After that, enable the PyPi job that @powderluv delivered to enable aarch64 weekly (?) distros to https://pypi.org/project/iree-compiler/.

Sounds reasonable?

@powderluv
Copy link
Collaborator

Thank you. We can run it nightly. We use it regularly. Stably promotion to pypi I will defer to @stellaraccident @ScottTodd but in general maybe good to keep that updated regularly

@stellaraccident
Copy link
Collaborator

Definitely pushing to the nightly release page should be automated. We currently are manually pushing to pypi from there so that is more a matter of folks knowing to do so.

@ScottTodd
Copy link
Member

Next step is to get the other jobs, build_all -> test_all -> build_test_all_bazel running on Arm. I'll get to that within a couple of weeks.

That order SGTM, though I'd personally not put energy towards Bazel unless there's a specific request for it (for any given platform/configuration).

After that, enable the PyPi job that @powderluv delivered to enable aarch64 weekly (?) distros to https://pypi.org/project/iree-compiler/.

Yep. We can take a patch of nod-ai@f55375e or equivalent upstream and then release builds will make their way into nightly releases automatically. Stable releases are a manual process and we should push another soon. (docs for pushing to pypi are at https://github.com/openxla/iree/blob/main/docs/developers/developing_iree/releasing.md#releasing-1 - mostly just a matter of running https://github.com/openxla/iree/blob/main/build_tools/python_deploy/pypi_deploy.sh with credentials these days)

@freddan80
Copy link
Contributor Author

That order SGTM, though I'd personally not put energy towards Bazel unless there's a specific request for it (for any given platform/configuration).

Ack. No, there's particular reason for that from my side. It was suggested here, but that was some time ago, so things may have changed.

@freddan80
Copy link
Contributor Author

Sorry for the delay on this, but I'll pick it up now. A couple of questions.

I'll add a nektos/act environment to be able to test stuff locally. Does anyone have working config for that?

Question wrt the test_all job. IIUC the only difference to build_test_runtime job is that it runs the Vulkan tests on the Swiftshader docker image. Vulkan is not yet installed on the Arm image and there's no distro for it. Does it make sense to run this job?

@ScottTodd

@ScottTodd
Copy link
Member

I'll add a nektos/act environment to be able to test stuff locally. Does anyone have working config for that?

I think Geoffrey did at some point, though he is no longer working on these projects. Digging through Discord chat history, he gave this advice:

Yeah nektos doesn't like non-standard "runs-on"
Unless you're debugging the syntax of GitHub actions themselves, I recommend just issuing the same commands
FYI github_actions/docker_run.sh is a thin wrapper around docker/docker_run.sh with the latter having generic environment variable names


Question wrt the test_all job. IIUC the only difference to build_test_runtime job is that it runs the Vulkan tests on the Swiftshader docker image. Vulkan is not yet installed on the Arm image and there's no distro for it. Does it make sense to run this job?

  • test_all depends on the compiler build (build_all) and runs compiler tests. Compiler builds can take upwards of 20-30 minutes on medium sized machines with an empty build cache.
  • build_test_runtime just builds the runtime (without waiting on build_all) and runs runtime tests. That is expected to take < 5 minutes even with an empty cache on a small machine

For Arm servers, building and testing the runtime without GPU (Vulkan, CUDA, etc.) tests would be simplest. Depending on how useful it would be to run the compiler itself on that platform, build_all -> test_all (excluding GPU tests?) could also make sense.

I'm a little context switched on this part of the CI, so I hope that makes general sense. I'd need to refresh myself a bit more or see some code to translate that to specific implementation advice.

@freddan80
Copy link
Contributor Author

I think Geoffrey did at some point, though he is no longer working on these projects. Digging through Discord chat history, he gave this advice:

Thx. I'm using it to debug Github action. I'm far from fluent at that 😄 For reference, here's something that worked for me, combined with some minor tweaks in the yml's:

act -W .github/workflows/ci.yml -j setup -j build_all -j test_all --rm --bind -P self-hosted=catthehacker/ubuntu:act-20.04 -P ubuntu-20.04=catthehacker/ubuntu:act-20.04 --env-file my_act.env

Depending on how useful it would be to run the compiler itself on that platform, build_all -> test_all (excluding GPU tests?) could also make sense.

Thx. I must have missed something. Let me work on a PR to have something to discuss around.

@ScottTodd
Copy link
Member

Here are some results from the build_test_all_arm64 CI job.

Note that we're currently running on a "t2a-standard-8" machine (16, 32, and 48 are also available: https://cloud.google.com/compute/docs/general-purpose-machines)

workflow logs timing cache
run 1 72 minutes 0% hits (empty cache)
run 2 27 minutes 64% hits
run 3 7 minutes 99% hits (nice!)
run 4 7 minutes 99% hits (nice!)

We may still want to use a larger runner and add a few extra instances, but this is pretty encouraging.

@freddan80
Copy link
Contributor Author

Nice, looks promising. Thx for sharing the stats!

@freddan80
Copy link
Contributor Author

With the stability fixes for #15488, we'll soon be able to flip the switch for the Arm pkgs (with #15402).

@stellaraccident
Copy link
Collaborator

This is great! Let me know if I can help in a discord thread. I'll also look at the different patches a bit later when at my keyboard (gh's android app is terrible for anything detailed).

@freddan80
Copy link
Contributor Author

Update: All patches landed. Arm64 wheel can be found here currently:
https://github.com/openxla/iree/releases

These will roll into

https://pypi.org/project/iree-compiler/
https://pypi.org/project/iree-runtime/

if I understand correctly.

When do we expect the next update?

@stellaraccident
Copy link
Collaborator

Nice! The push to pypi is currently manual and we are doing it monthly, roughly the second week of the months, give or take CI weather.

I've been driving/tracking it from here: nod-ai/SHARK-ModelDev#121

@freddan80
Copy link
Contributor Author

All done: https://pypi.org/project/iree-compiler/#files

Thx everyone involved!

@freddan80
Copy link
Contributor Author

The ticket can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement ➕ New feature or request infrastructure Relating to build systems, CI, or testing
Projects
None yet
Development

No branches or pull requests

6 participants