-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Arm server support in CI, and create corresponding Python package #12684
Comments
For reference, our current docker images are at https://github.com/openxla/iree/tree/main/build_tools/docker. We have a helper script that manages the dependencies between images and such, but unfortunately it requires special permissions to even run because the only way to get a digest for a docker image is to push it to a registry (who came up with that design?). But the images all still work fine with normal This comment highlights that I need to update the README on that page... You'll probably want to fork the base image and parameterize the places that are currently hardcoded to x86_64. Would be nice to turn that into a build arg and not have to have separate Docker files, but that would require restructuring everything, so let's not do that right now. |
@freddan80 @GMNGeoffrey Checking, are we leaving this open? |
Yes, pls. We'll start working on this soonish. |
@GMNGeoffrey A couple of questions. What set of self-hosted machines do you currently run CI on? Out of the CI jobs, how many would you say we should run on Arm servers in addition to what is currently ran on x86 servers? That is, what subset would add most value without creating too much duplication? I see the 'base' docker image is based on Ubuntu 18.04. Is there a plan to move to the 22.04 (like the bleeding edge image)? |
I can give an initial answer those questions.
https://github.com/openxla/iree/blob/02f85eab220dcf8044a1811f653b5cd1ea9d3653/build_tools/github_actions/runner/gcp/create_image.sh#L29-L30
For any platform / architecture, I think we'd either add a or add a new cross-compile target to here: https://github.com/openxla/iree/blob/02f85eab220dcf8044a1811f653b5cd1ea9d3653/.github/workflows/ci.yml#L844-L879
See discussion on #11782 |
Thx @ScottTodd ! Ah, there is was! Presumably, we can add an arm64 CPU machine, Tau T2A, there as well?
I don't know the Tau T2A machine that well, but is it possible to run any GPU tests on that for instance? Also, eventually we'd wanna run the 'python_release_packages' job as well, so we can distribute aarch64 IREE python pkgs, right? And the benchmark related jobs...
Ack. Hope the comments / questions make sense, my github CI experience is limited :) Also, I'll be travelling for the coming 2w, so my next reply may take a while. After that I plan to get started on this ticket. |
Thanks @freddan80 (and thanks @ScottTodd for responding while I was at summits).
No (see above), but we could consider using base-bleeding-edge in this case if it makes thing easier. As noted, we use our oldest supported version to verify that support, but I'm not sure how much coverage we get from using Ubuntu 18.04, clang-9 with Arm vs Ubuntu 22.04, clang-17 with Arm + Ubuntu 18.04, clang-9 with x86. We don't really want to test the full combinatorial explosion :-) We may get nicer features from the later compiler also.
Yup :-) You'd need to have permissions in our cloud project to run that script as written, but if you have your own cloud project you could iterate on that.
Would we get significantly more coverage with that? I wouldn't expect there to be a lot of bugs that would be Arm + GPU specific. It also doesn't look like there are any Arm CPU + GPU combinations: https://cloud.google.com/compute/docs/gpus
I think cross-compiling may be reasonable here. We can only get 48 vCPU on the T2A and there are also more concerns around hosting our own release machines for supply chain security. But maybe compiling on the Arm machine would be better, IDK. We can explore that once we've got CI set up :-)
This might be tricky. For x86 benchmarks we use the C2 compute-optimized machines and you can basically get a machine to yourself with 1 logical CPU for each physical CPU. We might have very noisy CPU numbers if these machines don't get us as close to the bare metal. For Arm benchmarks we may be better off using a physical device. We've got an M1 Mini in a lab somewhere, but it's currently tied up in regular CI builds. I would love to also have some Android emulator tests running on Arm. The Android arm64 emulator for x86_64 was just disabled the last time we looked because it was unusably slow. Then we could reduce the tests we run on our physical lab devices, which are quite limited. |
Ok, I'm back in office again. Thx @GMNGeoffrey ! Some comments from my side.
Ack. I'll try to use 18.04. If I run into issues, we'll take it from there.
Ack. I have an AWS arm64 instance to test things on for now. I think that'll get me far enough on my own.
Agree. Let's not do the GPU tests on Arm for now (don't think it's possible either).
Ack. Let's take that in a later step.
Ok, I need to dig into that. Do you mean that there's no 'metal' option for the T2A? And therefore no way of getting reliable benchmark numbers... Did I get that right? I'd like to have benchmarks running on an Arm server somehow, but perhaps we can take that in a later step. See also a similar discussion I'm in here: https://github.com/openxla/community/pull/75/files#r1188545800
Sounds interesting! Let me look into that. But this would be a later step too I guess. So what I'll do now is try to build the docker images relevant to the tests we intend to run on T2A. Not sure which to test makes most sense, but here's a long shot from my side:
Hence, I'll make base, swiftshader and swiftshader-bleeding-edge work for Arm and push that. Next step, I'll debug whatever test fails on these... Or maybe it's better to just start with 'build_all'..? WDYT? |
Well there's no "metal" option at all on GCP, but based on the advice of others and our own testing, we've been using the compute optimized machines for our CPU benchmarks on X86-64. The reasoning is that in order to give consistent high performance you have to be giving consistent performance :-) There's not an analogous machine type for arm (only Intel and AMD x86-64 machines AFAICT). My impression from marketing is that with those machines they're focused for compute per dollar rather than maximally fast single-threaded performance. We can test it out though and see how noisy the numbers are.
I think I would go in this order |
Ok. Let me look into this on my end.
Sounds good to me. |
I think if we map the vCPU->cores we should be in ok state (with some acceptable variance) based on what we have observed for the icelake benchmarks. @freddan80 did you get to creating the docker image yet ? I am trying to get the build functional today (need a aarch64 whl today :) ) so if you haven't gotten to it yet I can create a docker image and test |
Hi @powderluv, I managed to get the the base+base-bleeding-edge to work with some minor tweaks. I didn't manage to run all test yet, so working on that. This week I'm pre-occupied with other work unfortunately, hope to get more focus time from next week onwards. Of course :) Feel free to create the docker image and test if you haven't already. |
I have a PR that builds aarch64 whls in docker with the latest manylinux 2_28. #13831 |
@GMNGeoffrey can you please add a self-hosted T2A instance or two to the runner-group and we can try to setup the job to run on it once #13831 lands. We will need to have docker setup on the VM image. FWIW - I am using the Ubuntu 23.04 base image on the T2A instance. And it is plenty fast to compile / run etc and I can do what I normally do on my x86 VM (modulo gpu tests). |
@powderluv nice! I'll check this out... I haven't been able to get to this work yet, but it's high on our prio list. Let me know if you run into any issues you need help with, we're happy to help. |
In the process of adding the ukernels bitcode build, we dropped all cmake configure-check for toolchain support for CPU-feature-enabling flags, and configured headers. I didn't properly think through that: that worked essentially because no one had tried building with an older toolchain. On x86-64, that was OK because we didn't use any recent flag. but on ARM that was more problematic with the `+i8mm` target feature. @freddan80 ran into this on #12684. So this brings back configure-checks and configured-headers, but only where they are specifically needed and not interfering with the bitcode build --- only in the arch/ subdirs and only in the system-build. Some `#if defined(IREE_DEVICE_STANDALONE)` lets the bitcode build opt out of including the configured headers. This has a couple of side benefits. We get to drop the clumsy `#if`'s trying to do version checks on compiler version tokens, and we get to conditionally add those feature-specific dependencies instead of having those `#if`'s around the entire files, which confuse syntax highlighting when the feature token is not defined in the IDE build.
Hi! I push:ed a rough draft in PR #14372. I highlighted some open questions from my side. I have verified this to work on AWS x86 and arm64 instances. Currently I've only tested Any comments are welcome. I quite inexperienced with docker and github CI flows, so assume total ignorance :) @GMNGeoffrey, I read your mail this morning. Good luck with your future assignments! |
In the process of adding the ukernels bitcode build, we dropped all cmake configure-check for toolchain support for CPU-feature-enabling flags, and configured headers. I didn't properly think through that: that worked essentially because no one had tried building with an older toolchain. On x86-64, that was OK because we didn't use any recent flag. but on ARM that was more problematic with the `+i8mm` target feature. @freddan80 ran into this on iree-org#12684. So this brings back configure-checks and configured-headers, but only where they are specifically needed and not interfering with the bitcode build --- only in the arch/ subdirs and only in the system-build. Some `#if defined(IREE_DEVICE_STANDALONE)` lets the bitcode build opt out of including the configured headers. This has a couple of side benefits. We get to drop the clumsy `#if`'s trying to do version checks on compiler version tokens, and we get to conditionally add those feature-specific dependencies instead of having those `#if`'s around the entire files, which confuse syntax highlighting when the feature token is not defined in the IDE build.
Update. #14372 is merged and there's a T2A runner available in the project now. Next step is to get the other jobs, After that, enable the PyPi job that @powderluv delivered to enable aarch64 weekly (?) distros to https://pypi.org/project/iree-compiler/. Sounds reasonable? |
Thank you. We can run it nightly. We use it regularly. Stably promotion to pypi I will defer to @stellaraccident @ScottTodd but in general maybe good to keep that updated regularly |
Definitely pushing to the nightly release page should be automated. We currently are manually pushing to pypi from there so that is more a matter of folks knowing to do so. |
That order SGTM, though I'd personally not put energy towards Bazel unless there's a specific request for it (for any given platform/configuration).
Yep. We can take a patch of nod-ai@f55375e or equivalent upstream and then release builds will make their way into nightly releases automatically. Stable releases are a manual process and we should push another soon. (docs for pushing to pypi are at https://github.com/openxla/iree/blob/main/docs/developers/developing_iree/releasing.md#releasing-1 - mostly just a matter of running https://github.com/openxla/iree/blob/main/build_tools/python_deploy/pypi_deploy.sh with credentials these days) |
Ack. No, there's particular reason for that from my side. It was suggested here, but that was some time ago, so things may have changed. |
Sorry for the delay on this, but I'll pick it up now. A couple of questions. I'll add a nektos/act environment to be able to test stuff locally. Does anyone have working config for that? Question wrt the |
I think Geoffrey did at some point, though he is no longer working on these projects. Digging through Discord chat history, he gave this advice:
For Arm servers, building and testing the runtime without GPU (Vulkan, CUDA, etc.) tests would be simplest. Depending on how useful it would be to run the compiler itself on that platform, I'm a little context switched on this part of the CI, so I hope that makes general sense. I'd need to refresh myself a bit more or see some code to translate that to specific implementation advice. |
Thx. I'm using it to debug Github action. I'm far from fluent at that 😄 For reference, here's something that worked for me, combined with some minor tweaks in the yml's:
Thx. I must have missed something. Let me work on a PR to have something to discuss around. |
Here are some results from the Note that we're currently running on a "t2a-standard-8" machine (16, 32, and 48 are also available: https://cloud.google.com/compute/docs/general-purpose-machines)
We may still want to use a larger runner and add a few extra instances, but this is pretty encouraging. |
Nice, looks promising. Thx for sharing the stats! |
This is great! Let me know if I can help in a discord thread. I'll also look at the different patches a bit later when at my keyboard (gh's android app is terrible for anything detailed). |
Update: All patches landed. Arm64 wheel can be found here currently: These will roll into https://pypi.org/project/iree-compiler/ if I understand correctly. When do we expect the next update? |
Nice! The push to pypi is currently manual and we are doing it monthly, roughly the second week of the months, give or take CI weather. I've been driving/tracking it from here: nod-ai/SHARK-ModelDev#121 |
All done: https://pypi.org/project/iree-compiler/#files Thx everyone involved! |
The ticket can be closed. |
Request description
It would be great to have a CI pipeline for Arm servers running on regular basis, as well as being able to just "pip install" IREE tooling on Arm based platforms. For these two things to happen, we need
In that order.
For 1) to happen, a) physical resources and b) docker images for Arm, are needed.
For more details to get started on this ticket, checkout [this discord thread]:(https://discord.com/channels/689900678990135345/706175572920762449/1086197158237175808).
We propose doing thing in the following order
What component(s) does this issue relate to?
Python, Compiler, Runtime
Additional context
No response
The text was updated successfully, but these errors were encountered: