-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support arm64 for Huggingface trainer #1986
Comments
/good-first-issue |
@tenzen-y: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I am interested in this issue. I am wondering if this requires further triage before work can begin. |
Feel free to assign this yourself with |
/assign |
I presume that the job for the publication of the PR #1987 appears to have fixed the issue by removing arm64 from the list of platforms that are supported for the image. I was wondering how it was determined that arm64 is the root-cause of the issue as the logs do not appear to be descriptive in that regard.
|
That error was caused by the multi-arch image building since the multi-arch image building uses a larger amount of storage than the single-arch building. But, recently GitHub increased computing resources in OSS project CI: https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/ |
The following is the disk usage for amd64.
The following is the disk usage for arm64.
The issue has to do with the size of the downloaded parent image. |
@tariq-hasan No, I meant multi-arch image. Your logs indicate the single arch image. |
But I would think that even when using Docker Buildx with the Docker container driver for multi-platform builds, the base image is typically downloaded separately for each architecture. I presume Docker Buildx creates a separate container for each platform specified in the build and each container runs the build process for the corresponding architecture. So that would mean that the disk space would be used up even for a multi-architecture image build process. Should we create a matrix of supported platforms so that we can distribute the execution across parallel runners and mitigate the issue with disk usage? |
That is correct, but when we faced the disk pressure issue, increasing resources haven't yet applied to the kubeflow project. After that, the action resources were increased. So, I guess that we shouldn't face the same issue, again. Have you tried to run CI with multi-platform image building?
No, we shouldn't do it as described above. |
Currently arm64 support for Hugging face trainer image is removed due to low resources in Github CI. This is to enabled later after further investigation.
The text was updated successfully, but these errors were encountered: