-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work-around aarch64 conda installed numpy 2.x version. #1984
Conversation
.github/scripts/validate_binaries.sh
Outdated
@@ -15,8 +15,9 @@ else | |||
conda install -y conda=23.11.0 | |||
fi | |||
# Please note ffmpeg is required for torchaudio, see https://github.com/pytorch/pytorch/issues/96159 | |||
conda create -y -n ${ENV_NAME} python=${MATRIX_PYTHON_VERSION} numpy ffmpeg | |||
conda create -y -n ${ENV_NAME} python=${MATRIX_PYTHON_VERSION} ffmpeg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please restrict this change only for linux-aarch64 GPU builds. We want to continue testing the numpy from conda on all other builds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed a fix. Fingers crossed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm! Thank you very much!
The fix is not effective, mainly because: https://github.com/pytorch/builder/actions/runs/10820014857/job/30020147808#step:1:71 cuda jobs are incorrectly labeled as cpu gpu_arch_type. or perhaps, |
The CI, aarch64 cuda, by default tests torch 2.4.1, so it did not catch the issue that was only tested with main branch. |
Background:
PyTorch Nightly Binary Validation workflow and PyTorch 2.5.0 RC1 Binary Validation workflow both failed for aarch64, which seems to co-relate to CUDA bump from 12.4.0 to 12.4.1 (see this )
Example failed github actions job: https://github.com/pytorch/builder/actions/runs/10794919545/job/29940441536 and v250 RC1 https://github.com/pytorch/builder/actions/runs/10794919545/job/29944860153
Locally reproduced this by following the critical step below:
/opt/conda/bin/conda create -y -n conda-env-10794919545 python=3.10 numpy ffmpeg
then run pip3 install torch --index-url https://download.pytorch.org/whl/test/cu124 could easily reproduce the following error (shown in the above github action failure links)
2024-09-10T16:08:19.4727026Z ++ python3 ./test/smoke_test/smoke_test.py --package torchonly
2024-09-10T16:08:19.4727531Z Traceback (most recent call last):
2024-09-10T16:08:19.4728089Z File "/pytorch/builder/./test/smoke_test/smoke_test.py", line 9, in
2024-09-10T16:08:19.4728654Z import torch._dynamo
2024-09-10T16:08:19.4729527Z File "/opt/conda/envs/conda-env-10794919545/lib/python3.10/site-packages/torch/_dynamo/init.py", line 3, in
2024-09-10T16:08:19.4730459Z from . import convert_frame, eval_frame, resume_execution
2024-09-10T16:08:19.4731531Z File "/opt/conda/envs/conda-env-10794919545/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 53, in
2024-09-10T16:08:19.4732512Z from . import config, exc, trace_rules
2024-09-10T16:08:19.4733556Z File "/opt/conda/envs/conda-env-10794919545/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 45, in
2024-09-10T16:08:19.4734616Z from .utils import getfile, hashable, NP_SUPPORTED_MODULES, unwrap_if_wrapper
2024-09-10T16:08:19.4736024Z ImportError: cannot import name 'NP_SUPPORTED_MODULES' from 'torch._dynamo.utils' (/opt/conda/envs/conda-env-10794919545/lib/python3.10/site-packages/torch/_dynamo/utils.py)
Two possible workarounds identified:
I currently do not quite know why on ARM64, numpy anaconda package does not seem to be compatible with our generated pytorch wheel. As a follow-up, maybe we can check whether the cuda 12.4.0 arm nightly wheel is compatible with this numpy version.
Update: cuda 12.4.0 aarch64 cuda wheel seems to get along well with conda numpy 2.1.1. So it is likely that cuda bump had introduced incompatbility with conda's numpy.
Since we cannot prevent users from using conda's numpy 2.x, ideally we should come up with a fix on the pytorch aarch64 cuda wheel side.
cc @atalman @malfet @ptrblck @tinglvv