Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work-around aarch64 conda installed numpy 2.x version. #1984

Merged
merged 4 commits into from
Sep 11, 2024

Conversation

nWEIdia
Copy link
Collaborator

@nWEIdia nWEIdia commented Sep 11, 2024

Background:
PyTorch Nightly Binary Validation workflow and PyTorch 2.5.0 RC1 Binary Validation workflow both failed for aarch64, which seems to co-relate to CUDA bump from 12.4.0 to 12.4.1 (see this )

Example failed github actions job: https://github.com/pytorch/builder/actions/runs/10794919545/job/29940441536 and v250 RC1 https://github.com/pytorch/builder/actions/runs/10794919545/job/29944860153

Locally reproduced this by following the critical step below:

/opt/conda/bin/conda create -y -n conda-env-10794919545 python=3.10 numpy ffmpeg

then run pip3 install torch --index-url https://download.pytorch.org/whl/test/cu124 could easily reproduce the following error (shown in the above github action failure links)

2024-09-10T16:08:19.4727026Z ++ python3 ./test/smoke_test/smoke_test.py --package torchonly
2024-09-10T16:08:19.4727531Z Traceback (most recent call last):
2024-09-10T16:08:19.4728089Z File "/pytorch/builder/./test/smoke_test/smoke_test.py", line 9, in
2024-09-10T16:08:19.4728654Z import torch._dynamo
2024-09-10T16:08:19.4729527Z File "/opt/conda/envs/conda-env-10794919545/lib/python3.10/site-packages/torch/_dynamo/init.py", line 3, in
2024-09-10T16:08:19.4730459Z from . import convert_frame, eval_frame, resume_execution
2024-09-10T16:08:19.4731531Z File "/opt/conda/envs/conda-env-10794919545/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 53, in
2024-09-10T16:08:19.4732512Z from . import config, exc, trace_rules
2024-09-10T16:08:19.4733556Z File "/opt/conda/envs/conda-env-10794919545/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 45, in
2024-09-10T16:08:19.4734616Z from .utils import getfile, hashable, NP_SUPPORTED_MODULES, unwrap_if_wrapper
2024-09-10T16:08:19.4736024Z ImportError: cannot import name 'NP_SUPPORTED_MODULES' from 'torch._dynamo.utils' (/opt/conda/envs/conda-env-10794919545/lib/python3.10/site-packages/torch/_dynamo/utils.py)

Two possible workarounds identified:

  1. this PR, do not install conda numpy (conda-forge or anaconda did not make a difference), use PYPI's numpy
  2. do not install conda numpy 2.x, instead , install conda numpy 1.x, e.g. 1.24.4 would work

I currently do not quite know why on ARM64, numpy anaconda package does not seem to be compatible with our generated pytorch wheel. As a follow-up, maybe we can check whether the cuda 12.4.0 arm nightly wheel is compatible with this numpy version.

Update: cuda 12.4.0 aarch64 cuda wheel seems to get along well with conda numpy 2.1.1. So it is likely that cuda bump had introduced incompatbility with conda's numpy.

Since we cannot prevent users from using conda's numpy 2.x, ideally we should come up with a fix on the pytorch aarch64 cuda wheel side.

cc @atalman @malfet @ptrblck @tinglvv

@@ -15,8 +15,9 @@ else
conda install -y conda=23.11.0
fi
# Please note ffmpeg is required for torchaudio, see https://github.com/pytorch/pytorch/issues/96159
conda create -y -n ${ENV_NAME} python=${MATRIX_PYTHON_VERSION} numpy ffmpeg
conda create -y -n ${ENV_NAME} python=${MATRIX_PYTHON_VERSION} ffmpeg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please restrict this change only for linux-aarch64 GPU builds. We want to continue testing the numpy from conda on all other builds

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a fix. Fingers crossed.

Copy link
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm! Thank you very much!

@atalman atalman merged commit 8748ef7 into pytorch:main Sep 11, 2024
28 of 51 checks passed
@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Sep 11, 2024

The fix is not effective, mainly because:

https://github.com/pytorch/builder/actions/runs/10820014857/job/30020147808#step:1:71

cuda jobs are incorrectly labeled as cpu gpu_arch_type.

or perhaps,
cuda-aarch64 is not == 'cuda'.

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Sep 11, 2024

The CI, aarch64 cuda, by default tests torch 2.4.1, so it did not catch the issue that was only tested with main branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants