Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement accelerate for osx-arm64 #88

Closed
wants to merge 4 commits into from

Conversation

ngam
Copy link
Contributor

@ngam ngam commented Feb 4, 2022

This PR only changes netlib implementation for osx-arm64. Everything else remains the same. Skipping all builds but osx for confirmation.

Fixes #82

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

@conda-forge-linter
Copy link

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

@ngam
Copy link
Contributor Author

ngam commented Feb 4, 2022

@hmaarrfk please review if changes are acceptable.

@isuruf could you please double-check if I did this correctly?

Thank you both!!

@ngam
Copy link
Contributor Author

ngam commented Feb 4, 2022

(I will test this locally and put the results here in a bit)

Okay results: Essentially, the build seems to hardcode to either/or, so in this case it just goes to Accelerate. However, the good news is, it doesn't bomb like before. Note I am not really sure if this is hardcoding or not --- it could just be that the config is just printed this way and I have no idea how to check this further. I can only say that it seems it hardcodes because of this https://github.com/pytorch/pytorch/blob/v1.10.2/cmake/Modules/FindBLAS.cmake

Anyway, imo, this could/should be merged.


Details with Accelerate:
(ptps_ac) ~$ mamba list
# packages in environment at /Users/ngam/.Mambaforge-MacOSX-arm64/envs/ptps_ac:
#
# Name                    Version                   Build  Channel
bzip2                     1.0.8                h3422bc3_4    conda-forge
ca-certificates           2021.10.8            h4653dfc_0    conda-forge
cffi                      1.15.0           py39h52b1de0_0    conda-forge
future                    0.18.2           py39h2804cbe_4    conda-forge
libblas                   3.9.0           13_osxarm64_accelerate    conda-forge
libcblas                  3.9.0           13_osxarm64_accelerate    conda-forge
libcxx                    12.0.1               h168391b_1    conda-forge
libffi                    3.4.2                h3422bc3_5    conda-forge
libgfortran               5.0.0.dev0      11_0_1_hf114ba7_23    conda-forge
libgfortran5              11.0.1.dev0         hf114ba7_23    conda-forge
liblapack                 3.9.0           13_osxarm64_accelerate    conda-forge
liblapacke                3.9.0           13_osxarm64_accelerate    conda-forge
libprotobuf               3.19.4               hccf11d3_0    conda-forge
libzlib                   1.2.11            hee7b306_1013    conda-forge
llvm-openmp               12.0.1               hf3c4609_1    conda-forge
ncurses                   6.3                  hc470f4d_0    conda-forge
ninja                     1.10.2               hc021e02_1    conda-forge
numpy                     1.22.2           py39h61a45d2_0    conda-forge
openssl                   3.0.0                h3422bc3_2    conda-forge
pip                       22.0.3             pyhd8ed1ab_0    conda-forge
pycparser                 2.21               pyhd8ed1ab_0    conda-forge
python                    3.9.10          h38ef502_2_cpython    conda-forge
python_abi                3.9                      2_cp39    conda-forge
pytorch                   1.10.2          cpu_py39h0d1fb64_0    ngam
readline                  8.1                  hedafd6a_0    conda-forge
setuptools                60.7.1           py39h2804cbe_0    conda-forge
sleef                     3.5.1                h156473d_2    conda-forge
sqlite                    3.37.0               h72a2b83_0    conda-forge
tk                        8.6.11               he1e0b03_1    conda-forge
typing_extensions         4.0.1              pyha770c72_0    conda-forge
tzdata                    2021e                he74cb21_0    conda-forge
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
xz                        5.2.5                h642e427_1    conda-forge
zlib                      1.2.11            hee7b306_1013    conda-forge
(ptps_ac) ~$ python
Python 3.9.10 | packaged by conda-forge | (main, Feb  1 2022, 21:25:34) 
[Clang 11.1.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__config__.show()
'PyTorch built with:\n  - GCC 4.2\n  - C++ Version: 201402\n  - clang 11.1.0\n  - OpenMP 201811\n  - LAPACK is enabled (usually provided by MKL)\n  - NNPACK is enabled\n  - CPU capability usage: NO AVX\n  - Build settings: BLAS_INFO=accelerate, BUILD_TYPE=Release, CXX_COMPILER=/Users/ngam/Repos/pytorch-cpu-feedstock/miniforge3/conda-bld/pytorch-recipe_1644000430681/_build_env/bin/arm64-apple-darwin20.0.0-clang++, CXX_FLAGS=-ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -stdlib=libc++  -std=c++14 -fmessage-length=0 -isystem /Users/ngam/Repos/pytorch-cpu-feedstock/miniforge3/conda-bld/pytorch-recipe_1644000430681/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehol/include -fdebug-prefix-map=/Users/ngam/Repos/pytorch-cpu-feedstock/miniforge3/conda-bld/pytorch-recipe_1644000430681/work=/usr/local/src/conda/pytorch-1.10.2 -fdebug-prefix-map=/Users/ngam/Repos/pytorch-cpu-feedstock/miniforge3/conda-bld/pytorch-recipe_1644000430681/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehol=/usr/local/src/conda-prefix -Wno-deprecated-declarations -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp=libomp -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-typedef-redefinition -Wno-unknown-warning-option -Wno-unused-private-field -Wno-inconsistent-missing-override -Wno-aligned-allocation-unavailable -Wno-c++14-extensions -Wno-constexpr-not-const -Wno-missing-braces -Qunused-arguments -fcolor-diagnostics -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-unused-private-field -Wno-missing-braces -Wno-c++14-extensions -Wno-constexpr-not-const, LAPACK_INFO=accelerate, TORCH_VERSION=1.10.2, USE_CUDA=OFF, USE_CUDNN=OFF, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=ON, USE_OPENMP=ON, \n'
>>> 


Details with OpenBLAS:
(ptps_op) ~$ mamba list
# packages in environment at /Users/ngam/.Mambaforge-MacOSX-arm64/envs/ptps_op:
#
# Name                    Version                   Build  Channel
bzip2                     1.0.8                h3422bc3_4    conda-forge
ca-certificates           2021.10.8            h4653dfc_0    conda-forge
cffi                      1.15.0           py39h52b1de0_0    conda-forge
future                    0.18.2           py39h2804cbe_4    conda-forge
libblas                   3.9.0           13_osxarm64_openblas    conda-forge
libcblas                  3.9.0           13_osxarm64_openblas    conda-forge
libcxx                    12.0.1               h168391b_1    conda-forge
libffi                    3.4.2                h3422bc3_5    conda-forge
libgfortran               5.0.0.dev0      11_0_1_hf114ba7_23    conda-forge
libgfortran5              11.0.1.dev0         hf114ba7_23    conda-forge
liblapack                 3.9.0           13_osxarm64_openblas    conda-forge
liblapacke                3.9.0           13_osxarm64_openblas    conda-forge
libopenblas               0.3.18          openmp_h5dd58f0_0    conda-forge
libprotobuf               3.19.4               hccf11d3_0    conda-forge
libzlib                   1.2.11            hee7b306_1013    conda-forge
llvm-openmp               12.0.1               hf3c4609_1    conda-forge
ncurses                   6.3                  hc470f4d_0    conda-forge
ninja                     1.10.2               hc021e02_1    conda-forge
numpy                     1.22.2           py39h61a45d2_0    conda-forge
openssl                   3.0.0                h3422bc3_2    conda-forge
pip                       22.0.3             pyhd8ed1ab_0    conda-forge
pycparser                 2.21               pyhd8ed1ab_0    conda-forge
python                    3.9.10          h38ef502_2_cpython    conda-forge
python_abi                3.9                      2_cp39    conda-forge
pytorch                   1.10.2          cpu_py39h0d1fb64_0    ngam
readline                  8.1                  hedafd6a_0    conda-forge
setuptools                60.7.1           py39h2804cbe_0    conda-forge
sleef                     3.5.1                h156473d_2    conda-forge
sqlite                    3.37.0               h72a2b83_0    conda-forge
tk                        8.6.11               he1e0b03_1    conda-forge
typing_extensions         4.0.1              pyha770c72_0    conda-forge
tzdata                    2021e                he74cb21_0    conda-forge
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
xz                        5.2.5                h642e427_1    conda-forge
zlib                      1.2.11            hee7b306_1013    conda-forge
(ptps_op) ~$ python
Python 3.9.10 | packaged by conda-forge | (main, Feb  1 2022, 21:25:34) 
[Clang 11.1.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__config__.show()
'PyTorch built with:\n  - GCC 4.2\n  - C++ Version: 201402\n  - clang 11.1.0\n  - OpenMP 201811\n  - LAPACK is enabled (usually provided by MKL)\n  - NNPACK is enabled\n  - CPU capability usage: NO AVX\n  - Build settings: BLAS_INFO=accelerate, BUILD_TYPE=Release, CXX_COMPILER=/Users/ngam/Repos/pytorch-cpu-feedstock/miniforge3/conda-bld/pytorch-recipe_1644000430681/_build_env/bin/arm64-apple-darwin20.0.0-clang++, CXX_FLAGS=-ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -stdlib=libc++  -std=c++14 -fmessage-length=0 -isystem /Users/ngam/Repos/pytorch-cpu-feedstock/miniforge3/conda-bld/pytorch-recipe_1644000430681/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehol/include -fdebug-prefix-map=/Users/ngam/Repos/pytorch-cpu-feedstock/miniforge3/conda-bld/pytorch-recipe_1644000430681/work=/usr/local/src/conda/pytorch-1.10.2 -fdebug-prefix-map=/Users/ngam/Repos/pytorch-cpu-feedstock/miniforge3/conda-bld/pytorch-recipe_1644000430681/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehol=/usr/local/src/conda-prefix -Wno-deprecated-declarations -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp=libomp -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-typedef-redefinition -Wno-unknown-warning-option -Wno-unused-private-field -Wno-inconsistent-missing-override -Wno-aligned-allocation-unavailable -Wno-c++14-extensions -Wno-constexpr-not-const -Wno-missing-braces -Qunused-arguments -fcolor-diagnostics -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-unused-private-field -Wno-missing-braces -Wno-c++14-extensions -Wno-constexpr-not-const, LAPACK_INFO=accelerate, TORCH_VERSION=1.10.2, USE_CUDA=OFF, USE_CUDNN=OFF, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=ON, USE_OPENMP=ON, \n'
>>> 

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Feb 5, 2022

Could you add a constraint like:

- blas *=accelerate   # [osx and arm64 and py==38]
- blas *=openblas     # [osx and arm64 and py==39]

And run the tests locally? this would somewhat build up a minimal test matrix.

@ngam
Copy link
Contributor Author

ngam commented Feb 5, 2022

Could you add a constraint like:

- blas *=accelerate   # [osx and arm64 and py==38]
- blas *=openblas     # [osx and arm64 and py==39]

And run the tests locally? this would somewhat build up a minimal test matrix.

Yes, but what's the point? We already know what this will lead to: same exact outcome. The reason: 568f298.

If we want to build a matrix, then we need to base it on the CMAKE flag. Happy to go that route, but I believe it's just too much work for nothing -- this is already a challenging build. I think it is better to just default osx-arm to accelerate for now. But if you want me to go ahead and build matrix for arm64, I am happy to do so --- again, it would need to be done with a control flow on the cmake flag, I believe.

@ngam
Copy link
Contributor Author

ngam commented Feb 5, 2022

To clarify: 568f298 removes instructing it to find a specific BLAS. So it goes through its list: MKL, BLIS, Accelerate, and then OpenBLAS (and maybe some others, see here https://github.com/pytorch/pytorch/blob/v1.10.2/cmake/Modules/FindBLAS.cmake). So unless we specify that cmake flag, it will just repeat the same process again: MKL, BLIS, Accelerate, OpenBLAS, etc. --- note: Accelerate is always part of the macos SDK, so it will always be discovered before OpenBLAS unless instructed otherwise.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Feb 5, 2022

don't we want it to therefore find netlib?

@ngam
Copy link
Contributor Author

ngam commented Feb 5, 2022

don't we want it to therefore find netlib?

Maybe we can force it to find 'generic'?

# Generic BLAS library?
if((NOT BLAS_LIBRARIES)
    AND ((NOT WITH_BLAS) OR (WITH_BLAS STREQUAL "generic")))
  check_fortran_libraries(
  BLAS_LIBRARIES
  BLAS
  sgemm
  ""
  "blas")
  if (BLAS_LIBRARIES)
    set(BLAS_INFO "generic")
  endif(BLAS_LIBRARIES)
endif()

https://github.com/pytorch/pytorch/blob/71f889c7d265b9636b93ede9d651c0a9c4bee191/cmake/Modules/FindBLAS.cmake#L279-L291

@ngam
Copy link
Contributor Author

ngam commented Feb 5, 2022

If this 'generic' trick works, we can implement it for all builds...

@ngam
Copy link
Contributor Author

ngam commented Feb 5, 2022

@isuruf does this count as netlib?

-- Using BLAS: /Applications/Xcode_12.4.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.0.sdk/usr/lib/libblas.tbd

Compare this to:

  --   Library Accelerate: /Applications/Xcode_12.4.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.0.sdk/System/Library/Frameworks/Accelerate.framework

@ngam
Copy link
Contributor Author

ngam commented Feb 5, 2022

@ngam
Copy link
Contributor Author

ngam commented Feb 5, 2022

Trying to find the way isuruf implemented this, I remember seeing .tbd in his hacks :)

@ngam
Copy link
Contributor Author

ngam commented Feb 5, 2022

@ngam ngam changed the title implement netlib for osx-arm64 implement accelerate for osx-arm64 Feb 5, 2022
@ngam
Copy link
Contributor Author

ngam commented Feb 5, 2022

lrwxr-xr-x    1 root  wheel     114 Dec 14 19:47 libblas.tbd -> ../../System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.tbd
lrwxr-xr-x    1 root  wheel     114 Dec 14 19:47 libcblas.tbd -> ../../System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.tbd

I am going to revert back to the default since this won't work I think. I will give it sometime for people to weigh in

@ngam
Copy link
Contributor Author

ngam commented Feb 5, 2022

Or I can build a matrix... will revisit later

@conda-forge-linter
Copy link

Hi! This is the friendly automated conda-forge-linting service.

I wanted to let you know that I linted all conda-recipes in your PR (recipe) and found some lint.

Here's what I've got...

For recipe:

  • Failed to even lint the recipe, probably because of a conda-smithy bug 😢. This likely indicates a problem in your meta.yaml, though. To get a traceback to help figure out what's going on, install conda-smithy and run conda smithy recipe-lint . from the recipe directory.

@ngam
Copy link
Contributor Author

ngam commented Feb 8, 2022

  • conda smithy recipe-lint .
pytorch-cpu-feedstock/recipe$ conda smithy recipe-lint .
. is in fine form

🤷

@ngam
Copy link
Contributor Author

ngam commented Feb 8, 2022

@hmaarrfk, I thought about this and did a little more looking around. Now I believe this is likely not worth it and so I am going to abandon this for now. The reason is that this the benefit of defaulting to Accelerate don't really materialize for PyTorch (at least not yet). They're supposedly working on supporting the new Apple GPUs (Metal) and we can try again then. For now, I just think this shouldn't be a priority.

Closing. Happy to revisit again if other people really want this...

@ngam ngam closed this Feb 8, 2022
@hmaarrfk
Copy link
Contributor

hmaarrfk commented Feb 8, 2022

Interesting findings!

@ngam
Copy link
Contributor Author

ngam commented Feb 8, 2022

I think as part of their metal effort, they might reorganize things and so we are just better off waiting until then, especially that this is an osx-arm64 issue/improvement only. They have a big central issue upstream with a lot of annoying Apple fans asking for updates incessantly 😆

@jkleckner
Copy link

At the risk of necroposting, this approach in numpy with wrappers is interesting [1].
Some of the benchmarks show huge improvement.

Note that MacOS 13.3 improved netlib compatibility [2]:

The BLAS and LAPACK libraries under the Accelerate framework are now inline with reference version 3.9.1. These new interfaces provide additional functionality and a new ILP64 interface. To use the new interfaces, define ACCELERATE_NEW_LAPACK before including the Accelerate or vecLib headers. For ILP64 interfaces, also define ACCELERATE_LAPACK_ILP64.

[1] numpy/numpy#24053
[2] https://developer.apple.com/documentation/macos-release-notes/macos-13_3-release-notes#New-Features

Cross links to:

[3] #85
[4] #82

@jkleckner
Copy link

Closing. Happy to revisit again if other people really want this...

A conda environment that includes pytorch forces the use of openblas rather than the up to 10x faster implementation MacOS 13.3 accelerate BLAS [1]. If you don't use pytorch in your conda environment, numpy works great with the accelerate BLAS but you end up having two environments, one for pytorch and GPU and one for numpy and not GPU.

How hard/complex would it be to add pytorch builds specific to MacOS 13.3+ that enable the flags to use the accelerate BLAS? It is a bit unfortunate that Apple ties this important performance improvement into the OS version...

[1] numpy/numpy#24053

@isuruf
Copy link
Member

isuruf commented Jul 30, 2023

You can already use conda install pytorch blas=*=*accelerate. It gives you only BLAS functions with accelerate. LAPACK is still provided by netlib.

@jkleckner
Copy link

You can already use conda install pytorch blas=*=*accelerate. It gives you only BLAS functions with accelerate. LAPACK is still provided by netlib.

@isuruf I thought I tried that. I just re-ran this script and get the same output as before my post with the error below.
Attaching the script and perhaps you can point out my error?

Traceback (most recent call last):
  File "<stdin>", line 7, in <module>
  File "/opt/homebrew/Caskroom/miniforge/base/envs/try_pytorch_3_10/lib/python3.10/site-packages/torch/__init__.py", line 229, in <module>
    from torch._C import *  # noqa: F403
ImportError: dlopen(/opt/homebrew/Caskroom/miniforge/base/envs/try_pytorch_3_10/lib/python3.10/site-packages/torch/_C.cpython-310-darwin.so, 0x0002): Library not loaded: @rpath/libopenblas.0.dylib
  Referenced from: <51247470-C91F-3562-843D-7DDA002EA002> /opt/homebrew/Caskroom/miniforge/base/envs/try_pytorch_3_10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.dylib
  Reason: tried: '/opt/homebrew/Caskroom/miniforge/base/envs/try_pytorch_3_10/lib/python3.10/site-packages/torch/lib/libopenblas.0.dylib' (no such file), '/opt/homebrew/Caskroom/miniforge/base/envs/try_pytorch_3_10/lib/python3.10/site-packages/torch/lib/../../../../libopenblas.0.dylib' (no such file), '/opt/homebrew/Caskroom/miniforge/base/envs/try_pytorch_3_10/lib/python3.10/site-packages/torch/lib/libopenblas.0.dylib' (no such file), '/opt/homebrew/Caskroom/miniforge/base/envs/try_pytorch_3_10/lib/python3.10/site-packages/torch/lib/../../../../libopenblas.0.dylib' (no such file), '/opt/homebrew/Caskroom/miniforge/base/envs/try_pytorch_3_10/lib/python3.10/site-packages/torch/lib/libopenblas.0.dylib' (no such file), '/opt/homebrew/Caskroom/miniforge/base/envs/try_pytorch_3_10/lib/python3.10/site-packages/torch/lib/../../../../libopenblas.0.dylib' (no such file), '/opt/homebrew/Caskroom/miniforge/base/envs/try_pytorch_3_10/lib/python3.10/site-packages/torch/lib/libopenblas.0.dylib' (no such file), '/opt/homebrew/Caskroom/miniforge/base/envs/try_pytorch_3_10/lib/python3.10/site-packages/torch/lib/../../../../libopenblas.0.dylib' (no such file), '/opt/homebrew/Caskroom/miniforge/base/envs/try_pytorch_3_10/lib/python3.10/site-packages/torch/lib/libopenblas.0.dylib' (no such file), '/opt/homebrew/Caskroom/miniforge/base/envs/try_pytorch_3_10/lib/python3.10/site-packages/torch/../../../libopenblas.0.dylib' (no such file), '/opt/homebrew/Caskroom/miniforge/base/envs/try_pytorch_3_10/lib/python3.10/site-packages/torch/lib/libopenblas.0.dylib' (no such file), '/opt/homebrew/Caskroom/miniforge/base/envs/try_pytorch_3_10/lib/python3.10/site-packages/torch/../../../libopenblas.0.dylib' (no such file), '/opt/homebrew/Caskroom/miniforge/base/envs/try_pytorch_3_10/bin/../lib/libopenblas.0.dylib' (no such file), '/opt/homebrew/Caskroom/miniforge/base/envs/try_pytorch_3_10/bin/../lib/libopenblas.0.dylib' (no such file), '/usr/local/lib/libopenblas.0.dylib' (no such file), '/usr/lib/libopenblas.0.dylib' (no such file, not in dyld cache)

try_pytorch_alone.txt

@jkleckner
Copy link

Side note, this is a great read for context of these libraries: https://pypackaging-native.github.io/key-issues/native-dependencies/blas_openmp/

@ngam
Copy link
Contributor Author

ngam commented Jul 31, 2023

I didn’t review the above carefully… but my recollection: We tried this before and things didn’t turn out well super well. I’d say we need to rebuild more carefully… and we likely need to take care of deps like numpy and scipy as well… @jkleckner if you’re interested in helping, please try to follow the logic here and elsewhere and we/I can try to help

@isuruf
Copy link
Member

isuruf commented Jul 31, 2023

@jkleckner, ah I thought we had #175 merged. Until that PR is merged you can do ln -sf $CONDA_PREFIX/lib/libcblas.3.dylib $CONDA_PREFIX/lib/libopenblas.0.dylib

@jkleckner
Copy link

@isuruf Wow, thanks! That got it going. Now the numpy runs fast and pytorch still uses the GPU. I used one of these benchmarks [1] to try it out. Hopefully, pytorch fallback to CPU from GPU will reap the speed benefits. You mention that it is still LAPACK via netlib rather than accelerate, true? This [2] suggests a 4x difference in speed when netlib is the BLAS, but if netlib LAPACK is using the underlying BLAS then it might not be so big a difference. Those benchmarks [2] don't really exercise LAPACK apis directly.

[1] https://github.com/lucadiliello/pytorch-apple-silicon-benchmarks

python tests/transformers_sequence_classification.py --device mps --pre_trained_name bert-base-cased --batch_size 64 --mode training --steps 100 --sequence_length 128
Took 86s on M1 Max and saturated the GPUs.

[2] https://stackoverflow.com/a/70255105

@jkleckner
Copy link

After your symlink, these are the dylibs. (The symlink could be local).

$ ll  *vec*.dylib *blas*.dylib  *lapa*.dylib
lrwxr-xr-x  1 jim  admin    22B Jul 30 19:29 libblas.3.dylib -> libvecLibFort-ng.dylib
lrwxr-xr-x  1 jim  admin    22B Jul 30 19:29 libblas.dylib -> libvecLibFort-ng.dylib
lrwxr-xr-x  1 jim  admin    22B Jul 30 19:29 libcblas.3.dylib -> libvecLibFort-ng.dylib
lrwxr-xr-x  1 jim  admin    22B Jul 30 19:29 libcblas.dylib -> libvecLibFort-ng.dylib
-rwxrwxr-x  5 jim  admin   5.8M Jun  4 19:10 liblapack-netlib.3.9.0.dylib
lrwxr-xr-x  1 jim  admin    22B Jul 30 19:29 liblapack.3.dylib -> libvecLibFort-ng.dylib
lrwxr-xr-x  1 jim  admin    22B Jul 30 19:29 liblapack.dylib -> libvecLibFort-ng.dylib
-rwxrwxr-x  5 jim  admin   1.6M Jun  4 19:10 liblapacke-netlib.3.9.0.dylib
lrwxr-xr-x  1 jim  admin    22B Jul 30 19:29 liblapacke.3.dylib -> libvecLibFort-ng.dylib
lrwxr-xr-x  1 jim  admin    22B Jul 30 19:29 liblapacke.dylib -> libvecLibFort-ng.dylib
lrwxr-xr-x  1 jim  admin    80B Jul 30 21:00 libopenblas.0.dylib -> /opt/homebrew/Caskroom/miniforge/base/envs/try_pytorch_3_10/lib/libcblas.3.dylib
-rwxrwxr-x  5 jim  admin    72K Jun  4 19:10 libvecLibFort-ng.dylib

@jkleckner
Copy link

I can confirm that the script runs correctly after the merge of #175 with numpy executing fast and GPU on arm64 working.
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pytorch bombs if "libblas=*=*accelerate"
5 participants