-
-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement accelerate for osx-arm64 #88
Conversation
Hi! This is the friendly automated conda-forge-linting service. I just wanted to let you know that I linted all conda-recipes in your PR ( |
(I will test this locally and put the results here in a bit) Okay results: Essentially, the build seems to hardcode to either/or, so in this case it just goes to Accelerate. However, the good news is, it doesn't bomb like before. Note I am not really sure if this is hardcoding or not --- it could just be that the Anyway, imo, this could/should be merged. Details with Accelerate:
Details with OpenBLAS:
|
Could you add a constraint like:
And run the tests locally? this would somewhat build up a minimal test matrix. |
Yes, but what's the point? We already know what this will lead to: same exact outcome. The reason: 568f298. If we want to build a matrix, then we need to base it on the CMAKE flag. Happy to go that route, but I believe it's just too much work for nothing -- this is already a challenging build. I think it is better to just default osx-arm to accelerate for now. But if you want me to go ahead and build matrix for arm64, I am happy to do so --- again, it would need to be done with a control flow on the cmake flag, I believe. |
To clarify: 568f298 removes instructing it to find a specific BLAS. So it goes through its list: MKL, BLIS, Accelerate, and then OpenBLAS (and maybe some others, see here https://github.com/pytorch/pytorch/blob/v1.10.2/cmake/Modules/FindBLAS.cmake). So unless we specify that cmake flag, it will just repeat the same process again: MKL, BLIS, Accelerate, OpenBLAS, etc. --- note: Accelerate is always part of the macos SDK, so it will always be discovered before OpenBLAS unless instructed otherwise. |
don't we want it to therefore find netlib? |
Maybe we can force it to find 'generic'?
|
If this 'generic' trick works, we can implement it for all builds... |
@isuruf does this count as netlib?
Compare this to:
|
Trying to find the way isuruf implemented this, I remember seeing |
I am going to revert back to the default since this won't work I think. I will give it sometime for people to weigh in |
Or I can build a matrix... will revisit later |
Hi! This is the friendly automated conda-forge-linting service. I wanted to let you know that I linted all conda-recipes in your PR ( Here's what I've got... For recipe:
|
…nda-forge-pinning 2022.02.07.06.12.16
🤷 |
@hmaarrfk, I thought about this and did a little more looking around. Now I believe this is likely not worth it and so I am going to abandon this for now. The reason is that this the benefit of defaulting to Accelerate don't really materialize for PyTorch (at least not yet). They're supposedly working on supporting the new Apple GPUs (Metal) and we can try again then. For now, I just think this shouldn't be a priority. Closing. Happy to revisit again if other people really want this... |
Interesting findings! |
I think as part of their metal effort, they might reorganize things and so we are just better off waiting until then, especially that this is an osx-arm64 issue/improvement only. They have a big central issue upstream with a lot of annoying Apple fans asking for updates incessantly 😆 |
At the risk of necroposting, this approach in numpy with wrappers is interesting [1]. Note that MacOS 13.3 improved netlib compatibility [2]:
[1] numpy/numpy#24053 Cross links to: |
A conda environment that includes pytorch forces the use of openblas rather than the up to 10x faster implementation MacOS 13.3 accelerate BLAS [1]. If you don't use pytorch in your conda environment, numpy works great with the accelerate BLAS but you end up having two environments, one for pytorch and GPU and one for numpy and not GPU. How hard/complex would it be to add pytorch builds specific to MacOS 13.3+ that enable the flags to use the accelerate BLAS? It is a bit unfortunate that Apple ties this important performance improvement into the OS version... |
You can already use |
@isuruf I thought I tried that. I just re-ran this script and get the same output as before my post with the error below.
|
Side note, this is a great read for context of these libraries: https://pypackaging-native.github.io/key-issues/native-dependencies/blas_openmp/ |
I didn’t review the above carefully… but my recollection: We tried this before and things didn’t turn out well super well. I’d say we need to rebuild more carefully… and we likely need to take care of deps like numpy and scipy as well… @jkleckner if you’re interested in helping, please try to follow the logic here and elsewhere and we/I can try to help |
@jkleckner, ah I thought we had #175 merged. Until that PR is merged you can do |
@isuruf Wow, thanks! That got it going. Now the numpy runs fast and pytorch still uses the GPU. I used one of these benchmarks [1] to try it out. Hopefully, pytorch fallback to CPU from GPU will reap the speed benefits. You mention that it is still LAPACK via netlib rather than accelerate, true? This [2] suggests a 4x difference in speed when netlib is the BLAS, but if netlib LAPACK is using the underlying BLAS then it might not be so big a difference. Those benchmarks [2] don't really exercise LAPACK apis directly. [1] https://github.com/lucadiliello/pytorch-apple-silicon-benchmarks
|
After your symlink, these are the dylibs. (The symlink could be local).
|
I can confirm that the script runs correctly after the merge of #175 with numpy executing fast and GPU on arm64 working. |
This PR only changes netlib implementation for osx-arm64. Everything else remains the same. Skipping all builds but osx for confirmation.
Fixes #82
Checklist
0
(if the version changed)conda-smithy
(Use the phrase@conda-forge-admin, please rerender
in a comment in this PR for automated rerendering)