Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PyTorch] Update to 2.1 #1426

Merged
merged 26 commits into from
Nov 10, 2023
Merged

[PyTorch] Update to 2.1 #1426

merged 26 commits into from
Nov 10, 2023

Conversation

HGuillemet
Copy link
Collaborator

@HGuillemet HGuillemet commented Oct 16, 2023

Included in this PR:

  • Update to PyTorch 2.1
  • Rationalization of function pointer classes. Group all of them in package functions
  • Add Tensor.item_bool and Tensor.item_byte
  • Add Tensor.data_ptr_byte and Tensor.data_ptr_bool
  • Remove useless classes not part of the libtorch API
  • Add CUDACachingAllocator
  • Map most of missing data loader API, for Example<Tensor, Tensor> and Example<Tensor,NoTarget>
  • Make all methods taking ArrayRef arguments of primitive types accept primitive Java array or variadic.
  • Make register_module generic and its return type the class of the registered module.
  • Other minor improvements

@saudet
Copy link
Member

saudet commented Oct 20, 2023

@sbrunk Could you review this pull request?

@HGuillemet
Copy link
Collaborator Author

I still plan to add another improvement: support for stateless datasets, dataloaders, etc...
Hopefully coming later today or tomorrow.
Please hold on before starting a review.

@saudet saudet requested a review from sbrunk October 20, 2023 22:44
@sbrunk
Copy link
Contributor

sbrunk commented Oct 22, 2023

@saudet: shouldn't we embed libtorch binaries now that it links with CUDA 12 or do we continue building our version with limited compute capabilities ?

If the upstream libtorch binaries don't cause any issues, I'm in favor of switching. I think it has a number of advantages:

  1. Less native building -> less strain on limited CI resources
  2. It can reduce friction regarding PTX JIT compilation (s. [Pytorch] New version of the presets #1360)
  3. Somewhat relaxed driver version constraints due to (currently) targeting CUDA 12.1 instead of always the latest CUDA version.

I will do some tests like running the test suite of Storch using this branch and the upstream binaries to see if I run into any issues.

One thing that might be interesting to look into is MPS acceleration on MacOS, as soon most macs will run on ARM with GPU support.
I'm not sure if the upstream binaries have MPS acceleration enabled, I know the Python wheels do. I'm not even sure if they provide ARM builds at all for pure libtorch (and I don't have an M1/M2 machine to test). I think right now, we don't have support for ARM builds either right?

@HGuillemet
Copy link
Collaborator Author

The libtorch binary for mac that can be downloaded from PyTorch main page is still x86_64 only.
I think we could build from source for Mac and use libtorch binaries for CUDA (and ROCm).

@saudet
Copy link
Member

saudet commented Oct 22, 2023

There's no need to switch to anything, the presets already support LibTorch:
https://github.com/bytedeco/javacpp-presets/tree/master/pytorch#documentation

@HGuillemet
Copy link
Collaborator Author

RIght, but what exactly does the user gain in using our binary compared to libtorch, now that libtorch links with CUDA 12 ? If nothing significant, it seems logical to embed libtorch which supports more hardware and for the reason listed by @sbrunk.

One problem I can think of is if the cuda presets is updated to a version and libtorch is not available with this cuda version, or the other way around.
Anything else ?

@saudet
Copy link
Member

saudet commented Oct 22, 2023

Like I mentioned previously many times, LibTorch links statically with MKL, so it can't be used together with other libraries that use BLAS and LAPACK like GSL, OpenCV, NumPy, SciPy, Smile, etc

@HGuillemet
Copy link
Collaborator Author

Ok. Maybe can we use libtorch_cuda only from provided binaries and compile libtorch_cpu ?

I'm done with what I planned to add to this PR.
You can start reviewing. Comments or RFE welcomed.

@HGuillemet
Copy link
Collaborator Author

The remaining error during build on Windows is:

D:\a\javacpp-presets\javacpp-presets\pytorch\target\native\org\bytedeco\pytorch\windows-x86_64\jnitorch.cpp(98904): error C2526: 'JavaCPP_org_bytedeco_pytorch_functions_GatheredContextSupplier_allocate_callback': C linkage function cannot return C++ class 'std::shared_ptr<c10::GatheredContext>'

This is related to issue bytedeco/javacpp#720.
@saudet Any idea if this can be fixed ? With some @Convention maybe ?

Else we can skip the single function needing this function pointer: CUDAAllocator.recordHistory.

@saudet
Copy link
Member

saudet commented Oct 24, 2023

The remaining error during build on Windows is:

D:\a\javacpp-presets\javacpp-presets\pytorch\target\native\org\bytedeco\pytorch\windows-x86_64\jnitorch.cpp(98904): error C2526: 'JavaCPP_org_bytedeco_pytorch_functions_GatheredContextSupplier_allocate_callback': C linkage function cannot return C++ class 'std::shared_ptr<c10::GatheredContext>'

This is related to issue bytedeco/javacpp#720.
@saudet Any idea if this can be fixed ? With some @Convention maybe ?

Sounds like a bug in MSVC 2019:
https://stackoverflow.com/questions/57429064/c-linkage-function-cannot-return-c-class-in-visual-studio-2019

Else we can skip the single function needing this function pointer: CUDAAllocator.recordHistory.

If it doesn't look like an important function, we can do that, sure.

@saudet
Copy link
Member

saudet commented Oct 25, 2023

Ok. Maybe can we use libtorch_cuda only from provided binaries and compile libtorch_cpu ?

Where can we get libtorch_cuda for CUDA 12.3?

@HGuillemet
Copy link
Collaborator Author

According to https://developer.nvidia.com/blog/cuda-toolkit-12-0-released-for-general-availability/ we could run with 12.3 a binary compiled for 12.1 ?

@saudet
Copy link
Member

saudet commented Oct 25, 2023 via email

@HGuillemet
Copy link
Collaborator Author

Have you seen the "compatibility" paragraph on the link above ? It seems to be new since CUDA 11.

@saudet
Copy link
Member

saudet commented Oct 25, 2023

Well, try it and tell me if it works! I don't think it does. NVIDIA says many things, but reality is often different.

@sbrunk
Copy link
Contributor

sbrunk commented Oct 25, 2023

I'm trying to test this locally, but I'm running into issues.

[ERROR] Failed to execute JavaCPP Builder: Could not parse "c10/cuda/impl/cuda_cmake_macros.h": File does not exist

Fresh pytorch clone from the cppbuild. There's a cuda_cmake_macros.h.in. Any idea what I might be missing?

@HGuillemet
Copy link
Collaborator Author

Have you run the cppbuid ?

@sbrunk
Copy link
Contributor

sbrunk commented Oct 25, 2023

Have you run the cppbuid ?

Yes sorry should have mentioned that. I'm running mvn install --projects pytorch -Djavacpp.platform=linux-x86_64 -Djavacpp.platform.extension=-gpu and the native build is working fine.

Update I re-checked the native build and I think I did not put CUDA in the right place, sorry for the noise.

@sbrunk
Copy link
Contributor

sbrunk commented Oct 27, 2023

I managed to build it locally now, and the Storch tests are passing (with minimal changes) with 2.1. :)

Testing with CUDA is causing issues though. My current assumption is that we might need to extend the CUDA arch list to support Ampere GPUs. See sbrunk/storch#62

@sbrunk
Copy link
Contributor

sbrunk commented Oct 28, 2023

I did some more testing. All tests were done on an Ampere GPU with compute capability 8.6 Ubuntu 22.04 with the latest Nvidia driver currently available in the official package sources which is 535.104.12 which means CUDA driver version 12.2

  • Snapshot from current master (targeting PyTorch 2.0.1 and CUDA 12.3), with cuda-redist-12.3-8.9 or locally installed CUDA 12.3 -> PTX error
  • Locally built libtorch from this branch. (targeting PyTorch 2.1.0 and CUDA 12.3) -> PTX error
    • Added compute capabilities 8.0 and 8.6 to TORCH_CUDA_ARCH_LIST -> PTX error
  • Switched to CUDA 12.1 for the local build:
    • cuda-redist-12.3-8.9 -> works
    • Locally installed CUDA versions 12.1 and 12.3 -> works
    • Upstream libtorch (libtorch-cxx11-abi-shared-with-deps-2.1.0+cu121) when setting pathsFirst and lib path -> works

To make sure that the correct native libraries are chosen I always checked the symlinks in $HOME/.javacpp/cache and deleted the cache folder between runs.

So it looks like building against 12.1 makes it work again, and is compatible with 12.3 at runtime. I still need to verify this for 2.0.1

@HGuillemet
Copy link
Collaborator Author

HGuillemet commented Oct 28, 2023

Thanks for reporting all these tests.

535.104.12 which means CUDA driver version 12.2

That explains the whole thing.
So they release CUDA 12.3 but the 545 driver that is compatible is still in beta...

Hopefully upstream libtorch_cuda and libtorch_cuda_linalg will work with JavaCPP build of other libs and we could get rid of PTX nightmares.

@sbrunk sbrunk mentioned this pull request Oct 29, 2023
1 task
@saudet
Copy link
Member

saudet commented Oct 30, 2023

It looks like either the GitHub runners or NVCC from CUDA 12.3 are a bit faster. In any case, we have an extra build hour that we could use for 8.0 arch.

@sbrunk
Copy link
Contributor

sbrunk commented Oct 31, 2023

With both 41a97d1 and a3102e5 it seems to be working with my local build (built with locally installed CUDA 12.3 and linked against cuda-redist (also 12.3) at runtime). 🚀

Would be great if someone else could verify this.

@HGuillemet
Copy link
Collaborator Author

I tried to replace libtorch_cuda in JavaCPP pytorch with "official" binary libtorch_cuda, but this binary is linked with a a libcudart provided in the archive and this would need to patch the elf and the dll files to link with JavaCPP cuda.
Doable but a bit more complex.

Maybe there is another option: copy cubins from a library file to another, but I don't know how to do it.

@saudet You might as well replace +PTX with 9.0, or drop it if it helps since there is almost no 9.0 architecture running out there for now.

@saudet
Copy link
Member

saudet commented Nov 1, 2023

@sbrunk Apart from that, anything else that needs to be fixed?

@saudet
Copy link
Member

saudet commented Nov 3, 2023

7,812 additions, 5,691 deletions not shown because the diff is too large. Please use a local Git client to view these changes.

A lot of those "changes" is simply because you're changing around the parsing order. If there are no good reasons to change the order, please revert the order to how it was before. It makes it hard to see the differences.

@HGuillemet
Copy link
Collaborator Author

torch_include.h is generated by a script from the output of g++ -H. This ensures we don't miss any new includes and that the includes are parsed in the order they are processed by g++. If the order changed since 2.0.1 this is probably because of upstream changes (or maybe I fixed a bug in my script).

@saudet
Copy link
Member

saudet commented Nov 3, 2023

Where is that script?

@HGuillemet
Copy link
Collaborator Author

Just added to the repo

@saudet
Copy link
Member

saudet commented Nov 3, 2023

Ok, so can you please run that against master and make sure nothing changes? If something changes, please fix it so that nothing changes.

pytorch/README.md Outdated Show resolved Hide resolved
@saudet saudet merged commit 5507552 into bytedeco:master Nov 10, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants