Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torchvision breaks in official pytorch Docker image: RuntimeError: Couldn't load custom C++ ops. #4222

Closed
joek13 opened this issue Jul 29, 2021 · 7 comments

Comments

@joek13
Copy link

joek13 commented Jul 29, 2021

🐛 Bug

I'm using the pytorch/pytorch:1.9.0-cuda10.2-cudnn7-runtime Docker image and trying to install torchvision on top. The installation proceeds as expected, but if I try to call a function that uses custom C++ ops (such as torchvision.ops.nms), I get the following error message:

RuntimeError: Couldn't load custom C++ ops. This can happen if your PyTorch and torchvision versions are incompatible, or if you had errors while compiling torchvision from source. For further information on the compatible versions, check https://github.com/pytorch/vision#installation for the compatibility matrix. Please check your PyTorch version with torch.__version__ and your torchvision version with torchvision.__version__ and verify if they are compatible, and if not please reinstall torchvision so that it matches your PyTorch install.

I can confirm that the installed versions are compatible by bashing into the container and opening a Python prompt:

>>> import torch
>>> torch.__version__
'1.9.0'
>>> import torchvision
>>> torchvision.__version__
'0.10.0'
>>> import torchvision.ops

This issue occurs regardless of if I install pytorch by:

  • Using pip, i.e., RUN pip install torchvision
  • Using conda without a version pin, i.e., RUN conda install -c pytorch torchvision
  • Using conda with a version pin, i.e., RUN conda install -c pytorch torchvision=0.10.0

To Reproduce

Steps to reproduce the behavior:

In a new directory:

  1. Create a minimal Dockerfile with the following content:
FROM pytorch/pytorch:1.9.0-cuda10.2-cudnn7-runtime

RUN conda install -c pytorch torchvision

COPY ./test.py ./test.py

ENTRYPOINT ["python", "test.py"]
  1. Create a minimal test.py with the following content:
import torchvision.ops

torchvision.ops.nms(None, None, 0.0)
  1. Build and run the container:
docker build -t torchvisiondockerbug . && docker run torchvisiondockerbug
  1. Observe the following output:
Traceback (most recent call last):
  File "test.py", line 3, in <module>
    torchvision.ops.nms(None, None, 0.0)
  File "/opt/conda/lib/python3.7/site-packages/torchvision/ops/boxes.py", line 34, in nms
    _assert_has_ops()
  File "/opt/conda/lib/python3.7/site-packages/torchvision/extension.py", line 63, in _assert_has_ops
    "Couldn't load custom C++ ops. This can happen if your PyTorch and "
RuntimeError: Couldn't load custom C++ ops. This can happen if your PyTorch and torchvision versions are incompatible, or if you had errors while compiling torchvision from source. For further information on the compatible versions, check https://github.com/pytorch/vision#installation for the compatibility matrix. Please check your PyTorch version with torch.__version__ and your torchvision version with torchvision.__version__ and verify if they are compatible, and if not please reinstall torchvision so that it matches your PyTorch install.

Expected behavior

I expect to be able to load custom C++ ops, because torch 1.9.0 and torchvision 0.10.0 are marked as compatible in torchvision's compatibility matrix.

In a working environment, the output of test.py looks like this:

Traceback (most recent call last):
  File "test.py", line 3, in <module>
    torchvision.ops.nms(None, None, 0.0)
  File "/home/joe/.pyenv/versions/pytorch_problem/lib/python3.7/site-packages/torchvision/ops/boxes.py", line 35, in nms
    return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
RuntimeError: torchvision::nms() Expected a value of type 'Tensor' for argument 'dets' but instead found type 'NoneType'.
Position: 0
Value: None
Declaration: torchvision::nms(Tensor dets, Tensor scores, float iou_threshold) -> (Tensor)
Cast error details: Unable to cast Python instance to C++ type (compile in debug mode for details)

(Yes, this is still an error, but it at least demonstrates that _assert_has_ops is successful.)

Environment

Output of running collect_env.py inside the Docker container:

Collecting environment information...
PyTorch version: 1.9.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.10

Python version: 3.7.10 (default, Feb 26 2021, 18:47:35)  [GCC 7.3.0] (64-bit runtime)
Python platform: Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-debian-buster-sid
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.2
[pip3] torch==1.9.0
[pip3] torchelastic==0.2.0
[pip3] torchtext==0.10.0
[pip3] torchvision==0.10.0
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               10.2.89              h6bb024c_0    nvidia
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.2.0           h06a4308_296
[conda] mkl-service               2.3.0            py37h27cfd23_1
[conda] mkl_fft                   1.3.0            py37h42c9631_2
[conda] mkl_random                1.2.1            py37ha9443f7_2
[conda] numpy                     1.20.2           py37h2d18471_0
[conda] numpy-base                1.20.2           py37hfae3a4d_0
[conda] pytorch                   1.9.0           py3.7_cuda10.2_cudnn7.6.5_0    pytorch
[conda] torchelastic              0.2.0                    pypi_0    pypi
[conda] torchtext                 0.10.0                     py37    pytorch
[conda] torchvision               0.10.0               py37_cu102    pytorch
@joek13
Copy link
Author

joek13 commented Aug 2, 2021

In case anyone else struggles with this, the workaround I'm using is to start with the base nvidia/cuda image and install Python, torch, and torchvision on top.

The beginning of my Dockerfile looks like this:

FROM nvidia/cuda:11.4.0-runtime-ubuntu20.04

WORKDIR /app

# Setting DEBIAN_FRONTEND=noninteractive allows installation
# of some packages to complete without user input.
# Install Python3.8
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y python3.8 python3-pip \
    && rm -rf /var/lib/apt/lists/*

# copy Python dependencies
COPY requirements.txt .
# install them
RUN pip install -r requirements.txt

@indam
Copy link

indam commented Aug 10, 2021

Having the same issue with 1.9.0-cuda11.1-cudnn8-runtime

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Aug 11, 2021

Yes, same for me.

Cc @seemethere

@sberryman
Copy link

FYI: I was able to get torchvision to work using the pytorch/pytorch:1.9.0-cuda11.1-cudnn8-devel container.

RUN pip3 install \
    torchvision==0.10.0+cu111 \
    -f https://download.pytorch.org/whl/torch_stable.html

@KimSangYeon-DGU
Copy link

KimSangYeon-DGU commented Feb 15, 2022

In my case, the workaround is to uninstall torchvision and reinstall it. After that, the version of PyTorch was subsequently upgraded from 1.9.0 to 1.10.2 (torchvision: 0.11.3).

pip uninstall torchvision
pip install torchvision

@malfet
Copy link
Contributor

malfet commented Mar 2, 2022

I can not reproduce the problem using 1.10.0-cuda11.3-cudnn8-runtime

$ docker build -t torchvisiondockerbug . && docker run torchvisiondockerbug
Sending build context to Docker daemon  3.072kB
Step 1/4 : FROM pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime
 ---> c3f17e5ac010
Step 2/4 : RUN conda install -c pytorch torchvision
 ---> Running in 0fb646354e70
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - torchvision


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2022.2.1   |       h06a4308_0         122 KB
    certifi-2021.10.8          |   py37h06a4308_2         151 KB
    conda-4.11.0               |   py37h06a4308_0        14.4 MB
    openssl-1.1.1m             |       h7f8727e_0         2.5 MB
    torchvision-0.11.1         |       py37_cu113        30.3 MB  pytorch
    ------------------------------------------------------------
                                           Total:        47.6 MB

The following packages will be UPDATED:

  ca-certificates                      2021.9.30-h06a4308_1 --> 2022.2.1-h06a4308_0
  certifi                          2021.10.8-py37h06a4308_0 --> 2021.10.8-py37h06a4308_2
  conda                               4.10.3-py37h06a4308_0 --> 4.11.0-py37h06a4308_0
  openssl                                 1.1.1l-h7f8727e_0 --> 1.1.1m-h7f8727e_0
  torchvision                             0.11.0-py37_cu113 --> 0.11.1-py37_cu113


Proceed ([y]/n)? 

Downloading and Extracting Packages
certifi-2021.10.8    | 151 KB    | ########## | 100% 
ca-certificates-2022 | 122 KB    | ########## | 100% 
torchvision-0.11.1   | 30.3 MB   | ########## | 100% 
openssl-1.1.1m       | 2.5 MB    | ########## | 100% 
conda-4.11.0         | 14.4 MB   | ########## | 100% 
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Removing intermediate container 0fb646354e70
 ---> 44b573a1432a
Step 3/4 : COPY ./test.py ./test.py
 ---> 7f91b82fa28a
Step 4/4 : ENTRYPOINT ["python", "test.py"]
 ---> Running in 917ee4855033
Removing intermediate container 917ee4855033
 ---> 14aed0ea9819
Successfully built 14aed0ea9819
Successfully tagged torchvisiondockerbug:latest
Traceback (most recent call last):
  File "test.py", line 3, in <module>
    torchvision.ops.nms(None, None, 0.0)
  File "/opt/conda/lib/python3.7/site-packages/torchvision/ops/boxes.py", line 35, in nms
    return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
RuntimeError: torchvision::nms() Expected a value of type 'Tensor' for argument 'dets' but instead found type 'NoneType'.
Position: 0
Value: None
Declaration: torchvision::nms(Tensor dets, Tensor scores, float iou_threshold) -> (Tensor)
Cast error details: Unable to cast Python instance to C++ type (compile in debug mode for details)
$ cat Dockerfile 
FROM pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime

RUN conda install -c pytorch torchvision

COPY ./test.py ./test.py

ENTRYPOINT ["python", "test.py"]

Also, please note, that torchvision is already pre-installed in the container, so running something like

$ docker run -it pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime python -c "import torchvision;torchvision.ops.nms(None, None, 0.0)"

Produces the same result. Closing. Please do not hesitate to reopen a new one if it will be reproduced in new builds

@malfet malfet closed this as completed Mar 2, 2022
@nepeta2o
Copy link

nepeta2o commented Mar 4, 2022

@malfet This issue still exists in pytorch/pytorch:1.9.0-cuda10.2-cudnn7-runtime
I'm not able to use any newer image because the nvidia driver on my machine is compatible only up to cuda 10.2. Could you please provide any suggestions?

To reproduce:
Running

docker run -it --gpus all pytorch/pytorch:1.9.0-cuda10.2-cudnn7-runtime python -c "import torchvision;torchvision.ops.nms(None, None, 0.0)"

Produce error messages:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.7/site-packages/torchvision/ops/boxes.py", line 34, in nms
    _assert_has_ops()
  File "/opt/conda/lib/python3.7/site-packages/torchvision/extension.py", line 63, in _assert_has_ops
    "Couldn't load custom C++ ops. This can happen if your PyTorch and "
RuntimeError: Couldn't load custom C++ ops. This can happen if your PyTorch and torchvision versions are incompatible, or if you had errors while compiling torchvision from source. For further
 information on the compatible versions, check https://github.com/pytorch/vision#installation for the compatibility matrix. Please check your PyTorch version with torch.__version__ and your to
rchvision version with torchvision.__version__ and verify if they are compatible, and if not please reinstall torchvision so that it matches your PyTorch install.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants