Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torchvision does not build PYTORCH_VERSION_TAG:-v1.10.2 TORCHVISION_VERSION_TAG:-v0.11.3 #38

Closed
skandermoalla opened this issue Dec 1, 2022 · 5 comments

Comments

@skandermoalla
Copy link
Contributor

Hello,

I'm building the docker image with the following config (all others being default)

BUILD_MODE=include
CCA=3.5
CUDA_VERSION:-11.3.1
PYTHON_VERSION:-3.8
PYTORCH_VERSION_TAG:-v1.10.2
TORCHVISION_VERSION_TAG:-v0.11.3

The build fails at the build-vision stage with the following error:

#0 85.06 [14/16] c++ -MMD -MF /opt/vision/build/temp.linux-x86_64-3.8/opt/vision/torchvision/csrc/io/video_reader/video_reader.o.d -pthread -B /opt/conda/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall
-fPIC -O2 -isystem /opt/conda/include -fPIC -O2 -isystem /opt/conda/include -fPIC -I/opt/vision/torchvision/csrc/io/decoder -I/opt/vision/torchvision/csrc/io/video_reader -I/opt/vision/torchvision/csrc/io/video -I/opt/vision/torchv
ision/csrc -I/opt/conda/include -I/opt/conda/include/x86_64-linux-gnu -I/opt/vision/torchvision/csrc -I/opt/conda/lib/python3.8/site-packages/torch/include -I/opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/inclu
de -I/opt/conda/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/lib/python3.8/site-packages/torch/include -I/opt/conda/lib/python3.8/sit
e-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/include/python3.8 -c -c /opt/vision/torchvision/csrc/
io/video_reader/video_reader.cpp -o /opt/vision/build/temp.linux-x86_64-3.8/opt/vision/torchvision/csrc/io/video_reader/video_reader.o -std=c++14 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB=
"_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=video_reader -D_GLIBCXX_USE_CXX11_ABI=1
#0 85.06 ninja: build stopped: subcommand failed.
#0 85.06 Traceback (most recent call last):
#0 85.06   File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1717, in _run_ninja_build
#0 85.06     subprocess.run(
#0 85.06   File "/opt/conda/lib/python3.8/subprocess.py", line 516, in run
#0 85.06     raise CalledProcessError(retcode, process.args,
#0 85.06 subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
#0 85.06
#0 85.06 The above exception was the direct cause of the following exception:                                                    
#0 85.06
#0 85.06 Traceback (most recent call last):
#0 85.06   File "setup.py", line 484, in <module>
#0 85.06     setup(
#0 85.06   File "/opt/conda/lib/python3.8/site-packages/setuptools/__init__.py", line 153, in setup
#0 85.06     return distutils.core.setup(**attrs)
#0 85.06   File "/opt/conda/lib/python3.8/distutils/core.py", line 148, in setup
#0 85.06     dist.run_commands()
#0 85.06   File "/opt/conda/lib/python3.8/distutils/dist.py", line 966, in run_commands
#0 85.06     self.run_command(cmd)
#0 85.06   File "/opt/conda/lib/python3.8/distutils/dist.py", line 985, in run_command
#0 85.06     cmd_obj.run()
#0 85.06   File "/opt/conda/lib/python3.8/site-packages/wheel/bdist_wheel.py", line 299, in run
#0 85.06     self.run_command('build')
#0 85.06   File "/opt/conda/lib/python3.8/distutils/cmd.py", line 313, in run_command
#0 85.06     self.distribution.run_command(command)
#0 85.06   File "/opt/conda/lib/python3.8/distutils/dist.py", line 985, in run_command
#0 85.06     cmd_obj.run()
#0 85.06   File "/opt/conda/lib/python3.8/distutils/command/build.py", line 135, in run
#0 85.06     self.run_command(cmd_name)
#0 85.06   File "/opt/conda/lib/python3.8/distutils/cmd.py", line 313, in run_command
#0 85.06     self.distribution.run_command(command)
#0 85.06   File "/opt/conda/lib/python3.8/distutils/dist.py", line 985, in run_command
#0 85.06     cmd_obj.run()
#0 85.06   File "/opt/conda/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 79, in run
#0 85.07     _build_ext.run(self)
#0 85.07   File "/opt/conda/lib/python3.8/distutils/command/build_ext.py", line 340, in run
#0 85.07     self.build_extensions()
#0 85.07   File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 735, in build_extensions
#0 85.07     build_ext.build_extensions(self)
#0 85.07   File "/opt/conda/lib/python3.8/distutils/command/build_ext.py", line 449, in build_extensions
#0 85.07     self._build_extensions_serial()
#0 85.07   File "/opt/conda/lib/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial
#0 85.07     self.build_extension(ext)
#0 85.07   File "/opt/conda/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 202, in build_extension
#0 85.07     _build_ext.build_extension(self, ext)
#0 85.07   File "/opt/conda/lib/python3.8/distutils/command/build_ext.py", line 528, in build_extension
#0 85.07     objects = self.compiler.compile(sources,
#0 85.07   File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 556, in unix_wrap_ninja_compile
#0 85.07     _write_ninja_file_and_compile_objects(
#0 85.07   File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1399, in _write_ninja_file_and_compile_objects
#0 85.07     _run_ninja_build(
#0 85.07   File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1733, in _run_ninja_build
#0 85.07     raise RuntimeError(message) from e
#0 85.07 RuntimeError: Error compiling objects for extension
------
failed to solve: executor failed running [/bin/sh -c python setup.py bdist_wheel -d /tmp/dist]: exit code: 1
make: *** [Makefile:67: build] Error 17

When ignoring the build-vision stage, the build works fine. I.e by changing the train-builds-include as

FROM ${BUILD_IMAGE} AS train-builds-include

...

COPY --link --from=install-base /opt/conda /opt/conda
COPY --link --from=build-pillow /tmp/dist  /tmp/dist

COPY --link --from=build-torch /tmp/dist  /tmp/dist <--- here 

COPY --link --from=fetch-pure   /opt/zsh   /opt

I get the same error with different versions of torchvision. I tried 0.11.3, 0.11.2 and 0.11.1.

Could you suggest any help?

Thanks a lot!

@veritas9872
Copy link
Collaborator

@skandermoalla Hello. Because I do not have a GPU with compute capability 3.5, I probably cannot reproduce the problem exactly, though I will try to do so. Please try setting export BUILDKIT_PROGRESS=plain on the terminal before trying make build to get the full log of outputs.

@veritas9872
Copy link
Collaborator

The most likely explanation is that a build dependency has an incompatible version.
In this case, I suspect that pytorch/pytorch#69894 is the problem but the full log provided by BUILDKIT_PROGRESS=plain is necessary for a complete explanation.
The issue arises because PyTorch always uses the latest version of all dependencies at the time of the release, making the build process vulnerable to build-time dependency incompatibilities.

@veritas9872
Copy link
Collaborator

I believe that the simplest solution would be to specify setuptools==59.5.0 in the conda-build-requirements.txt file distutils.version is the problem.

@veritas9872
Copy link
Collaborator

veritas9872 commented Dec 4, 2022

@skandermoalla After some digging, I have found the cause of the problem.
Though specifying the setuptools version is necessary to get started, I believe that this has already been fixed, since PyTorch was compiled.
The problem turned out to be the ffmpeg dependency. Removing it from the conda build requirements file allowed the TorchVision stage to build without problems. Also, please set USE_FFMPEG=0 in the build-vision stage to disable FFMPEG search.
This is probably because the ffmpeg binary installed through Anaconda does not have the same version meta-data as the version installed via apt. More experimentation is required to be sure since the error message was very unhelpful.

@skandermoalla
Copy link
Contributor Author

skandermoalla commented Dec 5, 2022

@veritas9872 indeed, I had setuptools==59.5.0 locked, however, without BUILDKIT_PROGRESS=plain the error message was not descriptive enough. Thanks for pointing that out!
Thanks so much for looking into the problem!
I'm closing the issue. Should we add a few things to the README based on this?
E.g. a debugging guide where we mention BUILDKIT_PROGRESS=plain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants