Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get Stuck at Building Wheel #976

Open
kingformatty opened this issue Jun 27, 2024 · 5 comments
Open

Get Stuck at Building Wheel #976

kingformatty opened this issue Jun 27, 2024 · 5 comments
Labels
build Build system

Comments

@kingformatty
Copy link

Hi, anyone faces the problem of installation gets stuck at building wheel?

@timmoon10 timmoon10 added the build Build system label Jun 27, 2024
@timmoon10
Copy link
Collaborator

timmoon10 commented Jun 27, 2024

Can you share more information on your configuration, especially which DL framework you're building with? Passing the --verbose flag to pip install would also provide more useful build logs. A hang makes me suspect your system is over-parallelizing the build process:

  • If the hang happens while building Flash Attention or transformer_engine_torch, then it's a failure while building a PyTorch extension. Try setting MAX_JOBS=1 in the environment (see this note). Note that building Flash Attention is especially resource-intensive and can experience problems even on relatively powerful systems.
  • If the hang happens in CMake, then it's a failure in a Ninja build. We currently don't have a nice way to reduce the number of parallel Ninja jobs, but it is something we should prioritize if it is causing a problem (pinging @phu0ngng). You could try setting CMAKE_BUILD_PARALLEL_LEVEL=1 in the environment.

@timmoon10
Copy link
Collaborator

With #987, you can control the number of parallel build jobs with the MAX_JOBS environment variable.

@ZSL98
Copy link

ZSL98 commented Aug 1, 2024

Same problem.
Especially, got stuck in Running command /usr/lib/cmake-3.22.6-linux-x86_64/bin/cmake --build /opt/tiger/TransformerEngine/build/cmake --parallel 1

@timmoon10
Copy link
Collaborator

Hm, I'd expect most systems could handle building with MAX_JOBS=1. I wonder if we could get more clues if you build with verbose output (pip install -v -v .).

@AdrLfv
Copy link

AdrLfv commented Aug 14, 2024

I have a similar problem. With MAX_JOBS=1 it gets stuck after 6/24 and otherwise it gets stuck after 8/24 building transpose_fusion.cu.o. My whole computer gets frozen and I have to reboot manually. I use Cuda 12.5 and I have a rtx 3060.
I also tried to limitate the number of threads with export MAKEFLAGS="-j2" but without success.

CMake Warning:
Manually-specified variables were not used by the project:

  pybind11_DIR

-- Build files have been written to: /home/adrlfv/Téléchargements/TransformerEngine/build/cmake
Running command /usr/bin/cmake --build /home/adrlfv/Téléchargements/TransformerEngine/build/cmake
[1/32] Building CXX object CMakeFiles/transformer_engine.dir/transformer_engine.cpp.o
[2/32] Building CUDA object CMakeFiles/transformer_engine.dir/gemm/cublaslt_gemm.cu.o
[3/32] Building CXX object CMakeFiles/transformer_engine.dir/layer_norm/ln_api.cpp.o
[4/32] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/transpose.cu.o
[5/32] Building CUDA object CMakeFiles/transformer_engine.dir/fused_attn/fused_attn.cpp.o
[6/32] Building CXX object CMakeFiles/transformer_engine.dir/rmsnorm/rmsnorm_api.cpp.o
[7/32] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/cast_transpose.cu.o
[8/32] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/transpose_fusion.cu.o

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Build system
Projects
None yet
Development

No branches or pull requests

4 participants