Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Gloo Connection reset by peer #6308

Closed
thies1006 opened this issue Jul 10, 2024 · 15 comments
Closed

[Bug]: Gloo Connection reset by peer #6308

thies1006 opened this issue Jul 10, 2024 · 15 comments
Labels
bug Something isn't working

Comments

@thies1006
Copy link

thies1006 commented Jul 10, 2024

Your current environment

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: version 3.30.0
Libc version: glibc-2.35

Python version: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-58-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA L4
GPU 1: NVIDIA L4
GPU 2: NVIDIA L4
GPU 3: NVIDIA L4
GPU 4: NVIDIA L4
GPU 5: NVIDIA L4
GPU 6: NVIDIA L4
GPU 7: NVIDIA L4

Nvidia driver version: 535.86.10
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True


Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] torchvision==0.18.0
[pip3] transformers==4.42.3
[pip3] triton==2.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled

🐛 Describe the bug

I'm running Llama3-70B on two nodes with 8 GPUs each using TP=16. I tried adding the options eager-mode and disable-custom-all-reduce without any success. First ~100 queries are always running fine, but after a while I get this Runtime Error:

(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] Traceback (most recent call last):
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 340, in execute_method
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     return func(*args, **kwargs)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 64, in start_worker_execution_loop
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     output = self.execute_model(execute_model_req=None)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 249, in execute_model
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     broadcast_data = broadcast_tensor_dict(src=0)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/distributed/communication_op.py", line 32, in broadcast_tensor_dict
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     return get_tp_group().broadcast_tensor_dict(tensor_dict, src)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 528, in broadcast_tensor_dict
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     metadata_list = self.broadcast_object(None, src=src)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 390, in broadcast_object
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     torch.distributed.broadcast_object_list(recv,
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     return func(*args, **kwargs)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     broadcast(object_sizes_tensor, src=src, group=group)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     return func(*args, **kwargs)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     work.wait()
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [172.26.161.177]:50407: Connection reset by peer
@thies1006 thies1006 added the bug Something isn't working label Jul 10, 2024
@youkaichao
Copy link
Member

running Llama3-70B on two nodes with 8 GPUs each using TP=16

maybe worthwhile to try the new pipeline parallel? check out https://docs.vllm.ai/en/latest/serving/distributed_serving.html for more details. basically --pipeline-parallel-size 2 --tensor-parallel-size 8 .

@thies1006
Copy link
Author

thies1006 commented Jul 11, 2024

I tried this (--tensor-parallel-size 8 --pipeline-parallel-size 2) as well, after a couple of successful requests I get this error:

Exception in callback functools.partial(<function _log_task_completion at 0x7f9175e82050>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f915e435690>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7f9175e82050>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f915e435690>>)>
Traceback (most recent call last):
  File "/secondary/thies/vllm/vllm/engine/async_llm_engine.py", line 43, in _log_task_completion
    return_value = task.result()
  File "/secondary/thies/vllm/vllm/engine/async_llm_engine.py", line 595, in run_engine_loop
    result = task.result()
  File "/secondary/thies/vllm/vllm/engine/async_llm_engine.py", line 540, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
  File "/secondary/thies/vllm/vllm/engine/async_llm_engine.py", line 241, in step_async
    output = await self.model_executor.execute_model_async(
  File "/secondary/thies/vllm/vllm/executor/distributed_gpu_executor.py", line 173, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
  File "/secondary/thies/vllm/vllm/executor/ray_gpu_executor.py", line 401, in _driver_execute_model_async
    results = await asyncio.gather(*tasks)
  File "/secondary/thies/vllm/vllm/executor/ray_gpu_executor.py", line 386, in _run_task_with_lock
    return await task(*args, **kwargs)
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
	class_name: RayWorkerWrapper
	actor_id: 43c7e94b11d26b2a300de88801000000
	pid: 269002
	namespace: d337399f-13bb-40d2-beca-e7d2dc3c5b64
	ip: 10.10.10.179
The actor died because its node has died. Node Id: 9fdcaaa408b93f5e19397604d822498c90bbbdfca64b643a83bfa6b1
	the actor's node was terminated expectedly: received SIGTERM

@youkaichao
Copy link
Member

cc @andoorve for pipeline parallel, and @rkooo567 for ray related

@andoorve
Copy link
Contributor

The issue in both cases seems similar - running for a few (< 100) iterations and then running into this type of error suggests to me that there's something wrong with the instances/network setup itself.

@thies1006
Copy link
Author

thies1006 commented Jul 12, 2024

Ok, thank you. I'll check the HW and network than. Are there are any pointers (best practise) for checking this? I'm aware of nccl-tests and those seem to run fine for this setup.

@thies1006
Copy link
Author

The issue in both cases seems similar - running for a few (< 100) iterations and then running into this type of error suggests to me that there's something wrong with the instances/network setup itself.

Indeed. I changed instances and network and now it seems stable (no crash in 1h of nonstop testing). In particular I use different networks for ray/gloo and nccl (IB network) now (but not sure if that's the reason). Anyways closing this, thanks @andoorve!

@FangxuLiu
Copy link

I also encountered the same problem, how to solve it? @youkaichao

@FangxuLiu
Copy link

FangxuLiu commented Jul 17, 2024

The issue in both cases seems similar - running for a few (< 100) iterations and then running into this type of error suggests to me that there's something wrong with the instances/network setup itself.

Indeed. I changed instances and network and now it seems stable (no crash in 1h of nonstop testing). In particular I use different networks for ray/gloo and nccl (IB network) now (but not sure if that's the reason). Anyways closing this, thanks @andoorve!

Did you solve this problem? How did you solve it? @thies1006

@youkaichao
Copy link
Member

usually it is caused by network setup problem. try to set GLOO_SOCKET_IFNAME and NCCL_SOCKET_IFNAME to appropriate value, and try the test script of https://docs.vllm.ai/en/latest/getting_started/debugging.html until it can run successfully.

@FangxuLiu
Copy link

usually it is caused by network setup problem. try to set GLOO_SOCKET_IFNAME and NCCL_SOCKET_IFNAME to appropriate value, and try the test script of https://docs.vllm.ai/en/latest/getting_started/debugging.html until it can run successfully.

@youkaichao I set NCCL_SOCKET_IFNAME=eth0 and GLOO_SOCKET_IFNAME=eth0, but the issue is not resolved, do you have any suggestions?

@youkaichao
Copy link
Member

does the sanity check script run normally?

@FangxuLiu
Copy link

FangxuLiu commented Jul 17, 2024

does the sanity check script run normally?

@youkaichao No, I executed the corresponding check script, the error message is as below:

torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).

Context

When the service is just deployed, the requests can be responded normally. When the service is idle for more than 1 hour, the service will be abnormal and will not respond if requested again, and the following error will be reported

Exception in thread Thread-9: Traceback (most recent call last): File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner self.run() File "/qs_service/model_compiled/ecom-llm-56B-A14B-chat/handler.py", line 37, in run step_outputs = self.llm.llm_engine.step() File "/usr/local/lib/python3.9/dist-packages/vllm/engine/llm_engine.py", line 773, in step output = self.model_executor.execute_model( File "/usr/local/lib/python3.9/dist-packages/vllm/executor/distributed_gpu_executor.py", line 76, in execute_model return self._driver_execute_model(execute_model_req) File "/usr/local/lib/python3.9/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 84, in _driver_execute_model return self.driver_worker.execute_model( File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.9/dist-packages/vllm/worker/worker.py", line 264, in execute_model broadcast_tensor_dict(data, src=0) File "/usr/local/lib/python3.9/dist-packages/vllm/distributed/communication_op.py", line 258, in broadcast_tensor_dict torch.distributed.broadcast_object_list([metadata_list], File "/usr/local/lib/python3.9/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.9/dist-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list broadcast(object_sizes_tensor, src=src, group=group) File "/usr/local/lib/python3.9/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.9/dist-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast work.wait() RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.122.156.106]:21171

my model is Mixtral-8x7B-moe, and I used 2 * A800-40G to deploy the model, with vllm==0.5.0

I also tried setting NCCL_SOCKET_IFNAME=eth0 and GLOO_SOCKET_IFNAME=eth0, It does not work.

@youkaichao
Copy link
Member

When the service is just deployed, the requests can be responded normally. When the service is idle for more than 1 hour, the service will be abnormal and will not respond if requested again, and the following error will be reported

this is another problem like #5084 . And I think #5399 should help. You can try the latest version.

cc @njhill it is in the broadcast operation even if no requests are running. looks strange.

@thies1006
Copy link
Author

The issue in both cases seems similar - running for a few (< 100) iterations and then running into this type of error suggests to me that there's something wrong with the instances/network setup itself.

Indeed. I changed instances and network and now it seems stable (no crash in 1h of nonstop testing). In particular I use different networks for ray/gloo and nccl (IB network) now (but not sure if that's the reason). Anyways closing this, thanks @andoorve!

Did you solve this problem? How did you solve it? @thies1006

I used different networks for gloo and nccl. Could you try this if possible? Not sure if the problem really got solved by this though (I'm out of office right now), but at least it got much better. I also saw significant improvement in the metrics by this.

@njhill
Copy link
Member

njhill commented Jul 17, 2024

my model is Mixtral-8x7B-moe, and I used 2 * A800-40G to deploy the model, with vllm==0.5.0

@FangxuLiu this is a known issue that was fixed by #5987 which is in 0.5.1. Please upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants