[Bug]: Gloo Connection reset by peer #6308

thies1006 · 2024-07-10T14:28:48Z

Your current environment

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: version 3.30.0
Libc version: glibc-2.35

Python version: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-58-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA L4
GPU 1: NVIDIA L4
GPU 2: NVIDIA L4
GPU 3: NVIDIA L4
GPU 4: NVIDIA L4
GPU 5: NVIDIA L4
GPU 6: NVIDIA L4
GPU 7: NVIDIA L4

Nvidia driver version: 535.86.10
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True


Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] torchvision==0.18.0
[pip3] transformers==4.42.3
[pip3] triton==2.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled

🐛 Describe the bug

I'm running Llama3-70B on two nodes with 8 GPUs each using TP=16. I tried adding the options eager-mode and disable-custom-all-reduce without any success. First ~100 queries are always running fine, but after a while I get this Runtime Error:

(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] Traceback (most recent call last):
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 340, in execute_method
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     return func(*args, **kwargs)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 64, in start_worker_execution_loop
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     output = self.execute_model(execute_model_req=None)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 249, in execute_model
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     broadcast_data = broadcast_tensor_dict(src=0)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/distributed/communication_op.py", line 32, in broadcast_tensor_dict
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     return get_tp_group().broadcast_tensor_dict(tensor_dict, src)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 528, in broadcast_tensor_dict
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     metadata_list = self.broadcast_object(None, src=src)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 390, in broadcast_object
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     torch.distributed.broadcast_object_list(recv,
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     return func(*args, **kwargs)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     broadcast(object_sizes_tensor, src=src, group=group)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     return func(*args, **kwargs)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     work.wait()
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [172.26.161.177]:50407: Connection reset by peer

The text was updated successfully, but these errors were encountered:

youkaichao · 2024-07-11T00:36:48Z

running Llama3-70B on two nodes with 8 GPUs each using TP=16

maybe worthwhile to try the new pipeline parallel? check out https://docs.vllm.ai/en/latest/serving/distributed_serving.html for more details. basically --pipeline-parallel-size 2 --tensor-parallel-size 8 .

thies1006 · 2024-07-11T07:14:00Z

I tried this (--tensor-parallel-size 8 --pipeline-parallel-size 2) as well, after a couple of successful requests I get this error:

Exception in callback functools.partial(<function _log_task_completion at 0x7f9175e82050>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f915e435690>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7f9175e82050>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f915e435690>>)>
Traceback (most recent call last):
  File "/secondary/thies/vllm/vllm/engine/async_llm_engine.py", line 43, in _log_task_completion
    return_value = task.result()
  File "/secondary/thies/vllm/vllm/engine/async_llm_engine.py", line 595, in run_engine_loop
    result = task.result()
  File "/secondary/thies/vllm/vllm/engine/async_llm_engine.py", line 540, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
  File "/secondary/thies/vllm/vllm/engine/async_llm_engine.py", line 241, in step_async
    output = await self.model_executor.execute_model_async(
  File "/secondary/thies/vllm/vllm/executor/distributed_gpu_executor.py", line 173, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
  File "/secondary/thies/vllm/vllm/executor/ray_gpu_executor.py", line 401, in _driver_execute_model_async
    results = await asyncio.gather(*tasks)
  File "/secondary/thies/vllm/vllm/executor/ray_gpu_executor.py", line 386, in _run_task_with_lock
    return await task(*args, **kwargs)
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
	class_name: RayWorkerWrapper
	actor_id: 43c7e94b11d26b2a300de88801000000
	pid: 269002
	namespace: d337399f-13bb-40d2-beca-e7d2dc3c5b64
	ip: 10.10.10.179
The actor died because its node has died. Node Id: 9fdcaaa408b93f5e19397604d822498c90bbbdfca64b643a83bfa6b1
	the actor's node was terminated expectedly: received SIGTERM

youkaichao · 2024-07-11T07:28:53Z

cc @andoorve for pipeline parallel, and @rkooo567 for ray related

andoorve · 2024-07-11T16:46:26Z

The issue in both cases seems similar - running for a few (< 100) iterations and then running into this type of error suggests to me that there's something wrong with the instances/network setup itself.

thies1006 · 2024-07-12T08:05:33Z

Ok, thank you. I'll check the HW and network than. Are there are any pointers (best practise) for checking this? I'm aware of nccl-tests and those seem to run fine for this setup.

thies1006 · 2024-07-12T14:10:41Z

The issue in both cases seems similar - running for a few (< 100) iterations and then running into this type of error suggests to me that there's something wrong with the instances/network setup itself.

Indeed. I changed instances and network and now it seems stable (no crash in 1h of nonstop testing). In particular I use different networks for ray/gloo and nccl (IB network) now (but not sure if that's the reason). Anyways closing this, thanks @andoorve!

FangxuLiu · 2024-07-17T03:15:57Z

I also encountered the same problem, how to solve it? @youkaichao

FangxuLiu · 2024-07-17T03:16:54Z

The issue in both cases seems similar - running for a few (< 100) iterations and then running into this type of error suggests to me that there's something wrong with the instances/network setup itself.

Indeed. I changed instances and network and now it seems stable (no crash in 1h of nonstop testing). In particular I use different networks for ray/gloo and nccl (IB network) now (but not sure if that's the reason). Anyways closing this, thanks @andoorve!

Did you solve this problem? How did you solve it? @thies1006

youkaichao · 2024-07-17T03:21:03Z

usually it is caused by network setup problem. try to set GLOO_SOCKET_IFNAME and NCCL_SOCKET_IFNAME to appropriate value, and try the test script of https://docs.vllm.ai/en/latest/getting_started/debugging.html until it can run successfully.

FangxuLiu · 2024-07-17T06:56:47Z

usually it is caused by network setup problem. try to set GLOO_SOCKET_IFNAME and NCCL_SOCKET_IFNAME to appropriate value, and try the test script of https://docs.vllm.ai/en/latest/getting_started/debugging.html until it can run successfully.

@youkaichao I set NCCL_SOCKET_IFNAME=eth0 and GLOO_SOCKET_IFNAME=eth0, but the issue is not resolved, do you have any suggestions?

youkaichao · 2024-07-17T07:00:38Z

does the sanity check script run normally?

FangxuLiu · 2024-07-17T07:35:27Z

does the sanity check script run normally?

@youkaichao No, I executed the corresponding check script, the error message is as below:

torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).

Context

When the service is just deployed, the requests can be responded normally. When the service is idle for more than 1 hour, the service will be abnormal and will not respond if requested again, and the following error will be reported

Exception in thread Thread-9: Traceback (most recent call last): File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner self.run() File "/qs_service/model_compiled/ecom-llm-56B-A14B-chat/handler.py", line 37, in run step_outputs = self.llm.llm_engine.step() File "/usr/local/lib/python3.9/dist-packages/vllm/engine/llm_engine.py", line 773, in step output = self.model_executor.execute_model( File "/usr/local/lib/python3.9/dist-packages/vllm/executor/distributed_gpu_executor.py", line 76, in execute_model return self._driver_execute_model(execute_model_req) File "/usr/local/lib/python3.9/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 84, in _driver_execute_model return self.driver_worker.execute_model( File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.9/dist-packages/vllm/worker/worker.py", line 264, in execute_model broadcast_tensor_dict(data, src=0) File "/usr/local/lib/python3.9/dist-packages/vllm/distributed/communication_op.py", line 258, in broadcast_tensor_dict torch.distributed.broadcast_object_list([metadata_list], File "/usr/local/lib/python3.9/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.9/dist-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list broadcast(object_sizes_tensor, src=src, group=group) File "/usr/local/lib/python3.9/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.9/dist-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast work.wait() RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.122.156.106]:21171

my model is Mixtral-8x7B-moe, and I used 2 * A800-40G to deploy the model, with vllm==0.5.0

I also tried setting NCCL_SOCKET_IFNAME=eth0 and GLOO_SOCKET_IFNAME=eth0, It does not work.

youkaichao · 2024-07-17T07:54:10Z

When the service is just deployed, the requests can be responded normally. When the service is idle for more than 1 hour, the service will be abnormal and will not respond if requested again, and the following error will be reported

this is another problem like #5084 . And I think #5399 should help. You can try the latest version.

cc @njhill it is in the broadcast operation even if no requests are running. looks strange.

thies1006 · 2024-07-17T08:58:01Z

The issue in both cases seems similar - running for a few (< 100) iterations and then running into this type of error suggests to me that there's something wrong with the instances/network setup itself.

Indeed. I changed instances and network and now it seems stable (no crash in 1h of nonstop testing). In particular I use different networks for ray/gloo and nccl (IB network) now (but not sure if that's the reason). Anyways closing this, thanks @andoorve!

Did you solve this problem? How did you solve it? @thies1006

I used different networks for gloo and nccl. Could you try this if possible? Not sure if the problem really got solved by this though (I'm out of office right now), but at least it got much better. I also saw significant improvement in the metrics by this.

njhill · 2024-07-17T14:08:03Z

my model is Mixtral-8x7B-moe, and I used 2 * A800-40G to deploy the model, with vllm==0.5.0

@FangxuLiu this is a known issue that was fixed by #5987 which is in 0.5.1. Please upgrade.

thies1006 added the bug Something isn't working label Jul 10, 2024

thies1006 closed this as completed Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Gloo Connection reset by peer #6308

[Bug]: Gloo Connection reset by peer #6308

thies1006 commented Jul 10, 2024 •

edited

Loading

youkaichao commented Jul 11, 2024

thies1006 commented Jul 11, 2024 •

edited

Loading

youkaichao commented Jul 11, 2024

andoorve commented Jul 11, 2024

thies1006 commented Jul 12, 2024 •

edited

Loading

thies1006 commented Jul 12, 2024

FangxuLiu commented Jul 17, 2024

FangxuLiu commented Jul 17, 2024 •

edited

Loading

youkaichao commented Jul 17, 2024

FangxuLiu commented Jul 17, 2024

youkaichao commented Jul 17, 2024

FangxuLiu commented Jul 17, 2024 •

edited

Loading

youkaichao commented Jul 17, 2024

thies1006 commented Jul 17, 2024

njhill commented Jul 17, 2024

[Bug]: Gloo Connection reset by peer #6308

[Bug]: Gloo Connection reset by peer #6308

Comments

thies1006 commented Jul 10, 2024 • edited Loading

Your current environment

🐛 Describe the bug

youkaichao commented Jul 11, 2024

thies1006 commented Jul 11, 2024 • edited Loading

youkaichao commented Jul 11, 2024

andoorve commented Jul 11, 2024

thies1006 commented Jul 12, 2024 • edited Loading

thies1006 commented Jul 12, 2024

FangxuLiu commented Jul 17, 2024

FangxuLiu commented Jul 17, 2024 • edited Loading

youkaichao commented Jul 17, 2024

FangxuLiu commented Jul 17, 2024

youkaichao commented Jul 17, 2024

FangxuLiu commented Jul 17, 2024 • edited Loading

youkaichao commented Jul 17, 2024

thies1006 commented Jul 17, 2024

njhill commented Jul 17, 2024

thies1006 commented Jul 10, 2024 •

edited

Loading

thies1006 commented Jul 11, 2024 •

edited

Loading

thies1006 commented Jul 12, 2024 •

edited

Loading

FangxuLiu commented Jul 17, 2024 •

edited

Loading

FangxuLiu commented Jul 17, 2024 •

edited

Loading