Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: vLLM 0.4.2 8xH100 init failed #5785

Closed
xiejibing opened this issue Jun 24, 2024 · 5 comments
Closed

[Bug]: vLLM 0.4.2 8xH100 init failed #5785

xiejibing opened this issue Jun 24, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@xiejibing
Copy link

Your current environment

environment:
vllm 0.4.2
python3.10
cuda11.8
cpu: 52
mem: 375Gi

model:
llama3-70B

🐛 Describe the bug

description:
vLLM engine init failed, when use 8xH100, tensor_parallel_size=8
Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

log:
024-06-24 02:48:44,378 INFO MainProcess logger.py:58 darwin logger initialized
Starting Python Closure container
2024-06-24 02:48:44,438 INFO MainProcess python_closure_container.py:77 load mms model from: /mms/download/models, mms_info: llm-demo-project:llama-3-70B-Instruct:1
2024-06-24 02:48:44,438 INFO MainProcess python_closure_container.py:91 darwin entrypoint not found in MMS path /mms/download/models
2024-06-24 02:48:44,438 INFO MainProcess python_closure_container.py:97 load model from /model
2024-06-24 02:48:44,438 INFO MainProcess python_closure_container.py:17 load serialized func from /model/func.pkl
/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
2024-06-24 02:48:45,687 DEBUG MainProcess _api.py:254 Attempting to acquire lock 140146051556848 on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/node_ip_address.json.lock
2024-06-24 02:48:45,687 DEBUG MainProcess _api.py:257 Lock 140146051556848 acquired on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/node_ip_address.json.lock
2024-06-24 02:48:45,687 DEBUG MainProcess _api.py:286 Attempting to release lock 140146051556848 on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/node_ip_address.json.lock
2024-06-24 02:48:45,687 DEBUG MainProcess _api.py:289 Lock 140146051556848 released on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/node_ip_address.json.lock
2024-06-24 02:48:45,688 DEBUG MainProcess _api.py:254 Attempting to acquire lock 140146051556896 on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/ports_by_node.json.lock
2024-06-24 02:48:45,688 DEBUG MainProcess _api.py:257 Lock 140146051556896 acquired on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/ports_by_node.json.lock
2024-06-24 02:48:45,688 DEBUG MainProcess _api.py:286 Attempting to release lock 140146051556896 on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/ports_by_node.json.lock
2024-06-24 02:48:45,688 DEBUG MainProcess _api.py:289 Lock 140146051556896 released on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/ports_by_node.json.lock
2024-06-24 02:48:45,688 DEBUG MainProcess _api.py:254 Attempting to acquire lock 140146051549504 on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/ports_by_node.json.lock
2024-06-24 02:48:45,688 DEBUG MainProcess _api.py:257 Lock 140146051549504 acquired on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/ports_by_node.json.lock
2024-06-24 02:48:45,688 DEBUG MainProcess _api.py:286 Attempting to release lock 140146051549504 on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/ports_by_node.json.lock
2024-06-24 02:48:45,688 DEBUG MainProcess _api.py:289 Lock 140146051549504 released on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/ports_by_node.json.lock
2024-06-24 02:48:45,688 DEBUG MainProcess _api.py:254 Attempting to acquire lock 140146051550080 on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/ports_by_node.json.lock
2024-06-24 02:48:45,688 DEBUG MainProcess _api.py:257 Lock 140146051550080 acquired on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/ports_by_node.json.lock
2024-06-24 02:48:45,688 DEBUG MainProcess _api.py:286 Attempting to release lock 140146051550080 on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/ports_by_node.json.lock
2024-06-24 02:48:45,688 DEBUG MainProcess _api.py:289 Lock 140146051550080 released on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/ports_by_node.json.lock
2024-06-24 02:48:45,688 DEBUG MainProcess _api.py:254 Attempting to acquire lock 140146051552048 on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/ports_by_node.json.lock
2024-06-24 02:48:45,688 DEBUG MainProcess _api.py:257 Lock 140146051552048 acquired on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/ports_by_node.json.lock
2024-06-24 02:48:45,688 DEBUG MainProcess _api.py:286 Attempting to release lock 140146051552048 on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/ports_by_node.json.lock
2024-06-24 02:48:45,688 DEBUG MainProcess _api.py:289 Lock 140146051552048 released on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/ports_by_node.json.lock
2024-06-24 02:48:45,688 DEBUG MainProcess _api.py:254 Attempting to acquire lock 140146051552912 on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/ports_by_node.json.lock
2024-06-24 02:48:45,689 DEBUG MainProcess _api.py:257 Lock 140146051552912 acquired on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/ports_by_node.json.lock
2024-06-24 02:48:45,689 DEBUG MainProcess _api.py:286 Attempting to release lock 140146051552912 on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/ports_by_node.json.lock
2024-06-24 02:48:45,689 DEBUG MainProcess _api.py:289 Lock 140146051552912 released on /tmp/ray/session_2024-06-24_02-48-45_687047_49975/ports_by_node.json.lock
2024-06-24 02:48:47,351 WARNING services.py:2009 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 1073229824 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2024-06-24 02:48:47,493 INFO worker.py:1749 -- Started a local Ray instance.
INFO 06-24 02:48:48 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='/mms/download/models', speculative_config=None, tokenizer='/mms/download/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/mms/download/models)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(pid=53316) /opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
(pid=53316) warnings.warn(
(pid=53827) /opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead. [repeated 4x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(pid=53827) warnings.warn( [repeated 4x across cluster]
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff8e00e96dff7db9bcf575d3b201000000 Worker ID: a3def51d719ae881346f4f13a1fd1d82a0f2c59c0f173334047f9805 Node ID: 2b3b79e87ea6a056b18c58b0da3509637cb7472549014e2ad22fb63a Worker IP address: 10.177.152.13 Worker port: 36385 Worker PID: 54163 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Traceback (most recent call last):
File "/container/python_closure_container.py", line 156, in
py_model, model_path = unpickle_model(model_name, model_version)
File "/container/python_closure_container.py", line 71, in unpickle_model
model, model_file_path = unpickle_model_from_local(model_name, model_version, model_file_path)
File "/container/python_closure_container.py", line 109, in unpickle_model_from_local
model = get_verified_darwin_model(model_name, model_version, predict_model_loaded)
File "/container/python_closure_container.py", line 124, in get_verified_darwin_model
model = predict_model_loaded(model_name, model_version)
File "/container/predict.py", line 47, in init
logger.info('model initialized')
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 366, in from_engine_args
engine = cls(
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 324, in init
self.engine = self._init_engine(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
return engine_class(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 160, in init
self.model_executor = executor_class(
File "/opt/conda/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 300, in init
super().init(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in init
self._init_executor()
File "/opt/conda/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 43, in _init_executor
self._init_workers_ray(placement_group)
File "/opt/conda/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 104, in _init_workers_ray
worker_ip = ray.get(worker.get_node_ip.remote())
File "/opt/conda/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/opt/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 863, in get_objects
raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: RayWorkerWrapper
actor_id: 8e00e96dff7db9bcf575d3b201000000
pid: 54163
namespace: 1cf18ed4-f200-47ab-bd68-b09a0115686d
ip: 10.177.152.13
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
The actor never ran - it was cancelled before it started running.
(RayWorkerWrapper pid=54163) [2024-06-24 02:49:00,245 E 54163 54163] logging.cc:101: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11]
(RayWorkerWrapper pid=54163) [2024-06-24 02:49:00,314 E 54163 54163] logging.cc:108: Stack trace:
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0x101a47a) [0x7fa623b3a47a] ray::operator<<()
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0x101cf38) [0x7fa623b3cf38] ray::TerminateHandler()
(RayWorkerWrapper pid=54163) /opt/conda/bin/../lib/libstdc++.so.6(+0xb135a) [0x7fa6229ab35a] __cxxabiv1::__terminate()
(RayWorkerWrapper pid=54163) /opt/conda/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7fa6229ab3c5]
(RayWorkerWrapper pid=54163) /opt/conda/bin/../lib/libstdc++.so.6(+0xb1658) [0x7fa6229ab658]
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0x567f58) [0x7fa623087f58] boost::throw_exception<>()
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0x1106b1b) [0x7fa623c26b1b] boost::asio::detail::do_throw_error()
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0x110753b) [0x7fa623c2753b] boost::asio::detail::posix_thread::start_thread()
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0x110799c) [0x7fa623c2799c] boost::asio::thread_pool::thread_pool()
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0xa49444) [0x7fa623569444] ray::rpc::(anonymous namespace)::_GetServerCallExecutor()
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray3rpc21GetServerCallExecutorEv+0x9) [0x7fa6235694d9] ray::rpc::GetServerCallExecutor()
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(ZNSt17_Function_handlerIFvN3ray6StatusESt8functionIFvvEES4_EZNS0_3rpc14ServerCallImplINS6_24CoreWorkerServiceHandlerENS6_15PushTaskRequestENS6_13PushTaskReplyELNS6_8AuthTypeE0EE17HandleRequestImplEbEUlS1_S4_S4_E0_E9_M_invokeERKSt9_Any_dataOS1_OS4_SJ+0xe2) [0x7fa623283d32] std::_Function_handler<>::_M_invoke()
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0x7dad3f) [0x7fa6232fad3f] ray::core::CoreWorkerDirectTaskReceiver::HandleTask()::{lambda()#1}::operator()()
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0x7dbe1a) [0x7fa6232fbe1a] std::_Function_handler<>::_M_invoke()
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0x7f2d1e) [0x7fa623312d1e] ray::core::InboundRequest::Accept()
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0x7c5150) [0x7fa6232e5150] ray::core::NormalSchedulingQueue::ScheduleRequests()
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0xa5acee) [0x7fa62357acee] EventTracker::RecordExecution()
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0xa540de) [0x7fa6235740de] std::_Function_handler<>::_M_invoke()
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0xa54556) [0x7fa623574556] boost::asio::detail::completion_handler<>::do_complete()
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0x11041ab) [0x7fa623c241ab] boost::asio::detail::scheduler::do_run_one()
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0x1105b29) [0x7fa623c25b29] boost::asio::detail::scheduler::run()
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0x1106232) [0x7fa623c26232] boost::asio::io_context::run()
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker20RunTaskExecutionLoopEv+0xcd) [0x7fa62329996d] ray::core::CoreWorker::RunTaskExecutionLoop()
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core21CoreWorkerProcessImpl26RunWorkerTaskExecutionLoopEv+0x8c) [0x7fa6232dec2c] ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop()
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core17CoreWorkerProcess20RunTaskExecutionLoopEv+0x1d) [0x7fa6232deddd] ray::core::CoreWorkerProcess::RunTaskExecutionLoop()
(RayWorkerWrapper pid=54163) /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0x5c2a07) [0x7fa6230e2a07] __pyx_pw_3ray_7_raylet_10CoreWorker_7run_task_loop()
(RayWorkerWrapper pid=54163) ray::RayWorkerWrapper() [0x500884] method_vectorcall_NOARGS
(RayWorkerWrapper pid=54163) ray::RayWorkerWrapper(_PyEval_EvalFrameDefault+0x731) [0x4ee5e1] _PyEval_EvalFrameDefault
(RayWorkerWrapper pid=54163) ray::RayWorkerWrapper(_PyFunction_Vectorcall+0x6f) [0x4fdedf] _PyFunction_Vectorcall
(RayWorkerWrapper pid=54163) ray::RayWorkerWrapper(_PyEval_EvalFrameDefault+0x731) [0x4ee5e1] _PyEval_EvalFrameDefault
(RayWorkerWrapper pid=54163) ray::RayWorkerWrapper() [0x5953f2] _PyEval_Vector
(RayWorkerWrapper pid=54163) ray::RayWorkerWrapper(PyEval_EvalCode+0x87) [0x595337] PyEval_EvalCode
(RayWorkerWrapper pid=54163) ray::RayWorkerWrapper() [0x5c5f47] run_eval_code_obj
(RayWorkerWrapper pid=54163) ray::RayWorkerWrapper() [0x5c10b0] run_mod
(RayWorkerWrapper pid=54163) ray::RayWorkerWrapper() [0x45971a] pyrun_file.cold
(RayWorkerWrapper pid=54163) ray::RayWorkerWrapper(_PyRun_SimpleFileObject+0x19f) [0x5bb63f] _PyRun_SimpleFileObject
(RayWorkerWrapper pid=54163) ray::RayWorkerWrapper(_PyRun_AnyFileObject+0x43) [0x5bb3a3] _PyRun_AnyFileObject
(RayWorkerWrapper pid=54163) ray::RayWorkerWrapper(Py_RunMain+0x38d) [0x5b815d] Py_RunMain
(RayWorkerWrapper pid=54163) ray::RayWorkerWrapper(Py_BytesMain+0x39) [0x588629] Py_BytesMain
(RayWorkerWrapper pid=54163) /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fa624801083] __libc_start_main
(RayWorkerWrapper pid=54163) ray::RayWorkerWrapper() [0x5884de]
(RayWorkerWrapper pid=54163)
(RayWorkerWrapper pid=54163) *** SIGABRT received at time=1719222540 on cpu 223 ***
(RayWorkerWrapper pid=54163) PC: @ 0x7fa62482000b (unknown) raise
(RayWorkerWrapper pid=54163) @ 0x7fa624b3d420 286446400 (unknown)
(RayWorkerWrapper pid=54163) @ 0x7fa6229ab35a (unknown) __cxxabiv1::__terminate()
(RayWorkerWrapper pid=54163) @ 0x7fa6229ab580 (unknown) (unknown)
(RayWorkerWrapper pid=54163) [2024-06-24 02:49:00,315 E 54163 54163] logging.cc:365: *** SIGABRT received at time=1719222540 on cpu 223 ***
(RayWorkerWrapper pid=54163) [2024-06-24 02:49:00,315 E 54163 54163] logging.cc:365: PC: @ 0x7fa62482000b (unknown) raise
(RayWorkerWrapper pid=54163) [2024-06-24 02:49:00,315 E 54163 54163] logging.cc:365: @ 0x7fa624b3d420 286446400 (unknown)
(RayWorkerWrapper pid=54163) [2024-06-24 02:49:00,315 E 54163 54163] logging.cc:365: @ 0x7fa6229ab35a (unknown) __cxxabiv1::__terminate()
(RayWorkerWrapper pid=54163) [2024-06-24 02:49:00,315 E 54163 54163] logging.cc:365: @ 0x7fa6229ab580 (unknown) (unknown)
(RayWorkerWrapper pid=54163) Fatal Python error: Aborted
(RayWorkerWrapper pid=54163)
(RayWorkerWrapper pid=54163) Stack (most recent call first):
(RayWorkerWrapper pid=54163) File "/opt/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 876 in main_loop
(RayWorkerWrapper pid=54163) File "/opt/conda/lib/python3.10/site-packages/ray/_private/workers/default_worker.py", line 289 in
(RayWorkerWrapper pid=54163)
(RayWorkerWrapper pid=54163) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _cffi_backend, uvloop.loop, ray._raylet, pvectorc, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, sentencepiece._sentencepiece (total: 31)
(pid=54163) /opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead. [repeated 3x across cluster]
(pid=54163) warnings.warn( [repeated 3x across cluster]

@xiejibing xiejibing added the bug Something isn't working label Jun 24, 2024
@youkaichao
Copy link
Member

worker_ip = ray.get(worker.get_node_ip.remote())

seems to die in this line of code. I assume this is a ray setup problem. you can try to use the latest code, or the latest released version, and try the multiprocessing backend

@xiejibing
Copy link
Author

xiejibing commented Jun 25, 2024

Thanks @youkaichao, it works now by using the vllm0.5.0.post1 and distributed_executor_backend: mp.

@xiejibing
Copy link
Author

xiejibing commented Jun 26, 2024

Update: vllm0.5.0.post1
when I run benchmark for more than 1 hour, the engine crashed. @youkaichao could you please help to take a look?

Here is the error log:
pid=561)
(pid=561)
(pid=560)
(pid=560)
@ 0x7f1c5e1b7580 (unknown) (unknown)
[2024-06-25 20:36:23,253 E 97 544] logging.cc:440: *** SIGABRT received at time=1719372983 on cpu 207 ***
[2024-06-25 20:36:23,253 E 97 544] logging.cc:440: PC: @ 0x7f1c5e9e100b (unknown) raise
[2024-06-25 20:36:23,253 E 97 544] logging.cc:440: @ 0x7f1c5ecfe420 5184 (unknown)
[2024-06-25 20:36:23,253 E 97 544] logging.cc:440: @ 0x7f1c5e1b735a (unknown) _cxxabiv1::_terminate()
(pid=559)
(pid=559)
(pid=563)
(pid=563)
(pid=563) Extension modules: msgpack._cmsgpack
(pid=587)
(pid=587)
(pid=549)
(pid=549)
(pid=569)
(pid=569)
(pid=572)
[2024-06-25 20:36:23,256 E 97 544] logging.cc:440: @ 0x7f1c5e1b7580 (unknown) (unknown)
Fatal Python error: Aborted

Extension modules: msgpack._cmsgpack, yaml._yaml, google._upb._message, grpc._cython.cygrpc, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, (pid=548)
numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, (pid=583)
torch._C._nested, (pid=583)
torch._C._nn, torch._C._sparse, torch._C._special, _cffi_backend, psutil._psutil_linux, psutil._psutil_posix, setproctitle, uvloop.loop, ray._raylet, pvectorc, sentencepiece._sentencepiece, zstandard.backend_c, PIL._imaging (total: 34)
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.177.136.142]:24490: Connection reset by peer, Traceback (most recent call last):
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 294, in start_worker_execution_loop
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] while self._execute_model_non_driver():
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 303, in _execute_model_non_driver
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] data = broadcast_tensor_dict(src=0)
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/communication_op.py", line 32, in broadcast_tensor_dict
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] return get_tp_group().broadcast_tensor_dict(tensor_dict, src)
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 399, in broadcast_tensor_dict
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] torch.distributed.broadcast_object_list(recv_metadata_list,
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] broadcast(object_sizes_tensor, src=src, group=group)
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] work.wait()
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.177.136.142]:24490: Connection reset by peer
(VllmWorkerProcess pid=3399) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.177.136.142]:24490: Connection reset by peer, Traceback (most recent call last):
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 294, in start_worker_execution_loop
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] while self._execute_model_non_driver():
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 303, in _execute_model_non_driver
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] data = broadcast_tensor_dict(src=0)
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/communication_op.py", line 32, in broadcast_tensor_dict
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] return get_tp_group().broadcast_tensor_dict(tensor_dict, src)
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 399, in broadcast_tensor_dict
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] torch.distributed.broadcast_object_list(recv_metadata_list,
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] broadcast(object_sizes_tensor, src=src, group=group)
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] work.wait()
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.177.136.142]:24490: Connection reset by peer
(VllmWorkerProcess pid=3402) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.177.136.142]:24490: Connection reset by peer, Traceback (most recent call last):
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 294, in start_worker_execution_loop
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] while self._execute_model_non_driver():
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 303, in _execute_model_non_driver
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] data = broadcast_tensor_dict(src=0)
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/communication_op.py", line 32, in broadcast_tensor_dict
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] return get_tp_group().broadcast_tensor_dict(tensor_dict, src)
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 399, in broadcast_tensor_dict
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] torch.distributed.broadcast_object_list(recv_metadata_list,
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] broadcast(object_sizes_tensor, src=src, group=group)
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] work.wait()
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226] RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.177.136.142]:24490: Connection reset by peer
(VllmWorkerProcess pid=3400) ERROR 06-25 20:36:44 multiproc_worker_utils.py:226]
Aborted (core dumped)

@youkaichao
Copy link
Member

can you try the latest main code? The error shows somewhere in broadcast_object_list, which #5399 should help alleviate this.

@mgoin
Copy link
Sponsor Collaborator

mgoin commented Sep 19, 2024

This should be resolved, lmk if not!

@mgoin mgoin closed this as completed Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants