akash-provider: report Pod `Exit Code`, `Restart Count`, `Time running since last restart` #246

andy108369 · 2024-08-11T17:38:00Z

Is your feature request related to a problem? Please describe.

Current vllm deployments would just exit with -15 (SIGTERM) exit code, making it hard for the users to realize that the root cause for the issue is the Pod reaching its max. memory limit set in SDL.

The following chain of events happens:

App gets restarted and the user can only see the following in lease logs:

-15 exit code is for SIGTERM

ERROR 08-11 16:59:41 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 143 died, exit code: -15
<... continued with bunch of additional rather misleading lines ...>

(Click to expand) Complete deployment log

ERROR 08-11 16:59:41 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 143 died, exit code: -15
INFO 08-11 16:59:41 multiproc_worker_utils.py:123] Killing local vLLM worker processes
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, port)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 35, in __init__
    self.socket.bind(f"tcp://127.0.0.1:{port}")
  File "/usr/local/lib/python3.10/dist-packages/zmq/sugar/socket.py", line 311, in bind
    super().bind(addr)
  File "_zmq.py", line 917, in zmq.backend.cython._zmq.Socket.bind
  File "_zmq.py", line 179, in zmq.backend.cython._zmq._check_rc
zmq.error.ZMQError: Address already in use (addr='tcp://127.0.0.1:8000')
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

However, it is only possible to see the real reason (OOM) for the Pod to receive SIGTERM from the K8s directly (Akash users do not have access to it in order to see the reason themselves):

137 exit code at the pod level

$ kubectl -n $ns describe pods | grep -B2 -A4 'Last State:'
    State:          Running
      Started:      Sun, 11 Aug 2024 19:00:04 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Sun, 11 Aug 2024 17:16:14 +0200
      Finished:     Sun, 11 Aug 2024 19:00:03 +0200

What exit code 137 means for Kubernetes Exit code 137 is a signal that occurs when a container's memory exceeds the memory limit provided in the pod specification. When a container consumes too much memory, Kubernetes kills it to protect it from consuming too many resources on the node.

Describe the solution you'd like

Provider should read the last pod's Exit Code (as well as the time) as well as the Restart Count.
Ideally also report the time the deployment has been running since last restart.

All this info should be obtained via the akash-provider process itself (akash-provider needs to query K8s to obtain this data).
This data should not be recorded on the blockchain of course. (as this will bloat the chain and render in unnecessary txs/fees)

Exit Code

$ kubectl -n $ns describe pods | grep -B2 -A4 'Last State:'
    State:          Running
      Started:      Sun, 11 Aug 2024 19:00:04 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Sun, 11 Aug 2024 17:16:14 +0200
      Finished:     Sun, 11 Aug 2024 19:00:03 +0200

Restart Count

$ kubectl -n $ns describe pods | grep -i 'restart count'
    Restart Count:  1

Time running since last restart (32m ago):

$ kubectl -n $ns get pods -o wide
NAME     READY   STATUS    RESTARTS      AGE    IP               NODE    NOMINATED NODE   READINESS GATES
vllm-0   1/1     Running   1 (32m ago)   136m   10.233.102.161   node1   <none>           <none>

Describe alternatives you've considered

N/A

Search

I did search for other open and closed issues before opening this

Code of Conduct

I agree to follow this project's Code of Conduct

Additional context

Unfortunately, lease-events does not report this information.

The text was updated successfully, but these errors were encountered:

andy108369 added the awaiting-triage label Aug 11, 2024

andy108369 changed the title akash-provider: report Pod **Exit Code**, **Restart Count**, **Time running since last restart** akash-provider: report Pod Exit Code, Restart Count, Time running since last restart Aug 11, 2024

chainzero added repo/provider Akash provider-services repo issues P2 and removed awaiting-triage labels Aug 21, 2024

chainzero assigned chainzero and troian Aug 21, 2024

This was referenced Sep 17, 2024

Feature request: display deployment restart count akash-network/console#367

Closed

Lease events drop off after some time #252

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

akash-provider: report Pod `Exit Code`, `Restart Count`, `Time running since last restart` #246

akash-provider: report Pod `Exit Code`, `Restart Count`, `Time running since last restart` #246

andy108369 commented Aug 11, 2024 •

edited

Loading

akash-provider: report Pod Exit Code, Restart Count, Time running since last restart #246

akash-provider: report Pod Exit Code, Restart Count, Time running since last restart #246

Comments

andy108369 commented Aug 11, 2024 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Search

Code of Conduct

Additional context

akash-provider: report Pod `Exit Code`, `Restart Count`, `Time running since last restart` #246

akash-provider: report Pod `Exit Code`, `Restart Count`, `Time running since last restart` #246

andy108369 commented Aug 11, 2024 •

edited

Loading