Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

akash-provider: report Pod Exit Code, Restart Count, Time running since last restart #246

Open
2 tasks done
andy108369 opened this issue Aug 11, 2024 · 0 comments
Open
2 tasks done
Assignees
Labels
P2 repo/provider Akash provider-services repo issues

Comments

@andy108369
Copy link
Contributor

andy108369 commented Aug 11, 2024

Is your feature request related to a problem? Please describe.

Current vllm deployments would just exit with -15 (SIGTERM) exit code, making it hard for the users to realize that the root cause for the issue is the Pod reaching its max. memory limit set in SDL.

The following chain of events happens:

  1. App gets restarted and the user can only see the following in lease logs:
  • -15 exit code is for SIGTERM
ERROR 08-11 16:59:41 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 143 died, exit code: -15
<... continued with bunch of additional rather misleading lines ...>
(Click to expand) Complete deployment log
ERROR 08-11 16:59:41 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 143 died, exit code: -15
INFO 08-11 16:59:41 multiproc_worker_utils.py:123] Killing local vLLM worker processes
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, port)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 35, in __init__
    self.socket.bind(f"tcp://127.0.0.1:{port}")
  File "/usr/local/lib/python3.10/dist-packages/zmq/sugar/socket.py", line 311, in bind
    super().bind(addr)
  File "_zmq.py", line 917, in zmq.backend.cython._zmq.Socket.bind
  File "_zmq.py", line 179, in zmq.backend.cython._zmq._check_rc
zmq.error.ZMQError: Address already in use (addr='tcp://127.0.0.1:8000')
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
  1. However, it is only possible to see the real reason (OOM) for the Pod to receive SIGTERM from the K8s directly (Akash users do not have access to it in order to see the reason themselves):
  • 137 exit code at the pod level
$ kubectl -n $ns describe pods | grep -B2 -A4 'Last State:'
    State:          Running
      Started:      Sun, 11 Aug 2024 19:00:04 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Sun, 11 Aug 2024 17:16:14 +0200
      Finished:     Sun, 11 Aug 2024 19:00:03 +0200

What exit code 137 means for Kubernetes​ Exit code 137 is a signal that occurs when a container's memory exceeds the memory limit provided in the pod specification. When a container consumes too much memory, Kubernetes kills it to protect it from consuming too many resources on the node.

Describe the solution you'd like

Provider should read the last pod's Exit Code (as well as the time) as well as the Restart Count.
Ideally also report the time the deployment has been running since last restart.

All this info should be obtained via the akash-provider process itself (akash-provider needs to query K8s to obtain this data).
This data should not be recorded on the blockchain of course. (as this will bloat the chain and render in unnecessary txs/fees)

  • Exit Code
$ kubectl -n $ns describe pods | grep -B2 -A4 'Last State:'
    State:          Running
      Started:      Sun, 11 Aug 2024 19:00:04 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Sun, 11 Aug 2024 17:16:14 +0200
      Finished:     Sun, 11 Aug 2024 19:00:03 +0200
  • Restart Count
$ kubectl -n $ns describe pods | grep -i 'restart count'
    Restart Count:  1
  • Time running since last restart (32m ago):
$ kubectl -n $ns get pods -o wide
NAME     READY   STATUS    RESTARTS      AGE    IP               NODE    NOMINATED NODE   READINESS GATES
vllm-0   1/1     Running   1 (32m ago)   136m   10.233.102.161   node1   <none>           <none>

Describe alternatives you've considered

N/A

Search

  • I did search for other open and closed issues before opening this

Code of Conduct

  • I agree to follow this project's Code of Conduct

Additional context

Unfortunately, lease-events does not report this information.

@andy108369 andy108369 changed the title akash-provider: report Pod **Exit Code**, **Restart Count**, **Time running since last restart** akash-provider: report Pod Exit Code, Restart Count, Time running since last restart Aug 11, 2024
@chainzero chainzero added repo/provider Akash provider-services repo issues P2 and removed awaiting-triage labels Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 repo/provider Akash provider-services repo issues
Projects
None yet
Development

No branches or pull requests

3 participants