-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] Include full traceback in deployment update error message #23752
Conversation
Thanks for the quick fix ! Do you mind retrying the 1k test on product to see what the actual underlying error was in #23747 ? I think it might actually be flakiness, but at least helps to get to the root cause and we might be able to close that P0. |
@jiaodong Do I need to wait for this PR to get merged before retrying? |
@shrekris-anyscale i don't think so, our e2e.py in releaser runs on user code that doesn't have to be merged |
i tried re-running serve_single_deployment_1k_noop_replica test w/ the release tooling using the wheel from 21ee7e4 but ran into some import errors. Will give it another try later after we have the rebased wheel on this commit .
|
@jiaodong the issue you ran into looks like a version mismatch -- need to make sure you're running the same commit on the cluster and locally. |
…y-project#23752) When deployments fail to update, [Serve sets their status to UNHEALTHY and logs the error message](https://github.com/ray-project/ray/blob/46465abd6d866c3903b17c601e84e81b46c67190/python/ray/serve/deployment_state.py#L1507-L1511). However, the message lacks a traceback, making it impossible to find what caused it. [For example](https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_SfGPJq8WWJUhAvmHHsDgJWUe?command-history-section=command_history): ``` File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/api.py", line 328, in _wait_for_deployment_healthy raise RuntimeError(f"Deployment {name} is UNHEALTHY: {status.message}") RuntimeError: Deployment echo is UNHEALTHY: Failed to update deployment: '>' not supported between instances of 'NoneType' and 'int'. ``` It's not clear where `'>' not supported between instances of 'NoneType' and 'int'.` is being triggered. The change includes the full traceback for this type of update failure. The new status message is easier to debug: ``` File "/Users/shrekris/Desktop/ray/python/ray/serve/api.py", line 328, in _wait_for_deployment_healthy raise RuntimeError(f"Deployment {name} is UNHEALTHY: {status.message}") RuntimeError: Deployment A is UNHEALTHY: Failed to update deployment: Traceback (most recent call last): File "/Users/shrekris/Desktop/ray/python/ray/serve/deployment_state.py", line 1503, in update running_replicas_changed |= self._check_and_update_replicas() File "/Users/shrekris/Desktop/ray/python/ray/serve/deployment_state.py", line 1396, in _check_and_update_replicas a = 1/0 ZeroDivisionError: division by zero ``` (I forced a divide-by-zero error to get this traceback).
Why are these changes needed?
When deployments fail to update, Serve sets their status to UNHEALTHY and logs the error message. However, the message lacks a traceback, making it impossible to find what caused it. For example:
It's not clear where
'>' not supported between instances of 'NoneType' and 'int'.
is being triggered.The change includes the full traceback for this type of update failure. The new status message is easier to debug:
(I forced a divide-by-zero error to get this traceback).
Related issue number
Addresses the vague error from #23747.
Checks
scripts/format.sh
to lint the changes in this PR.