[serve] Include full traceback in deployment update error message #23752

shrekris-anyscale · 2022-04-06T18:33:02Z

Why are these changes needed?

When deployments fail to update, Serve sets their status to UNHEALTHY and logs the error message. However, the message lacks a traceback, making it impossible to find what caused it. For example:

File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/api.py", line 328, in _wait_for_deployment_healthy
    raise RuntimeError(f"Deployment {name} is UNHEALTHY: {status.message}")
RuntimeError: Deployment echo is UNHEALTHY: Failed to update deployment:
'>' not supported between instances of 'NoneType' and 'int'.

It's not clear where '>' not supported between instances of 'NoneType' and 'int'. is being triggered.

The change includes the full traceback for this type of update failure. The new status message is easier to debug:

File "/Users/shrekris/Desktop/ray/python/ray/serve/api.py", line 328, in _wait_for_deployment_healthy
    raise RuntimeError(f"Deployment {name} is UNHEALTHY: {status.message}")
RuntimeError: Deployment A is UNHEALTHY: Failed to update deployment:
Traceback (most recent call last):
  File "/Users/shrekris/Desktop/ray/python/ray/serve/deployment_state.py", line 1503, in update
    running_replicas_changed |= self._check_and_update_replicas()
  File "/Users/shrekris/Desktop/ray/python/ray/serve/deployment_state.py", line 1396, in _check_and_update_replicas
    a = 1/0
ZeroDivisionError: division by zero

(I forced a divide-by-zero error to get this traceback).

Related issue number

Addresses the vague error from #23747.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- N/A

jiaodong · 2022-04-06T18:48:22Z

Thanks for the quick fix ! Do you mind retrying the 1k test on product to see what the actual underlying error was in #23747 ? I think it might actually be flakiness, but at least helps to get to the root cause and we might be able to close that P0.

shrekris-anyscale · 2022-04-06T19:18:30Z

@jiaodong Do I need to wait for this PR to get merged before retrying?

jiaodong · 2022-04-06T19:20:33Z

@shrekris-anyscale i don't think so, our e2e.py in releaser runs on user code that doesn't have to be merged

jiaodong · 2022-04-07T01:38:36Z

i tried re-running serve_single_deployment_1k_noop_replica test w/ the release tooling using the wheel from 21ee7e4 but ran into some import errors. Will give it another try later after we have the rebased wheel on this commit .

Traceback (most recent call last):
  File "run_release_test.py", line 137, in main
    no_terminate=no_terminate,
  File "/Users/jiaodong/Workspace/ray/release/ray_release/glue.py", line 314, in run_release_test
    raise pipeline_exception
  File "/Users/jiaodong/Workspace/ray/release/ray_release/glue.py", line 204, in run_release_test
    command_runner.prepare_remote_env()
  File "/Users/jiaodong/Workspace/ray/release/ray_release/command_runner/sdk_runner.py", line 58, in prepare_remote_env
    ) from e
ray_release.exception.RemoteEnvSetupError: Error setting up remote environment: No module named 'ray.job_submission'

edoakes · 2022-04-07T15:32:58Z

@jiaodong the issue you ran into looks like a version mismatch -- need to make sure you're running the same commit on the cluster and locally.

…y-project#23752) When deployments fail to update, [Serve sets their status to UNHEALTHY and logs the error message](https://github.com/ray-project/ray/blob/46465abd6d866c3903b17c601e84e81b46c67190/python/ray/serve/deployment_state.py#L1507-L1511). However, the message lacks a traceback, making it impossible to find what caused it. [For example](https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_SfGPJq8WWJUhAvmHHsDgJWUe?command-history-section=command_history): ``` File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/api.py", line 328, in _wait_for_deployment_healthy raise RuntimeError(f"Deployment {name} is UNHEALTHY: {status.message}") RuntimeError: Deployment echo is UNHEALTHY: Failed to update deployment: '>' not supported between instances of 'NoneType' and 'int'. ``` It's not clear where `'>' not supported between instances of 'NoneType' and 'int'.` is being triggered. The change includes the full traceback for this type of update failure. The new status message is easier to debug: ``` File "/Users/shrekris/Desktop/ray/python/ray/serve/api.py", line 328, in _wait_for_deployment_healthy raise RuntimeError(f"Deployment {name} is UNHEALTHY: {status.message}") RuntimeError: Deployment A is UNHEALTHY: Failed to update deployment: Traceback (most recent call last): File "/Users/shrekris/Desktop/ray/python/ray/serve/deployment_state.py", line 1503, in update running_replicas_changed |= self._check_and_update_replicas() File "/Users/shrekris/Desktop/ray/python/ray/serve/deployment_state.py", line 1396, in _check_and_update_replicas a = 1/0 ZeroDivisionError: division by zero ``` (I forced a divide-by-zero error to get this traceback).

shrekris-anyscale added 2 commits April 6, 2022 11:07

Print entire traceback on update error

f080da2

Use correct exc function

21ee7e4

shrekris-anyscale requested review from jiaodong and edoakes April 6, 2022 18:33

shrekris-anyscale assigned jiaodong and edoakes Apr 6, 2022

jiaodong approved these changes Apr 6, 2022

View reviewed changes

edoakes approved these changes Apr 6, 2022

View reviewed changes

Merge branch 'master' of github.com:ray-project/ray into refine_err

5851a45

edoakes merged commit 0902ec5 into ray-project:master Apr 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] Include full traceback in deployment update error message #23752

[serve] Include full traceback in deployment update error message #23752

shrekris-anyscale commented Apr 6, 2022

jiaodong commented Apr 6, 2022

shrekris-anyscale commented Apr 6, 2022

jiaodong commented Apr 6, 2022

jiaodong commented Apr 7, 2022 •

edited by edoakes

Loading

edoakes commented Apr 7, 2022

[serve] Include full traceback in deployment update error message #23752

[serve] Include full traceback in deployment update error message #23752

Conversation

shrekris-anyscale commented Apr 6, 2022

Why are these changes needed?

Related issue number

Checks

jiaodong commented Apr 6, 2022

shrekris-anyscale commented Apr 6, 2022

jiaodong commented Apr 6, 2022

jiaodong commented Apr 7, 2022 • edited by edoakes Loading

edoakes commented Apr 7, 2022

jiaodong commented Apr 7, 2022 •

edited by edoakes

Loading