Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] Include full traceback in deployment update error message #23752

Merged
merged 3 commits into from
Apr 7, 2022

Conversation

shrekris-anyscale
Copy link
Contributor

Why are these changes needed?

When deployments fail to update, Serve sets their status to UNHEALTHY and logs the error message. However, the message lacks a traceback, making it impossible to find what caused it. For example:

File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/api.py", line 328, in _wait_for_deployment_healthy
    raise RuntimeError(f"Deployment {name} is UNHEALTHY: {status.message}")
RuntimeError: Deployment echo is UNHEALTHY: Failed to update deployment:
'>' not supported between instances of 'NoneType' and 'int'.

It's not clear where '>' not supported between instances of 'NoneType' and 'int'. is being triggered.

The change includes the full traceback for this type of update failure. The new status message is easier to debug:

File "/Users/shrekris/Desktop/ray/python/ray/serve/api.py", line 328, in _wait_for_deployment_healthy
    raise RuntimeError(f"Deployment {name} is UNHEALTHY: {status.message}")
RuntimeError: Deployment A is UNHEALTHY: Failed to update deployment:
Traceback (most recent call last):
  File "/Users/shrekris/Desktop/ray/python/ray/serve/deployment_state.py", line 1503, in update
    running_replicas_changed |= self._check_and_update_replicas()
  File "/Users/shrekris/Desktop/ray/python/ray/serve/deployment_state.py", line 1396, in _check_and_update_replicas
    a = 1/0
ZeroDivisionError: division by zero

(I forced a divide-by-zero error to get this traceback).

Related issue number

Addresses the vague error from #23747.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • N/A

@jiaodong
Copy link
Member

jiaodong commented Apr 6, 2022

Thanks for the quick fix ! Do you mind retrying the 1k test on product to see what the actual underlying error was in #23747 ? I think it might actually be flakiness, but at least helps to get to the root cause and we might be able to close that P0.

@shrekris-anyscale
Copy link
Contributor Author

@jiaodong Do I need to wait for this PR to get merged before retrying?

@jiaodong
Copy link
Member

jiaodong commented Apr 6, 2022

@shrekris-anyscale i don't think so, our e2e.py in releaser runs on user code that doesn't have to be merged

@jiaodong
Copy link
Member

jiaodong commented Apr 7, 2022

i tried re-running serve_single_deployment_1k_noop_replica test w/ the release tooling using the wheel from 21ee7e4 but ran into some import errors. Will give it another try later after we have the rebased wheel on this commit .

Traceback (most recent call last):
  File "run_release_test.py", line 137, in main
    no_terminate=no_terminate,
  File "/Users/jiaodong/Workspace/ray/release/ray_release/glue.py", line 314, in run_release_test
    raise pipeline_exception
  File "/Users/jiaodong/Workspace/ray/release/ray_release/glue.py", line 204, in run_release_test
    command_runner.prepare_remote_env()
  File "/Users/jiaodong/Workspace/ray/release/ray_release/command_runner/sdk_runner.py", line 58, in prepare_remote_env
    ) from e
ray_release.exception.RemoteEnvSetupError: Error setting up remote environment: No module named 'ray.job_submission'

@edoakes
Copy link
Contributor

edoakes commented Apr 7, 2022

@jiaodong the issue you ran into looks like a version mismatch -- need to make sure you're running the same commit on the cluster and locally.

@edoakes edoakes merged commit 0902ec5 into ray-project:master Apr 7, 2022
edoakes pushed a commit to edoakes/ray that referenced this pull request Apr 7, 2022
…y-project#23752)

When deployments fail to update, [Serve sets their status to UNHEALTHY and logs the error message](https://github.com/ray-project/ray/blob/46465abd6d866c3903b17c601e84e81b46c67190/python/ray/serve/deployment_state.py#L1507-L1511). However, the message lacks a traceback, making it impossible to find what caused it. [For example](https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_SfGPJq8WWJUhAvmHHsDgJWUe?command-history-section=command_history):

```
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/api.py", line 328, in _wait_for_deployment_healthy
    raise RuntimeError(f"Deployment {name} is UNHEALTHY: {status.message}")
RuntimeError: Deployment echo is UNHEALTHY: Failed to update deployment:
'>' not supported between instances of 'NoneType' and 'int'.
```

It's not clear where `'>' not supported between instances of 'NoneType' and 'int'.` is being triggered.

The change includes the full traceback for this type of update failure. The new status message is easier to debug:

```
File "/Users/shrekris/Desktop/ray/python/ray/serve/api.py", line 328, in _wait_for_deployment_healthy
    raise RuntimeError(f"Deployment {name} is UNHEALTHY: {status.message}")
RuntimeError: Deployment A is UNHEALTHY: Failed to update deployment:
Traceback (most recent call last):
  File "/Users/shrekris/Desktop/ray/python/ray/serve/deployment_state.py", line 1503, in update
    running_replicas_changed |= self._check_and_update_replicas()
  File "/Users/shrekris/Desktop/ray/python/ray/serve/deployment_state.py", line 1396, in _check_and_update_replicas
    a = 1/0
ZeroDivisionError: division by zero
```

(I forced a divide-by-zero error to get this traceback).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants