Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] Better surfacing of errors in serve status #34773

Merged
merged 13 commits into from
May 4, 2023

Conversation

zcin
Copy link
Contributor

@zcin zcin commented Apr 26, 2023

Why are these changes needed?

  1. Surface tracebacks from constructor failures.
    Output from serve status:
name: default
app_status:
  status: DEPLOY_FAILED
  message: |-
    Deploying app 'default' failed:
    ray::deploy_serve_application() (pid=15878, ip=192.168.1.14)
      File "/Users/cindyz/ray/python/ray/serve/controller.py", line 947, in deploy_serve_application
        serve.run(app, name=name, route_prefix=route_prefix)
      File "/Users/cindyz/ray/python/ray/serve/api.py", line 539, in run
        client.deploy_application(
      File "/Users/cindyz/ray/python/ray/serve/_private/client.py", line 43, in check
        return f(self, *args, **kwargs)
      File "/Users/cindyz/ray/python/ray/serve/_private/client.py", line 299, in deploy_application
        self._wait_for_deployment_healthy(deployment_name)
      File "/Users/cindyz/ray/python/ray/serve/_private/client.py", line 183, in _wait_for_deployment_healthy
        raise RuntimeError(
    RuntimeError: Deployment default_Fail is UNHEALTHY: The Deployment failed to start 3 times in a row. This may be due to a problem with the deployment constructor or the initial health check failing. See controller logs for details. Retrying after 1 seconds. Error:
    ray::ServeReplica:default_Fail.is_initialized() (pid=15919, ip=192.168.1.14, repr=<ray.serve._private.replica.ServeReplica:default_Fail object at 0x1257772e0>)
      File "/Users/cindyz/miniforge3/envs/ray/lib/python3.8/concurrent/futures/_base.py", line 437, in result
        return self.__get_result()
      File "/Users/cindyz/miniforge3/envs/ray/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
        raise self._exception
      File "/Users/cindyz/ray/python/ray/serve/_private/replica.py", line 234, in is_initialized
        await self._initialize_replica()
      File "/Users/cindyz/ray/python/ray/serve/_private/replica.py", line 150, in initialize_replica
        await sync_to_async(_callable.__init__)(*init_args, **init_kwargs)
      File "/Users/cindyz/Desktop/constructor_fail.py", line 16, in __init__
        raise Exception("I need to know about this!")
    Exception: I need to know about this!
  deployment_timestamp: 1682476137.8513532
deployment_statuses:
- name: default_Fail
  status: UNHEALTHY
  message: |-
    The Deployment failed to start 3 times in a row. This may be due to a problem with the deployment constructor or the initial health check failing. See controller logs for details. Retrying after 1 seconds. Error:
    ray::ServeReplica:default_Fail.is_initialized() (pid=15919, ip=192.168.1.14, repr=<ray.serve._private.replica.ServeReplica:default_Fail object at 0x1257772e0>)
      File "/Users/cindyz/miniforge3/envs/ray/lib/python3.8/concurrent/futures/_base.py", line 437, in result
        return self.__get_result()
      File "/Users/cindyz/miniforge3/envs/ray/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
        raise self._exception
      File "/Users/cindyz/ray/python/ray/serve/_private/replica.py", line 234, in is_initialized
        await self._initialize_replica()
      File "/Users/cindyz/ray/python/ray/serve/_private/replica.py", line 150, in initialize_replica
        await sync_to_async(_callable.__init__)(*init_args, **init_kwargs)
      File "/Users/cindyz/Desktop/constructor_fail.py", line 16, in __init__
        raise Exception("I need to know about this!")
    Exception: I need to know about this!
  1. Serializes exceptions from the replica actor, so that they are displayed properly when surfaced through the controller.

Related issue number

Closes #34712
Closes #34683

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
@zcin zcin changed the title surface constructor errors [serve] Surface errors from constructor failure Apr 26, 2023
@zcin zcin marked this pull request as ready for review April 26, 2023 17:16
@@ -446,7 +446,9 @@ def recover(self):
else:
self._ready_obj_ref = self._actor_handle.get_metadata.remote()

def check_ready(self) -> Tuple[ReplicaStartupStatus, Optional[DeploymentVersion]]:
def check_ready(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Update the docstring for return value.

@zcin
Copy link
Contributor Author

zcin commented Apr 26, 2023

FYI: Preferably #34430 is merged first, since this touches the same code and would prefer to resolve merge conflicts here.

Signed-off-by: Cindy Zhang <[email protected]>
@edoakes
Copy link
Contributor

edoakes commented Apr 26, 2023

@zcin other PR is merged please fix conflicts

@edoakes
Copy link
Contributor

edoakes commented Apr 26, 2023

@zcin should we fix #34683 at the same time by doing the str(e) inside the replica and returning that somehow to controller?

@zcin zcin force-pushed the status-message branch 2 times, most recently from d0c7078 to 887d07d Compare April 26, 2023 20:03
Signed-off-by: Cindy Zhang <[email protected]>
@ollie-iterators
Copy link

I think the doc tests will be fixed by merging in recent changes from the main branch

@ollie-iterators
Copy link

Lint Error: python/ray/serve/tests/test_config_files/sqlalchemy.py:5:1: E302 expected 2 blank lines, found 1

Signed-off-by: Cindy Zhang <[email protected]>
@zcin
Copy link
Contributor Author

zcin commented Apr 29, 2023

@zcin should we fix #34683 at the same time by doing the str(e) inside the replica and returning that somehow to controller?

I've added the fix into this PR!

@zcin zcin changed the title [serve] Surface errors from constructor failure [serve] Better surfacing of errors in serve status Apr 29, 2023
# here because the exception may be a RayTaskError, in which case the
# full details of the error is not displayed properly with
# traceback.format_exc().
return ReplicaStartupStatus.FAILED, str(e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to use repr(e) instead of str(e), but there's likely no difference in practice with common exception types

Copy link
Contributor Author

@zcin zcin May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For RayTaskErrors specifically, repr(e) seems to cause weird formatting:

Screen Shot 2023-05-01 at 5 59 42 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Btw, can we unwrap the RayTaskError so we only show the user their underlying exception instead? This can be done with e.as_instanceof_cause()

# NOTE(zcin): we should use str(e) instead of traceback.format_exc()
# here because the full details of the error is not displayed properly
# with traceback.format_exc().
return ReplicaStartupStatus.FAILED, str(e.as_instanceof_cause())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice :)

@zcin
Copy link
Contributor Author

zcin commented May 4, 2023

@edoakes Not sure whats happening with the GPU test but I pulled an instance with repro-ci and confirmed the serve doc tests work. Should be ready to merge.

@edoakes edoakes merged commit adae990 into ray-project:master May 4, 2023
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023
1. Surface tracebacks from constructor failures.
Output from `serve status`:
```
name: default
app_status:
  status: DEPLOY_FAILED
  message: |-
    Deploying app 'default' failed:
    ray::deploy_serve_application() (pid=15878, ip=192.168.1.14)
      File "/Users/cindyz/ray/python/ray/serve/controller.py", line 947, in deploy_serve_application
        serve.run(app, name=name, route_prefix=route_prefix)
      File "/Users/cindyz/ray/python/ray/serve/api.py", line 539, in run
        client.deploy_application(
      File "/Users/cindyz/ray/python/ray/serve/_private/client.py", line 43, in check
        return f(self, *args, **kwargs)
      File "/Users/cindyz/ray/python/ray/serve/_private/client.py", line 299, in deploy_application
        self._wait_for_deployment_healthy(deployment_name)
      File "/Users/cindyz/ray/python/ray/serve/_private/client.py", line 183, in _wait_for_deployment_healthy
        raise RuntimeError(
    RuntimeError: Deployment default_Fail is UNHEALTHY: The Deployment failed to start 3 times in a row. This may be due to a problem with the deployment constructor or the initial health check failing. See controller logs for details. Retrying after 1 seconds. Error:
    ray::ServeReplica:default_Fail.is_initialized() (pid=15919, ip=192.168.1.14, repr=<ray.serve._private.replica.ServeReplica:default_Fail object at 0x1257772e0>)
      File "/Users/cindyz/miniforge3/envs/ray/lib/python3.8/concurrent/futures/_base.py", line 437, in result
        return self.__get_result()
      File "/Users/cindyz/miniforge3/envs/ray/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
        raise self._exception
      File "/Users/cindyz/ray/python/ray/serve/_private/replica.py", line 234, in is_initialized
        await self._initialize_replica()
      File "/Users/cindyz/ray/python/ray/serve/_private/replica.py", line 150, in initialize_replica
        await sync_to_async(_callable.__init__)(*init_args, **init_kwargs)
      File "/Users/cindyz/Desktop/constructor_fail.py", line 16, in __init__
        raise Exception("I need to know about this!")
    Exception: I need to know about this!
  deployment_timestamp: 1682476137.8513532
deployment_statuses:
- name: default_Fail
  status: UNHEALTHY
  message: |-
    The Deployment failed to start 3 times in a row. This may be due to a problem with the deployment constructor or the initial health check failing. See controller logs for details. Retrying after 1 seconds. Error:
    ray::ServeReplica:default_Fail.is_initialized() (pid=15919, ip=192.168.1.14, repr=<ray.serve._private.replica.ServeReplica:default_Fail object at 0x1257772e0>)
      File "/Users/cindyz/miniforge3/envs/ray/lib/python3.8/concurrent/futures/_base.py", line 437, in result
        return self.__get_result()
      File "/Users/cindyz/miniforge3/envs/ray/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
        raise self._exception
      File "/Users/cindyz/ray/python/ray/serve/_private/replica.py", line 234, in is_initialized
        await self._initialize_replica()
      File "/Users/cindyz/ray/python/ray/serve/_private/replica.py", line 150, in initialize_replica
        await sync_to_async(_callable.__init__)(*init_args, **init_kwargs)
      File "/Users/cindyz/Desktop/constructor_fail.py", line 16, in __init__
        raise Exception("I need to know about this!")
    Exception: I need to know about this!
```

2. Serializes exceptions from the replica actor, so that they are displayed properly when surfaced through the controller.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants