-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] Better surfacing of errors in serve status #34773
Conversation
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
@@ -446,7 +446,9 @@ def recover(self): | |||
else: | |||
self._ready_obj_ref = self._actor_handle.get_metadata.remote() | |||
|
|||
def check_ready(self) -> Tuple[ReplicaStartupStatus, Optional[DeploymentVersion]]: | |||
def check_ready( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Update the docstring for return value.
FYI: Preferably #34430 is merged first, since this touches the same code and would prefer to resolve merge conflicts here. |
Signed-off-by: Cindy Zhang <[email protected]>
@zcin other PR is merged please fix conflicts |
d0c7078
to
887d07d
Compare
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
I think the doc tests will be fixed by merging in recent changes from the main branch |
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Lint Error: python/ray/serve/tests/test_config_files/sqlalchemy.py:5:1: E302 expected 2 blank lines, found 1 |
Signed-off-by: Cindy Zhang <[email protected]>
# here because the exception may be a RayTaskError, in which case the | ||
# full details of the error is not displayed properly with | ||
# traceback.format_exc(). | ||
return ReplicaStartupStatus.FAILED, str(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might want to use repr(e)
instead of str(e)
, but there's likely no difference in practice with common exception types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Btw, can we unwrap the RayTaskError
so we only show the user their underlying exception instead? This can be done with e.as_instanceof_cause()
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
# NOTE(zcin): we should use str(e) instead of traceback.format_exc() | ||
# here because the full details of the error is not displayed properly | ||
# with traceback.format_exc(). | ||
return ReplicaStartupStatus.FAILED, str(e.as_instanceof_cause()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice :)
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
@edoakes Not sure whats happening with the GPU test but I pulled an instance with |
1. Surface tracebacks from constructor failures. Output from `serve status`: ``` name: default app_status: status: DEPLOY_FAILED message: |- Deploying app 'default' failed: ray::deploy_serve_application() (pid=15878, ip=192.168.1.14) File "/Users/cindyz/ray/python/ray/serve/controller.py", line 947, in deploy_serve_application serve.run(app, name=name, route_prefix=route_prefix) File "/Users/cindyz/ray/python/ray/serve/api.py", line 539, in run client.deploy_application( File "/Users/cindyz/ray/python/ray/serve/_private/client.py", line 43, in check return f(self, *args, **kwargs) File "/Users/cindyz/ray/python/ray/serve/_private/client.py", line 299, in deploy_application self._wait_for_deployment_healthy(deployment_name) File "/Users/cindyz/ray/python/ray/serve/_private/client.py", line 183, in _wait_for_deployment_healthy raise RuntimeError( RuntimeError: Deployment default_Fail is UNHEALTHY: The Deployment failed to start 3 times in a row. This may be due to a problem with the deployment constructor or the initial health check failing. See controller logs for details. Retrying after 1 seconds. Error: ray::ServeReplica:default_Fail.is_initialized() (pid=15919, ip=192.168.1.14, repr=<ray.serve._private.replica.ServeReplica:default_Fail object at 0x1257772e0>) File "/Users/cindyz/miniforge3/envs/ray/lib/python3.8/concurrent/futures/_base.py", line 437, in result return self.__get_result() File "/Users/cindyz/miniforge3/envs/ray/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result raise self._exception File "/Users/cindyz/ray/python/ray/serve/_private/replica.py", line 234, in is_initialized await self._initialize_replica() File "/Users/cindyz/ray/python/ray/serve/_private/replica.py", line 150, in initialize_replica await sync_to_async(_callable.__init__)(*init_args, **init_kwargs) File "/Users/cindyz/Desktop/constructor_fail.py", line 16, in __init__ raise Exception("I need to know about this!") Exception: I need to know about this! deployment_timestamp: 1682476137.8513532 deployment_statuses: - name: default_Fail status: UNHEALTHY message: |- The Deployment failed to start 3 times in a row. This may be due to a problem with the deployment constructor or the initial health check failing. See controller logs for details. Retrying after 1 seconds. Error: ray::ServeReplica:default_Fail.is_initialized() (pid=15919, ip=192.168.1.14, repr=<ray.serve._private.replica.ServeReplica:default_Fail object at 0x1257772e0>) File "/Users/cindyz/miniforge3/envs/ray/lib/python3.8/concurrent/futures/_base.py", line 437, in result return self.__get_result() File "/Users/cindyz/miniforge3/envs/ray/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result raise self._exception File "/Users/cindyz/ray/python/ray/serve/_private/replica.py", line 234, in is_initialized await self._initialize_replica() File "/Users/cindyz/ray/python/ray/serve/_private/replica.py", line 150, in initialize_replica await sync_to_async(_callable.__init__)(*init_args, **init_kwargs) File "/Users/cindyz/Desktop/constructor_fail.py", line 16, in __init__ raise Exception("I need to know about this!") Exception: I need to know about this! ``` 2. Serializes exceptions from the replica actor, so that they are displayed properly when surfaced through the controller.
Why are these changes needed?
Output from
serve status
:Related issue number
Closes #34712
Closes #34683
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.