Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Serve] Recover PENDING_INITIALIZATION status actor (ray-project#33890)
Then replica is under initializing state and the controller is dead, user will see ``` (ServeController pid=458318) ERROR 2023-03-29 08:44:42,547 controller 458318 deployment_state.py:500 - Exception in deployment 'xqHWInctmH' (ServeController pid=458318) Traceback (most recent call last): (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/deployment_state.py", line 489, in check_ready (ServeController pid=458318) deployment_config, version = ray.get(self._ready_obj_ref) (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper (ServeController pid=458318) return func(*args, **kwargs) (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2508, in get (ServeController pid=458318) raise value.as_instanceof_cause() (ServeController pid=458318) ray.exceptions.RayTaskError(AttributeError): ray::ServeReplica:xqHWInctmH.get_metadata() (pid=458148, ip=172.31.126.222, repr=<ray.serve._private.replica.ServeReplica:xqHWInctmH object at 0x7f57ae0f0110>) (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 428, in result (ServeController pid=458318) return self.__get_result() (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result (ServeController pid=458318) raise self._exception (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/replica.py", line 249, in get_metadata (ServeController pid=458318) return self.replica.deployment_config, self.replica.version (ServeController pid=458318) AttributeError: 'NoneType' object has no attribute 'deployment_config' ``` - Fix the NoneType bug when recover happens, no matter the actor is under any state, user would not see the traceback. Instead user will see the 1. slow startup warning, and then replica will be terminated, and new replica will be provisioned. 2. If actor succeed in `PENDING_INITIALIZATION`, no error pops out. This is observed from long_running_serve_failure: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_u7xeve33e2djg9grgr9qcs9l4x?command-history-section=command_history **Note**: If user constructor initialization just needs very long time to finish, it is recommended to increase the `SERVE_SLOW_STARTUP_WARNING_S`, (if the user doesn't change, the deployment manager will terminate the old replica and start new replica after the timeout.) Co-authored-by: shrekris-anyscale <[email protected]> Signed-off-by: elliottower <[email protected]>
- Loading branch information