Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] Controller seems to always re-deploy deployments on controller restart #34097

Closed
edoakes opened this issue Apr 5, 2023 · 5 comments
Closed
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks ray-team-created Ray Team created serve Ray Serve Related Issue

Comments

@edoakes
Copy link
Contributor

edoakes commented Apr 5, 2023

app.py:

from ray import serve

@serve.deployment
class A:
    pass

a = A.bind()

config.yaml:

import_path: app:a

Repro:

$ ray start --head
$ serve deploy config.yaml
$ ray list actors
< get PID of controller >
$ kill -9 <controller PID>

Original controller logs:

INFO 2023-04-05 09:33:28,920 controller 67744 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-c65fcfbe44617665d7d23552a397332bec829eaefed801e7916673fa' on node 'c65fcfbe44617665d7d23552a397332bec829eaefed801e7916673fa' listening on '0.0.0.0:8000'
INFO 2023-04-05 09:33:30,140 controller 67744 deployment_state.py:1376 - Adding 1 replica to deployment 'A'.

Restarted controller logs:

INFO 2023-04-05 09:33:49,242 controller 67770 deployment_state.py:1060 - Recovering target state for deployment A from checkpoint..
INFO 2023-04-05 09:33:49,552 controller 67770 deployment_state.py:1265 - Stopping 1 replicas of deployment 'A' with outdated versions.
INFO 2023-04-05 09:33:51,692 controller 67770 deployment_state.py:1376 - Adding 1 replica to deployment 'A'.
@edoakes edoakes added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) serve Ray Serve Related Issue labels Apr 5, 2023
@edoakes edoakes changed the title [serve] Controller seems to always re-deploy deployments on restart [serve] Controller seems to always re-deploy deployments on controller restart Apr 5, 2023
@edoakes
Copy link
Contributor Author

edoakes commented Apr 5, 2023

Looks like this is caused by the known issue related to not having a deployments list. I ran the same steps including deployments and did not see the errant behavior.

@akshay-anyscale akshay-anyscale added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 7, 2023
@richardliaw richardliaw added the ray-team-created Ray Team created label Apr 19, 2023
@edoakes
Copy link
Contributor Author

edoakes commented May 2, 2023

@zcin is this fixed? Wasn't able to find the other tracking issue.

@zcin
Copy link
Contributor

zcin commented May 2, 2023

Seems to have been fixed. This is what I'm seeing on master:
Original controller logs:

INFO 2023-05-02 09:52:13,800 controller 75111 http_state.py:206 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-263b2f686a8d5ac7e2dccdab2ca8eca956c688f5c5d2c7c199ebfd75' on node '263b2f686a8d5ac7e2dccdab2ca8eca956c688f5c5d2c7c199ebfd75' listening on '0.0.0.0:8000'
INFO 2023-05-02 09:52:14,360 controller 75111 controller.py:546 - Starting deploy_serve_application task for application default.
INFO 2023-05-02 09:52:14,885 controller 75111 deployment_state.py:1205 - Deploying new version of deployment default_A.
INFO 2023-05-02 09:52:14,924 controller 75111 deployment_state.py:1444 - Adding 1 replica to deployment default_A.
INFO 2023-05-02 09:52:14,924 controller 75111 deployment_state.py:330 - Starting replica default_A#NnWMgE for deployment default_A.
INFO 2023-05-02 09:52:15,544 controller 75111 deployment_state.py:1598 - Replica default_A#NnWMgE started successfully.
INFO 2023-05-02 09:52:15,954 controller 75111 application_state.py:202 - Deploy task for app 'default' ran successfully.

Restarted controller logs:

INFO 2023-05-02 09:52:31,591 controller 75172 deployment_state.py:1085 - Recovering target state for deployment default_A from checkpoint.
INFO 2023-05-02 09:52:31,591 controller 75172 deployment_state.py:1098 - Recovering current state for deployment default_A from 1 total actors.
INFO 2023-05-02 09:52:31,591 controller 75172 deployment_state.py:469 - Recovering replica default_A#NnWMgE for deployment default_A.
INFO 2023-05-02 09:52:31,593 controller 75172 controller.py:546 - Starting deploy_serve_application task for application default.
INFO 2023-05-02 09:52:31,697 controller 75172 deployment_state.py:1598 - Replica default_A#NnWMgE started successfully.
INFO 2023-05-02 09:52:32,411 controller 75172 application_state.py:202 - Deploy task for app 'default' ran successfully.

@zcin
Copy link
Contributor

zcin commented May 2, 2023

Closed by #34430

@zcin zcin closed this as completed May 2, 2023
@edoakes
Copy link
Contributor Author

edoakes commented May 2, 2023

woohoo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks ray-team-created Ray Team created serve Ray Serve Related Issue
Projects
None yet
Development

No branches or pull requests

4 participants