-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] Set status message if deployment pending for too long #25861
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work so far! I added a comment about the comment.
python/ray/serve/deployment_state.py
Outdated
if _SCALING_LOG_ENABLED: | ||
print_verbose_scaling_log() | ||
# If status is UNHEALTHY, give it higher priority over the stuck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can expand on this comment and its reasoning bit. Here's a suggestion:
# If status is UNHEALTHY, give it higher priority over the stuck | |
# If status is UNHEALTHY, leave the message as is. The issue that caused the deployment to be unhealthy should be prioritized over this resource availability issue. |
This is a bit malformatted since it's one big line, so I recommend copying it into your local code and running the formatting script on it. Feel free to edit this comment further.
python/ray/serve/deployment_state.py
Outdated
f"Deployment '{self._name}' has " | ||
f"{len(pending_initialization)} replicas that have taken " | ||
f"more than {SLOW_STARTUP_WARNING_S}s to initialize. This " | ||
f"may be caused by a slow __init__ or reconfigure method." | ||
) | ||
logger.warning(message) | ||
# If status is UNHEALTHY, give it higher priority over the stuck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same suggestion for this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on adding the unit test! It looks good. I've added a few suggestions.
client._controller._get_slow_startup_warning_s.remote() | ||
) | ||
# Lower slow startup warning threshold to 1 second to reduce test duration | ||
client._controller._set_slow_startup_warning_period_s.remote(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
client._controller._set_slow_startup_warning_period_s.remote(1) | |
ray.get(client._controller._set_slow_startup_warning_period_s.remote(1)) |
remote()
calls aren't necessarily executed immediately. They return a reference to the return value, so their actual execution can happen later. By calling ray.get()
on this reference, we can be sure that after this line executes, the function has actually finished running.
) | ||
# Lower slow startup warning threshold to 1 second to reduce test duration | ||
client._controller._set_slow_startup_warning_period_s.remote(1) | ||
client._controller._set_slow_startup_warning_s.remote(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
client._controller._set_slow_startup_warning_s.remote(1) | |
ray.get(client._controller._set_slow_startup_warning_s.remote(1)) |
wait_for_condition(updating_message, timeout=2) | ||
# Reset slow startup warning threshold in case bugs that cause different | ||
# tests to share state occur | ||
client._controller._set_slow_startup_warning_period_s.remote( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should wrap this in ray.get()
.
client._controller._set_slow_startup_warning_period_s.remote( | ||
original_slow_startup_warning_period_s | ||
) | ||
client._controller._set_slow_startup_warning_s.remote( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should wrap this in ray.get()
.
) | ||
|
||
wait_for_condition(updating_message, timeout=2) | ||
# Reset slow startup warning threshold in case bugs that cause different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit– can we rephrase this comment to "Reset slow startup warning threshold to prevent state sharing across unit tests."
…pending for too long
…rrided with resource availability issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! This change is well-tested, and it will be very helpful when debugging resource shortages.
…te_status_message
…te_status_message
…te_status_message
python/ray/serve/controller.py
Outdated
def _get_slow_startup_warning_period_s(self) -> float: | ||
return ray.serve.deployment_state.SLOW_STARTUP_WARNING_PERIOD_S | ||
|
||
def _get_slow_startup_warning_s(self) -> float: | ||
return ray.serve.deployment_state.SLOW_STARTUP_WARNING_S | ||
|
||
def _set_slow_startup_warning_period_s(self, period: float) -> None: | ||
ray.serve.deployment_state.SLOW_STARTUP_WARNING_PERIOD_S = period | ||
|
||
def _set_slow_startup_warning_s(self, time_limit: float) -> None: | ||
ray.serve.deployment_state.SLOW_STARTUP_WARNING_S = time_limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- looks like they are called in together. Can you group them into the one _get and one _set call?
- do we need to configure them this way? How about support configure these via environment variable and and test with environment fixture (@shrekris-anyscale can help here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
additionally, for private apis, we want to extremely pedantic, in particular, adding a _for_testing
suffix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modifying global variables is in general bad code smell/error prone. if we're going to do this, I'd prefer to make it a field of the controller or deployment state manager and pull the default from the constants. Also ok with Simon's suggestion to use env variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good aside from the getters/setters!
python/ray/serve/controller.py
Outdated
def _get_slow_startup_warning_period_s(self) -> float: | ||
return ray.serve.deployment_state.SLOW_STARTUP_WARNING_PERIOD_S | ||
|
||
def _get_slow_startup_warning_s(self) -> float: | ||
return ray.serve.deployment_state.SLOW_STARTUP_WARNING_S | ||
|
||
def _set_slow_startup_warning_period_s(self, period: float) -> None: | ||
ray.serve.deployment_state.SLOW_STARTUP_WARNING_PERIOD_S = period | ||
|
||
def _set_slow_startup_warning_s(self, time_limit: float) -> None: | ||
ray.serve.deployment_state.SLOW_STARTUP_WARNING_S = time_limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modifying global variables is in general bad code smell/error prone. if we're going to do this, I'd prefer to make it a field of the controller or deployment state manager and pull the default from the constants. Also ok with Simon's suggestion to use env variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I added a couple suggestions to remove defaults from the tests. The fixture is getting the env vars and resetting them after the test. It doesn't need to introduce separate defaults.
Co-authored-by: shrekris-anyscale <[email protected]>
Co-authored-by: shrekris-anyscale <[email protected]>
python/ray/serve/deployment_state.py
Outdated
SLOW_STARTUP_WARNING_S = int(os.getenv("SERVE_SLOW_STARTUP_WARNING_S", 30)) | ||
SLOW_STARTUP_WARNING_PERIOD_S = int( | ||
os.getenv("SERVE_SLOW_STARTUP_WARNING_PERIOD_S", 30) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SLOW_STARTUP_WARNING_S = int(os.getenv("SERVE_SLOW_STARTUP_WARNING_S", 30)) | |
SLOW_STARTUP_WARNING_PERIOD_S = int( | |
os.getenv("SERVE_SLOW_STARTUP_WARNING_PERIOD_S", 30) | |
) | |
SLOW_STARTUP_WARNING_S = int(os.environ.get("SERVE_SLOW_STARTUP_WARNING_S", 30)) | |
SLOW_STARTUP_WARNING_PERIOD_S = int( | |
os.environ.get("SERVE_SLOW_STARTUP_WARNING_PERIOD_S", 30) | |
) |
In Serve codebase we uses os.environ, it would be great to keep them consistent so it makes our life easier in next few months during refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks!
original_slow_startup_warning_s = os.getenv("SERVE_SLOW_STARTUP_WARNING_S") | ||
original_slow_startup_warning_period_s = os.getenv( | ||
"SERVE_SLOW_STARTUP_WARNING_PERIOD_S" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment for using os.environ
# Reset slow startup warning threshold to prevent state sharing across unit | ||
# tests | ||
os.environ["SERVE_SLOW_STARTUP_WARNING_S"] = original_slow_startup_warning_s | ||
os.environ[ | ||
"SERVE_SLOW_STARTUP_WARNING_PERIOD_S" | ||
] = original_slow_startup_warning_period_s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please move these env meddling into a fixture https://docs.pytest.org/en/6.2.x/fixture.html#yield-fixtures-recommended
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's currently in a fixture, called lower_slow_startup_threshold_and_reset
…nto update_status_message
…te_status_message
Why are these changes needed?
If a ray cluster does not have enough resources for a serve deployment, the deployment will be stuck at
updating
status. This change will set themessage
field when allocations/initializations of actors have been pending for too long.Related issue number
"Closes #25261"
Checks
scripts/format.sh
to lint the changes in this PR.