Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] Set status message if deployment pending for too long #25861

Merged
merged 22 commits into from
Jun 28, 2022
Merged

[serve] Set status message if deployment pending for too long #25861

merged 22 commits into from
Jun 28, 2022

Conversation

zcin
Copy link
Contributor

@zcin zcin commented Jun 16, 2022

Why are these changes needed?

If a ray cluster does not have enough resources for a serve deployment, the deployment will be stuck at updating status. This change will set the message field when allocations/initializations of actors have been pending for too long.

Related issue number

"Closes #25261"

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@zcin zcin added enhancement Request for new feature and/or capability serve Ray Serve Related Issue labels Jun 16, 2022
@zcin zcin changed the title Set status message if deployment pending for too long [WIP][serve] Set status message if deployment pending for too long Jun 16, 2022
@zcin zcin marked this pull request as ready for review June 16, 2022 19:07
Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work so far! I added a comment about the comment.

if _SCALING_LOG_ENABLED:
print_verbose_scaling_log()
# If status is UNHEALTHY, give it higher priority over the stuck
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can expand on this comment and its reasoning bit. Here's a suggestion:

Suggested change
# If status is UNHEALTHY, give it higher priority over the stuck
# If status is UNHEALTHY, leave the message as is. The issue that caused the deployment to be unhealthy should be prioritized over this resource availability issue.

This is a bit malformatted since it's one big line, so I recommend copying it into your local code and running the formatting script on it. Feel free to edit this comment further.

f"Deployment '{self._name}' has "
f"{len(pending_initialization)} replicas that have taken "
f"more than {SLOW_STARTUP_WARNING_S}s to initialize. This "
f"may be caused by a slow __init__ or reconfigure method."
)
logger.warning(message)
# If status is UNHEALTHY, give it higher priority over the stuck
Copy link
Contributor

@shrekris-anyscale shrekris-anyscale Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same suggestion for this comment.

Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on adding the unit test! It looks good. I've added a few suggestions.

client._controller._get_slow_startup_warning_s.remote()
)
# Lower slow startup warning threshold to 1 second to reduce test duration
client._controller._set_slow_startup_warning_period_s.remote(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
client._controller._set_slow_startup_warning_period_s.remote(1)
ray.get(client._controller._set_slow_startup_warning_period_s.remote(1))

remote() calls aren't necessarily executed immediately. They return a reference to the return value, so their actual execution can happen later. By calling ray.get() on this reference, we can be sure that after this line executes, the function has actually finished running.

)
# Lower slow startup warning threshold to 1 second to reduce test duration
client._controller._set_slow_startup_warning_period_s.remote(1)
client._controller._set_slow_startup_warning_s.remote(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
client._controller._set_slow_startup_warning_s.remote(1)
ray.get(client._controller._set_slow_startup_warning_s.remote(1))

wait_for_condition(updating_message, timeout=2)
# Reset slow startup warning threshold in case bugs that cause different
# tests to share state occur
client._controller._set_slow_startup_warning_period_s.remote(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should wrap this in ray.get().

client._controller._set_slow_startup_warning_period_s.remote(
original_slow_startup_warning_period_s
)
client._controller._set_slow_startup_warning_s.remote(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should wrap this in ray.get().

)

wait_for_condition(updating_message, timeout=2)
# Reset slow startup warning threshold in case bugs that cause different
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit– can we rephrase this comment to "Reset slow startup warning threshold to prevent state sharing across unit tests."

python/ray/serve/tests/test_standalone2.py Outdated Show resolved Hide resolved
Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! This change is well-tested, and it will be very helpful when debugging resource shortages.

@zcin zcin changed the title [WIP][serve] Set status message if deployment pending for too long [serve] Set status message if deployment pending for too long Jun 21, 2022
Comment on lines 168 to 178
def _get_slow_startup_warning_period_s(self) -> float:
return ray.serve.deployment_state.SLOW_STARTUP_WARNING_PERIOD_S

def _get_slow_startup_warning_s(self) -> float:
return ray.serve.deployment_state.SLOW_STARTUP_WARNING_S

def _set_slow_startup_warning_period_s(self, period: float) -> None:
ray.serve.deployment_state.SLOW_STARTUP_WARNING_PERIOD_S = period

def _set_slow_startup_warning_s(self, time_limit: float) -> None:
ray.serve.deployment_state.SLOW_STARTUP_WARNING_S = time_limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • looks like they are called in together. Can you group them into the one _get and one _set call?
  • do we need to configure them this way? How about support configure these via environment variable and and test with environment fixture (@shrekris-anyscale can help here)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additionally, for private apis, we want to extremely pedantic, in particular, adding a _for_testing suffix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modifying global variables is in general bad code smell/error prone. if we're going to do this, I'd prefer to make it a field of the controller or deployment state manager and pull the default from the constants. Also ok with Simon's suggestion to use env variables.

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good aside from the getters/setters!

Comment on lines 168 to 178
def _get_slow_startup_warning_period_s(self) -> float:
return ray.serve.deployment_state.SLOW_STARTUP_WARNING_PERIOD_S

def _get_slow_startup_warning_s(self) -> float:
return ray.serve.deployment_state.SLOW_STARTUP_WARNING_S

def _set_slow_startup_warning_period_s(self, period: float) -> None:
ray.serve.deployment_state.SLOW_STARTUP_WARNING_PERIOD_S = period

def _set_slow_startup_warning_s(self, time_limit: float) -> None:
ray.serve.deployment_state.SLOW_STARTUP_WARNING_S = time_limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modifying global variables is in general bad code smell/error prone. if we're going to do this, I'd prefer to make it a field of the controller or deployment state manager and pull the default from the constants. Also ok with Simon's suggestion to use env variables.

Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I added a couple suggestions to remove defaults from the tests. The fixture is getting the env vars and resetting them after the test. It doesn't need to introduce separate defaults.

python/ray/serve/tests/test_standalone.py Outdated Show resolved Hide resolved
python/ray/serve/tests/test_standalone.py Outdated Show resolved Hide resolved
Comment on lines 74 to 77
SLOW_STARTUP_WARNING_S = int(os.getenv("SERVE_SLOW_STARTUP_WARNING_S", 30))
SLOW_STARTUP_WARNING_PERIOD_S = int(
os.getenv("SERVE_SLOW_STARTUP_WARNING_PERIOD_S", 30)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SLOW_STARTUP_WARNING_S = int(os.getenv("SERVE_SLOW_STARTUP_WARNING_S", 30))
SLOW_STARTUP_WARNING_PERIOD_S = int(
os.getenv("SERVE_SLOW_STARTUP_WARNING_PERIOD_S", 30)
)
SLOW_STARTUP_WARNING_S = int(os.environ.get("SERVE_SLOW_STARTUP_WARNING_S", 30))
SLOW_STARTUP_WARNING_PERIOD_S = int(
os.environ.get("SERVE_SLOW_STARTUP_WARNING_PERIOD_S", 30)
)

In Serve codebase we uses os.environ, it would be great to keep them consistent so it makes our life easier in next few months during refactoring.

Copy link
Contributor Author

@zcin zcin Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks!

Comment on lines 57 to 60
original_slow_startup_warning_s = os.getenv("SERVE_SLOW_STARTUP_WARNING_S")
original_slow_startup_warning_period_s = os.getenv(
"SERVE_SLOW_STARTUP_WARNING_PERIOD_S"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment for using os.environ

Comment on lines 73 to 78
# Reset slow startup warning threshold to prevent state sharing across unit
# tests
os.environ["SERVE_SLOW_STARTUP_WARNING_S"] = original_slow_startup_warning_s
os.environ[
"SERVE_SLOW_STARTUP_WARNING_PERIOD_S"
] = original_slow_startup_warning_period_s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please move these env meddling into a fixture https://docs.pytest.org/en/6.2.x/fixture.html#yield-fixtures-recommended

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently in a fixture, called lower_slow_startup_threshold_and_reset

@edoakes edoakes merged commit da5366f into ray-project:master Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability serve Ray Serve Related Issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ray Serve: Return updating status detail in state message field
4 participants