Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] Add health checking for http proxy actors #34944

Merged
merged 13 commits into from
May 8, 2023

Conversation

zcin
Copy link
Contributor

@zcin zcin commented May 2, 2023

Why are these changes needed?

Add health checking to HTTP Proxy

  • Add no-op check_health() method to proxy actor
  • In controller, periodically call check_health() method
    • Before ready() returns, status is STARTING
    • If an error occurs with check_health(), set UNHEALTHY
    • If request to check_health() times out, set UNHEALTHY

Related issue number

Closes #19151

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

zcin added 5 commits May 1, 2023 23:29
@zcin zcin marked this pull request as ready for review May 5, 2023 04:26
Signed-off-by: Cindy Zhang <[email protected]>
zcin added 2 commits May 4, 2023 21:40
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Copy link
Contributor

@sihanwang41 sihanwang41 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! leave some nits and please make sure unit tests cover all the statues :)

Should only be called after confirming the object ref is ready.
Resets _health_check_obj_ref to None at the end.
"""
assert len(ray.wait([self._health_check_obj_ref], timeout=0)[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need to have this assert?

python/ray/serve/_private/http_state.py Outdated Show resolved Hide resolved
python/ray/serve/_private/http_state.py Outdated Show resolved Hide resolved
python/ray/serve/_private/http_proxy.py Show resolved Hide resolved
Signed-off-by: Cindy Zhang <[email protected]>
@edoakes
Copy link
Contributor

edoakes commented May 5, 2023

If request times out, set HEALTHY <- did you mean UNHEALTHY ?

@zcin
Copy link
Contributor Author

zcin commented May 5, 2023

If request times out, set HEALTHY <- did you mean UNHEALTHY ?

Oops yes, updated

Comment on lines 763 to 764
class HTTPProxyDetails(BaseModel):
status: HTTPProxyStatus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add node_id, ip_address here?

@alanwguo anything else that there would be useful to show in the UI?

oh... maybe logs path?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any metrics related to HTTP Proxy today? I don't think so.

If we did, it would be useful to have some sort of identifier for the HTTP Proxy so i could link or filter the metrics down to the metrics related to a particular HTTP proxy.

For HTTPProxyStatus, do we have any sort of string that can provide more details about errors or anything?

Creation date could be useful I guess also

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we do have some metrics. The unique identifier would be the node_id, though that doesn't handle restarts if they were to happen. So maybe we should add actor_id here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zcin added 4 commits May 6, 2023 12:47
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, thanks for picking it up!

@edoakes edoakes merged commit b7eeda3 into ray-project:master May 8, 2023
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023
Add health checking to HTTP Proxy

Add no-op check_health() method to proxy actor
- In controller, periodically call check_health() method
- Before the first health check returns, status is STARTING
- If an error occurs, set UNHEALTHY
- If request times out, set UNHEALTHY
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Properly health check HTTP proxy & report its status
4 participants