-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1/2][serve] Use GcsClient to replace the kv client to use timeout. #25633
Conversation
@simon-mo @edoakes @shrekris-anyscale any other comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, the changes look good!
python/ray/serve/constants.py
Outdated
@@ -99,6 +101,9 @@ | |||
# Handle metric push interval. (This interval will affect the cold start time period) | |||
HANDLE_METRIC_PUSH_INTERVAL_S = 10 | |||
|
|||
# Timeout for GCS internal KV service | |||
RAY_SERVE_KV_TIMEOUT = int(os.environ.get("RAY_SERVE_KV_TIMEOUT_S", "0")) or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean by default, Serve doesn't time out during KV operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question? Why is there a None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also can you suffix this constant with _S like others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also this can be a float right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simon-mo @shrekris-anyscale maybe I didn't state it clearly. My plan is to put a non-None value in following PR since it involves more work. This is only for replacing GcsClient in serve. The WIP PR is here.
@simon-mo Good catch I'll update the int to float.
I'm not sure whether it's a good practice to add _S there because, for python, all the timeouts are measured by seconds in float type. But it's your call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_S is the convention ins Serve, update will be great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, you are saying that right now gcs client only supports timeout=None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, GCS client supports timeout. I only mean, for this PR, serve is not using timeout. The timeout will be enabled in the next PR as mentioned.
Why are these changes needed?
Timeout is only introduced in GcsClient due to the reason that ray client is not defining the timeout well for their API and it's a lot of effort to make it work e2e. For built-in component, we should use GcsClient directly.
This PR use GcsClient to replace the old one to integrate GCS HA with Ray Serve.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.