Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1/2][serve] Use GcsClient to replace the kv client to use timeout. #25633

Merged
merged 8 commits into from
Jun 11, 2022

Conversation

fishbone
Copy link
Contributor

@fishbone fishbone commented Jun 9, 2022

Why are these changes needed?

Timeout is only introduced in GcsClient due to the reason that ray client is not defining the timeout well for their API and it's a lot of effort to make it work e2e. For built-in component, we should use GcsClient directly.

This PR use GcsClient to replace the old one to integrate GCS HA with Ray Serve.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

python/ray/serve/storage/kv_store.py Outdated Show resolved Hide resolved
python/ray/serve/storage/kv_store.py Outdated Show resolved Hide resolved
@fishbone fishbone added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jun 10, 2022
@fishbone
Copy link
Contributor Author

@simon-mo @edoakes @shrekris-anyscale any other comments?

Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, the changes look good!

@@ -99,6 +101,9 @@
# Handle metric push interval. (This interval will affect the cold start time period)
HANDLE_METRIC_PUSH_INTERVAL_S = 10

# Timeout for GCS internal KV service
RAY_SERVE_KV_TIMEOUT = int(os.environ.get("RAY_SERVE_KV_TIMEOUT_S", "0")) or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean by default, Serve doesn't time out during KV operations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question? Why is there a None?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also can you suffix this constant with _S like others?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this can be a float right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simon-mo @shrekris-anyscale maybe I didn't state it clearly. My plan is to put a non-None value in following PR since it involves more work. This is only for replacing GcsClient in serve. The WIP PR is here.

@simon-mo Good catch I'll update the int to float.

I'm not sure whether it's a good practice to add _S there because, for python, all the timeouts are measured by seconds in float type. But it's your call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_S is the convention ins Serve, update will be great!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, you are saying that right now gcs client only supports timeout=None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, GCS client supports timeout. I only mean, for this PR, serve is not using timeout. The timeout will be enabled in the next PR as mentioned.

@fishbone fishbone removed the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jun 10, 2022
@fishbone fishbone merged commit 0c527b4 into ray-project:master Jun 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants