Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dashboard] Increase the RPC timeout for the snapshot API #28330

Merged
merged 6 commits into from
Sep 26, 2022

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Sep 7, 2022

Why are these changes needed?

The current RPC timeout is too short (2s), and we've discovered Ray components not responding within the current timeout range occasionally under pressure. This is something we will fix, but it's better if we can have a longer timeout. The current timeout is configured as 2X of the polling frequency.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@wuisawesome wuisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this configurable via env var?

@wuisawesome
Copy link
Contributor

On a somewhat related note, I wonder if we should distinguish between timeouts and other error types...

@@ -28,6 +28,8 @@

routes = dashboard_optional_utils.ClassMethodRouteTable

SNAPSHOT_API_TIMEOUT_SECONDS = 30
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to pass timeout in as a query param for any of these? Similar to the component_activities route.

self.get_job_info(),
self.get_job_submission_info(),
self.get_actor_info(actor_limit),
self.get_serve_info(),
self.get_session_name(),

That way, if any clusters have long snapshots, we could use a feature flag to target specific clusters to extend the timeout period.

@jjyao
Copy link
Collaborator

jjyao commented Sep 7, 2022

I may miss some context here: will snapshot API be replaced completely by component_activities. Anything we do for snapshot, should we consider it for component_activities as well?

@wuisawesome
Copy link
Contributor

Actually, also are we sure we need this in light of #27589?

@rkooo567
Copy link
Contributor Author

@wuisawesome it is still safe to do this. I think 2 seconds are too drastic timeout in general

@rkooo567
Copy link
Contributor Author

@jjyao let me double check if this change is reflected to the component activities too.

@rkooo567
Copy link
Contributor Author

@wuisawesome @galenhwang I will go with Galen's suggestion for the timeout configuration (it can be specified via Http req argument)

@galenhwang
Copy link
Member

Would we able to backport this to previous Ray versions? This mainly affects pre-2.0 versions.

@rkooo567
Copy link
Contributor Author

cc @matthewdeng to answer #28330 (comment)

@rkooo567
Copy link
Contributor Author

@jjyao confirms it also affects timeout for component_activities . I think snapshot is still used along with component_activities

@matthewdeng
Copy link
Contributor

We could perform a patch release for older versions (perhaps just 1.13), but would be hesitant to do so unless there is a pressing need.

@rkooo567
Copy link
Contributor Author

Added the timeout support. I will try merging it once the CI passes.

@rkooo567
Copy link
Contributor Author

cc @alanwguo I need the code owner approval too

@rkooo567 rkooo567 merged commit d745939 into ray-project:master Sep 26, 2022
matthewdeng pushed a commit that referenced this pull request Sep 27, 2022
The current RPC timeout is too short (2s), and we've discovered Ray components not responding within the current timeout range occasionally under pressure. This is something we will fix, but it's better if we can have a longer timeout. The current timeout is configured as 2X of the polling frequency.
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…t#28330)

The current RPC timeout is too short (2s), and we've discovered Ray components not responding within the current timeout range occasionally under pressure. This is something we will fix, but it's better if we can have a longer timeout. The current timeout is configured as 2X of the polling frequency.

Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants