-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Dashboard] Increase the RPC timeout for the snapshot API #28330
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this configurable via env var?
On a somewhat related note, I wonder if we should distinguish between timeouts and other error types... |
@@ -28,6 +28,8 @@ | |||
|
|||
routes = dashboard_optional_utils.ClassMethodRouteTable | |||
|
|||
SNAPSHOT_API_TIMEOUT_SECONDS = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to pass timeout in as a query param for any of these? Similar to the component_activities
route.
ray/dashboard/modules/snapshot/snapshot_head.py
Lines 132 to 136 in de2db06
self.get_job_info(), | |
self.get_job_submission_info(), | |
self.get_actor_info(actor_limit), | |
self.get_serve_info(), | |
self.get_session_name(), |
That way, if any clusters have long snapshots, we could use a feature flag to target specific clusters to extend the timeout period.
I may miss some context here: will snapshot API be replaced completely by |
Actually, also are we sure we need this in light of #27589? |
@wuisawesome it is still safe to do this. I think 2 seconds are too drastic timeout in general |
@jjyao let me double check if this change is reflected to the component activities too. |
@wuisawesome @galenhwang I will go with Galen's suggestion for the timeout configuration (it can be specified via Http req argument) |
Would we able to backport this to previous Ray versions? This mainly affects pre-2.0 versions. |
cc @matthewdeng to answer #28330 (comment) |
@jjyao confirms it also affects timeout for |
We could perform a patch release for older versions (perhaps just 1.13), but would be hesitant to do so unless there is a pressing need. |
Added the timeout support. I will try merging it once the CI passes. |
cc @alanwguo I need the code owner approval too |
The current RPC timeout is too short (2s), and we've discovered Ray components not responding within the current timeout range occasionally under pressure. This is something we will fix, but it's better if we can have a longer timeout. The current timeout is configured as 2X of the polling frequency.
…t#28330) The current RPC timeout is too short (2s), and we've discovered Ray components not responding within the current timeout range occasionally under pressure. This is something we will fix, but it's better if we can have a longer timeout. The current timeout is configured as 2X of the polling frequency. Signed-off-by: Weichen Xu <[email protected]>
Why are these changes needed?
The current RPC timeout is too short (2s), and we've discovered Ray components not responding within the current timeout range occasionally under pressure. This is something we will fix, but it's better if we can have a longer timeout. The current timeout is configured as 2X of the polling frequency.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.