-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ray client] enable ray.get with >2 sec timeout (#21883) #22165
[ray client] enable ray.get with >2 sec timeout (#21883) #22165
Conversation
Commit 2cf4c72 ("[ray client] Fix ctrl-c for ray.get() by setting a short-server side timeout") introduced a short server-side timeout not to block later operations. However, the fix implicitly assumes that get() is complete within MAX_BLOCKING_OPERATION_TIME_S (two seconds). This becomes a problem when apps use heavy objects or limited network I/O bandwidth that require more than two seconds to push all chunks. The current retry logic needs to re-push from the beginning of chunks and block clients with the infinite re-push. I updated the logic to directly pass timeout if it is explicitly given. Without timeout, it still uses MAX_BLOCKING_OPERATION_TIME_S for polling with the short server-side timeout.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, can you rebase/merge with master? That should rerun CI and hopefully fix the failing tests.
@ckw017 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, re-evaluating this, I think this overloads the meaning of ray.get
's timeout a bit too much. For example, users will have to explicitly add timeout to work in client, even though it would be unnecessary when attaching a driver directly.
@takeshi-yoshimura would being able to set an environment variable, i.e. something like RAY_CLIENT_MAX_BLOCKING_OPERATION_TIME_S=20
to increase the timeout work for your usecase instead?
Alternatively we can try bumping MAX_BLOCKING_OPERATION_TIME
to a higher value
…himura/ray into ray-get-with-more-than-2-sec
@ckw017 |
Looks good, looks like the only thing failing is the lint. Can you update to match what the linter is suggesting? |
I fixed the format. thanks. |
Thanks! Looking into the failing tests |
Looks like we had a few problems with our testing infra when you first ran the PR, if you merge with master and push again it should be good to go! |
…get-with-more-than-2-sec
Merged again. let's see... |
Okay, looks like failures are:
|
cc @ijrsvt can you take a look/merge this when you get the chance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
RLLIB Tests & OSX |
Commit 2cf4c72 ("[ray client] Fix ctrl-c for ray.get() by setting a short-server side timeout") introduced a short server-side timeout not to block later operations.
However, the fix implicitly assumes that get() is complete within MAX_BLOCKING_OPERATION_TIME_S (two seconds). This becomes a problem when apps use heavy objects or limited network I/O bandwidth that require more than two seconds to push all chunks. The current retry logic needs to re-push from the beginning of chunks and block clients with the infinite re-push.
I updated the logic to directly pass timeout if it is explicitly given. Without timeout, it still uses MAX_BLOCKING_OPERATION_TIME_S for polling with the short server-side timeout.
Initially, I thought the fix should be exponential back-off, but that may change the behavior of many apps. So, this patch focuses on enabling bigger timeouts.
Why are these changes needed?
Related issue number
Closes #21883
Checks
scripts/format.sh
to lint the changes in this PR.