[ray client] enable ray.get with >2 sec timeout (#21883) #22165

takeshi-yoshimura · 2022-02-07T08:08:36Z

Commit 2cf4c72 ("[ray client] Fix ctrl-c for ray.get() by setting a short-server side timeout") introduced a short server-side timeout not to block later operations.

However, the fix implicitly assumes that get() is complete within MAX_BLOCKING_OPERATION_TIME_S (two seconds). This becomes a problem when apps use heavy objects or limited network I/O bandwidth that require more than two seconds to push all chunks. The current retry logic needs to re-push from the beginning of chunks and block clients with the infinite re-push.

I updated the logic to directly pass timeout if it is explicitly given. Without timeout, it still uses MAX_BLOCKING_OPERATION_TIME_S for polling with the short server-side timeout.

Initially, I thought the fix should be exponential back-off, but that may change the behavior of many apps. So, this patch focuses on enabling bigger timeouts.

Why are these changes needed?

Related issue number

Closes #21883

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Commit 2cf4c72 ("[ray client] Fix ctrl-c for ray.get() by setting a short-server side timeout") introduced a short server-side timeout not to block later operations. However, the fix implicitly assumes that get() is complete within MAX_BLOCKING_OPERATION_TIME_S (two seconds). This becomes a problem when apps use heavy objects or limited network I/O bandwidth that require more than two seconds to push all chunks. The current retry logic needs to re-push from the beginning of chunks and block clients with the infinite re-push. I updated the logic to directly pass timeout if it is explicitly given. Without timeout, it still uses MAX_BLOCKING_OPERATION_TIME_S for polling with the short server-side timeout.

stale · 2022-03-24T23:44:15Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

ckw017

Looks good to me, can you rebase/merge with master? That should rerun CI and hopefully fix the failing tests.

…-than-2-sec

takeshi-yoshimura · 2022-03-28T01:26:45Z

@ckw017
Thanks. I merged with master.

ckw017

Hmm, re-evaluating this, I think this overloads the meaning of ray.get's timeout a bit too much. For example, users will have to explicitly add timeout to work in client, even though it would be unnecessary when attaching a driver directly.

@takeshi-yoshimura would being able to set an environment variable, i.e. something like RAY_CLIENT_MAX_BLOCKING_OPERATION_TIME_S=20 to increase the timeout work for your usecase instead?

Alternatively we can try bumping MAX_BLOCKING_OPERATION_TIME to a higher value

cc @ericl (author of #14425)

…-than-2-sec-2

…himura/ray into ray-get-with-more-than-2-sec

takeshi-yoshimura · 2022-04-05T03:11:11Z

@ckw017
Updated the code to introduce a new environmental variable RAY_CLIENT_MAX_BLOCKING_OPERATION_TIME_S according to your request.

ckw017 · 2022-04-05T16:59:24Z

Looks good, looks like the only thing failing is the lint. Can you update to match what the linter is suggesting?

takeshi-yoshimura · 2022-04-07T02:01:58Z

I fixed the format. thanks.

ckw017 · 2022-04-07T18:00:48Z

Thanks! Looking into the failing tests

ckw017 · 2022-04-21T17:17:48Z

Looks like we had a few problems with our testing infra when you first ran the PR, if you merge with master and push again it should be good to go!

…get-with-more-than-2-sec

takeshi-yoshimura · 2022-04-22T02:47:55Z

Merged again. let's see...

ckw017 · 2022-04-22T19:08:31Z

Okay, looks like failures are:

LinkCheck (safe to ignore, failing on master as well)
rllib:learning_tests_pendulum_ddppo (safe to ignore, failing on master)
rllib:learning_tests_cartpole_ddppo (safe to ignore, failing on master at time of merge)
mac tests (Looks like something in our infra dropped a bunch of mac agents around that time, IMO safe to ignore)

ckw017 · 2022-04-22T19:09:28Z

cc @ijrsvt can you take a look/merge this when you get the chance

ijrsvt

LGTM

ijrsvt · 2022-04-25T20:06:23Z

RLLIB Tests & OSX test_shuffle are failing on master at the point of rebase.

simon-mo assigned ckw017 Feb 22, 2022

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 24, 2022

ckw017 self-requested a review March 25, 2022 00:41

stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 25, 2022

ckw017 approved these changes Mar 25, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/master' into ray-get-with-more…

6a2a756

…-than-2-sec

ckw017 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Mar 30, 2022

ckw017 requested changes Mar 30, 2022

View reviewed changes

takeshi-yoshimura added 3 commits April 5, 2022 03:02

Merge remote-tracking branch 'upstream/master' into ray-get-with-more…

b4dcafd

…-than-2-sec-2

Introduce RAY_CLIENT_MAX_BLOCKING_OPERATION_TIME_S

7433cf4

Merge branch 'ray-get-with-more-than-2-sec' of github.com:takeshi-yos…

b9b4ee8

…himura/ray into ray-get-with-more-than-2-sec

reformat

6f19b92

ckw017 removed the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Apr 7, 2022

ckw017 approved these changes Apr 7, 2022

View reviewed changes

Merge branch 'master' of https://github.com/ray-project/ray into ray-…

e941049

…get-with-more-than-2-sec

ckw017 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Apr 22, 2022

ckw017 assigned ijrsvt Apr 22, 2022

ckw017 requested a review from ijrsvt April 22, 2022 19:09

ijrsvt approved these changes Apr 22, 2022

View reviewed changes

ijrsvt merged commit e115545 into ray-project:master Apr 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ray client] enable ray.get with >2 sec timeout (#21883) #22165

[ray client] enable ray.get with >2 sec timeout (#21883) #22165

takeshi-yoshimura commented Feb 7, 2022

stale bot commented Mar 24, 2022

ckw017 left a comment •

edited

Loading

takeshi-yoshimura commented Mar 28, 2022

ckw017 left a comment •

edited

Loading

takeshi-yoshimura commented Apr 5, 2022

ckw017 commented Apr 5, 2022

takeshi-yoshimura commented Apr 7, 2022

ckw017 commented Apr 7, 2022

ckw017 commented Apr 21, 2022

takeshi-yoshimura commented Apr 22, 2022

ckw017 commented Apr 22, 2022

ckw017 commented Apr 22, 2022

ijrsvt left a comment

ijrsvt commented Apr 25, 2022

[ray client] enable ray.get with >2 sec timeout (#21883) #22165

[ray client] enable ray.get with >2 sec timeout (#21883) #22165

Conversation

takeshi-yoshimura commented Feb 7, 2022

Why are these changes needed?

Related issue number

Checks

stale bot commented Mar 24, 2022

ckw017 left a comment • edited Loading

Choose a reason for hiding this comment

takeshi-yoshimura commented Mar 28, 2022

ckw017 left a comment • edited Loading

Choose a reason for hiding this comment

takeshi-yoshimura commented Apr 5, 2022

ckw017 commented Apr 5, 2022

takeshi-yoshimura commented Apr 7, 2022

ckw017 commented Apr 7, 2022

ckw017 commented Apr 21, 2022

takeshi-yoshimura commented Apr 22, 2022

ckw017 commented Apr 22, 2022

ckw017 commented Apr 22, 2022

ijrsvt left a comment

Choose a reason for hiding this comment

ijrsvt commented Apr 25, 2022

ckw017 left a comment •

edited

Loading

ckw017 left a comment •

edited

Loading