[tune] Never block for results #18391

krfricke · 2021-09-07T11:12:35Z

Why are these changes needed?

If at least one trial is running, Tune is currently blocking until a result is received before continuing with the tuning loop.

This is an artifact of the legacy implementation, where Tune was aware of resource availability through its own resource management system.

However, since we changed to placement groups which can become ready anytime, blocking for results is no longer a good option.

In practice, we encountered the following problems:

Blocking for results of one running trial will prevent us from starting other trials if they become ready. Previously this was tackled with the staging grace period, but for long trial and resource startup times, this is unfeasible to tweak.
Blocking for results even if all trials are running will prevent us from printing periodic updates (e.g. the status table). During that time it seems like the process is hanging, even though it isn't.

This PR introduces an environment variable controlling the maximum wait time, defaulting to one second.

cc @AmeerHajAli

Related issue number

Addresses part of #18003

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

xwjiang2010 · 2021-09-07T18:18:14Z

Nooby questions:

Should we start removing legacy code and just leave PG code path?
What is staging grace period? Why do we handle it separately? Is it a legacy concept?
On a high level, the event loop wants to progress whichever the following becomes True:

There is some result available for running trials, in which case, we want to process it.
Some PG is ready for new trials to start running, in which case, we want to kick off the trial.
Is there a way that we can wait on both of them at the same time like
wait([FutureA, FutureB]) rather than
wait(FutureA, timeout=timeout). # do a bunch of things to tweak this behavior
wait(FutureB)

Thanks!

krfricke · 2021-09-08T08:54:34Z

Hi @xwjiang2010, those are great questions!

We should! Let's do this in separate PRs though, and let's include this in Q4 planning
The original purpose was to ensure that we don't block for results when placement groups become ready to avoid exactly what happened in the issue we observed here. We could get rid of it, but I think it's still worthwhile to have lower timeout times in the beginning (during the grace period) than later. We should definitely review this behavior though
Yes, that's a good suggestion and has been brought up when we originally introduced the placement groups. Back then we decided against it as it would have required (even more) refactoring of the tuning loop. But yes, generally we should wait for training futures and placement group futures at the same time and process whatever gets ready first.

xwjiang2010 · 2021-09-08T15:49:45Z

python/ray/tune/tests/test_actor_reuse.py

@@ -225,7 +230,8 @@ def testMultiTrialReuse(self):
            config={
                "message": tune.grid_search(
                    ["First", "Second", "Third", "Fourth"]),
-                "id": -1
+                "id": -1,
+                "sleep": 1,
            },


Maybe add some comments on why this ActorReuseMultiTest needs sleep to work(v.s. ActorReuseTest does not)?

I don't have a good answer yet as to how. But do you think when we restructure Tune code, it's also a good time to revisit testability, especially how to emulate the timing in tests in a consistent and readable way, preferably less sleep so that test can run faster.

xwjiang2010 · 2021-09-08T15:55:54Z

Hi @xwjiang2010, those are great questions!

We should! Let's do this in separate PRs though, and let's include this in Q4 planning
Sounds good. It's already captured in the planning doc.

The original purpose was to ensure that we don't block for results when placement groups become ready to avoid exactly what happened in the issue we observed here. We could get rid of it, but I think it's still worthwhile to have lower timeout times in the beginning (during the grace period) than later. We should definitely review this behavior though

Yes, that's a good suggestion and has been brought up when we originally introduced the placement groups. Back then we decided against it as it would have required (even more) refactoring of the tuning loop. But yes, generally we should wait for training futures and placement group futures at the same time and process whatever gets ready first.

For 2 & 3, understand the PG stuff introduces some tech debt. +1 to review the behavior. Personally I feel like we are at a point of refactoring the tuning loop to make it more robust and efficient to be future proof (less adhoc logic, easier to follow). Would like to know your thoughts.

xwjiang2010 · 2021-09-08T15:56:15Z

LGTM. Some minor questions on test setup.

[tune] Never block for results

626d197

krfricke requested review from xwjiang2010 and richardliaw September 7, 2021 11:12

krfricke assigned richardliaw and xwjiang2010 Sep 7, 2021

Kai Fricke added 2 commits September 7, 2021 14:52

Fix tests

fba9a94

Block in tests

3108336

xwjiang2010 reviewed Sep 8, 2021

View reviewed changes

xwjiang2010 approved these changes Sep 8, 2021

View reviewed changes

Add comment to test

f616231

richardliaw merged commit 395976c into ray-project:master Sep 9, 2021

krfricke deleted the tune/pg-ready branch September 22, 2023 22:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Never block for results #18391

[tune] Never block for results #18391

krfricke commented Sep 7, 2021 •

edited

Loading

xwjiang2010 commented Sep 7, 2021 •

edited

Loading

krfricke commented Sep 8, 2021

xwjiang2010 Sep 8, 2021

xwjiang2010 commented Sep 8, 2021 •

edited by richardliaw

Loading

xwjiang2010 commented Sep 8, 2021

[tune] Never block for results #18391

[tune] Never block for results #18391

Conversation

krfricke commented Sep 7, 2021 • edited Loading

Why are these changes needed?

Related issue number

Checks

xwjiang2010 commented Sep 7, 2021 • edited Loading

krfricke commented Sep 8, 2021

xwjiang2010 Sep 8, 2021

Choose a reason for hiding this comment

xwjiang2010 commented Sep 8, 2021 • edited by richardliaw Loading

xwjiang2010 commented Sep 8, 2021

krfricke commented Sep 7, 2021 •

edited

Loading

xwjiang2010 commented Sep 7, 2021 •

edited

Loading

xwjiang2010 commented Sep 8, 2021 •

edited by richardliaw

Loading