[release][tune] [build_base] Fix `tune_scalability_durable_trainable` configuration to force experiment syncs less often #31649

justinvyu · 2023-01-12T21:38:06Z

Deflakes tune_scalability_durable_trainable release test by configuring it so that forced experiment checkpoint syncs happen less frequently, cutting down on the test runtime.

Why are these changes needed?

The tune_scalability_durable_trainable test had a performance regression (running for ~300s longer) and is flaky. This was due to experiment checkpoint syncing to cloud being forced more often due to a fix by #31131. Each forced sync causes the driver to wait until that sync is finished before launching a new one, which accounts for the drastically increased runtime. This PR fixes the issue by increasing the number of checkpoints to keep, which in turn reduces the frequency of forced cloud syncs.

Question: How can we provide a better default for users? Can we provide a maximum amount of time spent waiting on cloud syncs? So that even if the user configures keep_checkpoints_num in such a way that syncs are getting forced very frequently, their script runtime doesn't get affected past a certain amount.

Tracking this here: [Tune] Excessive experiment state syncing can cause significant slowdown #31613, to be addressed in a followup.

Related issue number

Closes #31506

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <[email protected]>

…y_tune_scalability_test

krfricke

Instead of increasing the number of checkpoints to keep, can we just increase the fake training time? This will also increase the time between checkpoint syncs, correct?

justinvyu · 2023-01-13T18:11:46Z

Instead of increasing the number of checkpoints to keep, can we just increase the fake training time? This will also increase the time between checkpoint syncs, correct?

@krfricke Yes, we could increase the training time and increase the test pass threshold. Currently, the fake training time per iteration is 6s.

krfricke · 2023-01-16T20:18:05Z

@justinvyu I'd prefer that option then, could you update the PR? Thanks!

Signed-off-by: Justin Yu <[email protected]>

…y_tune_scalability_test

… configuration to force experiment syncs less often (ray-project#31649) Deflakes `tune_scalability_durable_trainable` release test by configuring it so that forced experiment checkpoint syncs happen less frequently, cutting down on the test runtime. Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Andrea Pisoni <[email protected]>

justinvyu added 4 commits January 6, 2023 16:17

Increase keep_checkpoints_num to force experiment syncs less frequently

c6646a4

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into flak…

d0908e6

…y_tune_scalability_test

Merge branch 'master' of https://github.com/ray-project/ray into flak…

2586fe1

…y_tune_scalability_test

Merge branch 'master' of https://github.com/ray-project/ray into flak…

5ac1636

…y_tune_scalability_test

justinvyu added tune Tune-related issues air release-test release test labels Jan 12, 2023

justinvyu assigned krfricke Jan 12, 2023

justinvyu requested a review from krfricke January 12, 2023 23:02

krfricke requested changes Jan 13, 2023

View reviewed changes

justinvyu added 2 commits January 17, 2023 00:12

Increase fake training time and checkpoint freq to match

ae5d307

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into flak…

4d5d406

…y_tune_scalability_test

justinvyu requested a review from krfricke January 17, 2023 17:28

krfricke approved these changes Jan 17, 2023

View reviewed changes

krfricke merged commit a205fda into ray-project:master Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release][tune] [build_base] Fix `tune_scalability_durable_trainable` configuration to force experiment syncs less often #31649

[release][tune] [build_base] Fix `tune_scalability_durable_trainable` configuration to force experiment syncs less often #31649

justinvyu commented Jan 12, 2023

krfricke left a comment

justinvyu commented Jan 13, 2023

krfricke commented Jan 16, 2023

[release][tune] [build_base] Fix tune_scalability_durable_trainable configuration to force experiment syncs less often #31649

[release][tune] [build_base] Fix tune_scalability_durable_trainable configuration to force experiment syncs less often #31649

Conversation

justinvyu commented Jan 12, 2023

Why are these changes needed?

Related issue number

Checks

krfricke left a comment

Choose a reason for hiding this comment

justinvyu commented Jan 13, 2023

krfricke commented Jan 16, 2023

[release][tune] [build_base] Fix `tune_scalability_durable_trainable` configuration to force experiment syncs less often #31649

[release][tune] [build_base] Fix `tune_scalability_durable_trainable` configuration to force experiment syncs less often #31649