[tune/rllib] Fix tune cloud tests for function and rllib trainables #20536

krfricke · 2021-11-18T13:37:25Z

Why are these changes needed?

We currently enforce strict checkpoint checking in the cloud test. However, we sometimes run into a race condition when trials are interrupted: A remote trial might already have progressed to a new checkpoint (and deleted old ones), but the trial scheduler has not yet received the latest result in order to add this information to the experiment checkpoint.

Additionally, there was a bug in the trial runner where trial metadata was not updated upon receiving a new result. This also lead to problems with interrupted training runs.

This PR relaxes the checkpoint requirement to also consider "too new" checkpoints if necessary, and removes the outdated try_checkpoint_metadata method to be replaced with the trial cache set.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

xwjiang2010 · 2021-11-18T16:31:35Z

python/ray/tune/trial_runner.py

@@ -1127,7 +1129,8 @@ def _process_trial_save(self, trial):
                    trial=trial,
                    checkpoint=trial.saving_to)
                trial.on_checkpoint(trial.saving_to)
-                self.trial_executor.try_checkpoint_metadata(trial)
+                if trial.checkpoint.storage != Checkpoint.MEMORY:
+                    self.trial_executor.mark_trial_to_checkpoint(trial)


why is this functionality not part of runner?

I agree, that should be moved into the runner eventually

krfricke · 2021-11-18T18:48:30Z

Remaining flakiness seems to come from sync problems:


ray.tune.error.TuneError: Sync error. Ran command: aws s3 sync /home/ray/ray_results/cloud_durable_upload/ s3://data-test-ilr/durable_upload_rllib_trainer/cloud_durable_upload --only-show-errors --exclude '*/checkpoint_*'
Error message (1): upload failed: ../ray_results/cloud_durable_upload/experiment_state-2021-11-18_09-10-21.json to s3://data-test-ilr/durable_upload_rllib_trainer/cloud_durable_upload/experiment_state-2021-11-18_09-10-21.json An error occurred (RequestTimeout) when calling the PutObject operation (reached max retries: 4): Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.

I'll kick off another run here, after which we should be able to merge. The syn problem should probably be tackled in a separate PR.

gjoliver · 2021-11-18T19:12:07Z

will this happen for people's production workload as well?
as things handled properly too when resume=AUTO?
just want to make sure.

krfricke · 2021-11-22T12:13:57Z

python/ray/tune/tests/test_trial_runner_3.py

+    def testTrialNoCheckpointSave(self):
+        """Check that non-checkpointing trials *are* saved."""


@richardliaw quick question: In this PR, we change the behavior that even non-checkpointing trials are saved. The reason is that we often got out-of-sync checkpoints (the FS sync already synchronized new checkpoints to the driver, but the trial metadata has not been updated on the driver side, yet).
Also, I don't see why we wouldn't want to save intermediate state of non-checkpointing trials. It seems to me like the main reason here would be that we can't restore these trials. However, in the case of early-exiting an experiment via keyboard interrupt, we sometimes may want to analyze reported resuls until that time.

Thus, I quickly wanted to confirm if there is any other reason why we specifically want trials not to be saved when they are not checkpointing.

What does it mean to save an intermediate trial without checkpoints? Do you mean just to move the trial folder back to local host?

"Saving" in that sense is to store its metadata (e.g. last results) in the experiment checkpoint (trial runner checkpoint)

richardliaw · 2021-11-22T22:41:40Z

that sounds good, though just worried that there may be some confusion internally

…

On Mon, Nov 22, 2021 at 2:17 PM Kai Fricke ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In python/ray/tune/tests/test_trial_runner_3.py <#20536 (comment)>: > + def testTrialNoCheckpointSave(self): + """Check that non-checkpointing trials *are* saved.""" "Saving" in that sense is to store its metadata (e.g. last results) in the experiment checkpoint (trial runner checkpoint) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20536 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABCRZZOOTW5DFLPWVAGOZDTUNK6OXANCNFSM5IJULAWA> .

# Conflicts: # python/ray/tune/trial_runner.py

[tune/rllib] Fix tune cloud tests for function and rllib trainables

9d2d449

krfricke requested review from richardliaw, gjoliver and xwjiang2010 November 18, 2021 13:37

krfricke assigned richardliaw, gjoliver and xwjiang2010 Nov 18, 2021

Kai Fricke added 2 commits November 18, 2021 13:56

Relax driver condition

131e9e4

relax worker condition

6a7f3ce

xwjiang2010 reviewed Nov 18, 2021

View reviewed changes

xwjiang2010 approved these changes Nov 18, 2021

View reviewed changes

Fix internal iter

9f1dee8

Modify no checkpoint test

ae9d9af

krfricke commented Nov 22, 2021

View reviewed changes

krfricke added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Nov 22, 2021

Kai Fricke added 2 commits November 23, 2021 17:13

Merge branch 'master' into tune/fix-cloud-tests

f52ee60

Merge remote-tracking branch 'upstream/master' into tune/fix-cloud-tests

e90bd39

# Conflicts: # python/ray/tune/trial_runner.py

krfricke merged commit 7446269 into ray-project:master Nov 24, 2021

krfricke deleted the tune/fix-cloud-tests branch November 24, 2021 09:29

thoglu mentioned this pull request Jan 24, 2022

[Bug] Possible tune / core bug related to correct resuming of trials on a dynamic cluster #21825

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune/rllib] Fix tune cloud tests for function and rllib trainables #20536

[tune/rllib] Fix tune cloud tests for function and rllib trainables #20536

krfricke commented Nov 18, 2021

xwjiang2010 Nov 18, 2021

krfricke Nov 22, 2021

krfricke commented Nov 18, 2021

gjoliver commented Nov 18, 2021

krfricke Nov 22, 2021

richardliaw Nov 22, 2021

krfricke Nov 22, 2021

richardliaw commented Nov 22, 2021 via email

		def testTrialNoCheckpointSave(self):
		"""Check that non-checkpointing trials are saved."""

[tune/rllib] Fix tune cloud tests for function and rllib trainables #20536

[tune/rllib] Fix tune cloud tests for function and rllib trainables #20536

Conversation

krfricke commented Nov 18, 2021

Why are these changes needed?

Related issue number

Checks

xwjiang2010 Nov 18, 2021

Choose a reason for hiding this comment

krfricke Nov 22, 2021

Choose a reason for hiding this comment

krfricke commented Nov 18, 2021

gjoliver commented Nov 18, 2021

krfricke Nov 22, 2021

Choose a reason for hiding this comment

richardliaw Nov 22, 2021

Choose a reason for hiding this comment

krfricke Nov 22, 2021

Choose a reason for hiding this comment

richardliaw commented Nov 22, 2021 via email