Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune/rllib] Fix tune cloud tests for function and rllib trainables #20536

Merged
merged 7 commits into from
Nov 24, 2021

Conversation

krfricke
Copy link
Contributor

Why are these changes needed?

We currently enforce strict checkpoint checking in the cloud test. However, we sometimes run into a race condition when trials are interrupted: A remote trial might already have progressed to a new checkpoint (and deleted old ones), but the trial scheduler has not yet received the latest result in order to add this information to the experiment checkpoint.

Additionally, there was a bug in the trial runner where trial metadata was not updated upon receiving a new result. This also lead to problems with interrupted training runs.

This PR relaxes the checkpoint requirement to also consider "too new" checkpoints if necessary, and removes the outdated try_checkpoint_metadata method to be replaced with the trial cache set.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -1127,7 +1129,8 @@ def _process_trial_save(self, trial):
trial=trial,
checkpoint=trial.saving_to)
trial.on_checkpoint(trial.saving_to)
self.trial_executor.try_checkpoint_metadata(trial)
if trial.checkpoint.storage != Checkpoint.MEMORY:
self.trial_executor.mark_trial_to_checkpoint(trial)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this functionality not part of runner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that should be moved into the runner eventually

@krfricke
Copy link
Contributor Author

Remaining flakiness seems to come from sync problems:


ray.tune.error.TuneError: Sync error. Ran command: aws s3 sync /home/ray/ray_results/cloud_durable_upload/ s3://data-test-ilr/durable_upload_rllib_trainer/cloud_durable_upload --only-show-errors --exclude '*/checkpoint_*'
Error message (1): upload failed: ../ray_results/cloud_durable_upload/experiment_state-2021-11-18_09-10-21.json to s3://data-test-ilr/durable_upload_rllib_trainer/cloud_durable_upload/experiment_state-2021-11-18_09-10-21.json An error occurred (RequestTimeout) when calling the PutObject operation (reached max retries: 4): Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.

I'll kick off another run here, after which we should be able to merge. The syn problem should probably be tackled in a separate PR.

@gjoliver
Copy link
Member

will this happen for people's production workload as well?
as things handled properly too when resume=AUTO?
just want to make sure.

Comment on lines +513 to +514
def testTrialNoCheckpointSave(self):
"""Check that non-checkpointing trials *are* saved."""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardliaw quick question: In this PR, we change the behavior that even non-checkpointing trials are saved. The reason is that we often got out-of-sync checkpoints (the FS sync already synchronized new checkpoints to the driver, but the trial metadata has not been updated on the driver side, yet).
Also, I don't see why we wouldn't want to save intermediate state of non-checkpointing trials. It seems to me like the main reason here would be that we can't restore these trials. However, in the case of early-exiting an experiment via keyboard interrupt, we sometimes may want to analyze reported resuls until that time.

Thus, I quickly wanted to confirm if there is any other reason why we specifically want trials not to be saved when they are not checkpointing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean to save an intermediate trial without checkpoints? Do you mean just to move the trial folder back to local host?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Saving" in that sense is to store its metadata (e.g. last results) in the experiment checkpoint (trial runner checkpoint)

@krfricke krfricke added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Nov 22, 2021
@richardliaw
Copy link
Contributor

richardliaw commented Nov 22, 2021 via email

@krfricke krfricke merged commit 7446269 into ray-project:master Nov 24, 2021
@krfricke krfricke deleted the tune/fix-cloud-tests branch November 24, 2021 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants