Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train/horovod] Fix horovod long running release test #27179

Merged
merged 9 commits into from
Sep 13, 2022

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Jul 28, 2022

Signed-off-by: Kai Fricke [email protected]

Why are these changes needed?

  1. progression by reporting only on checkpoint
    The long running horovod release test only ever stays at 1 iteration for each trial. The reason for this is that it PAUSES trials to give way to other trials - but the trials only checkpoint at the end of each epoch. We have to either increase the pertubation interval massively, or report at the end of the epoch. Here I decided for the latter.
  2. fix checkpoint_dict.

Related issue number

Addresses part of #27165

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

…reporting only on checkpoint

Signed-off-by: Kai Fricke <[email protected]>
@xwjiang2010 xwjiang2010 changed the title [train/horovod] Fix horovod long running release test progression by reporting only on checkpoint [train/horovod] Fix horovod long running release test Aug 1, 2022
@krfricke krfricke added the do-not-merge Do not merge this PR! label Aug 4, 2022
@krfricke
Copy link
Contributor Author

krfricke commented Aug 4, 2022

Marking as no merge as the test is still failing. If you continue on this please remove when appropriate @xwjiang2010 - otherwise I'll take a look in mid August

@stale
Copy link

stale bot commented Sep 8, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Sep 8, 2022
@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Sep 12, 2022
Signed-off-by: Kai Fricke <[email protected]>
@krfricke
Copy link
Contributor Author

Depends on #28440 and should report more frequently

@krfricke
Copy link
Contributor Author

The test is now working as expected (perturbing configs and correctly restarting from checkpoints). Merging.

@krfricke krfricke merged commit 4b59dfb into ray-project:master Sep 13, 2022
@krfricke krfricke deleted the air/horovod-release-test branch September 13, 2022 13:03
stephanie-wang added a commit that referenced this pull request Sep 13, 2022
stephanie-wang added a commit that referenced this pull request Sep 13, 2022
…" (#28476)

This reverts commit 4b59dfb.

Looks like this breaks linux://python/ray/air:horovod_cifar_pbt_example

Signed-off-by: Stephanie Wang [email protected]
krfricke added a commit that referenced this pull request Sep 13, 2022
PaulFenton pushed a commit to PaulFenton/ray that referenced this pull request Sep 19, 2022
…oject#27179)" (ray-project#28476)

This reverts commit 4b59dfb.

Looks like this breaks linux://python/ray/air:horovod_cifar_pbt_example

Signed-off-by: Stephanie Wang [email protected]
Signed-off-by: PaulFenton <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge Do not merge this PR!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants