-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train/horovod] Fix horovod long running release test #27179
[train/horovod] Fix horovod long running release test #27179
Conversation
…reporting only on checkpoint Signed-off-by: Kai Fricke <[email protected]>
Marking as no merge as the test is still failing. If you continue on this please remove when appropriate @xwjiang2010 - otherwise I'll take a look in mid August |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
Signed-off-by: Kai Fricke <[email protected]>
Depends on #28440 and should report more frequently |
Signed-off-by: Kai Fricke <[email protected]>
The test is now working as expected (perturbing configs and correctly restarting from checkpoints). Merging. |
This reverts commit 4b59dfb.
…" (#28476) This reverts commit 4b59dfb. Looks like this breaks linux://python/ray/air:horovod_cifar_pbt_example Signed-off-by: Stephanie Wang [email protected]
…oject#27179)" (ray-project#28476) This reverts commit 4b59dfb. Looks like this breaks linux://python/ray/air:horovod_cifar_pbt_example Signed-off-by: Stephanie Wang [email protected] Signed-off-by: PaulFenton <[email protected]>
Signed-off-by: Kai Fricke [email protected]
Why are these changes needed?
The long running horovod release test only ever stays at 1 iteration for each trial. The reason for this is that it PAUSES trials to give way to other trials - but the trials only checkpoint at the end of each epoch. We have to either increase the pertubation interval massively, or report at the end of the epoch. Here I decided for the latter.
Related issue number
Addresses part of #27165
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.