-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tune] [Bug] Ray checkpoint sync can sometimes fail to upload checkpoints to s3, plus log spew about sync client observed #21469
Comments
Thanks Anne.
On top of that, Tune currently is not officially supporting multiple Tune sessions running on the same cluster. But we could probably enhance this capability if we collect enough use cases requiring so. Let me know if I capture the issues correctly. Also cc @richardliaw |
Hi, @xwjiang2010, In response to your last comment: This ticket is specifically on the issue with checkpoints being lost. I have filed a separate ticket on the get_alive_node_ip issue, Wrt Tune not officially supporting multiple Tune sessions on the same cluster, it does |
Yes, great. Sorry I missed the other ticket. Let's keep the discussion separate on the two tickets. And yes, these issues are orthogonal to single v.s. multiple tune job. |
Wrt size of checkpoints, I've seen the problem with the dataset ames_housing and I'm not setting the checkpoint_period or sync_period, so my runs are getting the defaults. (base) MacBook-Air-2:forest_cover anne$ aws s3 ls s3://predibase-runs/nodeless/ames_housing/hours1/trainable_func_f6zONnP/trial_79924664/checkpoint_001459/model/ (base) MacBook-Air-2:forest_cover anne$ aws s3 ls s3://predibase-runs/nodeless/forest_cover/hours1/trainable_func_fVfbX2v/trial_86415c74/checkpoint_000401/model/ (base) MacBook-Air-2:forest_cover anne$ aws s3 ls s3://predibase-runs/nodeless/higgs/hours1/trainable_func_fI6YKzO/trial_30043e10/checkpoint_000016/model/ |
@amholler Yes, reducing the checkpoint frequency can really help here. Are you using class Trainables? If so, you can simply set |
Hi, @xwjiang2010 AFAIK, we aren't using class Trainables; we are using the function API. I tried setting I have put more tracing into the ames_housing runs, and I see more about what's going on. In the steady state, the checkpoint sync'ing to the cloud is being called from _maybe_save_to_cloud, And these sync invocations often fail for this workload, because the deletion of old/suboptimal (retention 1) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m 2022-01-08 07:26:47,781 WARNING sync_client.py:346 -- Running sync: aws s3 sync /home/ray/ray_results/trainable_func_fk4_l_i/trial_5677d610/ s3://predibase-elotl/baseline/ames_housing/hours1/trainable_func_fk4_l_i/trial_5677d610 --only-show-errors File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 218, in ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m 2022-01-08 07:26:49,039 WARNING sync_client.py:346 -- Running sync: aws s3 sync /home/ray/ray_results/trainable_func_fk4_l_i/trial_5677d610/ s3://predibase-elotl/baseline/ames_housing/hours1/trainable_func_fk4_l_i/trial_5677d610 --only-show-errors File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 218, in ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m 2022-01-08 07:26:49,876 WARNING sync_client.py:275 -- Running delete: aws s3 rm s3://predibase-elotl/baseline/ames_housing/hours1/trainable_func_fk4_l_i/trial_5677d610/checkpoint_000001 --recursive --only-show-errors File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 218, in ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m 2022-01-08 07:26:50,396 WARNING sync_client.py:323 -- Last sync client cmd still in progress, skipping. File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 218, in |
Hi @amholler Let me know how it goes. I will also update our doc as they seem unclear. |
To capture some of the discussion offline. There are several paths going forward:
Whichever we choose among 2-4 or a combination of them, we should probably understand what the current bottleneck is (i.e. 1). |
|
Search before asking
Ray Component
Ray Tune
What happened + What you expected to happen
Ray checkpoint sync can sometimes fail to upload checkpoints to s3, plus log spew about sync client observed
Problem Summary
Checkpoints seem to not always get uploaded reliably to s3 by Ray 1.9.1, so the best model produced
by the hyperparameter search can't always be pulled at the end of the search for evaluation.
In the case where I recently saw this (Ludwig tf-legacy AutoML for ames_housing dataset, I also saw a spew
of the following warnings being output:
WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
The warning seems correlated with the problem, in that I didn't see such warnings in the
other AutoML jobs running simultaneously, which all completed w/o missing their checkpoints.
From interaction on problem w/Kai Fricke on tune slack channel:
The Ray version is 1.9.1. The sync configuration is set up in Ludwig using this code:
self.sync_config = tune.SyncConfig(
sync_to_driver=False,
upload_dir=output_directory
)
with output directory set to 's3://predibase-runs/nodeless/ames_housing/hours1/'.
Some of the checkpoints from this run did get uploaded successfully (I can see them in s3), so I'm not
sure what is going on, including other trials on the same worker node.
I did not manually set the checkpointing interval. One thing I can mention is that the hyperparameter search
is running async_hyperband with a time budget, so trials can get stopped if they are considered unpromising.
Is it at all possible that a trial would get stopped by async_hyperband while it was trying to sync and the
sync could hang or something?
Observations from Repro of problem
After I observed the problem, I updated Ludwig AutoML to make it more resilient to not finding the checkpoints
for some trials at the end of the run, and reran the test case. The issue with some checkpoints not getting
uploaded for some trials (2 out of 8) reproduced but Ludwig was able to analyze the checkpoints that were
uploaded & produce a usable report.
Note that for the spew problem:
The attempts to run the sync client cmd seem very close in time, which is surprising since I thought
that sync_period is by default 300 seconds. I did not set TUNE_GLOBAL_CHECKPOINT_S, so presumably it
is using the auto setting; is that what is causing the spew of attempted sync client cmds I wonder?
The spew seems at a higher rate than the epochs reporting results. Just to give a feel for this,
here's the first page of spew along with trial results reporting:
(base) ray@example-cluster-ray-head-fh8jv:~/experiments/automl/validation$ grep -e WARNING -e "Result for" ames_housing/run.1hr |more
Result for trial_79924664:
(ImplicitFunc pid=28222) 2022-01-07 10:56:39,426 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:40,037 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:40,644 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:41,192 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:41,731 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:42,273 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:42,836 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
Result for trial_79924664:
Result for trial_79924664:
Result for trial_79924664:
Result for trial_79924664:
Result for trial_79924664:
Result for trial_79924664:
Result for trial_79924664:
Result for trial_79924664:
Result for trial_79924664:
Result for trial_79924664:
(ImplicitFunc pid=28222) 2022-01-07 10:56:43,389 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:43,957 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:44,540 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:45,057 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:47,281 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:47,935 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:48,599 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:49,171 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:49,749 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:50,306 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:50,820 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:53,049 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:53,747 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:54,373 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:54,963 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:57,026 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:57,613 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:58,177 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:00,622 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:01,291 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:03,430 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:04,117 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:04,709 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:05,303 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:05,920 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:07,936 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:09,444 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:10,096 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:10,636 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
Versions / Dependencies
Ray 1.9.1
Reproduction script
https://github.com/ludwig-ai/experiments/blob/main/automl/validation/ames_housing/run_auto_train_1hr_nodeless.py run with ludwig tf-legacy
Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: