Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tune] [Bug] Ray checkpoint sync can sometimes fail to upload checkpoints to s3, plus log spew about sync client observed #21469

Open
1 of 2 tasks
amholler opened this issue Jan 7, 2022 · 9 comments
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical tune Tune-related issues

Comments

@amholler
Copy link

amholler commented Jan 7, 2022

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Tune

What happened + What you expected to happen

Ray checkpoint sync can sometimes fail to upload checkpoints to s3, plus log spew about sync client observed


Problem Summary

Checkpoints seem to not always get uploaded reliably to s3 by Ray 1.9.1, so the best model produced
by the hyperparameter search can't always be pulled at the end of the search for evaluation.

In the case where I recently saw this (Ludwig tf-legacy AutoML for ames_housing dataset, I also saw a spew
of the following warnings being output:
WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
The warning seems correlated with the problem, in that I didn't see such warnings in the
other AutoML jobs running simultaneously, which all completed w/o missing their checkpoints.


From interaction on problem w/Kai Fricke on tune slack channel:

For syncing it would be good to know which Ray version you are using and how your sync configuration
looks like at the moment. Please feel free to add me to the discussion so I can help look into this

The Ray version is 1.9.1. The sync configuration is set up in Ludwig using this code:
self.sync_config = tune.SyncConfig(
sync_to_driver=False,
upload_dir=output_directory
)
with output directory set to 's3://predibase-runs/nodeless/ames_housing/hours1/'.

Some of the checkpoints from this run did get uploaded successfully (I can see them in s3), so I'm not
sure what is going on, including other trials on the same worker node.

I did not manually set the checkpointing interval. One thing I can mention is that the hyperparameter search
is running async_hyperband with a time budget, so trials can get stopped if they are considered unpromising.
Is it at all possible that a trial would get stopped by async_hyperband while it was trying to sync and the
sync could hang or something?


Observations from Repro of problem

After I observed the problem, I updated Ludwig AutoML to make it more resilient to not finding the checkpoints
for some trials at the end of the run, and reran the test case. The issue with some checkpoints not getting
uploaded for some trials (2 out of 8) reproduced but Ludwig was able to analyze the checkpoints that were
uploaded & produce a usable report.

Note that for the spew problem:
The attempts to run the sync client cmd seem very close in time, which is surprising since I thought
that sync_period is by default 300 seconds. I did not set TUNE_GLOBAL_CHECKPOINT_S, so presumably it
is using the auto setting; is that what is causing the spew of attempted sync client cmds I wonder?
The spew seems at a higher rate than the epochs reporting results. Just to give a feel for this,
here's the first page of spew along with trial results reporting:

(base) ray@example-cluster-ray-head-fh8jv:~/experiments/automl/validation$ grep -e WARNING -e "Result for" ames_housing/run.1hr |more
Result for trial_79924664:
(ImplicitFunc pid=28222) 2022-01-07 10:56:39,426 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:40,037 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:40,644 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:41,192 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:41,731 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:42,273 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:42,836 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
Result for trial_79924664:
Result for trial_79924664:
Result for trial_79924664:
Result for trial_79924664:
Result for trial_79924664:
Result for trial_79924664:
Result for trial_79924664:
Result for trial_79924664:
Result for trial_79924664:
Result for trial_79924664:
(ImplicitFunc pid=28222) 2022-01-07 10:56:43,389 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:43,957 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:44,540 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:45,057 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:47,281 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:47,935 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:48,599 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:49,171 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:49,749 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:50,306 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:50,820 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:53,049 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:53,747 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:54,373 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:54,963 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:57,026 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:57,613 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:56:58,177 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:00,622 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:01,291 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:03,430 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:04,117 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:04,709 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:05,303 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:05,920 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:07,936 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:09,444 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:10,096 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
(ImplicitFunc pid=28222) 2022-01-07 10:57:10,636 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.

Versions / Dependencies

Ray 1.9.1

Reproduction script

https://github.com/ludwig-ai/experiments/blob/main/automl/validation/ames_housing/run_auto_train_1hr_nodeless.py run with ludwig tf-legacy

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@amholler amholler added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 7, 2022
@xwjiang2010
Copy link
Contributor

Thanks Anne.
There are two issues:

  1. checkpoint lost. I have a few questions/comments on this.
  • What is the checkpoint frequency and how large are the checkpoints in your case? Sometimes this would happen when checkpointing too often and too large.
  • We are thinking of queueing checkpointing syncing instead of just bailing out when there is still one on-going. This is something we'd like to enhance for Q1. We also probably want to shift to use boto client instead of command syncing. cc @krfricke
  1. get_alive_node_ip information is stale. On this, I am looking at if we could get rid of this logic entirely.

On top of that, Tune currently is not officially supporting multiple Tune sessions running on the same cluster. But we could probably enhance this capability if we collect enough use cases requiring so.

Let me know if I capture the issues correctly.

Also cc @richardliaw

@amholler
Copy link
Author

amholler commented Jan 7, 2022

Hi, @xwjiang2010, In response to your last comment:

This ticket is specifically on the issue with checkpoints being lost.
Thanks for your questions on that issue; I will answer your questions in a follow-on comment.

I have filed a separate ticket on the get_alive_node_ip issue,
which is #21458
It would be great if you can get rid of that logic entirely because it seems too racy.

Wrt Tune not officially supporting multiple Tune sessions on the same cluster, it does
not seem to me that either of the above problems is actually related to that use case;
it seems like they could occur with a single Ray Tune job running on a cluster.

@xwjiang2010 xwjiang2010 self-assigned this Jan 7, 2022
@xwjiang2010
Copy link
Contributor

Yes, great. Sorry I missed the other ticket. Let's keep the discussion separate on the two tickets.

And yes, these issues are orthogonal to single v.s. multiple tune job.

@clarkzinzow clarkzinzow changed the title [Bug] Ray checkpoint sync can sometimes fail to upload checkpoints to s3, plus log spew about sync client observed [Tune] [Bug] Ray checkpoint sync can sometimes fail to upload checkpoints to s3, plus log spew about sync client observed Jan 7, 2022
@clarkzinzow clarkzinzow added the tune Tune-related issues label Jan 7, 2022
@amholler
Copy link
Author

amholler commented Jan 7, 2022

Wrt size of checkpoints, I've seen the problem with the dataset ames_housing and
I haven't seen it with higgs or forest_cover and below are examples of their uploaded sizes.
Maybe ames_housing looks somewhat bigger but it doesn't look dramatic to me.

I'm not setting the checkpoint_period or sync_period, so my runs are getting the defaults.
What does seem to be true to me is that the epochs complete much more quickly for ames_housing
than for forest_cover or higgs so presumably, much more epoch end stats reporting is happening.
Perhaps the automatic adaptive tuning for checkpoint period is actually too aggressive for this case.
If you think I should experiment w/setting sync_period and/or checkpoint_period explicitly, let me know.
Thanks!

(base) MacBook-Air-2:forest_cover anne$ aws s3 ls s3://predibase-runs/nodeless/ames_housing/hours1/trainable_func_f6zONnP/trial_79924664/checkpoint_001459/model/
PRE logs/
PRE training_checkpoints/
2022-01-07 11:55:59 83 checkpoint
2022-01-07 11:55:59 35350 model_hyperparameters.json
2022-01-07 11:55:59 584070 model_weights.data-00000-of-00001
2022-01-07 11:55:59 13165 model_weights.index
2022-01-07 11:55:59 985083 training_progress.json
2022-01-07 11:55:59 51110 training_set_metadata.json

(base) MacBook-Air-2:forest_cover anne$ aws s3 ls s3://predibase-runs/nodeless/forest_cover/hours1/trainable_func_fVfbX2v/trial_86415c74/checkpoint_000401/model/
PRE logs/
PRE training_checkpoints/
2022-01-07 11:56:31 83 checkpoint
2022-01-07 11:56:31 12198 model_hyperparameters.json
2022-01-07 11:56:31 215517 model_weights.data-00000-of-00001
2022-01-07 11:56:31 8389 model_weights.index
2022-01-07 11:56:31 175827 training_progress.json
2022-01-07 11:56:31 4170 training_set_metadata.json

(base) MacBook-Air-2:forest_cover anne$ aws s3 ls s3://predibase-runs/nodeless/higgs/hours1/trainable_func_fI6YKzO/trial_30043e10/checkpoint_000016/model/
PRE logs/
PRE training_checkpoints/
2022-01-07 12:05:20 83 checkpoint
2022-01-07 12:05:20 14674 model_hyperparameters.json
2022-01-07 12:05:20 226790 model_weights.data-00000-of-00001
2022-01-07 12:05:20 8273 model_weights.index
2022-01-07 12:05:20 8295 training_progress.json
2022-01-07 12:05:20 6428 training_set_metadata.json

@xwjiang2010 xwjiang2010 removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Jan 7, 2022
@xwjiang2010
Copy link
Contributor

@amholler Yes, reducing the checkpoint frequency can really help here. Are you using class Trainables? If so, you can simply set checkpoint_freq through tune.run. Let me know how it goes!

@amholler
Copy link
Author

amholler commented Jan 8, 2022

Hi, @xwjiang2010 AFAIK, we aren't using class Trainables; we are using the function API. I tried setting
TUNE_GLOBAL_CHECKPOINT_S but that didn't work in this case. Explanation below.

I have put more tracing into the ames_housing runs, and I see more about what's going on.
I will give an example tracing sequence below after giving some general observations about it.

In the steady state, the checkpoint sync'ing to the cloud is being called from _maybe_save_to_cloud,
which does not respect the checkpoint period, i.e, it calls sync_up rather than sync_up_if_needed.
It is invoked very often for this workload (presumably due to its quick epochs), like every few secs.

And these sync invocations often fail for this workload, because the deletion of old/suboptimal (retention 1)
checkpoints from the cloud is also running very frequently and hence blocks the checkpoint sync,
since only one sync client cmd can be running at a time. I don't know how/whether the checkpoint
period is expected to apply to the delete operations, but their frequent execution can choke the syncs.

^[[2m^[[36m(ImplicitFunc pid=None)^[[0m 2022-01-08 07:26:47,781 WARNING sync_client.py:346 -- Running sync: aws s3 sync /home/ray/ray_results/trainable_func_fk4_l_i/trial_5677d610/ s3://predibase-elotl/baseline/ames_housing/hours1/trainable_func_fk4_l_i/trial_5677d610 --only-show-errors File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 218, in
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m ray.worker.global_worker.main_loop()
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 431, in main_loop
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self.core_worker.run_task_loop()
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/function_manager.py", line 609, in actor_method_executor
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(__ray_actor, *args, **kwargs)
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(self, *_args, **_kwargs)
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/function_runner.py", line 452, in save
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self._maybe_save_to_cloud()
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(self, *_args, **_kwargs)
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 416, in _maybe_save_to_cloud
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self.remote_checkpoint_dir)
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/sync_client.py", line 263, in sync_up
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return self._execute(self.sync_up_template, source, target, exclude)
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/sync_client.py", line 321, in _execute
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m stack_trace_str = "".join(traceback.extract_stack().format())
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m

^[[2m^[[36m(ImplicitFunc pid=None)^[[0m 2022-01-08 07:26:49,039 WARNING sync_client.py:346 -- Running sync: aws s3 sync /home/ray/ray_results/trainable_func_fk4_l_i/trial_5677d610/ s3://predibase-elotl/baseline/ames_housing/hours1/trainable_func_fk4_l_i/trial_5677d610 --only-show-errors File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 218, in
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m ray.worker.global_worker.main_loop()
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 431, in main_loop
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self.core_worker.run_task_loop()
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/function_manager.py", line 609, in actor_method_executor
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(__ray_actor, *args, **kwargs)
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(self, *_args, **_kwargs)
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/function_runner.py", line 452, in save
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self._maybe_save_to_cloud()
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(self, *_args, **_kwargs)
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 416, in _maybe_save_to_cloud
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self.remote_checkpoint_dir)
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/sync_client.py", line 263, in sync_up
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return self._execute(self.sync_up_template, source, target, exclude)
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/sync_client.py", line 321, in _execute
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m stack_trace_str = "".join(traceback.extract_stack().format())
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m

^[[2m^[[36m(ImplicitFunc pid=None)^[[0m 2022-01-08 07:26:49,876 WARNING sync_client.py:275 -- Running delete: aws s3 rm s3://predibase-elotl/baseline/ames_housing/hours1/trainable_func_fk4_l_i/trial_5677d610/checkpoint_000001 --recursive --only-show-errors File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 218, in
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m ray.worker.global_worker.main_loop()
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 431, in main_loop
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self.core_worker.run_task_loop()
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/function_manager.py", line 609, in actor_method_executor
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(__ray_actor, *args, **kwargs)
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(self, *_args, **_kwargs)
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 504, in delete_checkpoint
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self.storage_client.delete(self._storage_path(checkpoint_dir))
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/sync_client.py", line 269, in delete
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m stack_trace_str = "".join(traceback.extract_stack().format())
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m

^[[2m^[[36m(ImplicitFunc pid=None)^[[0m 2022-01-08 07:26:50,396 WARNING sync_client.py:323 -- Last sync client cmd still in progress, skipping. File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 218, in
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m ray.worker.global_worker.main_loop()
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 431, in main_loop
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self.core_worker.run_task_loop()
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/function_manager.py", line 609, in actor_method_executor
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(__ray_actor, *args, **kwargs)
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(self, *_args, **_kwargs)
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/function_runner.py", line 452, in save
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self._maybe_save_to_cloud()
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(self, *_args, **_kwargs)
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 416, in _maybe_save_to_cloud
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self.remote_checkpoint_dir)
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/sync_client.py", line 263, in sync_up
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return self._execute(self.sync_up_template, source, target, exclude)
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/sync_client.py", line 321, in _execute
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m stack_trace_str = "".join(traceback.extract_stack().format())

@xwjiang2010
Copy link
Contributor

Hi @amholler
Yes, for your use case, I don't think there is throttling happening for uploading to cloud from syncing perspective.
Syncing is happening at the same frequency as checkpointing.
Could you reduce the frequency of checkpointing then? (it's not gonna be checkpoint_freq as that is only used for class trainable). Most likely check how often you call with tune.checkpoint_dir...

Let me know how it goes. I will also update our doc as they seem unclear.

@xwjiang2010
Copy link
Contributor

To capture some of the discussion offline.

There are several paths going forward:

  1. Try to optimize sync up/delete and see if we can keep up with per epoch checkpointing/deleting (Xiaowei to do some benchmarking). Let's also see if deleting can be done in a batch way. Does that improve anything?
  2. Tune to throttle syncing
  3. User to be responsible for setting reasonable checkpointing freq (in which case, Tune should still probably provide some guidelines).
  4. User can have the choice to choose reliable checkpointing per epoch at the cost of performance.

Whichever we choose among 2-4 or a combination of them, we should probably understand what the current bottleneck is (i.e. 1).

@amholler
Copy link
Author

amholler commented Jan 11, 2022

  • My patch in sync_client to wait rather than to skip when client cmd still in progress did not remove all instances where checkpoints were lost. I haven't tried to investigate this further.

  • Turning off checkpoint deletion, i.e., setting keep_checkpoints_num to None rather than 1 (which is hard-coded in Ludwig) did make the lost checkpoint and sync cmd skipping problem go away for my test.

@amogkam amogkam added the P2 Important issue, but not time-critical label Apr 11, 2022
@xwjiang2010 xwjiang2010 removed their assignment Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical tune Tune-related issues
Projects
None yet
Development

No branches or pull requests

4 participants