[tune/autoscaler] _LogSyncer cannot rsync with Docker #4403

AdamGleave · 2019-03-18T22:12:23Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
Ray installed from (source or binary): binary
Ray version: 0.6.4
Python version: 3.6.7

Describe the problem

In _LogSyncer, sync_to_worker_if_possible and sync_now use rsync to transfer logs between the local node and the worker. This breaks when using Docker, since:

If the local node is in Docker, it will typically have the root username, and so this is what get_ssh_user will return. But we cannot typically login to the worker node as root.
The local_dir on the worker is inside the Docker container, and may not even be visible outside. If it is bound, then it will typically be at a different path.

An unrelated issue: if self.sync_func is non-None, it will get executed before the worker_to_local_sync_cmd, which I think is wrong.

I'd be happy to make a stab at a PR, but I'd appreciate some suggestions on the right way of fixing this, as it's been a while since I've looked at Ray internals. This also feels like a problem that is likely to reoccur with slight variation, e.g. this bug is similar to #4183

Perhaps we can make autoscaler provide an abstract sync interface that tune and other consumers can use. This could make to rsync in the standard case, and something more complex in the Docker case (e.g. Docker cp followed by rsync)? ray.autoscaller.commands.rsync is already something along these lines -- would this be an appropriate place to modify?

A more hacky solution would be to make get_ssh_user return the right value and make the Docker volume-binding line up so that we can just ignore the difference between Docker and non-Docker instances.

Source code / logs

A MWE for this is hard to provide, but if the above description is insufficient I can try to come up with one.

The text was updated successfully, but these errors were encountered:

AdamGleave · 2019-03-18T22:14:55Z

@richardliaw may have thoughts as already triaging #4183

richardliaw · 2019-03-19T21:44:18Z

Hey @AdamGleave,

Thanks a bunch for opening up this!

I think the sync functionality in _LogSyncer should probably not depend on the autoscaler because it seems like the non-autoscaler cluster use case is still quite prevalent.

Given this, the cleanest solution is probably to line up (or play around with) the volume binding from the autoscaler side. This way, Tune would be agnostic to whether or not we have a docker container.

An unrelated issue: if self.sync_func is non-None, it will get executed before the worker_to_local_sync_cmd, which I think is wrong.

(Yeah, we should fix this - opened up an issue to track)

Let me know what you think!

AdamGleave · 2019-03-20T02:32:42Z

Thanks for pointing out the non-autoscaler use case; I agree this means we shouldn't introduce an autoscaler dependency. I managed to work around the problem for my particular setup just by changing the autoscaler config. This makes me confident a solution just inside ray/autoscaler/docker.py could workaround the problem. However, unfortunately it doesn't seem possible to solve the problem in the general case with this approach.

To make things work smoothly, the Docker-part of autoscaler would need to make both the results directory and username match inside and outside of Docker. The results directory is the key obstacle: although we can set the default by setting the environment TUNE_RESULT_DIR to an appropriate value, the user can always override the local_dir attribute of the Trial object in code. So in general, the results could be written anywhere inside the Docker container.

The username matching also poses problems. The default username Ray runs under depends on the Dockerfile, but will be root in most cases. We could set some environment variable, e.g. RAY_SSH_USER, that exposes the username outside of Docker. However, this feels error-prone: the code depends on username in a bunch of places. For example, when I tried to set LOGNAME and HOME (admittedly a hack), Ray could no longer find the SSH key until I added a symlink back to /root!

Given these issues I think any changes I make to autoscaler/docker.py would be more of a patch, and would likely break in some edge cases. I'm still willing to do that -- it'd fix the particular problem I'm running into at least -- but would prefer something that seems more robust. Perhaps each Ray worker could export some function that when called will copy directories across? Outside of Docker this would just be rsync; when running inside of Docker, it could be docker cp + rsync.

Alternatively, we could switch from a pull to a push model: since Ray already lets us run arbitrary functions on workers, we can send a function that syncs the worker data back to the local node. This has the advantage that the local node only needs to know how itself can be accessed -- no need to know details of each of the Ray workers.

Let me know what approach sounds best to you.

richardliaw · 2019-03-21T02:24:54Z

Hey,

The results directory is the key obstacle: although we can set the default by setting the environment TUNE_RESULT_DIR to an appropriate value, the user can always override the local_dir attribute of the Trial object in code. So in general, the results could be written anywhere inside the Docker container.

Hm, in general, I think the Trial object logdir isn't intended to be mutable; we should probably hide it and expose it as a property. IMO Users generally shouldn't be mutating Trial objects.

If we do that, and then tell Tune + Autoscaler users specify in run_options something like:

- "-v ${TUNE_RESULTS_DIR}:${TUNE_RESULTS_DIR}"

then I think this resolves part of the issue?

The username matching also poses problems. The default username Ray runs under depends on the Dockerfile, but will be root in most cases.

OK, one really quick fix for this would be to 1. detect if you're on an autoscaling ray cluster and 2. if so, pull the SSH user out of the ray_bootstrap_config.yaml in ray/tune/cluster_info:get_ssh_user. I think for this use case, this will let you SSH into the other nodes and rsync the mounted region. I do believe that if you're using Tune + Docker + Cluster, you're most likely using the autoscaler too. We can punt on those who are running Tune + Docker without the autoscaler for now.

Does this work?

Alternatively, we could switch from a pull to a push model: since Ray already lets us run arbitrary functions on workers, we can send a function that syncs the worker data back to the local node. This has the advantage that the local node only needs to know how itself can be accessed -- no need to know details of each of the Ray workers.

Hm, distributing the syncing would make it a bit harder to debug.

I realize this is a bit different from what you're proposing, but this would basically keep Tune/Ray agnostic of Docker (no need for docker cp and container name handling, etc). Let me know your thoughts!

AdamGleave · 2019-03-21T03:15:13Z

Thanks for the response. I trust your judgement given you have much more familiarity with the Ray codebase and pain points in distributed development in general. I'm mostly happy with the proposed plan: in particular, pulling the SSH user out of the bootstrap config seems easy and should work in most cases. However, I think I wasn't clear enough about the issue with local_dir being configurable, and I'm still not sure how to fix this.

Hm, in general, I think the Trial object logdir isn't intended to be mutable; we should probably hide it and expose it as a property. IMO Users generally shouldn't be mutating Trial objects.

The issue I was worried about wasn't user mutating the attributes, but that local_dir can be set to an arbitrary value in the constructor. For example, create_trial_from_spec in ray/tune/config_paser.py gets it from the spec dict.

It does seem to be useful to give users an option of where to save results. However, right now it feels like local_dir is doing double-duty: a temporary working directory (for workers), and also the final repository of results on the driver. Perhaps we can delete local_dir (hardcode it to some directory, maybe in /tmp), and make upload_dir the canonical place to store final results? Right now upload_dir only supports uploading to S3/GCS, but it seems easy to make it also support saving to a local directory. Alternatively, we could keep local_dir but only use it as the destination to sync to on the driver, not e.g. using local_dir inside of Trial.init_logger.

richardliaw · 2019-03-21T05:21:37Z

Hey,

The issue I was worried about wasn't user mutating the attributes, but that local_dir can be set to an arbitrary value in the constructor.

Oh ok, I see. One option is to have TUNE_RESULTS_DIR forcefully override or throw an error if both the env var and the value is also set in the constructor.

However, right now it feels like local_dir is doing double-duty: a temporary working directory (for workers), and also the final repository of results on the driver.

Yeah; especially in this Docker case I see the issue. Although I think this is mainly a non-issue given the above resolution?

Alternatively, we could keep local_dir but only use it as the destination to sync to on the driver, not e.g. using local_dir inside of Trial.init_logger.

I do think we should keep local_dir and upload_dir as the upload_dir should shadow the local_dir. Hmm, we have to special-case the situation where the trainable worker is placed on the head node, which is probably the most common case (and hence an extra cp seems weird). But maybe there's a clean way to do this.

cc @hartikainen @ericl if you have any thoughts on this.

richardliaw · 2020-07-21T18:41:01Z

@ijrsvt can you post the workaround for this?

richardliaw · 2020-07-21T19:03:39Z

I'm going to repoen this until we actually have a workaround documented

richardliaw · 2020-07-22T01:21:53Z

Actually, the fix for this should be to use the DockerCommandRunner and KubernetesCommandRunner for rsync rather than the default rsync. This should be an easy fix (but is still not implemented).

ijrsvt · 2020-07-23T18:03:15Z

Is it raw rsync, or is it the autoscaler's rsync. If it is the latter, it already uses the DockerCommandRunner or KubernetesCommandRunner.

richardliaw · 2020-07-23T19:27:27Z

@ijrsvt it is raw rsync; we use this in Tune. However, we should probably switch to the CommandRunner interface.

richardliaw self-assigned this Mar 19, 2019

richardliaw mentioned this issue Mar 19, 2019

[tune] sync_func should be executed after worker_to_local #4420

Closed

richardliaw mentioned this issue May 16, 2019

[tune] Later expansion of local_dir #4806

Merged

1 task

richardliaw mentioned this issue Jun 25, 2019

[autoscaler] Docker support should have explicit file mounts #5027

Closed

richardliaw mentioned this issue Aug 6, 2019

[autoscaler] Ray submit with docker doesn't rsync the script to the container #4183

Closed

richardliaw added the tune Tune-related issues label Mar 5, 2020

richardliaw added the autoscaler label Apr 22, 2020

ijrsvt self-assigned this May 22, 2020

richardliaw removed their assignment Jun 30, 2020

ijrsvt mentioned this issue Jul 21, 2020

[autoscaler] Create Docker Command Runner #8806

Merged

6 tasks

ijrsvt closed this as completed Jul 21, 2020

richardliaw reopened this Jul 21, 2020

richardliaw mentioned this issue Sep 25, 2020

[tune] docker syncer #11035

Merged

6 tasks

richardliaw closed this as completed in #11035 Oct 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune/autoscaler] _LogSyncer cannot rsync with Docker #4403

[tune/autoscaler] _LogSyncer cannot rsync with Docker #4403

AdamGleave commented Mar 18, 2019 •

edited

Loading

AdamGleave commented Mar 18, 2019

richardliaw commented Mar 19, 2019 •

edited

Loading

AdamGleave commented Mar 20, 2019 •

edited by richardliaw

Loading

richardliaw commented Mar 21, 2019

AdamGleave commented Mar 21, 2019

richardliaw commented Mar 21, 2019 •

edited

Loading

richardliaw commented Jul 21, 2020

richardliaw commented Jul 21, 2020

richardliaw commented Jul 22, 2020

ijrsvt commented Jul 23, 2020 •

edited

Loading

richardliaw commented Jul 23, 2020

[tune/autoscaler] _LogSyncer cannot rsync with Docker #4403

[tune/autoscaler] _LogSyncer cannot rsync with Docker #4403

Comments

AdamGleave commented Mar 18, 2019 • edited Loading

System information

Describe the problem

Source code / logs

AdamGleave commented Mar 18, 2019

richardliaw commented Mar 19, 2019 • edited Loading

AdamGleave commented Mar 20, 2019 • edited by richardliaw Loading

richardliaw commented Mar 21, 2019

AdamGleave commented Mar 21, 2019

richardliaw commented Mar 21, 2019 • edited Loading

richardliaw commented Jul 21, 2020

richardliaw commented Jul 21, 2020

richardliaw commented Jul 22, 2020

ijrsvt commented Jul 23, 2020 • edited Loading

richardliaw commented Jul 23, 2020

AdamGleave commented Mar 18, 2019 •

edited

Loading

richardliaw commented Mar 19, 2019 •

edited

Loading

AdamGleave commented Mar 20, 2019 •

edited by richardliaw

Loading

richardliaw commented Mar 21, 2019 •

edited

Loading

ijrsvt commented Jul 23, 2020 •

edited

Loading