Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune/autoscaler] _LogSyncer cannot rsync with Docker #4403

Closed
AdamGleave opened this issue Mar 18, 2019 · 11 comments · Fixed by #8806 or #11035
Closed

[tune/autoscaler] _LogSyncer cannot rsync with Docker #4403

AdamGleave opened this issue Mar 18, 2019 · 11 comments · Fixed by #8806 or #11035
Assignees
Labels
tune Tune-related issues

Comments

@AdamGleave
Copy link
Contributor

AdamGleave commented Mar 18, 2019

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • Ray installed from (source or binary): binary
  • Ray version: 0.6.4
  • Python version: 3.6.7

Describe the problem

In _LogSyncer, sync_to_worker_if_possible and sync_now use rsync to transfer logs between the local node and the worker. This breaks when using Docker, since:

  • If the local node is in Docker, it will typically have the root username, and so this is what get_ssh_user will return. But we cannot typically login to the worker node as root.
  • The local_dir on the worker is inside the Docker container, and may not even be visible outside. If it is bound, then it will typically be at a different path.

An unrelated issue: if self.sync_func is non-None, it will get executed before the worker_to_local_sync_cmd, which I think is wrong.

I'd be happy to make a stab at a PR, but I'd appreciate some suggestions on the right way of fixing this, as it's been a while since I've looked at Ray internals. This also feels like a problem that is likely to reoccur with slight variation, e.g. this bug is similar to #4183

Perhaps we can make autoscaler provide an abstract sync interface that tune and other consumers can use. This could make to rsync in the standard case, and something more complex in the Docker case (e.g. Docker cp followed by rsync)? ray.autoscaller.commands.rsync is already something along these lines -- would this be an appropriate place to modify?

A more hacky solution would be to make get_ssh_user return the right value and make the Docker volume-binding line up so that we can just ignore the difference between Docker and non-Docker instances.

Source code / logs

A MWE for this is hard to provide, but if the above description is insufficient I can try to come up with one.

@AdamGleave
Copy link
Contributor Author

@richardliaw may have thoughts as already triaging #4183

@richardliaw richardliaw self-assigned this Mar 19, 2019
@richardliaw
Copy link
Contributor

richardliaw commented Mar 19, 2019

Hey @AdamGleave,

Thanks a bunch for opening up this!

I think the sync functionality in _LogSyncer should probably not depend on the autoscaler because it seems like the non-autoscaler cluster use case is still quite prevalent.

Given this, the cleanest solution is probably to line up (or play around with) the volume binding from the autoscaler side. This way, Tune would be agnostic to whether or not we have a docker container.

An unrelated issue: if self.sync_func is non-None, it will get executed before the worker_to_local_sync_cmd, which I think is wrong.

(Yeah, we should fix this - opened up an issue to track)

Let me know what you think!

@AdamGleave
Copy link
Contributor Author

AdamGleave commented Mar 20, 2019

Thanks for pointing out the non-autoscaler use case; I agree this means we shouldn't introduce an autoscaler dependency. I managed to work around the problem for my particular setup just by changing the autoscaler config. This makes me confident a solution just inside ray/autoscaler/docker.py could workaround the problem. However, unfortunately it doesn't seem possible to solve the problem in the general case with this approach.

To make things work smoothly, the Docker-part of autoscaler would need to make both the results directory and username match inside and outside of Docker. The results directory is the key obstacle: although we can set the default by setting the environment TUNE_RESULT_DIR to an appropriate value, the user can always override the local_dir attribute of the Trial object in code. So in general, the results could be written anywhere inside the Docker container.

The username matching also poses problems. The default username Ray runs under depends on the Dockerfile, but will be root in most cases. We could set some environment variable, e.g. RAY_SSH_USER, that exposes the username outside of Docker. However, this feels error-prone: the code depends on username in a bunch of places. For example, when I tried to set LOGNAME and HOME (admittedly a hack), Ray could no longer find the SSH key until I added a symlink back to /root!

Given these issues I think any changes I make to autoscaler/docker.py would be more of a patch, and would likely break in some edge cases. I'm still willing to do that -- it'd fix the particular problem I'm running into at least -- but would prefer something that seems more robust. Perhaps each Ray worker could export some function that when called will copy directories across? Outside of Docker this would just be rsync; when running inside of Docker, it could be docker cp + rsync.

Alternatively, we could switch from a pull to a push model: since Ray already lets us run arbitrary functions on workers, we can send a function that syncs the worker data back to the local node. This has the advantage that the local node only needs to know how itself can be accessed -- no need to know details of each of the Ray workers.

Let me know what approach sounds best to you.

@richardliaw
Copy link
Contributor

Hey,

The results directory is the key obstacle: although we can set the default by setting the environment TUNE_RESULT_DIR to an appropriate value, the user can always override the local_dir attribute of the Trial object in code. So in general, the results could be written anywhere inside the Docker container.

Hm, in general, I think the Trial object logdir isn't intended to be mutable; we should probably hide it and expose it as a property. IMO Users generally shouldn't be mutating Trial objects.

If we do that, and then tell Tune + Autoscaler users specify in run_options something like:

- "-v ${TUNE_RESULTS_DIR}:${TUNE_RESULTS_DIR}"

then I think this resolves part of the issue?

The username matching also poses problems. The default username Ray runs under depends on the Dockerfile, but will be root in most cases.

OK, one really quick fix for this would be to 1. detect if you're on an autoscaling ray cluster and 2. if so, pull the SSH user out of the ray_bootstrap_config.yaml in ray/tune/cluster_info:get_ssh_user. I think for this use case, this will let you SSH into the other nodes and rsync the mounted region. I do believe that if you're using Tune + Docker + Cluster, you're most likely using the autoscaler too. We can punt on those who are running Tune + Docker without the autoscaler for now.

Does this work?

Alternatively, we could switch from a pull to a push model: since Ray already lets us run arbitrary functions on workers, we can send a function that syncs the worker data back to the local node. This has the advantage that the local node only needs to know how itself can be accessed -- no need to know details of each of the Ray workers.

Hm, distributing the syncing would make it a bit harder to debug.

I realize this is a bit different from what you're proposing, but this would basically keep Tune/Ray agnostic of Docker (no need for docker cp and container name handling, etc). Let me know your thoughts!

@AdamGleave
Copy link
Contributor Author

Thanks for the response. I trust your judgement given you have much more familiarity with the Ray codebase and pain points in distributed development in general. I'm mostly happy with the proposed plan: in particular, pulling the SSH user out of the bootstrap config seems easy and should work in most cases. However, I think I wasn't clear enough about the issue with local_dir being configurable, and I'm still not sure how to fix this.

Hm, in general, I think the Trial object logdir isn't intended to be mutable; we should probably hide it and expose it as a property. IMO Users generally shouldn't be mutating Trial objects.

The issue I was worried about wasn't user mutating the attributes, but that local_dir can be set to an arbitrary value in the constructor. For example, create_trial_from_spec in ray/tune/config_paser.py gets it from the spec dict.

It does seem to be useful to give users an option of where to save results. However, right now it feels like local_dir is doing double-duty: a temporary working directory (for workers), and also the final repository of results on the driver. Perhaps we can delete local_dir (hardcode it to some directory, maybe in /tmp), and make upload_dir the canonical place to store final results? Right now upload_dir only supports uploading to S3/GCS, but it seems easy to make it also support saving to a local directory. Alternatively, we could keep local_dir but only use it as the destination to sync to on the driver, not e.g. using local_dir inside of Trial.init_logger.

@richardliaw
Copy link
Contributor

richardliaw commented Mar 21, 2019

Hey,

The issue I was worried about wasn't user mutating the attributes, but that local_dir can be set to an arbitrary value in the constructor.

Oh ok, I see. One option is to have TUNE_RESULTS_DIR forcefully override or throw an error if both the env var and the value is also set in the constructor.

However, right now it feels like local_dir is doing double-duty: a temporary working directory (for workers), and also the final repository of results on the driver.

Yeah; especially in this Docker case I see the issue. Although I think this is mainly a non-issue given the above resolution?

Alternatively, we could keep local_dir but only use it as the destination to sync to on the driver, not e.g. using local_dir inside of Trial.init_logger.

I do think we should keep local_dir and upload_dir as the upload_dir should shadow the local_dir. Hmm, we have to special-case the situation where the trainable worker is placed on the head node, which is probably the most common case (and hence an extra cp seems weird). But maybe there's a clean way to do this.

cc @hartikainen @ericl if you have any thoughts on this.

@richardliaw
Copy link
Contributor

@ijrsvt can you post the workaround for this?

@richardliaw
Copy link
Contributor

I'm going to repoen this until we actually have a workaround documented

@richardliaw richardliaw reopened this Jul 21, 2020
@richardliaw
Copy link
Contributor

Actually, the fix for this should be to use the DockerCommandRunner and KubernetesCommandRunner for rsync rather than the default rsync. This should be an easy fix (but is still not implemented).

@ijrsvt
Copy link
Contributor

ijrsvt commented Jul 23, 2020

Is it raw rsync, or is it the autoscaler's rsync. If it is the latter, it already uses the DockerCommandRunner or KubernetesCommandRunner.

@richardliaw
Copy link
Contributor

@ijrsvt it is raw rsync; we use this in Tune. However, we should probably switch to the CommandRunner interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tune Tune-related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants