-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune/autoscaler] _LogSyncer cannot rsync with Docker #4403
Comments
@richardliaw may have thoughts as already triaging #4183 |
Hey @AdamGleave, Thanks a bunch for opening up this! I think the sync functionality in Given this, the cleanest solution is probably to line up (or play around with) the volume binding from the autoscaler side. This way, Tune would be agnostic to whether or not we have a docker container.
(Yeah, we should fix this - opened up an issue to track) Let me know what you think! |
Thanks for pointing out the non-autoscaler use case; I agree this means we shouldn't introduce an To make things work smoothly, the Docker-part of autoscaler would need to make both the results directory and username match inside and outside of Docker. The results directory is the key obstacle: although we can set the default by setting the environment The username matching also poses problems. The default username Ray runs under depends on the Dockerfile, but will be Given these issues I think any changes I make to Alternatively, we could switch from a pull to a push model: since Ray already lets us run arbitrary functions on workers, we can send a function that syncs the worker data back to the local node. This has the advantage that the local node only needs to know how itself can be accessed -- no need to know details of each of the Ray workers. Let me know what approach sounds best to you. |
Hey,
Hm, in general, I think the Trial object If we do that, and then tell Tune + Autoscaler users specify in
then I think this resolves part of the issue?
OK, one really quick fix for this would be to 1. detect if you're on an autoscaling ray cluster and 2. if so, pull the SSH user out of the ray_bootstrap_config.yaml in Does this work?
Hm, distributing the syncing would make it a bit harder to debug. I realize this is a bit different from what you're proposing, but this would basically keep Tune/Ray agnostic of Docker (no need for |
Thanks for the response. I trust your judgement given you have much more familiarity with the Ray codebase and pain points in distributed development in general. I'm mostly happy with the proposed plan: in particular, pulling the SSH user out of the bootstrap config seems easy and should work in most cases. However, I think I wasn't clear enough about the issue with
The issue I was worried about wasn't user mutating the attributes, but that It does seem to be useful to give users an option of where to save results. However, right now it feels like |
Hey,
Oh ok, I see. One option is to have TUNE_RESULTS_DIR forcefully override or throw an error if both the env var and the value is also set in the constructor.
Yeah; especially in this Docker case I see the issue. Although I think this is mainly a non-issue given the above resolution?
I do think we should keep cc @hartikainen @ericl if you have any thoughts on this. |
@ijrsvt can you post the workaround for this? |
I'm going to repoen this until we actually have a workaround documented |
Actually, the fix for this should be to use the DockerCommandRunner and KubernetesCommandRunner for rsync rather than the default rsync. This should be an easy fix (but is still not implemented). |
Is it raw |
@ijrsvt it is raw rsync; we use this in Tune. However, we should probably switch to the CommandRunner interface. |
System information
Describe the problem
In
_LogSyncer
,sync_to_worker_if_possible
andsync_now
usersync
to transfer logs between the local node and the worker. This breaks when using Docker, since:root
username, and so this is whatget_ssh_user
will return. But we cannot typically login to the worker node asroot
.local_dir
on the worker is inside the Docker container, and may not even be visible outside. If it is bound, then it will typically be at a different path.An unrelated issue: if
self.sync_func
is non-None, it will get executed before theworker_to_local_sync_cmd
, which I think is wrong.I'd be happy to make a stab at a PR, but I'd appreciate some suggestions on the right way of fixing this, as it's been a while since I've looked at Ray internals. This also feels like a problem that is likely to reoccur with slight variation, e.g. this bug is similar to #4183
Perhaps we can make
autoscaler
provide an abstractsync
interface thattune
and other consumers can use. This could make torsync
in the standard case, and something more complex in the Docker case (e.g. Docker cp followed by rsync)?ray.autoscaller.commands.rsync
is already something along these lines -- would this be an appropriate place to modify?A more hacky solution would be to make
get_ssh_user
return the right value and make the Docker volume-binding line up so that we can just ignore the difference between Docker and non-Docker instances.Source code / logs
A MWE for this is hard to provide, but if the above description is insufficient I can try to come up with one.
The text was updated successfully, but these errors were encountered: