Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] logger_creator fail on machines with different users (different $HOME paths) #4326

Closed
neychev opened this issue Mar 11, 2019 · 5 comments
Assignees

Comments

@neychev
Copy link

neychev commented Mar 11, 2019

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • Ray installed from (source or binary): from binary (pip)
  • Ray version: 0.6.3
  • Python version: 3.6.7
  • Exact command to reproduce: run experiment with worker on another machine working under another user.

Describe the problem

Running experiment with several workers on different machines in small cluster fails if the users on the machines are different. Master machine (the one that hosts redis server) broadcasts it's current $HOME path to the other machines, instead of retrieving $HOME locally on every machine. As a result workers fail to create folder for logs and the experiment fails with PermissionError.

Source code / logs

2019-03-11 09:23:59,683	ERROR trial_runner.py:413 -- Error processing event.
Traceback (most recent call last):
  File "/home/rads/miniconda3/envs/py3_prod/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 378, in _process_events
    result = self.trial_executor.fetch_result(trial)
  File "/home/rads/miniconda3/envs/py3_prod/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 228, in fetch_result
    result = ray.get(trial_future[0])
  File "/home/rads/miniconda3/envs/py3_prod/lib/python3.6/site-packages/ray/worker.py", line 2132, in get
    raise value
ray.worker.RayTaskError: ray_worker (pid=10058, host=***)
  File "/home/tesq/miniconda3/envs/py3_prod/lib/python3.6/site-packages/ray/tune/trainable.py", line 70, in __init__
    self._result_logger = logger_creator(self.config)
  File "/home/rads/miniconda3/envs/py3_prod/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 48, in logger_creator
  File "/home/tesq/miniconda3/envs/py3_prod/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/home/tesq/miniconda3/envs/py3_prod/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/home/tesq/miniconda3/envs/py3_prod/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/home/tesq/miniconda3/envs/py3_prod/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/home/rads'

Current workaround (technically, a crunch to fix it)

Specify local_dir parameter in tune.Experiment declaration and create corresponding directory on every machine.
This approach still requires root privileges on the worker machines (to create directory elsewhere the user $HOME) but at least makes the scripts resistant to user change on the machine running the experiments.

@richardliaw
Copy link
Contributor

Hmm, this can be done by avoiding the expanduser call until called within the Trainable.

Would you be interested in pushing a fix?

@richardliaw richardliaw self-assigned this Mar 14, 2019
@neychev
Copy link
Author

neychev commented Mar 26, 2019

Hmm, this can be done by avoiding the expanduser call until called within the Trainable.

Would you be interested in pushing a fix?

It would be great to contribute. I'll try to dive into on weekends.

@goshaQ
Copy link

goshaQ commented Apr 26, 2019

So how's your progress on the issue? I can submit PR if you didn't succeed with the fix.

@neychev
Copy link
Author

neychev commented May 24, 2019

Seems like this PR #4806 fixes the issue.

@richardliaw
Copy link
Contributor

I think this is closed by #4806.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants