Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Ray start/stop/start does not work when using a custom temporary folder #27021

Closed
jbedorf opened this issue Jul 26, 2022 · 1 comment · Fixed by #27666
Closed

[Core] Ray start/stop/start does not work when using a custom temporary folder #27021

jbedorf opened this issue Jul 26, 2022 · 1 comment · Fixed by #27666
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core release-blocker P0 Issue that blocks the release

Comments

@jbedorf
Copy link
Contributor

jbedorf commented Jul 26, 2022

What happened + What you expected to happen

The changes introduced here have as side effect that the following logic no longer works:

Start a cluster with a custom temporary folder: ray start --head --temp-dir /tmp/bla
Now stop any active ray cluster using the typical: ray stop

Now there is nothing running anymore, but when you next run: ray start --head --temp-dir /tmp/bla

You get this error:
ConnectionError: Ray is trying to start at 192.168.XXX.XXX:6379, but is already running at 192.168.XXX.XXX:6379.

This due to the file ray_current_cluster not being deleted when using a non-default temporary directory. The stop command does not allow you to specify the temp-dir and as such it will not find ray_current_cluster file. But when you then try to start ray again it will fail due to the file still being there. You can work around this by manually deleting the file or specifying a different port.

Versions / Dependencies

Nightly

Reproduction script

ray start --head --temp-dir /tmp/bla
ray stop
ray start --head --temp-dir /tmp/bla

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@jbedorf jbedorf added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 26, 2022
@jjyao
Copy link
Collaborator

jjyao commented Aug 1, 2022

@scv119 @stephanie-wang seems a release blocker since it's a regression.

@jjyao jjyao added core Issues that should be addressed in Ray Core release-blocker P0 Issue that blocks the release and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 1, 2022
@stephanie-wang stephanie-wang self-assigned this Aug 2, 2022
@scv119 scv119 assigned scv119 and unassigned stephanie-wang Aug 6, 2022
stephanie-wang added a commit that referenced this issue Aug 9, 2022
Signed-off-by: Stephanie Wang [email protected]

Cluster address is now written to a temp file. Previously we raised an error if ray start --head tried to reuse the old cluster address in the temp file, even if Ray was no longer running. This PR allows ray start --head to continue if it can't find any GCS process associated with the recorded cluster address.
Related issue number

Closes #27021.
scv119 pushed a commit that referenced this issue Aug 10, 2022
Signed-off-by: Stephanie Wang [email protected]

Cluster address is now written to a temp file. Previously we raised an error if ray start --head tried to reuse the old cluster address in the temp file, even if Ray was no longer running. This PR allows ray start --head to continue if it can't find any GCS process associated with the recorded cluster address.
Related issue number

Closes #27021.
scv119 pushed a commit that referenced this issue Aug 10, 2022
Signed-off-by: Stephanie Wang [email protected]

Cluster address is now written to a temp file. Previously we raised an error if ray start --head tried to reuse the old cluster address in the temp file, even if Ray was no longer running. This PR allows ray start --head to continue if it can't find any GCS process associated with the recorded cluster address.
Related issue number

Closes #27021.
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this issue Aug 18, 2022
…ect#27666)

Signed-off-by: Stephanie Wang [email protected]

Cluster address is now written to a temp file. Previously we raised an error if ray start --head tried to reuse the old cluster address in the temp file, even if Ray was no longer running. This PR allows ray start --head to continue if it can't find any GCS process associated with the recorded cluster address.
Related issue number

Closes ray-project#27021.

Signed-off-by: Stefan van der Kleij <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core release-blocker P0 Issue that blocks the release
Projects
None yet
4 participants