[core] ray.init defaults to an existing Ray instance if there is one #26678

stephanie-wang · 2022-07-18T21:22:29Z

Signed-off-by: Stephanie Wang [email protected]

Why are these changes needed?

ray.init() will currently start a new Ray instance even if one is already existing, which is very confusing if you are a new user trying to go from local development to a cluster. This PR changes it so that, when no address is specified, we first try to find an existing Ray cluster that was created through ray start. If none is found, we will start a new one.

This makes two changes to the ray.init() resolution order:

When ray start is called, the started cluster address was already written to a file called /tmp/ray/ray_current_cluster. For ray.init() and ray.init(address="auto"), we will first check this local file for an existing cluster address. The file is deleted on ray stop. If the file is empty, autodetect any running cluster (legacy behavior) if address="auto", or we will start a new local Ray instance if address=None.
When ray.init(address="local") is called, we will create a new local Ray instance, even if one is already existing. This behavior seems to be necessary mainly for ray.client use cases.

This also surfaces the logs about which Ray instance we are connecting to. Previously these were hidden because we didn't set up the log until after connecting to Ray. So now Ray will log one of the following messages during ray.init:

(Connecting to existing Ray cluster at address: <IP>...)
...connection...
(Started a local Ray cluster.| Connected to Ray Cluster.)( View the dashboard at <URL>)

Note that this changes the dashboard URL to be printed with ray.init() instead of when the dashboard is first started.

Related issue number

Closes #26329.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Stephanie Wang <[email protected]>

ericl

IMO, it's a bit sketchy to run the existing autodetect logic by default, since it's effectively grepping ps aux output. The two issues I see are:

It's possible for there to be a false positive.
If a cluster died, then it wouldn't autodetect it correctly (false negative).

How about the following instead?

ray start writes the created $RAY_ADDRESS to /tmp/ray/default_ray_address
we check /tmp/ray/default_ray_address if $RAY_ADDRESS isn't defined
ray stop deletes $RAY_ADDRESS

stephanie-wang · 2022-07-18T22:24:57Z

IMO, it's a bit sketchy to run the existing autodetect logic by default, since it's effectively grepping ps aux output. The two issues I see are:
1. It's possible for there to be a false positive.

2. If a cluster died, then it wouldn't autodetect it correctly (false negative).
How about the following instead?
* `ray start` writes the created $RAY_ADDRESS to `/tmp/ray/default_ray_address`

* we check `/tmp/ray/default_ray_address` if $RAY_ADDRESS isn't defined

* `ray stop` deletes $RAY_ADDRESS

If there are these issues, then shouldn't we just fix autodetect to use the default_ray_address? It doesn't seem great to have all these different ways of finding which Ray address to connect to.

Also, this seems like it will work to fix (2) since we can throw an error if RAY_ADDRESS is unreachable. But it could also cause problems if Ray crashes (so no one calls ray stop), and then we later want to call ray.init() to start a new local instance. The alternative is to just log a louder warning if there is no instance found, which isn't much different from the current implementation.

ericl · 2022-07-18T23:00:04Z

If there are these issues, then shouldn't we just fix autodetect to use the default_ray_address? It doesn't seem great to have all these different ways of finding which Ray address to connect to.

Yes, I think this would be ideal.

Also, this seems like it will work to fix (2) since we can throw an error if RAY_ADDRESS is unreachable. But it could also cause problems if Ray crashes (so no one calls ray stop), and then we later want to call ray.init() to start a new local instance.

You could tell the user to run "ray stop". The main thing is avoiding the scenario where because Ray crashed, we silently start a new Ray instance, which is actually not the real Ray cluster (prefer to raise an error rather than do something confusing).

The alternative is to just log a louder warning if there is no instance found, which isn't much different from the current implementation.

This could also work, but for this to work we'd need to still implement the /tmp/ray/default_address feature right? Otherwise, there is no way to tell if there is no instance found due to a Ray failure vs just running on a laptop. We shouldn't log a warning in the laptop case, since those users will never want to connect to an existing Ray cluster.

stephanie-wang · 2022-07-18T23:08:45Z

Hmm okay, that sounds good. So let me update this PR to also update the autodetect implementation.

Signed-off-by: Stephanie Wang <[email protected]>

ericl

Just tried this out again. I think this will be a big improvement. A couple more usability suggestions on the log messages:

When starting a new cluster, we now print two lines instead of one, but the second address is not really useful for local clusters. How about we combine them?

>>> ray.init()
View the Ray dashboard at http://127.0.0.1:8265
Started a new local Ray instance at address: 192.168.5.130:61720

For example:
Started a new Ray cluster with dashboard at: http://127.0.0.1:8265

Should we have a period or "..." at the end of the connecting message? Like

Connecting to existing Ray cluster at address: 192.168.5.130:6379., or
Connecting to existing Ray cluster at address: 192.168.5.130:6379....

The current log message seems a little abrupt.

ericl · 2022-07-19T21:44:47Z

Btw, do we need to update any docs?

stephanie-wang · 2022-07-19T23:33:28Z

Just tried this out again. I think this will be a big improvement. A couple more usability suggestions on the log messages:

1. When starting a new cluster, we now print two lines instead of one, but the second address is not really useful for local clusters. How about we combine them?

>>> ray.init()
View the Ray dashboard at http://127.0.0.1:8265
Started a new local Ray instance at address: 192.168.5.130:61720

Ah I see, I actually don't get the first message because I don't have the dashboard built. But yeah, I'll remove the new log message.


For example: `Started a new Ray cluster with dashboard at: http://127.0.0.1:8265`

    2. Should we have a period or "..." at the end of the connecting message? Like


`Connecting to existing Ray cluster at address: 192.168.5.130:6379.`, or `Connecting to existing Ray cluster at address: 192.168.5.130:6379...`.

The current log message seems a little abrupt.

Good idea!

Btw, do we need to update any docs?

Hmm let me check.

…ay-project#26678) ray.init() will currently start a new Ray instance even if one is already existing, which is very confusing if you are a new user trying to go from local development to a cluster. This PR changes it so that, when no address is specified, we first try to find an existing Ray cluster that was created through `ray start`. If none is found, we will start a new one. This makes two changes to the ray.init() resolution order: 1. When `ray start` is called, the started cluster address was already written to a file called `/tmp/ray/ray_current_cluster`. For ray.init() and ray.init(address="auto"), we will first check this local file for an existing cluster address. The file is deleted on `ray stop`. If the file is empty, autodetect any running cluster (legacy behavior) if address="auto", or we will start a new local Ray instance if address=None. 2. When ray.init(address="local") is called, we will create a new local Ray instance, even if one is already existing. This behavior seems to be necessary mainly for `ray.client` use cases. This also surfaces the logs about which Ray instance we are connecting to. Previously these were hidden because we didn't set up the log until after connecting to Ray. So now Ray will log one of the following messages during ray.init: ``` (Connecting to existing Ray cluster at address: <IP>...) ...connection... (Started a local Ray cluster.| Connected to Ray Cluster.)( View the dashboard at <URL>) ``` Note that this changes the dashboard URL to be printed with `ray.init()` instead of when the dashboard is first started. Co-authored-by: Eric Liang <[email protected]> Signed-off-by: Rohan138 <[email protected]>

…ay-project#26678) ray.init() will currently start a new Ray instance even if one is already existing, which is very confusing if you are a new user trying to go from local development to a cluster. This PR changes it so that, when no address is specified, we first try to find an existing Ray cluster that was created through `ray start`. If none is found, we will start a new one. This makes two changes to the ray.init() resolution order: 1. When `ray start` is called, the started cluster address was already written to a file called `/tmp/ray/ray_current_cluster`. For ray.init() and ray.init(address="auto"), we will first check this local file for an existing cluster address. The file is deleted on `ray stop`. If the file is empty, autodetect any running cluster (legacy behavior) if address="auto", or we will start a new local Ray instance if address=None. 2. When ray.init(address="local") is called, we will create a new local Ray instance, even if one is already existing. This behavior seems to be necessary mainly for `ray.client` use cases. This also surfaces the logs about which Ray instance we are connecting to. Previously these were hidden because we didn't set up the log until after connecting to Ray. So now Ray will log one of the following messages during ray.init: ``` (Connecting to existing Ray cluster at address: <IP>...) ...connection... (Started a local Ray cluster.| Connected to Ray Cluster.)( View the dashboard at <URL>) ``` Note that this changes the dashboard URL to be printed with `ray.init()` instead of when the dashboard is first started. Co-authored-by: Eric Liang <[email protected]> Signed-off-by: Stefan van der Kleij <[email protected]>

#25842 is not needed since we will no longer accidentally create a new cluster while an existing one is running after #26678

…2961) With #26678, when a ray cluster is started, it's address is written to /tmp/ray/ray_current_cluster so ray.init() can find the existing cluster and connect to it by default. However if a node is started with ray start --block, the file is not created so ray.init() will create a new cluster instead of connecting to the existing one, which is unexpected. Signed-off-by: Jiajun Yao <[email protected]>

…y-project#32961) With ray-project#26678, when a ray cluster is started, it's address is written to /tmp/ray/ray_current_cluster so ray.init() can find the existing cluster and connect to it by default. However if a node is started with ray start --block, the file is not created so ray.init() will create a new cluster instead of connecting to the existing one, which is unexpected. Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: Jack He <[email protected]>

ray-project#25842 is not needed since we will no longer accidentally create a new cluster while an existing one is running after ray-project#26678 Signed-off-by: Edward Oakes <[email protected]>

…y-project#32961) With ray-project#26678, when a ray cluster is started, it's address is written to /tmp/ray/ray_current_cluster so ray.init() can find the existing cluster and connect to it by default. However if a node is started with ray start --block, the file is not created so ray.init() will create a new cluster instead of connecting to the existing one, which is unexpected. Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: Edward Oakes <[email protected]>

ray-project#25842 is not needed since we will no longer accidentally create a new cluster while an existing one is running after ray-project#26678

…y-project#32961) With ray-project#26678, when a ray cluster is started, it's address is written to /tmp/ray/ray_current_cluster so ray.init() can find the existing cluster and connect to it by default. However if a node is started with ray start --block, the file is not created so ray.init() will create a new cluster instead of connecting to the existing one, which is unexpected. Signed-off-by: Jiajun Yao <[email protected]>

ray-project#25842 is not needed since we will no longer accidentally create a new cluster while an existing one is running after ray-project#26678 Signed-off-by: elliottower <[email protected]>

…y-project#32961) With ray-project#26678, when a ray cluster is started, it's address is written to /tmp/ray/ray_current_cluster so ray.init() can find the existing cluster and connect to it by default. However if a node is started with ray start --block, the file is not created so ray.init() will create a new cluster instead of connecting to the existing one, which is unexpected. Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: elliottower <[email protected]>

…y-project#32961) With ray-project#26678, when a ray cluster is started, it's address is written to /tmp/ray/ray_current_cluster so ray.init() can find the existing cluster and connect to it by default. However if a node is started with ray start --block, the file is not created so ray.init() will create a new cluster instead of connecting to the existing one, which is unexpected. Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: Jack He <[email protected]>

ray.init

3aabade

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang assigned ericl, pcmoritz and stephanie-wang and unassigned stephanie-wang Jul 18, 2022

ericl reviewed Jul 18, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 18, 2022

stephanie-wang added 2 commits July 19, 2022 12:49

Write default ray address

29790b0

Signed-off-by: Stephanie Wang <[email protected]>

doc

931c78c

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang force-pushed the ray-init branch from 5ca32ef to 931c78c Compare July 19, 2022 19:49

test

810ed79

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang force-pushed the ray-init branch from 64aea0f to 810ed79 Compare July 19, 2022 20:40

stephanie-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 19, 2022

ericl reviewed Jul 19, 2022

View reviewed changes

ericl added the core-interface-change-approval-required This changes the Ray core behavior / API and requires broader approvals. label Jul 19, 2022

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 19, 2022

Merge remote-tracking branch 'upstream/master' into ray-init

52c8e01

stephanie-wang requested review from edoakes, simon-mo, jiaodong, shrekris-anyscale, sihanwang41, wuisawesome, DmitriGekhtman and maxpumperla as code owners July 20, 2022 00:37

jbedorf mentioned this pull request Jul 26, 2022

[Core] Ray start/stop/start does not work when using a custom temporary folder #27021

Closed

suquark mentioned this pull request Jul 28, 2022

[CI][Windows] test_storage failing on Windows #27175

Closed

rkooo567 mentioned this pull request Jul 29, 2022

Logging prefix no longer showing #27220

Closed

jjyao mentioned this pull request Feb 24, 2023

Remove usage_lib.LibUsageRecorder #32806

Merged

7 tasks

ericl pushed a commit that referenced this pull request Feb 25, 2023

Remove usage_lib.LibUsageRecorder (#32806)

92ef37d

#25842 is not needed since we will no longer accidentally create a new cluster while an existing one is running after #26678

jjyao mentioned this pull request Mar 2, 2023

[Core] Write ray address even if ray node is started with --block #32961

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] ray.init defaults to an existing Ray instance if there is one #26678

[core] ray.init defaults to an existing Ray instance if there is one #26678

stephanie-wang commented Jul 18, 2022 •

edited by ericl

Loading

ericl left a comment

stephanie-wang commented Jul 18, 2022

ericl commented Jul 18, 2022

stephanie-wang commented Jul 18, 2022

ericl left a comment

ericl commented Jul 19, 2022

stephanie-wang commented Jul 19, 2022

[core] ray.init defaults to an existing Ray instance if there is one #26678

[core] ray.init defaults to an existing Ray instance if there is one #26678

Conversation

stephanie-wang commented Jul 18, 2022 • edited by ericl Loading

Why are these changes needed?

Related issue number

Checks

ericl left a comment

Choose a reason for hiding this comment

stephanie-wang commented Jul 18, 2022

ericl commented Jul 18, 2022

stephanie-wang commented Jul 18, 2022

ericl left a comment

Choose a reason for hiding this comment

ericl commented Jul 19, 2022

stephanie-wang commented Jul 19, 2022

stephanie-wang commented Jul 18, 2022 •

edited by ericl

Loading