Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] ray.init defaults to an existing Ray instance if there is one #26678

Merged
merged 32 commits into from
Jul 23, 2022

Conversation

stephanie-wang
Copy link
Contributor

@stephanie-wang stephanie-wang commented Jul 18, 2022

Signed-off-by: Stephanie Wang [email protected]

Why are these changes needed?

ray.init() will currently start a new Ray instance even if one is already existing, which is very confusing if you are a new user trying to go from local development to a cluster. This PR changes it so that, when no address is specified, we first try to find an existing Ray cluster that was created through ray start. If none is found, we will start a new one.

This makes two changes to the ray.init() resolution order:

  1. When ray start is called, the started cluster address was already written to a file called /tmp/ray/ray_current_cluster. For ray.init() and ray.init(address="auto"), we will first check this local file for an existing cluster address. The file is deleted on ray stop. If the file is empty, autodetect any running cluster (legacy behavior) if address="auto", or we will start a new local Ray instance if address=None.
  2. When ray.init(address="local") is called, we will create a new local Ray instance, even if one is already existing. This behavior seems to be necessary mainly for ray.client use cases.

This also surfaces the logs about which Ray instance we are connecting to. Previously these were hidden because we didn't set up the log until after connecting to Ray. So now Ray will log one of the following messages during ray.init:

(Connecting to existing Ray cluster at address: <IP>...)
...connection...
(Started a local Ray cluster.| Connected to Ray Cluster.)( View the dashboard at <URL>)

Note that this changes the dashboard URL to be printed with ray.init() instead of when the dashboard is first started.

Related issue number

Closes #26329.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Stephanie Wang <[email protected]>
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it's a bit sketchy to run the existing autodetect logic by default, since it's effectively grepping ps aux output. The two issues I see are:

  1. It's possible for there to be a false positive.
  2. If a cluster died, then it wouldn't autodetect it correctly (false negative).

How about the following instead?

  • ray start writes the created $RAY_ADDRESS to /tmp/ray/default_ray_address
  • we check /tmp/ray/default_ray_address if $RAY_ADDRESS isn't defined
  • ray stop deletes $RAY_ADDRESS

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 18, 2022
@stephanie-wang
Copy link
Contributor Author

IMO, it's a bit sketchy to run the existing autodetect logic by default, since it's effectively grepping ps aux output. The two issues I see are:

1. It's possible for there to be a false positive.

2. If a cluster died, then it wouldn't autodetect it correctly (false negative).

How about the following instead?

* `ray start` writes the created $RAY_ADDRESS to `/tmp/ray/default_ray_address`

* we check `/tmp/ray/default_ray_address` if $RAY_ADDRESS isn't defined

* `ray stop` deletes $RAY_ADDRESS

If there are these issues, then shouldn't we just fix autodetect to use the default_ray_address? It doesn't seem great to have all these different ways of finding which Ray address to connect to.

Also, this seems like it will work to fix (2) since we can throw an error if RAY_ADDRESS is unreachable. But it could also cause problems if Ray crashes (so no one calls ray stop), and then we later want to call ray.init() to start a new local instance. The alternative is to just log a louder warning if there is no instance found, which isn't much different from the current implementation.

@ericl
Copy link
Contributor

ericl commented Jul 18, 2022

If there are these issues, then shouldn't we just fix autodetect to use the default_ray_address? It doesn't seem great to have all these different ways of finding which Ray address to connect to.

Yes, I think this would be ideal.

Also, this seems like it will work to fix (2) since we can throw an error if RAY_ADDRESS is unreachable. But it could also cause problems if Ray crashes (so no one calls ray stop), and then we later want to call ray.init() to start a new local instance.

You could tell the user to run "ray stop". The main thing is avoiding the scenario where because Ray crashed, we silently start a new Ray instance, which is actually not the real Ray cluster (prefer to raise an error rather than do something confusing).

The alternative is to just log a louder warning if there is no instance found, which isn't much different from the current implementation.

This could also work, but for this to work we'd need to still implement the /tmp/ray/default_address feature right? Otherwise, there is no way to tell if there is no instance found due to a Ray failure vs just running on a laptop. We shouldn't log a warning in the laptop case, since those users will never want to connect to an existing Ray cluster.

@stephanie-wang
Copy link
Contributor Author

Hmm okay, that sounds good. So let me update this PR to also update the autodetect implementation.

Signed-off-by: Stephanie Wang <[email protected]>
Signed-off-by: Stephanie Wang <[email protected]>
Signed-off-by: Stephanie Wang <[email protected]>
@stephanie-wang stephanie-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 19, 2022
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tried this out again. I think this will be a big improvement. A couple more usability suggestions on the log messages:

  1. When starting a new cluster, we now print two lines instead of one, but the second address is not really useful for local clusters. How about we combine them?
>>> ray.init()
View the Ray dashboard at http://127.0.0.1:8265
Started a new local Ray instance at address: 192.168.5.130:61720

For example:
Started a new Ray cluster with dashboard at: http://127.0.0.1:8265

  1. Should we have a period or "..." at the end of the connecting message? Like

Connecting to existing Ray cluster at address: 192.168.5.130:6379., or
Connecting to existing Ray cluster at address: 192.168.5.130:6379....

The current log message seems a little abrupt.

@ericl ericl added the core-interface-change-approval-required This changes the Ray core behavior / API and requires broader approvals. label Jul 19, 2022
@ericl
Copy link
Contributor

ericl commented Jul 19, 2022

Btw, do we need to update any docs?

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 19, 2022
@stephanie-wang
Copy link
Contributor Author

Just tried this out again. I think this will be a big improvement. A couple more usability suggestions on the log messages:

1. When starting a new cluster, we now print two lines instead of one, but the second address is not really useful for local clusters. How about we combine them?
>>> ray.init()
View the Ray dashboard at http://127.0.0.1:8265
Started a new local Ray instance at address: 192.168.5.130:61720

Ah I see, I actually don't get the first message because I don't have the dashboard built. But yeah, I'll remove the new log message.


For example: `Started a new Ray cluster with dashboard at: http://127.0.0.1:8265`

    2. Should we have a period or "..." at the end of the connecting message? Like


`Connecting to existing Ray cluster at address: 192.168.5.130:6379.`, or `Connecting to existing Ray cluster at address: 192.168.5.130:6379...`.

The current log message seems a little abrupt.

Good idea!

Btw, do we need to update any docs?

Hmm let me check.

Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022
…ay-project#26678)

ray.init() will currently start a new Ray instance even if one is already existing, which is very confusing if you are a new user trying to go from local development to a cluster. This PR changes it so that, when no address is specified, we first try to find an existing Ray cluster that was created through `ray start`. If none is found, we will start a new one.

This makes two changes to the ray.init() resolution order:
1. When `ray start` is called, the started cluster address was already written to a file called `/tmp/ray/ray_current_cluster`. For ray.init() and ray.init(address="auto"), we will first check this local file for an existing cluster address. The file is deleted on `ray stop`. If the file is empty, autodetect any running cluster (legacy behavior) if address="auto", or we will start a new local Ray instance if address=None.
2. When ray.init(address="local") is called, we will create a new local Ray instance, even if one is already existing. This behavior seems to be necessary mainly for `ray.client` use cases.

This also surfaces the logs about which Ray instance we are connecting to. Previously these were hidden because we didn't set up the log until after connecting to Ray. So now Ray will log one of the following messages during ray.init:
```
(Connecting to existing Ray cluster at address: <IP>...)
...connection...
(Started a local Ray cluster.| Connected to Ray Cluster.)( View the dashboard at <URL>)
```

Note that this changes the dashboard URL to be printed with `ray.init()` instead of when the dashboard is first started.

Co-authored-by: Eric Liang <[email protected]>
Signed-off-by: Rohan138 <[email protected]>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
…ay-project#26678)

ray.init() will currently start a new Ray instance even if one is already existing, which is very confusing if you are a new user trying to go from local development to a cluster. This PR changes it so that, when no address is specified, we first try to find an existing Ray cluster that was created through `ray start`. If none is found, we will start a new one.

This makes two changes to the ray.init() resolution order:
1. When `ray start` is called, the started cluster address was already written to a file called `/tmp/ray/ray_current_cluster`. For ray.init() and ray.init(address="auto"), we will first check this local file for an existing cluster address. The file is deleted on `ray stop`. If the file is empty, autodetect any running cluster (legacy behavior) if address="auto", or we will start a new local Ray instance if address=None.
2. When ray.init(address="local") is called, we will create a new local Ray instance, even if one is already existing. This behavior seems to be necessary mainly for `ray.client` use cases.

This also surfaces the logs about which Ray instance we are connecting to. Previously these were hidden because we didn't set up the log until after connecting to Ray. So now Ray will log one of the following messages during ray.init:
```
(Connecting to existing Ray cluster at address: <IP>...)
...connection...
(Started a local Ray cluster.| Connected to Ray Cluster.)( View the dashboard at <URL>)
```

Note that this changes the dashboard URL to be printed with `ray.init()` instead of when the dashboard is first started.

Co-authored-by: Eric Liang <[email protected]>
Signed-off-by: Stefan van der Kleij <[email protected]>
@jjyao jjyao mentioned this pull request Feb 24, 2023
7 tasks
ericl pushed a commit that referenced this pull request Feb 25, 2023
#25842 is not needed since we will no longer accidentally create a new cluster while an existing one is running after #26678
jjyao added a commit that referenced this pull request Mar 2, 2023
…2961)

With #26678, when a ray cluster is started, it's address is written to /tmp/ray/ray_current_cluster so ray.init() can find the existing cluster and connect to it by default. However if a node is started with ray start --block, the file is not created so ray.init() will create a new cluster instead of connecting to the existing one, which is unexpected.

Signed-off-by: Jiajun Yao <[email protected]>
jjyao added a commit that referenced this pull request Mar 2, 2023
…2961)

With #26678, when a ray cluster is started, it's address is written to /tmp/ray/ray_current_cluster so ray.init() can find the existing cluster and connect to it by default. However if a node is started with ray start --block, the file is not created so ray.init() will create a new cluster instead of connecting to the existing one, which is unexpected.

Signed-off-by: Jiajun Yao <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request Mar 21, 2023
…y-project#32961)

With ray-project#26678, when a ray cluster is started, it's address is written to /tmp/ray/ray_current_cluster so ray.init() can find the existing cluster and connect to it by default. However if a node is started with ray start --block, the file is not created so ray.init() will create a new cluster instead of connecting to the existing one, which is unexpected.

Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jack He <[email protected]>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
ray-project#25842 is not needed since we will no longer accidentally create a new cluster while an existing one is running after ray-project#26678

Signed-off-by: Edward Oakes <[email protected]>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…y-project#32961)

With ray-project#26678, when a ray cluster is started, it's address is written to /tmp/ray/ray_current_cluster so ray.init() can find the existing cluster and connect to it by default. However if a node is started with ray start --block, the file is not created so ray.init() will create a new cluster instead of connecting to the existing one, which is unexpected.

Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
peytondmurray pushed a commit to peytondmurray/ray that referenced this pull request Mar 22, 2023
ray-project#25842 is not needed since we will no longer accidentally create a new cluster while an existing one is running after ray-project#26678
peytondmurray pushed a commit to peytondmurray/ray that referenced this pull request Mar 22, 2023
…y-project#32961)

With ray-project#26678, when a ray cluster is started, it's address is written to /tmp/ray/ray_current_cluster so ray.init() can find the existing cluster and connect to it by default. However if a node is started with ray start --block, the file is not created so ray.init() will create a new cluster instead of connecting to the existing one, which is unexpected.

Signed-off-by: Jiajun Yao <[email protected]>
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
ray-project#25842 is not needed since we will no longer accidentally create a new cluster while an existing one is running after ray-project#26678

Signed-off-by: elliottower <[email protected]>
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
…y-project#32961)

With ray-project#26678, when a ray cluster is started, it's address is written to /tmp/ray/ray_current_cluster so ray.init() can find the existing cluster and connect to it by default. However if a node is started with ray start --block, the file is not created so ray.init() will create a new cluster instead of connecting to the existing one, which is unexpected.

Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: elliottower <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
…y-project#32961)

With ray-project#26678, when a ray cluster is started, it's address is written to /tmp/ray/ray_current_cluster so ray.init() can find the existing cluster and connect to it by default. However if a node is started with ray start --block, the file is not created so ray.init() will create a new cluster instead of connecting to the existing one, which is unexpected.

Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jack He <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] Improve usability for ray.init when there is an existing cluster
8 participants