-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] ray.init defaults to an existing Ray instance if there is one #26678
Conversation
Signed-off-by: Stephanie Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, it's a bit sketchy to run the existing autodetect logic by default, since it's effectively grepping ps aux output. The two issues I see are:
- It's possible for there to be a false positive.
- If a cluster died, then it wouldn't autodetect it correctly (false negative).
How about the following instead?
ray start
writes the created $RAY_ADDRESS to/tmp/ray/default_ray_address
- we check
/tmp/ray/default_ray_address
if $RAY_ADDRESS isn't defined ray stop
deletes $RAY_ADDRESS
If there are these issues, then shouldn't we just fix autodetect to use the default_ray_address? It doesn't seem great to have all these different ways of finding which Ray address to connect to. Also, this seems like it will work to fix (2) since we can throw an error if RAY_ADDRESS is unreachable. But it could also cause problems if Ray crashes (so no one calls |
Yes, I think this would be ideal.
You could tell the user to run "ray stop". The main thing is avoiding the scenario where because Ray crashed, we silently start a new Ray instance, which is actually not the real Ray cluster (prefer to raise an error rather than do something confusing).
This could also work, but for this to work we'd need to still implement the /tmp/ray/default_address feature right? Otherwise, there is no way to tell if there is no instance found due to a Ray failure vs just running on a laptop. We shouldn't log a warning in the laptop case, since those users will never want to connect to an existing Ray cluster. |
Hmm okay, that sounds good. So let me update this PR to also update the autodetect implementation. |
Signed-off-by: Stephanie Wang <[email protected]>
Signed-off-by: Stephanie Wang <[email protected]>
Signed-off-by: Stephanie Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tried this out again. I think this will be a big improvement. A couple more usability suggestions on the log messages:
- When starting a new cluster, we now print two lines instead of one, but the second address is not really useful for local clusters. How about we combine them?
>>> ray.init()
View the Ray dashboard at http://127.0.0.1:8265
Started a new local Ray instance at address: 192.168.5.130:61720
For example:
Started a new Ray cluster with dashboard at: http://127.0.0.1:8265
- Should we have a period or "..." at the end of the connecting message? Like
Connecting to existing Ray cluster at address: 192.168.5.130:6379.
, or
Connecting to existing Ray cluster at address: 192.168.5.130:6379...
.
The current log message seems a little abrupt.
Btw, do we need to update any docs? |
Ah I see, I actually don't get the first message because I don't have the dashboard built. But yeah, I'll remove the new log message.
Good idea!
Hmm let me check. |
…ay-project#26678) ray.init() will currently start a new Ray instance even if one is already existing, which is very confusing if you are a new user trying to go from local development to a cluster. This PR changes it so that, when no address is specified, we first try to find an existing Ray cluster that was created through `ray start`. If none is found, we will start a new one. This makes two changes to the ray.init() resolution order: 1. When `ray start` is called, the started cluster address was already written to a file called `/tmp/ray/ray_current_cluster`. For ray.init() and ray.init(address="auto"), we will first check this local file for an existing cluster address. The file is deleted on `ray stop`. If the file is empty, autodetect any running cluster (legacy behavior) if address="auto", or we will start a new local Ray instance if address=None. 2. When ray.init(address="local") is called, we will create a new local Ray instance, even if one is already existing. This behavior seems to be necessary mainly for `ray.client` use cases. This also surfaces the logs about which Ray instance we are connecting to. Previously these were hidden because we didn't set up the log until after connecting to Ray. So now Ray will log one of the following messages during ray.init: ``` (Connecting to existing Ray cluster at address: <IP>...) ...connection... (Started a local Ray cluster.| Connected to Ray Cluster.)( View the dashboard at <URL>) ``` Note that this changes the dashboard URL to be printed with `ray.init()` instead of when the dashboard is first started. Co-authored-by: Eric Liang <[email protected]> Signed-off-by: Rohan138 <[email protected]>
…ay-project#26678) ray.init() will currently start a new Ray instance even if one is already existing, which is very confusing if you are a new user trying to go from local development to a cluster. This PR changes it so that, when no address is specified, we first try to find an existing Ray cluster that was created through `ray start`. If none is found, we will start a new one. This makes two changes to the ray.init() resolution order: 1. When `ray start` is called, the started cluster address was already written to a file called `/tmp/ray/ray_current_cluster`. For ray.init() and ray.init(address="auto"), we will first check this local file for an existing cluster address. The file is deleted on `ray stop`. If the file is empty, autodetect any running cluster (legacy behavior) if address="auto", or we will start a new local Ray instance if address=None. 2. When ray.init(address="local") is called, we will create a new local Ray instance, even if one is already existing. This behavior seems to be necessary mainly for `ray.client` use cases. This also surfaces the logs about which Ray instance we are connecting to. Previously these were hidden because we didn't set up the log until after connecting to Ray. So now Ray will log one of the following messages during ray.init: ``` (Connecting to existing Ray cluster at address: <IP>...) ...connection... (Started a local Ray cluster.| Connected to Ray Cluster.)( View the dashboard at <URL>) ``` Note that this changes the dashboard URL to be printed with `ray.init()` instead of when the dashboard is first started. Co-authored-by: Eric Liang <[email protected]> Signed-off-by: Stefan van der Kleij <[email protected]>
…2961) With #26678, when a ray cluster is started, it's address is written to /tmp/ray/ray_current_cluster so ray.init() can find the existing cluster and connect to it by default. However if a node is started with ray start --block, the file is not created so ray.init() will create a new cluster instead of connecting to the existing one, which is unexpected. Signed-off-by: Jiajun Yao <[email protected]>
…2961) With #26678, when a ray cluster is started, it's address is written to /tmp/ray/ray_current_cluster so ray.init() can find the existing cluster and connect to it by default. However if a node is started with ray start --block, the file is not created so ray.init() will create a new cluster instead of connecting to the existing one, which is unexpected. Signed-off-by: Jiajun Yao <[email protected]>
…y-project#32961) With ray-project#26678, when a ray cluster is started, it's address is written to /tmp/ray/ray_current_cluster so ray.init() can find the existing cluster and connect to it by default. However if a node is started with ray start --block, the file is not created so ray.init() will create a new cluster instead of connecting to the existing one, which is unexpected. Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: Jack He <[email protected]>
ray-project#25842 is not needed since we will no longer accidentally create a new cluster while an existing one is running after ray-project#26678 Signed-off-by: Edward Oakes <[email protected]>
…y-project#32961) With ray-project#26678, when a ray cluster is started, it's address is written to /tmp/ray/ray_current_cluster so ray.init() can find the existing cluster and connect to it by default. However if a node is started with ray start --block, the file is not created so ray.init() will create a new cluster instead of connecting to the existing one, which is unexpected. Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: Edward Oakes <[email protected]>
ray-project#25842 is not needed since we will no longer accidentally create a new cluster while an existing one is running after ray-project#26678
…y-project#32961) With ray-project#26678, when a ray cluster is started, it's address is written to /tmp/ray/ray_current_cluster so ray.init() can find the existing cluster and connect to it by default. However if a node is started with ray start --block, the file is not created so ray.init() will create a new cluster instead of connecting to the existing one, which is unexpected. Signed-off-by: Jiajun Yao <[email protected]>
ray-project#25842 is not needed since we will no longer accidentally create a new cluster while an existing one is running after ray-project#26678 Signed-off-by: elliottower <[email protected]>
…y-project#32961) With ray-project#26678, when a ray cluster is started, it's address is written to /tmp/ray/ray_current_cluster so ray.init() can find the existing cluster and connect to it by default. However if a node is started with ray start --block, the file is not created so ray.init() will create a new cluster instead of connecting to the existing one, which is unexpected. Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: elliottower <[email protected]>
…y-project#32961) With ray-project#26678, when a ray cluster is started, it's address is written to /tmp/ray/ray_current_cluster so ray.init() can find the existing cluster and connect to it by default. However if a node is started with ray start --block, the file is not created so ray.init() will create a new cluster instead of connecting to the existing one, which is unexpected. Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: Jack He <[email protected]>
Signed-off-by: Stephanie Wang [email protected]
Why are these changes needed?
ray.init() will currently start a new Ray instance even if one is already existing, which is very confusing if you are a new user trying to go from local development to a cluster. This PR changes it so that, when no address is specified, we first try to find an existing Ray cluster that was created through
ray start
. If none is found, we will start a new one.This makes two changes to the ray.init() resolution order:
ray start
is called, the started cluster address was already written to a file called/tmp/ray/ray_current_cluster
. For ray.init() and ray.init(address="auto"), we will first check this local file for an existing cluster address. The file is deleted onray stop
. If the file is empty, autodetect any running cluster (legacy behavior) if address="auto", or we will start a new local Ray instance if address=None.ray.client
use cases.This also surfaces the logs about which Ray instance we are connecting to. Previously these were hidden because we didn't set up the log until after connecting to Ray. So now Ray will log one of the following messages during ray.init:
Note that this changes the dashboard URL to be printed with
ray.init()
instead of when the dashboard is first started.Related issue number
Closes #26329.
Checks
scripts/format.sh
to lint the changes in this PR.