Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Connect to RayCluster via GCS port rather than Ray client in compatibility test #848

Open
2 tasks done
kevin85421 opened this issue Dec 27, 2022 · 2 comments
Open
2 tasks done
Assignees
Labels
enhancement New feature or request P2 Important issue, but not time critical

Comments

@kevin85421
Copy link
Member

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Based on my observation, the GCS FT bugs (#634 and #638) are easily triggered when we connect to RayCluster via GCS port. Hence, we temporarily connect to RayCluster via Ray client. However, the Ray community plans to de-emphasizes Ray client. Therefore, we need to update ray.init when #634 and #638 are fixed. See #844 (comment) for more details.

Use case

No response

Related issues

#634
#638
#844

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@kevin85421 kevin85421 added the enhancement New feature or request label Dec 27, 2022
@kevin85421 kevin85421 self-assigned this Dec 27, 2022
@kevin85421 kevin85421 added this to the v0.5.0 release milestone Dec 27, 2022
@kevin85421 kevin85421 added the P2 Important issue, but not time critical label Dec 27, 2022
@kevin85421
Copy link
Member Author

The flakiness caused by ray.init() is because it shuts down the process with exit code 1 rather than throwing an exception when it fails to connect to the cluster if the cluster is not ready.

@kevin85421
Copy link
Member Author

To clarify, if we use sys.exit(1) to exit the process, it will raise an exception that can be caught by BaseException. However, when the process exits due to QuickExit, no exception is thrown. As a result, the process will terminate after the first call of ray.init() and will not attempt a retry. On the other hand, if the cluster is not ready, Ray client will throw a Python exception.

def retry_with_timeout(func, timeout=90):
err = None
start = time.time()
while time.time() - start <= timeout:
try:
return func()
except BaseException as e:
err = e
finally:
time.sleep(1)
raise err

  • Logs for ray.init failure.
    2023-08-26 11:38:30,134	INFO worker.py:1330 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
    2023-08-26 11:38:30,135	INFO worker.py:1459 -- Connecting to existing Ray cluster at address: 10.244.0.7:6379...
    2023-08-26 11:38:30,203	INFO worker.py:1640 -- Connected to Ray cluster. View the dashboard at �[1m�[32mhttp://10.244.0.7:8265 �[39m�[22m
    [2023-08-26 11:38:39,210 E 244 258] core_worker_process.cc:216: Failed to get the system config from raylet because it is dead. Worker will terminate. Status: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:  .Please see `raylet.out` for more details.
    command terminated with exit code 1
    

Ray may need to have consistent behaviors to thrown exceptions between different methods for connecting with Ray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P2 Important issue, but not time critical
Projects
None yet
Development

No branches or pull requests

1 participant