-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[doc] Add a session in ray core doc for tips to run large ray cluster. #30599
Conversation
- `RAY_health_check_initial_delay_ms` The delay of the first health check, 5000ms by default. | ||
- `RAY_health_check_period_ms` The interval between two health check requests, 3000ms by default. | ||
- `RAY_health_check_timeout_ms` The timeout to consider a health check failed, 10000ms by default. | ||
- `RAY_health_check_failure_threshold` The consecutive failure threshold for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we switched to pull-based health checking, do we need to mention this? It seems misleading if it isn't an issue in practice now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as long as raylet is not overloaded, it's ok. (open to remote this for now)
|
||
- `RAY_resource_broadcast_batch_size` The maximum number of nodes in a single | ||
request sent by GCS, by default 512. | ||
- `RAY_raylet_report_resources_period_milliseconds` The interval between two |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here, I thought this is no longer an issue with the pull model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For resource broadcasting, it's still push mode now. The pull based is wip.
We need this #30460 merged and do complete tests.
The new plan is to make it experimental in 2.3 and turn it on in 2.4 by default.
I think maybe we can keep it here for now so that users who is using ray 2.2 can benefit this. Once we have that PR merged, and pass the benchmark, we can update this to enable pull based broadcasting.
After 2.4, we maybe can delete this.
I think this part is to give the user some instruction to run ray in large scale and report the unknown issues found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Why don't we file a GitHub issue with planned scalability improvements, and link to it from the doc? That way there is some more communication about which of these will be improved/unneeded in the future.
One issue with documenting these at all is that they show up in search randomly, and users might not realize it doesn't apply to future Ray versions.
|
||
- `RAY_gcs_server_rpc_server_thread_num` Control the number of threads in GCS | ||
polling from the server completion queue, by default, 1. | ||
- `RAY_gcs_server_rpc_client_thread_num` Control the number of threads in GCS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason we haven't set these higher by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we previously got some release test broken. We are reenabling it in this PR (https://github.com/ray-project/ray/pull/30131/files#diff-66e8718c71d8ec5383a5cc83b0ade4d1aa8bad17d7ced811928bb666b4b63165)
But for users who are using ray 2.2, it won't be enabled. We can delete it later once it's merged and runs well.
@ericl I think the goal of this doc is to help the users who are using ray 2.2 able to run large scale ray cluster if they want and report the issues found and collect feedbacks. (they might not want to use the master build). I think as work bing merged, we'll just update this doc accordingly. It's also in hidden very deep, so regular users will not see this. How do you think? |
I think it's okay, but another option to consider is writing an umbrella GitHub issue and link out to it for some of the stuff that will change in the future (e.g., the vars that won't be needed in 2.3). |
SG. Let me update the doc. |
ray-project#30599) To run a large ray cluster, some parameters have to be tuned and also some OS configs have to be set. It requires a lot of experience for the users to figure out everything. This PR add a session for this to help the user setup their large scale ray cluster. Co-authored-by: Eric Liang <[email protected]> Signed-off-by: Weichen Xu <[email protected]>
… cluster. (ray-project#30599)" (ray-project#30718) Reverts ray-project#30599 Breaks the bk://book Documentation build. Signed-off-by: Weichen Xu <[email protected]>
…arge ray cluster."" (ray-project#30722) * Revert "Revert "[doc] Add a session in ray core doc for tips to run large ray cluster. (ray-project#30599)" (ray-project#30718)" This reverts commit 373c30c. Signed-off-by: Weichen Xu <[email protected]>
… cluster. (ray-project#30599)" (ray-project#30718) Reverts ray-project#30599 Breaks the bk://book Documentation build. Signed-off-by: Capiru <[email protected]>
…arge ray cluster."" (ray-project#30722) * Revert "Revert "[doc] Add a session in ray core doc for tips to run large ray cluster. (ray-project#30599)" (ray-project#30718)" This reverts commit 373c30c. Signed-off-by: Capiru <[email protected]>
ray-project#30599) To run a large ray cluster, some parameters have to be tuned and also some OS configs have to be set. It requires a lot of experience for the users to figure out everything. This PR add a session for this to help the user setup their large scale ray cluster. Co-authored-by: Eric Liang <[email protected]> Signed-off-by: tmynn <[email protected]>
… cluster. (ray-project#30599)" (ray-project#30718) Reverts ray-project#30599 Breaks the bk://book Documentation build. Signed-off-by: tmynn <[email protected]>
…arge ray cluster."" (ray-project#30722) * Revert "Revert "[doc] Add a session in ray core doc for tips to run large ray cluster. (ray-project#30599)" (ray-project#30718)" This reverts commit 373c30c. Signed-off-by: tmynn <[email protected]>
Why are these changes needed?
To run a large ray cluster, some parameters have to be tuned and also some OS configs have to be set. It requires a lot of experience for the users to figure out everything.
This PR add a session for this to help the user setup their large scale ray cluster.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.