[doc] Add a session in ray core doc for tips to run large ray cluster. #30599

fishbone · 2022-11-22T22:52:41Z

Why are these changes needed?

To run a large ray cluster, some parameters have to be tuned and also some OS configs have to be set. It requires a lot of experience for the users to figure out everything.
This PR add a session for this to help the user setup their large scale ray cluster.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

ericl · 2022-11-23T03:59:05Z

doc/source/ray-core/miscellaneous.rst

+- `RAY_health_check_initial_delay_ms` The delay of the first health check, 5000ms by default.
+- `RAY_health_check_period_ms` The interval between two health check requests, 3000ms by default.
+- `RAY_health_check_timeout_ms` The timeout to consider a health check failed, 10000ms by default.
+- `RAY_health_check_failure_threshold` The consecutive failure threshold for the


Since we switched to pull-based health checking, do we need to mention this? It seems misleading if it isn't an issue in practice now.

I think as long as raylet is not overloaded, it's ok. (open to remote this for now)

ericl · 2022-11-23T03:59:19Z

doc/source/ray-core/miscellaneous.rst

+
+- `RAY_resource_broadcast_batch_size` The maximum number of nodes in a single
+  request sent by GCS, by default 512.
+- `RAY_raylet_report_resources_period_milliseconds` The interval between two


Same question here, I thought this is no longer an issue with the pull model?

For resource broadcasting, it's still push mode now. The pull based is wip.
We need this #30460 merged and do complete tests.
The new plan is to make it experimental in 2.3 and turn it on in 2.4 by default.

I think maybe we can keep it here for now so that users who is using ray 2.2 can benefit this. Once we have that PR merged, and pass the benchmark, we can update this to enable pull based broadcasting.

After 2.4, we maybe can delete this.

I think this part is to give the user some instruction to run ray in large scale and report the unknown issues found.

I see. Why don't we file a GitHub issue with planned scalability improvements, and link to it from the doc? That way there is some more communication about which of these will be improved/unneeded in the future.

One issue with documenting these at all is that they show up in search randomly, and users might not realize it doesn't apply to future Ray versions.

ericl · 2022-11-23T03:59:45Z

doc/source/ray-core/miscellaneous.rst

+
+- `RAY_gcs_server_rpc_server_thread_num` Control the number of threads in GCS
+  polling from the server completion queue, by default, 1. 
+- `RAY_gcs_server_rpc_client_thread_num` Control the number of threads in GCS


Any reason we haven't set these higher by default?

I think we previously got some release test broken. We are reenabling it in this PR (https://github.com/ray-project/ray/pull/30131/files#diff-66e8718c71d8ec5383a5cc83b0ade4d1aa8bad17d7ced811928bb666b4b63165)

But for users who are using ray 2.2, it won't be enabled. We can delete it later once it's merged and runs well.

fishbone · 2022-11-23T07:16:30Z

@ericl I think the goal of this doc is to help the users who are using ray 2.2 able to run large scale ray cluster if they want and report the issues found and collect feedbacks. (they might not want to use the master build).

I think as work bing merged, we'll just update this doc accordingly. It's also in hidden very deep, so regular users will not see this.

How do you think?

ericl · 2022-11-23T07:26:03Z

I think it's okay, but another option to consider is writing an umbrella GitHub issue and link out to it for some of the stuff that will change in the future (e.g., the vars that won't be needed in 2.3).

fishbone · 2022-11-23T16:58:48Z

SG. Let me update the doc.

… cluster. (#30599)" This reverts commit ba4af8e.

ray-project#30599) To run a large ray cluster, some parameters have to be tuned and also some OS configs have to be set. It requires a lot of experience for the users to figure out everything. This PR add a session for this to help the user setup their large scale ray cluster. Co-authored-by: Eric Liang <[email protected]> Signed-off-by: Weichen Xu <[email protected]>

… cluster. (ray-project#30599)" (ray-project#30718) Reverts ray-project#30599 Breaks the bk://book Documentation build. Signed-off-by: Weichen Xu <[email protected]>

…arge ray cluster."" (ray-project#30722) * Revert "Revert "[doc] Add a session in ray core doc for tips to run large ray cluster. (ray-project#30599)" (ray-project#30718)" This reverts commit 373c30c. Signed-off-by: Weichen Xu <[email protected]>

… cluster. (ray-project#30599)" (ray-project#30718) Reverts ray-project#30599 Breaks the bk://book Documentation build. Signed-off-by: Capiru <[email protected]>

…arge ray cluster."" (ray-project#30722) * Revert "Revert "[doc] Add a session in ray core doc for tips to run large ray cluster. (ray-project#30599)" (ray-project#30718)" This reverts commit 373c30c. Signed-off-by: Capiru <[email protected]>

ray-project#30599) To run a large ray cluster, some parameters have to be tuned and also some OS configs have to be set. It requires a lot of experience for the users to figure out everything. This PR add a session for this to help the user setup their large scale ray cluster. Co-authored-by: Eric Liang <[email protected]> Signed-off-by: tmynn <[email protected]>

… cluster. (ray-project#30599)" (ray-project#30718) Reverts ray-project#30599 Breaks the bk://book Documentation build. Signed-off-by: tmynn <[email protected]>

…arge ray cluster."" (ray-project#30722) * Revert "Revert "[doc] Add a session in ray core doc for tips to run large ray cluster. (ray-project#30599)" (ray-project#30718)" This reverts commit 373c30c. Signed-off-by: tmynn <[email protected]>

fishbone added 2 commits November 22, 2022 22:52

up

e584777

fix

14592f8

fishbone changed the title up [doc] Add a session in ray core doc for tips to run large ray cluster. Nov 23, 2022

fishbone marked this pull request as ready for review November 23, 2022 00:44

fishbone requested a review from a team as a code owner November 23, 2022 00:44

fishbone assigned scv119 Nov 23, 2022

fix title

df60065

fishbone assigned ericl, stephanie-wang and richardliaw and unassigned ericl and richardliaw Nov 23, 2022

Update miscellaneous.rst

11e7a38

ericl reviewed Nov 23, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 23, 2022

fishbone added 6 commits November 23, 2022 13:52

up

e1a6dc1

fix

85dd3ee

fix typo

8cfde5c

fix

16c5b1c

up

dc3d2cd

fix

3c9b3f9

fishbone removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 24, 2022

ericl approved these changes Nov 28, 2022

View reviewed changes

ericl merged commit ba4af8e into ray-project:master Nov 28, 2022

stephanie-wang added a commit that referenced this pull request Nov 29, 2022

Revert "[doc] Add a session in ray core doc for tips to run large ray…

66bc2a9

… cluster. (#30599)" This reverts commit ba4af8e.

stephanie-wang mentioned this pull request Nov 29, 2022

Revert "[doc] Add a session in ray core doc for tips to run large ray cluster." #30718

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[doc] Add a session in ray core doc for tips to run large ray cluster. #30599

[doc] Add a session in ray core doc for tips to run large ray cluster. #30599

fishbone commented Nov 22, 2022 •

edited

Loading

ericl Nov 23, 2022

fishbone Nov 23, 2022

ericl Nov 23, 2022

fishbone Nov 23, 2022

ericl Nov 23, 2022

ericl Nov 23, 2022

fishbone Nov 23, 2022

fishbone commented Nov 23, 2022

ericl commented Nov 23, 2022

fishbone commented Nov 23, 2022

[doc] Add a session in ray core doc for tips to run large ray cluster. #30599

[doc] Add a session in ray core doc for tips to run large ray cluster. #30599

Conversation

fishbone commented Nov 22, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

ericl Nov 23, 2022

Choose a reason for hiding this comment

fishbone Nov 23, 2022

Choose a reason for hiding this comment

ericl Nov 23, 2022

Choose a reason for hiding this comment

fishbone Nov 23, 2022

Choose a reason for hiding this comment

ericl Nov 23, 2022

Choose a reason for hiding this comment

ericl Nov 23, 2022

Choose a reason for hiding this comment

fishbone Nov 23, 2022

Choose a reason for hiding this comment

fishbone commented Nov 23, 2022

ericl commented Nov 23, 2022

fishbone commented Nov 23, 2022

fishbone commented Nov 22, 2022 •

edited

Loading