Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc] Add a session in ray core doc for tips to run large ray cluster. #30599

Merged
merged 10 commits into from
Nov 28, 2022

Conversation

fishbone
Copy link
Contributor

@fishbone fishbone commented Nov 22, 2022

Why are these changes needed?

To run a large ray cluster, some parameters have to be tuned and also some OS configs have to be set. It requires a lot of experience for the users to figure out everything.
This PR add a session for this to help the user setup their large scale ray cluster.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@fishbone fishbone changed the title up [doc] Add a session in ray core doc for tips to run large ray cluster. Nov 23, 2022
@fishbone fishbone marked this pull request as ready for review November 23, 2022 00:44
@fishbone fishbone requested a review from a team as a code owner November 23, 2022 00:44
- `RAY_health_check_initial_delay_ms` The delay of the first health check, 5000ms by default.
- `RAY_health_check_period_ms` The interval between two health check requests, 3000ms by default.
- `RAY_health_check_timeout_ms` The timeout to consider a health check failed, 10000ms by default.
- `RAY_health_check_failure_threshold` The consecutive failure threshold for the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we switched to pull-based health checking, do we need to mention this? It seems misleading if it isn't an issue in practice now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as long as raylet is not overloaded, it's ok. (open to remote this for now)


- `RAY_resource_broadcast_batch_size` The maximum number of nodes in a single
request sent by GCS, by default 512.
- `RAY_raylet_report_resources_period_milliseconds` The interval between two
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here, I thought this is no longer an issue with the pull model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For resource broadcasting, it's still push mode now. The pull based is wip.
We need this #30460 merged and do complete tests.
The new plan is to make it experimental in 2.3 and turn it on in 2.4 by default.

I think maybe we can keep it here for now so that users who is using ray 2.2 can benefit this. Once we have that PR merged, and pass the benchmark, we can update this to enable pull based broadcasting.

After 2.4, we maybe can delete this.

I think this part is to give the user some instruction to run ray in large scale and report the unknown issues found.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Why don't we file a GitHub issue with planned scalability improvements, and link to it from the doc? That way there is some more communication about which of these will be improved/unneeded in the future.

One issue with documenting these at all is that they show up in search randomly, and users might not realize it doesn't apply to future Ray versions.


- `RAY_gcs_server_rpc_server_thread_num` Control the number of threads in GCS
polling from the server completion queue, by default, 1.
- `RAY_gcs_server_rpc_client_thread_num` Control the number of threads in GCS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we haven't set these higher by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we previously got some release test broken. We are reenabling it in this PR (https://github.com/ray-project/ray/pull/30131/files#diff-66e8718c71d8ec5383a5cc83b0ade4d1aa8bad17d7ced811928bb666b4b63165)

But for users who are using ray 2.2, it won't be enabled. We can delete it later once it's merged and runs well.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 23, 2022
@fishbone
Copy link
Contributor Author

@ericl I think the goal of this doc is to help the users who are using ray 2.2 able to run large scale ray cluster if they want and report the issues found and collect feedbacks. (they might not want to use the master build).

I think as work bing merged, we'll just update this doc accordingly. It's also in hidden very deep, so regular users will not see this.

How do you think?

@ericl
Copy link
Contributor

ericl commented Nov 23, 2022

I think it's okay, but another option to consider is writing an umbrella GitHub issue and link out to it for some of the stuff that will change in the future (e.g., the vars that won't be needed in 2.3).

@fishbone
Copy link
Contributor Author

SG. Let me update the doc.

@fishbone fishbone removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 24, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
ray-project#30599)

To run a large ray cluster, some parameters have to be tuned and also some OS configs have to be set. It requires a lot of experience for the users to figure out everything.
This PR add a session for this to help the user setup their large scale ray cluster.

Co-authored-by: Eric Liang <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
… cluster. (ray-project#30599)" (ray-project#30718)

Reverts ray-project#30599

Breaks the bk://book Documentation build.

Signed-off-by: Weichen Xu <[email protected]>
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…arge ray cluster."" (ray-project#30722)

* Revert "Revert "[doc] Add a session in ray core doc for tips to run large ray cluster. (ray-project#30599)" (ray-project#30718)"

This reverts commit 373c30c.

Signed-off-by: Weichen Xu <[email protected]>
Capiru pushed a commit to Capiru/ray that referenced this pull request Dec 22, 2022
… cluster. (ray-project#30599)" (ray-project#30718)

Reverts ray-project#30599

Breaks the bk://book Documentation build.

Signed-off-by: Capiru <[email protected]>
Capiru pushed a commit to Capiru/ray that referenced this pull request Dec 22, 2022
…arge ray cluster."" (ray-project#30722)

* Revert "Revert "[doc] Add a session in ray core doc for tips to run large ray cluster. (ray-project#30599)" (ray-project#30718)"

This reverts commit 373c30c.

Signed-off-by: Capiru <[email protected]>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
ray-project#30599)

To run a large ray cluster, some parameters have to be tuned and also some OS configs have to be set. It requires a lot of experience for the users to figure out everything.
This PR add a session for this to help the user setup their large scale ray cluster.

Co-authored-by: Eric Liang <[email protected]>
Signed-off-by: tmynn <[email protected]>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
… cluster. (ray-project#30599)" (ray-project#30718)

Reverts ray-project#30599

Breaks the bk://book Documentation build.

Signed-off-by: tmynn <[email protected]>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…arge ray cluster."" (ray-project#30722)

* Revert "Revert "[doc] Add a session in ray core doc for tips to run large ray cluster. (ray-project#30599)" (ray-project#30718)"

This reverts commit 373c30c.


Signed-off-by: tmynn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants