Skip to content

Commit

Permalink
[doc] Add a session in ray core doc for tips to run large ray cluster. (
Browse files Browse the repository at this point in the history
ray-project#30599)

To run a large ray cluster, some parameters have to be tuned and also some OS configs have to be set. It requires a lot of experience for the users to figure out everything.
This PR add a session for this to help the user setup their large scale ray cluster.

Co-authored-by: Eric Liang <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
  • Loading branch information
2 people authored and WeichenXu123 committed Dec 19, 2022
1 parent 3abc6fa commit 5cb4399
Show file tree
Hide file tree
Showing 2 changed files with 137 additions and 0 deletions.
130 changes: 130 additions & 0 deletions doc/source/ray-core/miscellaneous.rst
Original file line number Diff line number Diff line change
Expand Up @@ -216,3 +216,133 @@ To get information about the current available resource capacity of your cluster
.. autofunction:: ray.available_resources
:noindex:
Running Large Ray Clusters
--------------------------
Here are some tips to run Ray with more than 1k nodes. When running Ray with such
a large number of nodes, several system settings may need to be tuned to enable
communication between such a large number of machines.
Tuning Operating System Settings
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Because all nodes and workers connect to the GCS, many network connections will
be created and the operating system has to support that number of connections.
Maximum open files
******************
The OS has to be configured to support opening many TCP connections since every
worker and raylet connects to the GCS. In POSIX systems, the current limit can
be checked by ``ulimit -n`` and if it's small, it should be increased according to
the OS manual.
ARP cache
*********
Another thing that needs to be configured is the ARP cache. In a large cluster,
all the worker nodes connect to the head node, which adds a lot of entries to
the ARP table. Ensure that the ARP cache size is large enough to handle this
many nodes.
Failure to do this will result in the head node hanging. When this happens,
``dmesg`` will show errors like ``neighbor table overflow message``.
In Ubuntu, the ARP cache size can be tuned in ``/etc/sysctl.conf`` by increasing
the value of ``net.ipv4.neigh.default.gc_thresh1`` - ``net.ipv4.neigh.default.gc_thresh3``.
For more details, please refer to the OS manual.
Tuning Ray Settings
~~~~~~~~~~~~~~~~~~~
.. note::
There is an ongoing `project <https://github.com/ray-project/ray/projects/15>`_ focusing on
improving Ray's scalability and stability. Feel free to share your thoughts and use cases.
To run a large cluster, several parameters need to be tuned in Ray.
Resource broadcasting
*********************
.. note::
There is an ongoing `project <https://github.com/ray-project/ray/issues/30631>`_ changing the
algorithm to pull-based which doesn't require tuning these parameters.
Another functionality GCS provided is to ensure each worker node has a view of
available resources of each other in the Ray cluster. Each raylet is going to
push its local available resource to GCS and GCS will broadcast it to all the
raylet periodically. The time complexity is O(N^2). In a large Ray cluster, this
is going to be an issue, since most of the time is spent on broadcasting the
resources. There are several settings we can use to tune this:
- ``RAY_resource_broadcast_batch_size`` The maximum number of nodes in a single
request sent by GCS, by default 512.
- ``RAY_raylet_report_resources_period_milliseconds`` The interval between two
resources report in raylet, 100ms by default.
Be aware that this is a trade-off between scheduling performance and GCS loads.
Decreasing the resource broadcasting frequency might make scheduling slower.
gRPC threads for GCS
********************
.. note::
There is an ongoing `PR <https://github.com/ray-project/ray/pull/30131>`_
setting it to vCPUs/4 by default. It's not necessary to set this up in ray 2.3+.
By default, only one gRPC thread is used for server and client polling from the
completion queue. This might become the bottleneck if QPS is too high.
- ``RAY_gcs_server_rpc_server_thread_num`` Control the number of threads in GCS
polling from the server completion queue, by default, 1.
- ``RAY_gcs_server_rpc_client_thread_num`` Control the number of threads in GCS
polling from the client completion queue, by default, 1.
Benchmark
~~~~~~~~~
The machine setup:
- 1 head node: m5.4xlarge (16 vCPUs/64GB mem)
- 2000 worker nodes: m5.large (2 vCPUs/8GB mem)
The OS setup:
- Set the maximum number of opening files to 1048576
- Increase the ARP cache size:
- ``net.ipv4.neigh.default.gc_thresh1=2048``
- ``net.ipv4.neigh.default.gc_thresh2=4096``
- ``net.ipv4.neigh.default.gc_thresh3=8192``
The Ray setup:
- ``RAY_gcs_server_rpc_client_thread_num=3``
- ``RAY_gcs_server_rpc_server_thread_num=3``
- ``RAY_event_stats=false``
- ``RAY_gcs_resource_report_poll_period_ms=1000``
Test workload:
- Test script: `code <https://github.com/ray-project/ray/blob/master/release/nightly_tests/many_nodes_tests/actor_test.py>`_
.. list-table:: Benchmark result
:header-rows: 1
* - Number of actors
- Actor launch time
- Actor ready time
- Total time
* - 20k (10 actors / node)
- 5.8s
- 146.1s
- 151.9s
* - 80k (40 actors / node)
- 21.1s
- 583.9s
- 605s
7 changes: 7 additions & 0 deletions src/ray/common/ray_config_def.h
Original file line number Diff line number Diff line change
Expand Up @@ -714,9 +714,16 @@ RAY_CONFIG(std::string, testing_asio_delay_us, "")

/// A feature flag to enable pull based health check.
RAY_CONFIG(bool, pull_based_healthcheck, true)
/// The following are configs for the health check. They are borrowed
/// from k8s health probe (shorturl.at/jmTY3)

/// The delay to send the first health check.
RAY_CONFIG(int64_t, health_check_initial_delay_ms, 5000)
/// The interval between two health check.
RAY_CONFIG(int64_t, health_check_period_ms, 3000)
/// The timeout for a health check.
RAY_CONFIG(int64_t, health_check_timeout_ms, 10000)
/// The threshold to consider a node dead.
RAY_CONFIG(int64_t, health_check_failure_threshold, 5)

/// The pool size for grpc server call.
Expand Down

0 comments on commit 5cb4399

Please sign in to comment.