ray-project · ericl · Nov 28, 2022 · Nov 22, 2022 · Nov 23, 2022 · Nov 23, 2022
@@ -216,3 +216,133 @@ To get information about the current available resource capacity of your cluster
 
 .. autofunction:: ray.available_resources
     :noindex:
+
+Running Large Ray Clusters
+--------------------------
+
+Here are some tips to run Ray with more than 1k nodes. When running Ray with such
+a large number of nodes, several system settings may need to be tuned to enable
+communication between such a large number of machines.
+
+Tuning Operating System Settings
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Because all nodes and workers connect to the GCS, many network connections will
+be created and the operating system has to support that number of connections.
+
+Maximum open files
+******************
+
+The OS has to be configured to support opening many TCP connections since every
+worker and raylet connects to the GCS. In POSIX systems, the current limit can
+be checked by ``ulimit -n`` and if it's small, it should be increased according to
+the OS manual. 
+
+ARP cache
+*********
+
+Another thing that needs to be configured is the ARP cache. In a large cluster,
+all the worker nodes connect to the head node, which adds a lot of entries to
+the ARP table. Ensure that the ARP cache size is large enough to handle this
+many nodes.
+Failure to do this will result in the head node hanging. When this happens,
+``dmesg`` will show errors like ``neighbor table overflow message``.
+
+In Ubuntu, the ARP cache size can be tuned in ``/etc/sysctl.conf`` by increasing
+the value of ``net.ipv4.neigh.default.gc_thresh1`` - ``net.ipv4.neigh.default.gc_thresh3``. 
+For more details, please refer to the OS manual.
+
+Tuning Ray Settings
+~~~~~~~~~~~~~~~~~~~
+
+.. note::
+  There is an ongoing `project <https://github.com/ray-project/ray/projects/15>`_ focusing on
+  improving Ray's scalability and stability. Feel free to share your thoughts and use cases.
+
+To run a large cluster, several parameters need to be tuned in Ray.
+
+Resource broadcasting
+*********************
+
+.. note::
+  There is an ongoing `project <https://github.com/ray-project/ray/issues/30631>`_ changing the
+  algorithm to pull-based which doesn't require tuning these parameters.
+
+
+Another functionality GCS provided is to ensure each worker node has a view of
+available resources of each other in the Ray cluster. Each raylet is going to
+push its local available resource to GCS and GCS will broadcast it to all the
+raylet periodically. The time complexity is O(N^2). In a large Ray cluster, this
+is going to be an issue, since most of the time is spent on broadcasting the
+resources. There are several settings we can use to tune this: 
+
+- ``RAY_resource_broadcast_batch_size`` The maximum number of nodes in a single
+  request sent by GCS, by default 512.
+- ``RAY_raylet_report_resources_period_milliseconds`` The interval between two
+  resources report in raylet, 100ms by default.
+
+Be aware that this is a trade-off between scheduling performance and GCS loads.
+Decreasing the resource broadcasting frequency might make scheduling slower.
+
+gRPC threads for GCS
+********************
+
+.. note::
+   There is an ongoing `PR <https://github.com/ray-project/ray/pull/30131>`_
+   setting it to vCPUs/4 by default. It's not necessary to set this up in ray 2.3+.
+
+
+By default, only one gRPC thread is used for server and client polling from the
+completion queue. This might become the bottleneck if QPS is too high.
+
+- ``RAY_gcs_server_rpc_server_thread_num`` Control the number of threads in GCS
+  polling from the server completion queue, by default, 1. 
+- ``RAY_gcs_server_rpc_client_thread_num`` Control the number of threads in GCS
+  polling from the client completion queue, by default, 1. 
+
+
+Benchmark
+~~~~~~~~~
+
+The machine setup:
+
+- 1 head node: m5.4xlarge (16 vCPUs/64GB mem)
+- 2000 worker nodes: m5.large (2 vCPUs/8GB mem)
+
+The OS setup:
+
+- Set the maximum number of opening files to 1048576
+- Increase the ARP cache size:
+    - ``net.ipv4.neigh.default.gc_thresh1=2048``
+    - ``net.ipv4.neigh.default.gc_thresh2=4096``
+    - ``net.ipv4.neigh.default.gc_thresh3=8192``
+
+
+The Ray setup:
+
+- ``RAY_gcs_server_rpc_client_thread_num=3``
+- ``RAY_gcs_server_rpc_server_thread_num=3``
+- ``RAY_event_stats=false``
+- ``RAY_gcs_resource_report_poll_period_ms=1000``
+
+Test workload:
+
+- Test script: `code <https://github.com/ray-project/ray/blob/master/release/nightly_tests/many_nodes_tests/actor_test.py>`_
+
+
+
+.. list-table:: Benchmark result
+   :header-rows: 1
+
+   * - Number of actors
+     - Actor launch time
+     - Actor ready time
+     - Total time
+   * - 20k (10 actors / node)
+     - 5.8s
+     - 146.1s
+     - 151.9s
+   * - 80k (40 actors / node)
+     - 21.1s
+     - 583.9s
+     - 605s
diff --git a/src/ray/common/ray_config_def.h b/src/ray/common/ray_config_def.h
@@ -710,7 +710,14 @@ RAY_CONFIG(std::string, testing_asio_delay_us, "")
 
 /// A feature flag to enable pull based health check.
 RAY_CONFIG(bool, pull_based_healthcheck, true)
+/// The following are configs for the health check. They are borrowed
+/// from k8s health probe (shorturl.at/jmTY3)
+
+/// The delay to send the first health check.
 RAY_CONFIG(int64_t, health_check_initial_delay_ms, 5000)
+/// The interval between two health check.
 RAY_CONFIG(int64_t, health_check_period_ms, 3000)
+/// The timeout for a health check.
 RAY_CONFIG(int64_t, health_check_timeout_ms, 10000)
+/// The threshold to consider a node dead.
 RAY_CONFIG(int64_t, health_check_failure_threshold, 5)