From 66bc2a90d5209b2a78b5ad13d6edcaf4d96dfbfe Mon Sep 17 00:00:00 2001 From: Stephanie Wang Date: Mon, 28 Nov 2022 16:25:31 -0800 Subject: [PATCH] Revert "[doc] Add a session in ray core doc for tips to run large ray cluster. (#30599)" This reverts commit ba4af8ef16265b2a13cec149d27265516d3884bb. --- doc/source/ray-core/miscellaneous.rst | 130 -------------------------- src/ray/common/ray_config_def.h | 7 -- 2 files changed, 137 deletions(-) diff --git a/doc/source/ray-core/miscellaneous.rst b/doc/source/ray-core/miscellaneous.rst index dcf9092dc9fb..9c20ce31282c 100644 --- a/doc/source/ray-core/miscellaneous.rst +++ b/doc/source/ray-core/miscellaneous.rst @@ -216,133 +216,3 @@ To get information about the current available resource capacity of your cluster .. autofunction:: ray.available_resources :noindex: - -Running Large Ray Clusters --------------------------- - -Here are some tips to run Ray with more than 1k nodes. When running Ray with such -a large number of nodes, several system settings may need to be tuned to enable -communication between such a large number of machines. - -Tuning Operating System Settings -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Because all nodes and workers connect to the GCS, many network connections will -be created and the operating system has to support that number of connections. - -Maximum open files -****************** - -The OS has to be configured to support opening many TCP connections since every -worker and raylet connects to the GCS. In POSIX systems, the current limit can -be checked by ``ulimit -n`` and if it's small, it should be increased according to -the OS manual. - -ARP cache -********* - -Another thing that needs to be configured is the ARP cache. In a large cluster, -all the worker nodes connect to the head node, which adds a lot of entries to -the ARP table. Ensure that the ARP cache size is large enough to handle this -many nodes. -Failure to do this will result in the head node hanging. When this happens, -``dmesg`` will show errors like ``neighbor table overflow message``. - -In Ubuntu, the ARP cache size can be tuned in ``/etc/sysctl.conf`` by increasing -the value of ``net.ipv4.neigh.default.gc_thresh1`` - ``net.ipv4.neigh.default.gc_thresh3``. -For more details, please refer to the OS manual. - -Tuning Ray Settings -~~~~~~~~~~~~~~~~~~~ - -.. note:: - There is an ongoing `project `_ focusing on - improving Ray's scalability and stability. Feel free to share your thoughts and use cases. - -To run a large cluster, several parameters need to be tuned in Ray. - -Resource broadcasting -********************* - -.. note:: - There is an ongoing `project `_ changing the - algorithm to pull-based which doesn't require tuning these parameters. - - -Another functionality GCS provided is to ensure each worker node has a view of -available resources of each other in the Ray cluster. Each raylet is going to -push its local available resource to GCS and GCS will broadcast it to all the -raylet periodically. The time complexity is O(N^2). In a large Ray cluster, this -is going to be an issue, since most of the time is spent on broadcasting the -resources. There are several settings we can use to tune this: - -- ``RAY_resource_broadcast_batch_size`` The maximum number of nodes in a single - request sent by GCS, by default 512. -- ``RAY_raylet_report_resources_period_milliseconds`` The interval between two - resources report in raylet, 100ms by default. - -Be aware that this is a trade-off between scheduling performance and GCS loads. -Decreasing the resource broadcasting frequency might make scheduling slower. - -gRPC threads for GCS -******************** - -.. note:: - There is an ongoing `PR `_ - setting it to vCPUs/4 by default. It's not necessary to set this up in ray 2.3+. - - -By default, only one gRPC thread is used for server and client polling from the -completion queue. This might become the bottleneck if QPS is too high. - -- ``RAY_gcs_server_rpc_server_thread_num`` Control the number of threads in GCS - polling from the server completion queue, by default, 1. -- ``RAY_gcs_server_rpc_client_thread_num`` Control the number of threads in GCS - polling from the client completion queue, by default, 1. - - -Benchmark -~~~~~~~~~ - -The machine setup: - -- 1 head node: m5.4xlarge (16 vCPUs/64GB mem) -- 2000 worker nodes: m5.large (2 vCPUs/8GB mem) - -The OS setup: - -- Set the maximum number of opening files to 1048576 -- Increase the ARP cache size: - - ``net.ipv4.neigh.default.gc_thresh1=2048`` - - ``net.ipv4.neigh.default.gc_thresh2=4096`` - - ``net.ipv4.neigh.default.gc_thresh3=8192`` - - -The Ray setup: - -- ``RAY_gcs_server_rpc_client_thread_num=3`` -- ``RAY_gcs_server_rpc_server_thread_num=3`` -- ``RAY_event_stats=false`` -- ``RAY_gcs_resource_report_poll_period_ms=1000`` - -Test workload: - -- Test script: `code `_ - - - -.. list-table:: Benchmark result - :header-rows: 1 - - * - Number of actors - - Actor launch time - - Actor ready time - - Total time - * - 20k (10 actors / node) - - 5.8s - - 146.1s - - 151.9s - * - 80k (40 actors / node) - - 21.1s - - 583.9s - - 605s diff --git a/src/ray/common/ray_config_def.h b/src/ray/common/ray_config_def.h index c32e733c39e4..c54512e7c62d 100644 --- a/src/ray/common/ray_config_def.h +++ b/src/ray/common/ray_config_def.h @@ -714,16 +714,9 @@ RAY_CONFIG(std::string, testing_asio_delay_us, "") /// A feature flag to enable pull based health check. RAY_CONFIG(bool, pull_based_healthcheck, true) -/// The following are configs for the health check. They are borrowed -/// from k8s health probe (shorturl.at/jmTY3) - -/// The delay to send the first health check. RAY_CONFIG(int64_t, health_check_initial_delay_ms, 5000) -/// The interval between two health check. RAY_CONFIG(int64_t, health_check_period_ms, 3000) -/// The timeout for a health check. RAY_CONFIG(int64_t, health_check_timeout_ms, 10000) -/// The threshold to consider a node dead. RAY_CONFIG(int64_t, health_check_failure_threshold, 5) /// The pool size for grpc server call.