Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc] Add a session in ray core doc for tips to run large ray cluster. #30599

Merged
merged 10 commits into from
Nov 28, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 130 additions & 0 deletions doc/source/ray-core/miscellaneous.rst
Original file line number Diff line number Diff line change
Expand Up @@ -216,3 +216,133 @@ To get information about the current available resource capacity of your cluster

.. autofunction:: ray.available_resources
:noindex:

Running Large Ray Clusters
--------------------------

Here are some tips to run Ray with more than 1k nodes. When running Ray with such
a large number of nodes, several system settings may need to be tuned to enable
communication between such a large number of machines.

Tuning Operating System Settings
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Because all nodes and workers connect to the GCS, many network connections will
be created and the operating system has to support that number of connections.

Maximum open files
******************

The OS has to be configured to support opening many TCP connections since every
worker and raylet connects to the GCS. In POSIX systems, the current limit can
be checked by ``ulimit -n`` and if it's small, it should be increased according to
the OS manual.

ARP cache
*********

Another thing that needs to be configured is the ARP cache. In a large cluster,
all the worker nodes connect to the head node, which adds a lot of entries to
the ARP table. Ensure that the ARP cache size is large enough to handle this
many nodes.
Failure to do this will result in the head node hanging. When this happens,
``dmesg`` will show errors like ``neighbor table overflow message``.

In Ubuntu, the ARP cache size can be tuned in ``/etc/sysctl.conf`` by increasing
the value of ``net.ipv4.neigh.default.gc_thresh1`` - ``net.ipv4.neigh.default.gc_thresh3``.
For more details, please refer to the OS manual.

Tuning Ray Settings
~~~~~~~~~~~~~~~~~~~

.. note::
There is an ongoing `project <https://github.com/ray-project/ray/projects/15>`_ focusing on
improving Ray's scalability and stability. Feel free to share your thoughts and use cases.

To run a large cluster, several parameters need to be tuned in Ray.

Resource broadcasting
*********************

.. note::
There is an ongoing `project <https://github.com/ray-project/ray/issues/30631>`_ changing the
algorithm to pull-based which doesn't require tuning these parameters.


Another functionality GCS provided is to ensure each worker node has a view of
available resources of each other in the Ray cluster. Each raylet is going to
push its local available resource to GCS and GCS will broadcast it to all the
raylet periodically. The time complexity is O(N^2). In a large Ray cluster, this
is going to be an issue, since most of the time is spent on broadcasting the
resources. There are several settings we can use to tune this:

- ``RAY_resource_broadcast_batch_size`` The maximum number of nodes in a single
request sent by GCS, by default 512.
- ``RAY_raylet_report_resources_period_milliseconds`` The interval between two
resources report in raylet, 100ms by default.

Be aware that this is a trade-off between scheduling performance and GCS loads.
Decreasing the resource broadcasting frequency might make scheduling slower.

gRPC threads for GCS
********************

.. note::
There is an ongoing `PR <https://github.com/ray-project/ray/pull/30131>`_
setting it to vCPUs/4 by default. It's not necessary to set this up in ray 2.3+.


By default, only one gRPC thread is used for server and client polling from the
completion queue. This might become the bottleneck if QPS is too high.

- ``RAY_gcs_server_rpc_server_thread_num`` Control the number of threads in GCS
polling from the server completion queue, by default, 1.
- ``RAY_gcs_server_rpc_client_thread_num`` Control the number of threads in GCS
polling from the client completion queue, by default, 1.


Benchmark
~~~~~~~~~

The machine setup:

- 1 head node: m5.4xlarge (16 vCPUs/64GB mem)
- 2000 worker nodes: m5.large (2 vCPUs/8GB mem)

The OS setup:

- Set the maximum number of opening files to 1048576
- Increase the ARP cache size:
- ``net.ipv4.neigh.default.gc_thresh1=2048``
- ``net.ipv4.neigh.default.gc_thresh2=4096``
- ``net.ipv4.neigh.default.gc_thresh3=8192``


The Ray setup:

- ``RAY_gcs_server_rpc_client_thread_num=3``
- ``RAY_gcs_server_rpc_server_thread_num=3``
- ``RAY_event_stats=false``
- ``RAY_gcs_resource_report_poll_period_ms=1000``

Test workload:

- Test script: `code <https://github.com/ray-project/ray/blob/master/release/nightly_tests/many_nodes_tests/actor_test.py>`_



.. list-table:: Benchmark result
:header-rows: 1

* - Number of actors
- Actor launch time
- Actor ready time
- Total time
* - 20k (10 actors / node)
- 5.8s
- 146.1s
- 151.9s
* - 80k (40 actors / node)
- 21.1s
- 583.9s
- 605s
7 changes: 7 additions & 0 deletions src/ray/common/ray_config_def.h
Original file line number Diff line number Diff line change
Expand Up @@ -710,7 +710,14 @@ RAY_CONFIG(std::string, testing_asio_delay_us, "")

/// A feature flag to enable pull based health check.
RAY_CONFIG(bool, pull_based_healthcheck, true)
/// The following are configs for the health check. They are borrowed
/// from k8s health probe (shorturl.at/jmTY3)

/// The delay to send the first health check.
RAY_CONFIG(int64_t, health_check_initial_delay_ms, 5000)
/// The interval between two health check.
RAY_CONFIG(int64_t, health_check_period_ms, 3000)
/// The timeout for a health check.
RAY_CONFIG(int64_t, health_check_timeout_ms, 10000)
/// The threshold to consider a node dead.
RAY_CONFIG(int64_t, health_check_failure_threshold, 5)