[Tune] [Bug] When Ray autoscalar adds node to cluster, Ray Tune can sometimes report new node as stale #21458

amholler · 2022-01-07T15:06:39Z

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Tune

What happened + What you expected to happen

I have observed this problem intermittently when running multiple Ray Tune jobs simultaneously
on a Ray 1.9.1 cluster with autoscaling enabled, & I provide some details on my latest repro below.
That said, I believe the problem is not dependent on having multiple Ray Tune jobs as in my setup,
so let me first present my theory of the problem, potential solution approaches, and potential
approaches to make the problem less intermittent.

Problem theory
AFAICT the basic problem is that a trial can be assigned to a new node added to the cluster,
but get_next_failed_trial can report that that node as failed/stale because the new node's
ip has not yet been added to self._last_ip_addresses by get_alive_node_ips; that addition is
only done every _refresh_period (TUNE_STATE_REFRESH_PERIOD).

Potential solution approaches
If the problem theory is correct, some potential solution approaches are:
Add a "force" parameter (default False) to get_alive_node_ips and either:
a) If get_next_failed_trial is about to return a trial as failed, it reinvokes get_alive_node_ips
with force set to True, and rechecks that the trial ip is not in the resulting list.
b) When a trial is assigned to a node, it always invokes get_alive_node_ips with force=True,
in case the trial is being assigned to a new node.

Potential approach to make problem less intermittent
If the problem theory is the case, the following should make the problem less intermittent:
Create a Ray cluster with minWorkers=0 and maxWorker=non-zero w/env including
TUNE_STATE_REFRESH_PERIOD set to a large value to make the race window as big as possible.
Run a Ray Tune job that requires that the Autoscaler to add nodes to accommodate some trials.

Details on my latest repro [Disclaimer: I don't think this complex setup is required to see the problem]

I was running tf-legacy Ludwig AutoML using a 1.9.1 Ray cluster with minWorkers=0 and maxWorkers=8
on AWS K8s EKS with Elotl Nodeless technology available to add nodes to the K8s cluster. The Ray
cluster started out with just the head node running on a g4dn.4xlarge node. I started this script
https://github.com/ludwig-ai/experiments/blob/main/automl/validation/run_nodeless.sh
Ray autoscalar requested additional workers, which were satisfied by Nodeless as g4dn.2xlarge nodes.
The forest_cover run completed w/o any problem; the higgs run showed had this problem (in a previous run it did not).

The following is seen in the higgs run console output:
^[[2m^[[36m(bundle_reservation_check_func pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:25.571914: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-01-06 08:41:27,127 INFO trial_runner.py:853 -- trial_66740572 (IP: 10.0.2.50) detected as stale. This is likely because the node was lost
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.441166: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.923189: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.923782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.923829: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.927267: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.927385: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.928547: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.928891: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.930052: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.930992: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.931171: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.931276: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.931874: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.932379: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.946541: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.946996: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.947568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.947663: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.948225: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.948741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.948775: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:30.126607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:30.126638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:30.126645: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:30.126856: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:30.127522: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:30.128085: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:30.128600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13777 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2022-01-06 08:41:43,819 WARNING worker.py:1245 -- The actor or task with ID ffffffffffffffff8209a476664cb082ce51148503000000 cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install. Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.
Required resources for this actor or task: {GPU_group_37f8519f8463afd2d2047bfcded29219: 1.000000}, {GPU_group_0_37f8519f8463afd2d2047bfcded29219: 1.000000}, {CPU_group_0_37f8519f8463afd2d2047bfcded29219: 4.000000}, {CPU_group_37f8519f8463afd2d2047bfcded29219: 4.000000}
Available resources on this node: {3.000000/7.000000 CPU, 1027604479.980469 GiB/1027604479.980469 GiB memory, 0.000000/1.000000 GPU, 437413619.970703 GiB/437413619.970703 GiB object_store_memory, 0.000000/1.000000 GPU_group_0_37f8519f8463afd2d2047bfcded29219, 1000.000000/1000.000000 bundle_group_0_37f8519f8463afd2d2047bfcded29219, 0.000000/4.000000 CPU_group_0_37f8519f8463afd2d2047bfcded29219, 0.000000/4.000000 CPU_group_37f8519f8463afd2d2047bfcded29219, 0.000000/1.000000 GPU_group_37f8519f8463afd2d2047bfcded29219, 1.000000/1.000000 example-resource-a, 1.000000/1.000000 accelerator_type:T4, 1.000000/1.000000 example-resource-b, 1000.000000/1000.000000 bundle_group_37f8519f8463afd2d2047bfcded29219, 1.000000/1.000000 node:10.0.2.50}
In total there are 0 pending tasks and 1 pending actors on this node.

And then we see this:
+----------------+----------+---------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+
| Trial name | status | loc | combiner.bn_momentum | combiner.bn_virtual_bs | combiner.num_steps | combiner.output_size | combiner.relaxation_factor | combiner.size | combiner.sparsity | training.batch_size | training.decay_rate | training.decay_steps | training.learning_rate |
|----------------+----------+---------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------|
| trial_2eb4482c | RUNNING | 10.0.2.95:275 | 0.7 | 2048 | 4 | 8 | 1 | 32 | 0.0001 | 8192 | 0.95 | 500 | 0.025 |
| trial_66740572 | RUNNING | 10.0.2.50:268 | 0.8 | 256 | 9 | 128 | 1 | 32 | 0 | 2048 | 0.95 | 10000 | 0.005 |
| trial_7c94fa00 | PENDING | | 0.98 | 1024 | 3 | 8 | 1 | 8 | 1e-06 | 1024 | 0.8 | 500 | 0.005 |
+----------------+----------+---------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+

Result for trial_66740572:
date: 2022-01-06_08-41-27
experiment_id: 11ba8403663842a492d8ce5779368899
hostname: example-cluster-ray-worker-7lxh5
node_ip: 10.0.2.50
pid: 268
timestamp: 1641487287
trial_id: '66740572'

and then we see this:

Current time: 2022-01-06 08:57:05 (running for 00:17:51.84)
Memory usage on this node: 11.2/62.0 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 1800.000: None | Iter 360.000: None | Iter 72.000: None
Resources requested: 8.0/49 CPUs, 2.0/7 GPUs, 0.0/152.6 GiB heap, 0.0/65.04 GiB objects (0.0/7.0 accelerator_type:T4, 0.0/6.0 example-resource-b, 0.0/6.0 example-resource-a)
Result logdir: /home/ray/ray_results/trainable_func_fI6YKzO
Number of trials: 4/10 (1 ERROR, 1 PENDING, 2 RUNNING)

+----------------+----------+---------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+
| Trial name | status | loc | combiner.bn_momentum | combiner.bn_virtual_bs | combiner.num_steps | combiner.output_size | combiner.relaxation_factor | combiner.size | combiner.sparsity | training.batch_size | training.decay_rate | training.decay_steps | training.learning_rate |
|----------------+----------+---------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------|
| trial_2eb4482c | RUNNING | 10.0.2.95:275 | 0.7 | 2048 | 4 | 8 | 1 | 32 | 0.0001 | 8192 | 0.95 | 500 | 0.025 |
| trial_7c94fa00 | RUNNING | 10.0.2.50:267 | 0.98 | 1024 | 3 | 8 | 1 | 8 | 1e-06 | 1024 | 0.8 | 500 | 0.005 |
| trial_7e08e7d4 | PENDING | | 0.8 | 1024 | 7 | 24 | 2 | 8 | 0.001 | 256 | 0.95 | 20000 | 0.01 |
| trial_66740572 | ERROR | 10.0.2.50:268 | 0.8 | 256 | 9 | 128 | 1 | 32 | 0 | 2048 | 0.95 | 10000 | 0.005 |
+----------------+----------+---------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+

So the trial_66740572 is marked as failed due to stale node and the next trial trial_7c94fa00 is scheduled on the same node and runs with no problem.

Versions / Dependencies

Ray 1.9.1

Reproduction script

Information on repro included above

Anything else

Intermittent when Ray autoscaler is used with Ray Tune and new node is added to Ray cluster

Are you willing to submit a PR?

Yes I am willing to submit a PR!

amholler · 2022-01-07T16:40:53Z

I guess one way for me to reduce the likelihood of this problem having an impact is to set max_failures to 1,
with the hope that retrying the trial will be successful the second time, along with further reducing the value
of TUNE_STATE_REFRESH_PERIOD.

xwjiang2010 · 2022-01-07T21:37:17Z

Working on accessing to remove get_alive_node_ips entirely. Note to myself: we can potentially borrow what is used here.

See #21458. Currently, Tune keeps its own list of alive node IPs, but this information is only updated every 10 seconds and is usually stale when a new node is added. Because of this, the first trial scheduled on this node is usually marked as failed. This PR adds a test confirming this behavior and gets rid of the unneeded code path. Co-authored-by: Xiaowei Jiang <[email protected]> Co-authored-by: xwjiang2010 <[email protected]>

xwjiang2010 · 2022-01-19T21:53:51Z

Fixed by 8fd5b7a. Closing out.

amholler added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 7, 2022

This was referenced Jan 7, 2022

Improve Ludwig resilience to Ray Tune issues ludwig-ai/ludwig#1660

Merged

[Tune] [Bug] Ray checkpoint sync can sometimes fail to upload checkpoints to s3, plus log spew about sync client observed #21469

Open

xwjiang2010 self-assigned this Jan 7, 2022

xwjiang2010 added tune Tune-related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 7, 2022

clarkzinzow changed the title ~~[Bug] When Ray autoscalar adds node to cluster, Ray Tune can sometimes report new node as stale~~ [Tune] [Bug] When Ray autoscalar adds node to cluster, Ray Tune can sometimes report new node as stale Jan 7, 2022

krfricke mentioned this issue Jan 10, 2022

Tune test autoscaler / fix stale node detection bug #21516

Merged

6 tasks

xwjiang2010 closed this as completed Jan 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tune] [Bug] When Ray autoscalar adds node to cluster, Ray Tune can sometimes report new node as stale #21458

[Tune] [Bug] When Ray autoscalar adds node to cluster, Ray Tune can sometimes report new node as stale #21458

amholler commented Jan 7, 2022 •

edited

Loading

amholler commented Jan 7, 2022

xwjiang2010 commented Jan 7, 2022

xwjiang2010 commented Jan 19, 2022

[Tune] [Bug] When Ray autoscalar adds node to cluster, Ray Tune can sometimes report new node as stale #21458

[Tune] [Bug] When Ray autoscalar adds node to cluster, Ray Tune can sometimes report new node as stale #21458

Comments

amholler commented Jan 7, 2022 • edited Loading

Search before asking

Ray Component

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Anything else

Are you willing to submit a PR?

amholler commented Jan 7, 2022

xwjiang2010 commented Jan 7, 2022

xwjiang2010 commented Jan 19, 2022

amholler commented Jan 7, 2022 •

edited

Loading