Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tune] [Bug] When Ray autoscalar adds node to cluster, Ray Tune can sometimes report new node as stale #21458

Closed
2 tasks done
amholler opened this issue Jan 7, 2022 · 3 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't tune Tune-related issues

Comments

@amholler
Copy link

amholler commented Jan 7, 2022

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Tune

What happened + What you expected to happen

I have observed this problem intermittently when running multiple Ray Tune jobs simultaneously
on a Ray 1.9.1 cluster with autoscaling enabled, & I provide some details on my latest repro below.
That said, I believe the problem is not dependent on having multiple Ray Tune jobs as in my setup,
so let me first present my theory of the problem, potential solution approaches, and potential
approaches to make the problem less intermittent.

Problem theory
AFAICT the basic problem is that a trial can be assigned to a new node added to the cluster,
but get_next_failed_trial can report that that node as failed/stale because the new node's
ip has not yet been added to self._last_ip_addresses by get_alive_node_ips; that addition is
only done every _refresh_period (TUNE_STATE_REFRESH_PERIOD).

Potential solution approaches
If the problem theory is correct, some potential solution approaches are:
Add a "force" parameter (default False) to get_alive_node_ips and either:
a) If get_next_failed_trial is about to return a trial as failed, it reinvokes get_alive_node_ips
with force set to True, and rechecks that the trial ip is not in the resulting list.
b) When a trial is assigned to a node, it always invokes get_alive_node_ips with force=True,
in case the trial is being assigned to a new node.

Potential approach to make problem less intermittent
If the problem theory is the case, the following should make the problem less intermittent:
Create a Ray cluster with minWorkers=0 and maxWorker=non-zero w/env including
TUNE_STATE_REFRESH_PERIOD set to a large value to make the race window as big as possible.
Run a Ray Tune job that requires that the Autoscaler to add nodes to accommodate some trials.


Details on my latest repro [Disclaimer: I don't think this complex setup is required to see the problem]

I was running tf-legacy Ludwig AutoML using a 1.9.1 Ray cluster with minWorkers=0 and maxWorkers=8
on AWS K8s EKS with Elotl Nodeless technology available to add nodes to the K8s cluster. The Ray
cluster started out with just the head node running on a g4dn.4xlarge node. I started this script
https://github.com/ludwig-ai/experiments/blob/main/automl/validation/run_nodeless.sh
Ray autoscalar requested additional workers, which were satisfied by Nodeless as g4dn.2xlarge nodes.
The forest_cover run completed w/o any problem; the higgs run showed had this problem (in a previous run it did not).

The following is seen in the higgs run console output:
^[[2m^[[36m(bundle_reservation_check_func pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:25.571914: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-01-06 08:41:27,127 INFO trial_runner.py:853 -- trial_66740572 (IP: 10.0.2.50) detected as stale. This is likely because the node was lost
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.441166: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.923189: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.923782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.923829: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.927267: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.927385: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.928547: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.928891: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.930052: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.930992: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.931171: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.931276: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.931874: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.932379: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.946541: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.946996: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.947568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.947663: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.948225: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.948741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:28.948775: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:30.126607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:30.126638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:30.126645: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:30.126856: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:30.127522: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:30.128085: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
^[[2m^[[36m(ImplicitFunc pid=268, ip=10.0.2.50)^[[0m 2022-01-06 08:41:30.128600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13777 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2022-01-06 08:41:43,819 WARNING worker.py:1245 -- The actor or task with ID ffffffffffffffff8209a476664cb082ce51148503000000 cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install. Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.
Required resources for this actor or task: {GPU_group_37f8519f8463afd2d2047bfcded29219: 1.000000}, {GPU_group_0_37f8519f8463afd2d2047bfcded29219: 1.000000}, {CPU_group_0_37f8519f8463afd2d2047bfcded29219: 4.000000}, {CPU_group_37f8519f8463afd2d2047bfcded29219: 4.000000}
Available resources on this node: {3.000000/7.000000 CPU, 1027604479.980469 GiB/1027604479.980469 GiB memory, 0.000000/1.000000 GPU, 437413619.970703 GiB/437413619.970703 GiB object_store_memory, 0.000000/1.000000 GPU_group_0_37f8519f8463afd2d2047bfcded29219, 1000.000000/1000.000000 bundle_group_0_37f8519f8463afd2d2047bfcded29219, 0.000000/4.000000 CPU_group_0_37f8519f8463afd2d2047bfcded29219, 0.000000/4.000000 CPU_group_37f8519f8463afd2d2047bfcded29219, 0.000000/1.000000 GPU_group_37f8519f8463afd2d2047bfcded29219, 1.000000/1.000000 example-resource-a, 1.000000/1.000000 accelerator_type:T4, 1.000000/1.000000 example-resource-b, 1000.000000/1000.000000 bundle_group_37f8519f8463afd2d2047bfcded29219, 1.000000/1.000000 node:10.0.2.50}
In total there are 0 pending tasks and 1 pending actors on this node.

And then we see this:
+----------------+----------+---------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+
| Trial name | status | loc | combiner.bn_momentum | combiner.bn_virtual_bs | combiner.num_steps | combiner.output_size | combiner.relaxation_factor | combiner.size | combiner.sparsity | training.batch_size | training.decay_rate | training.decay_steps | training.learning_rate |
|----------------+----------+---------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------|
| trial_2eb4482c | RUNNING | 10.0.2.95:275 | 0.7 | 2048 | 4 | 8 | 1 | 32 | 0.0001 | 8192 | 0.95 | 500 | 0.025 |
| trial_66740572 | RUNNING | 10.0.2.50:268 | 0.8 | 256 | 9 | 128 | 1 | 32 | 0 | 2048 | 0.95 | 10000 | 0.005 |
| trial_7c94fa00 | PENDING | | 0.98 | 1024 | 3 | 8 | 1 | 8 | 1e-06 | 1024 | 0.8 | 500 | 0.005 |
+----------------+----------+---------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+

Result for trial_66740572:
date: 2022-01-06_08-41-27
experiment_id: 11ba8403663842a492d8ce5779368899
hostname: example-cluster-ray-worker-7lxh5
node_ip: 10.0.2.50
pid: 268
timestamp: 1641487287
trial_id: '66740572'

and then we see this:

Current time: 2022-01-06 08:57:05 (running for 00:17:51.84)
Memory usage on this node: 11.2/62.0 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 1800.000: None | Iter 360.000: None | Iter 72.000: None
Resources requested: 8.0/49 CPUs, 2.0/7 GPUs, 0.0/152.6 GiB heap, 0.0/65.04 GiB objects (0.0/7.0 accelerator_type:T4, 0.0/6.0 example-resource-b, 0.0/6.0 example-resource-a)
Result logdir: /home/ray/ray_results/trainable_func_fI6YKzO
Number of trials: 4/10 (1 ERROR, 1 PENDING, 2 RUNNING)

+----------------+----------+---------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+
| Trial name | status | loc | combiner.bn_momentum | combiner.bn_virtual_bs | combiner.num_steps | combiner.output_size | combiner.relaxation_factor | combiner.size | combiner.sparsity | training.batch_size | training.decay_rate | training.decay_steps | training.learning_rate |
|----------------+----------+---------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------|
| trial_2eb4482c | RUNNING | 10.0.2.95:275 | 0.7 | 2048 | 4 | 8 | 1 | 32 | 0.0001 | 8192 | 0.95 | 500 | 0.025 |
| trial_7c94fa00 | RUNNING | 10.0.2.50:267 | 0.98 | 1024 | 3 | 8 | 1 | 8 | 1e-06 | 1024 | 0.8 | 500 | 0.005 |
| trial_7e08e7d4 | PENDING | | 0.8 | 1024 | 7 | 24 | 2 | 8 | 0.001 | 256 | 0.95 | 20000 | 0.01 |
| trial_66740572 | ERROR | 10.0.2.50:268 | 0.8 | 256 | 9 | 128 | 1 | 32 | 0 | 2048 | 0.95 | 10000 | 0.005 |
+----------------+----------+---------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+

So the trial_66740572 is marked as failed due to stale node and the next trial trial_7c94fa00 is scheduled on the same node and runs with no problem.

Versions / Dependencies

Ray 1.9.1

Reproduction script

Information on repro included above

Anything else

Intermittent when Ray autoscaler is used with Ray Tune and new node is added to Ray cluster

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@amholler amholler added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 7, 2022
@amholler
Copy link
Author

amholler commented Jan 7, 2022

I guess one way for me to reduce the likelihood of this problem having an impact is to set max_failures to 1,
with the hope that retrying the trial will be successful the second time, along with further reducing the value
of TUNE_STATE_REFRESH_PERIOD.

@xwjiang2010 xwjiang2010 self-assigned this Jan 7, 2022
@xwjiang2010 xwjiang2010 added tune Tune-related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 7, 2022
@xwjiang2010
Copy link
Contributor

Working on accessing to remove get_alive_node_ips entirely. Note to myself: we can potentially borrow what is used here.

@clarkzinzow clarkzinzow changed the title [Bug] When Ray autoscalar adds node to cluster, Ray Tune can sometimes report new node as stale [Tune] [Bug] When Ray autoscalar adds node to cluster, Ray Tune can sometimes report new node as stale Jan 7, 2022
krfricke added a commit that referenced this issue Jan 19, 2022
See #21458. Currently, Tune keeps its own list of alive node IPs, but this information is only updated every 10 seconds and is usually stale when a new node is added. Because of this, the first trial scheduled on this node is usually marked as failed. This PR adds a test confirming this behavior and gets rid of the unneeded code path.

Co-authored-by: Xiaowei Jiang <[email protected]>
Co-authored-by: xwjiang2010 <[email protected]>
@xwjiang2010
Copy link
Contributor

Fixed by 8fd5b7a. Closing out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't tune Tune-related issues
Projects
None yet
Development

No branches or pull requests

2 participants