[tune] Trial executor crashes on node removal in xray #2851

ericl · 2018-09-09T23:46:22Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Ray installed from (source or binary): binary
Ray version: 0.5.2
Python version: 2.7
Exact command to reproduce:

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 99/256 CPUs, 3/4 GPUs
Result logdir: /home/ubuntu/ray_results/atari-impala
ERROR trials:
 - IMPALA_BeamRiderNoFrameskip-v4_1_env=BeamRiderNoFrameskip-v4:        ERROR, 1 failures: /home/ubuntu/ray_results/atari-impala/IMPALA_BeamRiderNoFrameskip-v4_1_env=BeamRiderNoFrameskip-v4_2018-09-09_23-41-15GCdo68/error_2018-09-09_23-44-49.txt [pid=26010], 193 s, 6 iter, 380750 ts, 401 rew
RUNNING trials:
 - IMPALA_BreakoutNoFrameskip-v4_0_env=BreakoutNoFrameskip-v4:  RUNNING [pid=26046], 193 s, 6 iter, 378750 ts, 8.15 rew
 - IMPALA_QbertNoFrameskip-v4_2_env=QbertNoFrameskip-v4:        RUNNING [pid=26033], 194 s, 6 iter, 391250 ts, 300 rew
 - IMPALA_SpaceInvadersNoFrameskip-v4_3_env=SpaceInvadersNoFrameskip-v4:        RUNNING [pid=26021], 193 s, 6 iter, 382500 ts, 212 rew
A worker died or was killed while executing task 0000000030003bc01e1113c07967cbbfffa54f7a.

A worker died or was killed while executing task 00000000278f319e0918aff7b8ba3c67ca7e4caf.
A worker died or was killed while executing task 000000002b068851a815f7af6f532e7868697a94.
A worker died or was killed while executing task 00000000602c103969d4f2cdea5a27759b576492.
A worker died or was killed while executing task 00000000d5e61fdd4fbfd6f33eda7a7fe8b9f06e.
A worker died or was killed while executing task 00000000a2c607f67fbfaf56a0868fa459895dfa.
A worker died or was killed while executing task 0000000016390605fb225586db66c37f4d88c0b2.
A worker died or was killed while executing task 00000000dae4eec56597b4790904c72ed6034529.
Traceback (most recent call last):
  File "./train.py", line 118, in <module>
A worker died or was killed while executing task 000000004c0abfa50e320361265b7d3446038b1b.
    run(args, parser)
A worker died or was killed while executing task 00000000e7b81db3eef71ce561edfbfc14133466.
  File "./train.py", line 112, in run
A worker died or was killed while executing task 00000000a83b970ac64b2b050e7988b5a9f98d54.
    A worker died or was killed while executing task 00000000f92be4b92ae8bfc43886ed97272fc1cc.
queue_trials=args.queue_trials)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/ray/tune/tune.py", line 102, in run_experiments
    A worker died or was killed while executing task 00000000fddb28c7a3346a66fd203c0e52e30c7e.
runner.step()
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/ray/tune/trial_runner.py", line 101, in step
A worker died or was killed while executing task 00000000edba4e591ebf8a89b35c26bec6d12474.
    self.trial_executor.on_step_begin()
A worker died or was killed while executing task 00000000c3149efa851aea1b391fe181a7b3f419.
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/ray/tune/ray_trial_executor.py", line 252, in on_step_begin
    A worker died or was killed while executing task 00000000a2d156fc320d7557c8fa923d141caec7.
self._update_avail_resources()
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/ray/tune/ray_trial_executor.py", line 197, in _update_avail_resources
A worker died or was killed while executing task 000000004508d8f47c933f2aa81027b6660e8bf4.
    num_cpus = sum(cl['Resources']['CPU'] for cl in clients)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/ray/tune/ray_trial_executor.py", line 197, in <genexpr>
    num_cpus = sum(cl['Resources']['CPU'] for cl in clients)
KeyError: 'CPU'

The text was updated successfully, but these errors were encountered:

richardliaw · 2018-09-10T00:10:48Z

@atumanov @robertnishihara

To try to reproduce this, I started ray twice on the same node (once as head, once as worker to connect to head). I opened an interpreter to connect to ray, checking the client table:

In [4]: ray.global_state.client_table()
Out[4]:
[{'ClientID': 'e619bc437872c4ec9fa34b963bb3cf10e23b48f4',
  'IsInsertion': True,
  'NodeManagerAddress': '169.229.49.173',
  'NodeManagerPort': 40347,
  'ObjectManagerPort': 43791,
  'ObjectStoreSocketName': '/tmp/plasma_store82307007',
  'RayletSocketName': '/tmp/raylet27088953',
  'Resources': {'GPU': 1.0, 'CPU': 4.0}},
 {'ClientID': '2d596eab937a8cced74b72d904c1da578cdb7cdb',
  'IsInsertion': True,
  'NodeManagerAddress': '169.229.49.173',
  'NodeManagerPort': 46637,
  'ObjectManagerPort': 38235,
  'ObjectStoreSocketName': '/tmp/plasma_store53490427',
  'RayletSocketName': '/tmp/raylet23718122',
  'Resources': {'GPU': 1.0, 'CPU': 4.0}}]

I then killed the second raylet, which gave me this message:

In [5]: The node with client ID 2d596eab937a8cced74b72d904c1da578cdb7cdb has been marked dead because the monitor has missed too many heartbeats from it.

But I checked the client table and now I have 3 entries. Is this intended? Note that the 2nd and 3rd entry have the same client ID.

In [7]: ray.global_state.client_table()
Out[7]:
[{'ClientID': 'e619bc437872c4ec9fa34b963bb3cf10e23b48f4',
  'IsInsertion': True,
  'NodeManagerAddress': '169.229.49.173',
  'NodeManagerPort': 40347,
  'ObjectManagerPort': 43791,
  'ObjectStoreSocketName': '/tmp/plasma_store82307007',
  'RayletSocketName': '/tmp/raylet27088953',
  'Resources': {'GPU': 1.0, 'CPU': 4.0}},
 {'ClientID': '2d596eab937a8cced74b72d904c1da578cdb7cdb',
  'IsInsertion': True,
  'NodeManagerAddress': '169.229.49.173',
  'NodeManagerPort': 46637,
  'ObjectManagerPort': 38235,
  'ObjectStoreSocketName': '/tmp/plasma_store53490427',
  'RayletSocketName': '/tmp/raylet23718122',
  'Resources': {'GPU': 1.0, 'CPU': 4.0}},
 {'ClientID': '2d596eab937a8cced74b72d904c1da578cdb7cdb',
  'IsInsertion': False,
  'NodeManagerAddress': '',
  'NodeManagerPort': 0,
  'ObjectManagerPort': 0,
  'ObjectStoreSocketName': '',
  'RayletSocketName': '',
  'Resources': {}}]

robertnishihara · 2018-09-10T02:27:07Z

@richardliaw That's how it currently works (but this is confusing and should probably be changed). In Xray, the client table is stored in the GCS as an append-only log, so node deletion is achieved simply by appending another entry with IsInsertion = False.

However, this should probably not be exposed to the user.

ericl · 2018-09-10T05:20:04Z

We could switch to #2501 once that's merged.

ericl · 2018-09-11T19:09:48Z

@pschafhalter does https://github.com/ray-project/ray/pull/2501/files work correctly in this case above? I looked at the code briefly and didn't see any handling of IsInsertion.

richardliaw · 2018-09-14T23:45:17Z

@pschafhalter Looks like it doesn't work (#2875).

richardliaw · 2018-09-15T05:09:58Z

After #2582 is closed, we can change this part of the code to be .get(resource, 0).

This PR introduces single-node fault tolerance for Tune. ## Previous behavior: - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources. ## New behavior: - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued. - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running. Remaining questions: - Should `last_result` be consistent during restore? Yes; but not for earlier trials (trials that are yet to be checkpointed). - Waiting for some PRs to merge first (#3239) Closes #2851.

ericl added the xray label Sep 9, 2018

ericl added the regression label Sep 9, 2018

richardliaw self-assigned this Sep 10, 2018

robertnishihara mentioned this issue Sep 10, 2018

[xray] Hide "append-only log" semantics in global state API. #2852

Closed

richardliaw mentioned this issue Sep 15, 2018

[tune] Add Real Multi-Node Tests (Stress Tests) #2877

Closed

3 tasks

richardliaw mentioned this issue Sep 17, 2018

[wip] Client Table Multi-Node Tests #2893

Closed

richardliaw mentioned this issue Oct 1, 2018

[tune/core] Use Global State API for resources #3004

Merged

1 task

robertnishihara removed the xray label Oct 27, 2018

richardliaw mentioned this issue Nov 12, 2018

[tune] Node Fault Tolerance #3238

Merged

richardliaw closed this as completed in #3238 Nov 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Trial executor crashes on node removal in xray #2851

[tune] Trial executor crashes on node removal in xray #2851

ericl commented Sep 9, 2018

richardliaw commented Sep 10, 2018 •

edited

Loading

robertnishihara commented Sep 10, 2018

ericl commented Sep 10, 2018

ericl commented Sep 11, 2018

richardliaw commented Sep 14, 2018

richardliaw commented Sep 15, 2018

[tune] Trial executor crashes on node removal in xray #2851

[tune] Trial executor crashes on node removal in xray #2851

Comments

ericl commented Sep 9, 2018

System information

richardliaw commented Sep 10, 2018 • edited Loading

robertnishihara commented Sep 10, 2018

ericl commented Sep 10, 2018

ericl commented Sep 11, 2018

richardliaw commented Sep 14, 2018

richardliaw commented Sep 15, 2018

richardliaw commented Sep 10, 2018 •

edited

Loading