Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Trial executor crashes on node removal in xray #2851

Closed
ericl opened this issue Sep 9, 2018 · 6 comments
Closed

[tune] Trial executor crashes on node removal in xray #2851

ericl opened this issue Sep 9, 2018 · 6 comments
Assignees

Comments

@ericl
Copy link
Contributor

ericl commented Sep 9, 2018

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • Ray installed from (source or binary): binary
  • Ray version: 0.5.2
  • Python version: 2.7
  • Exact command to reproduce:
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 99/256 CPUs, 3/4 GPUs
Result logdir: /home/ubuntu/ray_results/atari-impala
ERROR trials:
 - IMPALA_BeamRiderNoFrameskip-v4_1_env=BeamRiderNoFrameskip-v4:        ERROR, 1 failures: /home/ubuntu/ray_results/atari-impala/IMPALA_BeamRiderNoFrameskip-v4_1_env=BeamRiderNoFrameskip-v4_2018-09-09_23-41-15GCdo68/error_2018-09-09_23-44-49.txt [pid=26010], 193 s, 6 iter, 380750 ts, 401 rew
RUNNING trials:
 - IMPALA_BreakoutNoFrameskip-v4_0_env=BreakoutNoFrameskip-v4:  RUNNING [pid=26046], 193 s, 6 iter, 378750 ts, 8.15 rew
 - IMPALA_QbertNoFrameskip-v4_2_env=QbertNoFrameskip-v4:        RUNNING [pid=26033], 194 s, 6 iter, 391250 ts, 300 rew
 - IMPALA_SpaceInvadersNoFrameskip-v4_3_env=SpaceInvadersNoFrameskip-v4:        RUNNING [pid=26021], 193 s, 6 iter, 382500 ts, 212 rew
A worker died or was killed while executing task 0000000030003bc01e1113c07967cbbfffa54f7a.

A worker died or was killed while executing task 00000000278f319e0918aff7b8ba3c67ca7e4caf.
A worker died or was killed while executing task 000000002b068851a815f7af6f532e7868697a94.
A worker died or was killed while executing task 00000000602c103969d4f2cdea5a27759b576492.
A worker died or was killed while executing task 00000000d5e61fdd4fbfd6f33eda7a7fe8b9f06e.
A worker died or was killed while executing task 00000000a2c607f67fbfaf56a0868fa459895dfa.
A worker died or was killed while executing task 0000000016390605fb225586db66c37f4d88c0b2.
A worker died or was killed while executing task 00000000dae4eec56597b4790904c72ed6034529.
Traceback (most recent call last):
  File "./train.py", line 118, in <module>
A worker died or was killed while executing task 000000004c0abfa50e320361265b7d3446038b1b.
    run(args, parser)
A worker died or was killed while executing task 00000000e7b81db3eef71ce561edfbfc14133466.
  File "./train.py", line 112, in run
A worker died or was killed while executing task 00000000a83b970ac64b2b050e7988b5a9f98d54.
    A worker died or was killed while executing task 00000000f92be4b92ae8bfc43886ed97272fc1cc.
queue_trials=args.queue_trials)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/ray/tune/tune.py", line 102, in run_experiments
    A worker died or was killed while executing task 00000000fddb28c7a3346a66fd203c0e52e30c7e.
runner.step()
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/ray/tune/trial_runner.py", line 101, in step
A worker died or was killed while executing task 00000000edba4e591ebf8a89b35c26bec6d12474.
    self.trial_executor.on_step_begin()
A worker died or was killed while executing task 00000000c3149efa851aea1b391fe181a7b3f419.
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/ray/tune/ray_trial_executor.py", line 252, in on_step_begin
    A worker died or was killed while executing task 00000000a2d156fc320d7557c8fa923d141caec7.
self._update_avail_resources()
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/ray/tune/ray_trial_executor.py", line 197, in _update_avail_resources
A worker died or was killed while executing task 000000004508d8f47c933f2aa81027b6660e8bf4.
    num_cpus = sum(cl['Resources']['CPU'] for cl in clients)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/ray/tune/ray_trial_executor.py", line 197, in <genexpr>
    num_cpus = sum(cl['Resources']['CPU'] for cl in clients)
KeyError: 'CPU'
@ericl ericl added the xray label Sep 9, 2018
@richardliaw
Copy link
Contributor

richardliaw commented Sep 10, 2018

@atumanov @robertnishihara

To try to reproduce this, I started ray twice on the same node (once as head, once as worker to connect to head). I opened an interpreter to connect to ray, checking the client table:

In [4]: ray.global_state.client_table()
Out[4]:
[{'ClientID': 'e619bc437872c4ec9fa34b963bb3cf10e23b48f4',
  'IsInsertion': True,
  'NodeManagerAddress': '169.229.49.173',
  'NodeManagerPort': 40347,
  'ObjectManagerPort': 43791,
  'ObjectStoreSocketName': '/tmp/plasma_store82307007',
  'RayletSocketName': '/tmp/raylet27088953',
  'Resources': {'GPU': 1.0, 'CPU': 4.0}},
 {'ClientID': '2d596eab937a8cced74b72d904c1da578cdb7cdb',
  'IsInsertion': True,
  'NodeManagerAddress': '169.229.49.173',
  'NodeManagerPort': 46637,
  'ObjectManagerPort': 38235,
  'ObjectStoreSocketName': '/tmp/plasma_store53490427',
  'RayletSocketName': '/tmp/raylet23718122',
  'Resources': {'GPU': 1.0, 'CPU': 4.0}}]

I then killed the second raylet, which gave me this message:

In [5]: The node with client ID 2d596eab937a8cced74b72d904c1da578cdb7cdb has been marked dead because the monitor has missed too many heartbeats from it.

But I checked the client table and now I have 3 entries. Is this intended? Note that the 2nd and 3rd entry have the same client ID.

In [7]: ray.global_state.client_table()
Out[7]:
[{'ClientID': 'e619bc437872c4ec9fa34b963bb3cf10e23b48f4',
  'IsInsertion': True,
  'NodeManagerAddress': '169.229.49.173',
  'NodeManagerPort': 40347,
  'ObjectManagerPort': 43791,
  'ObjectStoreSocketName': '/tmp/plasma_store82307007',
  'RayletSocketName': '/tmp/raylet27088953',
  'Resources': {'GPU': 1.0, 'CPU': 4.0}},
 {'ClientID': '2d596eab937a8cced74b72d904c1da578cdb7cdb',
  'IsInsertion': True,
  'NodeManagerAddress': '169.229.49.173',
  'NodeManagerPort': 46637,
  'ObjectManagerPort': 38235,
  'ObjectStoreSocketName': '/tmp/plasma_store53490427',
  'RayletSocketName': '/tmp/raylet23718122',
  'Resources': {'GPU': 1.0, 'CPU': 4.0}},
 {'ClientID': '2d596eab937a8cced74b72d904c1da578cdb7cdb',
  'IsInsertion': False,
  'NodeManagerAddress': '',
  'NodeManagerPort': 0,
  'ObjectManagerPort': 0,
  'ObjectStoreSocketName': '',
  'RayletSocketName': '',
  'Resources': {}}]

@richardliaw richardliaw self-assigned this Sep 10, 2018
@robertnishihara
Copy link
Collaborator

@richardliaw That's how it currently works (but this is confusing and should probably be changed). In Xray, the client table is stored in the GCS as an append-only log, so node deletion is achieved simply by appending another entry with IsInsertion = False.

However, this should probably not be exposed to the user.

@ericl
Copy link
Contributor Author

ericl commented Sep 10, 2018

We could switch to #2501 once that's merged.

@ericl
Copy link
Contributor Author

ericl commented Sep 11, 2018

@pschafhalter does https://github.com/ray-project/ray/pull/2501/files work correctly in this case above? I looked at the code briefly and didn't see any handling of IsInsertion.

@richardliaw
Copy link
Contributor

@pschafhalter Looks like it doesn't work (#2875).

@richardliaw
Copy link
Contributor

After #2582 is closed, we can change this part of the code to be .get(resource, 0).

richardliaw added a commit that referenced this issue Nov 21, 2018
This PR introduces single-node fault tolerance for Tune.

## Previous behavior:
 - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.

## New behavior:
 - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). 
 - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
 - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.


Remaining questions:
 -  Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).

 - Waiting for some PRs to merge first (#3239)

Closes #2851.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants