-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Trial executor crashes on node removal in xray #2851
Comments
To try to reproduce this, I started ray twice on the same node (once as head, once as worker to connect to head). I opened an interpreter to connect to ray, checking the client table: In [4]: ray.global_state.client_table()
Out[4]:
[{'ClientID': 'e619bc437872c4ec9fa34b963bb3cf10e23b48f4',
'IsInsertion': True,
'NodeManagerAddress': '169.229.49.173',
'NodeManagerPort': 40347,
'ObjectManagerPort': 43791,
'ObjectStoreSocketName': '/tmp/plasma_store82307007',
'RayletSocketName': '/tmp/raylet27088953',
'Resources': {'GPU': 1.0, 'CPU': 4.0}},
{'ClientID': '2d596eab937a8cced74b72d904c1da578cdb7cdb',
'IsInsertion': True,
'NodeManagerAddress': '169.229.49.173',
'NodeManagerPort': 46637,
'ObjectManagerPort': 38235,
'ObjectStoreSocketName': '/tmp/plasma_store53490427',
'RayletSocketName': '/tmp/raylet23718122',
'Resources': {'GPU': 1.0, 'CPU': 4.0}}] I then killed the second raylet, which gave me this message: In [5]: The node with client ID 2d596eab937a8cced74b72d904c1da578cdb7cdb has been marked dead because the monitor has missed too many heartbeats from it. But I checked the client table and now I have 3 entries. Is this intended? Note that the 2nd and 3rd entry have the same client ID. In [7]: ray.global_state.client_table()
Out[7]:
[{'ClientID': 'e619bc437872c4ec9fa34b963bb3cf10e23b48f4',
'IsInsertion': True,
'NodeManagerAddress': '169.229.49.173',
'NodeManagerPort': 40347,
'ObjectManagerPort': 43791,
'ObjectStoreSocketName': '/tmp/plasma_store82307007',
'RayletSocketName': '/tmp/raylet27088953',
'Resources': {'GPU': 1.0, 'CPU': 4.0}},
{'ClientID': '2d596eab937a8cced74b72d904c1da578cdb7cdb',
'IsInsertion': True,
'NodeManagerAddress': '169.229.49.173',
'NodeManagerPort': 46637,
'ObjectManagerPort': 38235,
'ObjectStoreSocketName': '/tmp/plasma_store53490427',
'RayletSocketName': '/tmp/raylet23718122',
'Resources': {'GPU': 1.0, 'CPU': 4.0}},
{'ClientID': '2d596eab937a8cced74b72d904c1da578cdb7cdb',
'IsInsertion': False,
'NodeManagerAddress': '',
'NodeManagerPort': 0,
'ObjectManagerPort': 0,
'ObjectStoreSocketName': '',
'RayletSocketName': '',
'Resources': {}}] |
@richardliaw That's how it currently works (but this is confusing and should probably be changed). In Xray, the client table is stored in the GCS as an append-only log, so node deletion is achieved simply by appending another entry with However, this should probably not be exposed to the user. |
We could switch to #2501 once that's merged. |
@pschafhalter does https://github.com/ray-project/ray/pull/2501/files work correctly in this case above? I looked at the code briefly and didn't see any handling of |
@pschafhalter Looks like it doesn't work (#2875). |
After #2582 is closed, we can change this part of the code to be |
This PR introduces single-node fault tolerance for Tune. ## Previous behavior: - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources. ## New behavior: - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued. - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running. Remaining questions: - Should `last_result` be consistent during restore? Yes; but not for earlier trials (trials that are yet to be checkpointed). - Waiting for some PRs to merge first (#3239) Closes #2851.
System information
The text was updated successfully, but these errors were encountered: