[core][2/2] Worker resubscribe when GCS failed #24813

fishbone · 2022-05-15T06:32:14Z

Why are these changes needed?

A follow-up PR from this one: #24628

In the previous PR, it fixed the resubscribing issue for raylet. But there is also core worker which needs to do resubscribing.

There are two ways of doing resubscribe:

When the client-side detects any failure, it'll do resubscribing.
Server side will ask the client to do resubscribing.

is a cleaner and better solution. However, it's a little bit hard due to the following reasons:

We are using long-polling, so for some extreme cases, we won't be able to detect the failure. For example, the client-side received the message, but before it sends another request, the server-side restarts, and the client will miss the opportunity of detecting the failure. This could happen if we have a standby GCS that starts very fast and somehow the client-side has a lot of traffic and runs very slow.
The current gRPC framework doesn't give the user a way to handle failure which might need some refactoring on this one.

We can go with this way once we have gRPC streaming.

This PR is implementing 2) which includes three parts:

raylet: ([core][1/2] Resubscribe when GCS restarts for raylet. #24628)
core worker: (this pr)
python

Correctness: whenever when a worker started, it'll register to raylet immediately (sync call) before connecting to GCS. So, we just need to send all restart rpcs to registered workers and it should work because:

if the worker just started and hasn't registered with the raylet: it's ok, because the worker hasn't connected with GCS yet, so no need to do resubscribing.
if the worker has registered with the rayelt: it's covered by the code path here.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

mwtian

Is there a plan for Python subscribers?

src/ray/raylet/worker.cc

src/ray/raylet/worker.h

src/ray/raylet/node_manager.cc

fishbone · 2022-05-16T19:38:48Z

Is there a plan for Python subscribers?

There are logging, error, dashboard that is using pubsub and the error is not that critical.
So for python, I plan to do it with the client-side resubscribe in the python layer. I'll get a RFC PR for some feedback.

A follow-up PR from this one: #24628 In the previous PR, it fixed the resubscribing issue for raylet. But there is also core worker which needs to do resubscribing. There are two ways of doing resubscribe: 1. When the client-side detects any failure, it'll do resubscribing. 2. Server side will ask the client to do resubscribing. 1) is a cleaner and better solution. However, it's a little bit hard due to the following reasons: - We are using long-polling, so for some extreme cases, we won't be able to detect the failure. For example, the client-side received the message, but before it sends another request, the server-side restarts, and the client will miss the opportunity of detecting the failure. This could happen if we have a standby GCS that starts very fast and somehow the client-side has a lot of traffic and runs very slow. - The current gRPC framework doesn't give the user a way to handle failure which might need some refactoring on this one. We can go with this way once we have gRPC streaming. This PR is implementing 2) which includes three parts: - raylet: (#24628) - core worker: (this pr) - python Correctness: whenever when a worker started, it'll register to raylet immediately (sync call) before connecting to GCS. So, we just need to send all restart rpcs to registered workers and it should work because: - if the worker just started and hasn't registered with the raylet: it's ok, because the worker hasn't connected with GCS yet, so no need to do resubscribing. - if the worker has registered with the rayelt: it's covered by the code path here.

This is a follow-up PRs of #24813 and #24628 Unlike the change in cpp layer, where the resubscription is done by GCS broadcast a request to raylet/core_worker and the client-side do the resubscription, in the python layer, we detect the failure in the client-side. In case of a failure, the protocol is: 1. call subscribe 2. if timeout when doing resubscribe, throw an exception and this will crash the system. This is ok because when GCS has been down for a time longer than expected, we expect the ray cluster to be down. 3. continue to poll once subscribe ok. However, there is an extreme case where things might be broken: the client might miss detecting a failure. This could happen if the long-polling has been returned and the python layer is doing its own work. And before it sends another long-polling, GCS restarts and recovered. Here we are not going to take care of this case because: 1. usually GCS is going to take several seconds to be up and the python layer's work is simply pushing data into a queue (sync version). For the async version, it's only used in Dashboard which is not a critical component. 2. pubsub in python layer is not doing critical work: it handles logs/errors for ray job; 3. for the dashboard, it can just restart to fix the issue. A known issue here is that we might miss logs in case of GCS failure due to the following reasons: - py's pubsub is only doing best effort publishing. If it failed too many times, it'll skip publishing the message (lose messages from producer side) - if message is pushed to GCS, but the worker hasn't done resubscription yet, the pushed message will be lost (lose messages from consumer side) We think it's reasonable and valid behavior given that the logs are not defined to be a critical component and we'd like to simplify the design of pubsub in GCS. Another things is `run_functions_on_all_workers`. We'll plan to stop using it within ray core and deprecate it in the longer term. But it won't cause a problem for the current cases because: 1. It's only set in driver and we don't support creating a new driver when GCS is down. 2. When GCS is down, we don't support starting new ray workers. And `run_functions_on_all_workers` is only used when we initialize driver/workers.

fishbone added 19 commits May 9, 2022 22:52

update

7d2f834

update

5b1480c

update

62c0f8e

fix cpp tests

58df33e

format

38fae18

add e2e test

7d6f987

format

7c29016

update test case

f67577a

--amend

413a451

up

dbee5ed

update

26f1ab9

fix

66083b2

merge

28574c5

Merge remote-tracking branch 'upstream/master' into resubscribe-core

758b6fe

Merge remote-tracking branch 'upstream/master' into resubscribe-core

2037c86

check

8d9407e

compile

94c7e57

format

f4bbc2a

commit

b915861

fishbone changed the title ~~[wip][core][1/2] Worker resubscribe when GCS failed~~ [wip][core][2/2] Worker resubscribe when GCS failed May 16, 2022

Merge remote-tracking branch 'upstream/master' into resubscribe-worker

58a85b9

fishbone changed the title ~~[wip][core][2/2] Worker resubscribe when GCS failed~~ [core][2/2] Worker resubscribe when GCS failed May 16, 2022

fishbone marked this pull request as ready for review May 16, 2022 18:13

fishbone assigned scv119 and mwtian May 16, 2022

update

07640c4

mwtian approved these changes May 16, 2022

View reviewed changes

src/ray/raylet/worker.cc Outdated Show resolved Hide resolved

src/ray/raylet/worker.h Outdated Show resolved Hide resolved

src/ray/raylet/node_manager.cc Outdated Show resolved Hide resolved

fishbone added 3 commits May 16, 2022 19:10

add e2e test

d9ebab7

update

0261bae

fix comments

5f687a6

fishbone linked an issue May 16, 2022 that may be closed by this pull request

[core] Resubscribe for GCS pubsub from GCS side #24642

Closed

2 tasks

fishbone added 3 commits May 16, 2022 21:09

update

51d3c21

update

3ce6d0a

fix cpp

e445376

fishbone merged commit 379fa63 into ray-project:master May 17, 2022

fishbone mentioned this pull request May 18, 2022

[core] Resubscribe GCS in python when GCS restarts. #24887

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][2/2] Worker resubscribe when GCS failed #24813

[core][2/2] Worker resubscribe when GCS failed #24813

fishbone commented May 15, 2022 •

edited

Loading

mwtian left a comment

fishbone commented May 16, 2022

[core][2/2] Worker resubscribe when GCS failed #24813

[core][2/2] Worker resubscribe when GCS failed #24813

Conversation

fishbone commented May 15, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

mwtian left a comment

Choose a reason for hiding this comment

fishbone commented May 16, 2022

fishbone commented May 15, 2022 •

edited

Loading