[serve] immediately send ping in router when receiving new replica set #47053

zcin · 2024-08-09T22:50:54Z

Why are these changes needed?

Context:
When a new set of RunningReplicaInfos are broadcasted to a router, the nested actor handles are "empty" and don't hold the necessary actor info (e.g. actor address) to send a request to that replica. Upon first request, the handle fetches that info from the GCS.

This can cause fault tolerance issues because if the GCS goes down immediately after a replica set change is broadcasted to a router, that router is unable to send requests to any replicas; they will all be blocked until the GCS recovers.

Fix:

Upon receiving a new replica set, the router actively probes the queue lengths for each replica. This simultaneously sends an initial "ping" using the actor handle which populates the actor info from the GCS, and also updates the queue length cache.
Since proxy sends its "self" actor handle to the replica for the replica to call receive_asgi_messages, also push this actor handle to replicas upon replica set change, otherwise proxy requests to new replicas will hang when GCS is down.

Related issue number

closes #47036

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Cindy Zhang <[email protected]>

edoakes · 2024-08-13T20:44:08Z

python/ray/serve/_private/replica_scheduler/pow_2_scheduler.py

+        # Populate cache for all replicas
+        self._loop.create_task(self._probe_queue_lens(list(self._replicas.values()), 0))


hm can we do this only for the replicas that were added instead of all?

yes! for some reason I thought it would mess with the fault tolerance, but seems like the actor info is stored per-process not per actor handle. changed to only ping new replicas.

edoakes · 2024-08-13T20:44:40Z

python/ray/serve/_private/replica_scheduler/pow_2_scheduler.py

+            # `receive_asgi_messages` which can be blocked when GCS is down.
+            # To prevent that from happening, push proxy handle eagerly
+            if self._handle_source == DeploymentHandleSource.PROXY:
+                r._actor_handle.push_proxy_handle.remote(


let's add a method to the interface, shouldn't be accessing the _actor_handle private attribute

that way it can be tested as well

Signed-off-by: Cindy Zhang <[email protected]>

GeneDer · 2024-08-13T22:20:09Z

python/ray/serve/_private/replica.py

@@ -321,6 +321,9 @@ def _configure_logger_and_profilers(
            component_id=self._component_id,
        )

+    def push_proxy_handle(self, handle):


Should we do something to the handle? Also maybe add a type hint is it's required 🙃

doing something with the handle seems unnecessary for now, I think if you pass any actor handle as an argument in a ray remote call like:

x.remote(actor_handle)

then ray core does some processing under the hood that requires making a call to the GCS, so if this actor_handle was never "pushed" to actor beforehand then this call hangs. "Pushing" it once is enough to unblock the call though when the GCS goes down.

Signed-off-by: Cindy Zhang <[email protected]>

edoakes · 2024-08-14T15:53:20Z

fill in the "good comment"s before merging please :)

Signed-off-by: Cindy Zhang <[email protected]>

ray-project#47053) When a new set of `RunningReplicaInfos` are broadcasted to a router, the nested actor handles are "empty" and don't hold the necessary actor info (e.g. actor address) to send a request to that replica. Upon first request, the handle fetches that info from the GCS. If the GCS goes down immediately after a replica set change is broadcasted to a router, requests will all be blocked until the GCS recovers. Fix: - Upon receiving a new replica set, the router actively probes the queue lengths for each replica. - On proxies, also push its self actor handle to replicas upon replica set change, else proxy requests to new replicas will hang when GCS is down. Signed-off-by: Cindy Zhang <[email protected]>

zcin added 5 commits August 9, 2024 15:49

send initial ping to new replicas

794e306

Signed-off-by: Cindy Zhang <[email protected]>

use handle source to determine if router is in proxy

3bc3381

Signed-off-by: Cindy Zhang <[email protected]>

add comment

ab96e8c

Signed-off-by: Cindy Zhang <[email protected]>

Merge branch 'master' into router-initial-ping-replica

13273fa

Signed-off-by: Cindy Zhang <[email protected]>

comment

f7c0ff1

Signed-off-by: Cindy Zhang <[email protected]>

zcin marked this pull request as ready for review August 13, 2024 00:03

zcin requested review from GeneDer and edoakes August 13, 2024 00:03

zcin linked an issue Aug 13, 2024 that may be closed by this pull request

[serve] proxy should ping replica immediately after receiving new actor handle #47036

Closed

edoakes reviewed Aug 13, 2024

View reviewed changes

Merge branch 'master' into router-initial-ping-replica

34aa802

Signed-off-by: Cindy Zhang <[email protected]>

GeneDer reviewed Aug 13, 2024

View reviewed changes

zcin added 3 commits August 13, 2024 17:03

only populate cache and push proxy handle for new replicas

e915ddf

Signed-off-by: Cindy Zhang <[email protected]>

add type annotation

310a9c4

Signed-off-by: Cindy Zhang <[email protected]>

wip

86a332e

Signed-off-by: Cindy Zhang <[email protected]>

edoakes approved these changes Aug 14, 2024

View reviewed changes

lint + fix tests

d1c41a1

Signed-off-by: Cindy Zhang <[email protected]>

zcin force-pushed the router-initial-ping-replica branch from a98a4e7 to d1c41a1 Compare August 14, 2024 17:56

add test docstrings

afa5808

Signed-off-by: Cindy Zhang <[email protected]>

zcin added the go add ONLY when ready to merge, run all tests label Aug 14, 2024

lint

2eea743

Signed-off-by: Cindy Zhang <[email protected]>

zcin merged commit 048190e into ray-project:master Aug 14, 2024
5 checks passed

zcin deleted the router-initial-ping-replica branch August 14, 2024 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] immediately send ping in router when receiving new replica set #47053

[serve] immediately send ping in router when receiving new replica set #47053

zcin commented Aug 9, 2024 •

edited

Loading

edoakes Aug 13, 2024

zcin Aug 14, 2024

edoakes Aug 13, 2024

edoakes Aug 13, 2024

GeneDer Aug 13, 2024

zcin Aug 14, 2024

edoakes commented Aug 14, 2024

		# Populate cache for all replicas
		self._loop.create_task(self._probe_queue_lens(list(self._replicas.values()), 0))

[serve] immediately send ping in router when receiving new replica set #47053

[serve] immediately send ping in router when receiving new replica set #47053

Conversation

zcin commented Aug 9, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

edoakes Aug 13, 2024

Choose a reason for hiding this comment

zcin Aug 14, 2024

Choose a reason for hiding this comment

edoakes Aug 13, 2024

Choose a reason for hiding this comment

edoakes Aug 13, 2024

Choose a reason for hiding this comment

GeneDer Aug 13, 2024

Choose a reason for hiding this comment

zcin Aug 14, 2024

Choose a reason for hiding this comment

edoakes commented Aug 14, 2024

zcin commented Aug 9, 2024 •

edited

Loading