[Serve] Add hash function of RunningReplicaInfo #32772

sihanwang41 · 2023-02-23T06:52:44Z

Why are these changes needed?

When the longpoll client timeout happens, the all internal objects will be cleaned up. The longpoll client will try to poll again. When this happens, longpoll client will receive another object which having same information but different object id (ActorHandle) from the controller.
Then after compute_iterable_delta is called, it will replace the same replica in in_flight_queries, and cause router clean up ongoing request_ref, and keep assigning new request to the replica, which will break the max_concurrent_queries parameter.

Related issue number

Closes #32652

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

architkulkarni · 2023-02-23T22:29:15Z

python/ray/serve/BUILD

+    size = "small",
+    srcs = serve_tests_srcs,
+    tags = ["exclusive", "team:serve"],
+    deps = [":serve_lib"],


This test file wasn't running in CI before? 😮

architkulkarni

Nice catch! Is it feasible to add a test which fails without this change?

architkulkarni · 2023-02-23T22:34:39Z

python/ray/serve/_private/common.py

+                ]
+            )
+        )
+        object.__setattr__(self, "_hash", hash_val)


For my education what's the benefit of this over self._hash = hash_val? If not obvious maybe add a code comment

oh it is a hacky way to set attribute, since we use frozen for this dataclass class. (Not allow to set attribute as you mentioned)

Makes sense! Would be good to put this in a code comment

python/ray/serve/tests/test_common.py

python/ray/serve/_private/common.py

Signed-off-by: Sihan Wang <[email protected]>

zcin

Nice work! Learned some new things 😀

python/ray/serve/_private/common.py

Co-authored-by: Cindy Zhang <[email protected]> Signed-off-by: Sihan Wang <[email protected]>

architkulkarni · 2023-02-24T23:06:08Z

Linkcheck failure unrelated

When the longpoll client timeout happens, the all internal objects will be cleaned up. The longpoll client will try to poll again. When this happens, longpoll client will receive another object which having same information but different object id (ActorHandle) from the controller. Then after compute_iterable_delta is called, it will replace the same replica in in_flight_queries, and cause router clean up ongoing request_ref, and keep assigning new request to the replica, which will break the max_concurrent_queries parameter. Related issue number Closes ray-project#32652 Signed-off-by: Sihan Wang <[email protected]>

* [Serve] Add hash function of RunningReplicaInfo (#32772) When the longpoll client timeout happens, the all internal objects will be cleaned up. The longpoll client will try to poll again. When this happens, longpoll client will receive another object which having same information but different object id (ActorHandle) from the controller. Then after compute_iterable_delta is called, it will replace the same replica in in_flight_queries, and cause router clean up ongoing request_ref, and keep assigning new request to the replica, which will break the max_concurrent_queries parameter. Related issue number Closes #32652 Signed-off-by: Sihan Wang <[email protected]> * [Serve] Fix the max_concurrent_queries issue (#33022) For the `hashable` object, __eq__ and __hash__ both need to be provided for correctness. https://docs.python.org/3.9/glossary.html#term-hashable And add tests to make sure the long poll timeout issue won't happen. --------- Signed-off-by: Sihan Wang <[email protected]>

When the longpoll client timeout happens, the all internal objects will be cleaned up. The longpoll client will try to poll again. When this happens, longpoll client will receive another object which having same information but different object id (ActorHandle) from the controller. Then after compute_iterable_delta is called, it will replace the same replica in in_flight_queries, and cause router clean up ongoing request_ref, and keep assigning new request to the replica, which will break the max_concurrent_queries parameter. Related issue number Closes ray-project#32652 Signed-off-by: Edward Oakes <[email protected]>

When the longpoll client timeout happens, the all internal objects will be cleaned up. The longpoll client will try to poll again. When this happens, longpoll client will receive another object which having same information but different object id (ActorHandle) from the controller. Then after compute_iterable_delta is called, it will replace the same replica in in_flight_queries, and cause router clean up ongoing request_ref, and keep assigning new request to the replica, which will break the max_concurrent_queries parameter. Related issue number Closes ray-project#32652

When the longpoll client timeout happens, the all internal objects will be cleaned up. The longpoll client will try to poll again. When this happens, longpoll client will receive another object which having same information but different object id (ActorHandle) from the controller. Then after compute_iterable_delta is called, it will replace the same replica in in_flight_queries, and cause router clean up ongoing request_ref, and keep assigning new request to the replica, which will break the max_concurrent_queries parameter. Related issue number Closes ray-project#32652 Signed-off-by: elliottower <[email protected]>

sihanwang41 force-pushed the autoscaling_fix_with_RunningReplicaInfo branch from 0c64878 to 972a1ac Compare February 23, 2023 17:49

sihanwang41 changed the title ~~[Serve] Use actor_id when compare RunningReplicaInfo~~ [Serve] Add hash function of RunningReplicaInfo Feb 23, 2023

sihanwang41 force-pushed the autoscaling_fix_with_RunningReplicaInfo branch from 972a1ac to fae27a1 Compare February 23, 2023 18:15

sihanwang41 marked this pull request as ready for review February 23, 2023 18:53

sihanwang41 assigned architkulkarni, edoakes, zcin and shrekris-anyscale Feb 23, 2023

sihanwang41 force-pushed the autoscaling_fix_with_RunningReplicaInfo branch 2 times, most recently from d54a3c1 to 78a872c Compare February 23, 2023 21:01

architkulkarni reviewed Feb 23, 2023

View reviewed changes

[Serve] Use actor_id when compare RunningReplicaInfo

d9902cf

Signed-off-by: Sihan Wang <[email protected]>

sihanwang41 force-pushed the autoscaling_fix_with_RunningReplicaInfo branch from 78a872c to d9902cf Compare February 23, 2023 22:57

Address comments

8464afc

Signed-off-by: Sihan Wang <[email protected]>

architkulkarni approved these changes Feb 23, 2023

View reviewed changes

zcin approved these changes Feb 24, 2023

View reviewed changes

python/ray/serve/_private/common.py Outdated Show resolved Hide resolved

Update python/ray/serve/_private/common.py

4e902bb

Co-authored-by: Cindy Zhang <[email protected]> Signed-off-by: Sihan Wang <[email protected]>

architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Feb 24, 2023

architkulkarni merged commit c23c0db into ray-project:master Feb 24, 2023

sihanwang41 mentioned this pull request Mar 2, 2023

[Serve][Cherry-Pick] Fix the max_concurrent_queries issue #32974

Merged

7 tasks

sihanwang41 mentioned this pull request Mar 3, 2023

[Serve] Fix the max_concurrent_queries issue #33022

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Add hash function of RunningReplicaInfo #32772

[Serve] Add hash function of RunningReplicaInfo #32772

sihanwang41 commented Feb 23, 2023 •

edited

Loading

architkulkarni Feb 23, 2023

sihanwang41 Feb 23, 2023

architkulkarni left a comment

architkulkarni Feb 23, 2023

sihanwang41 Feb 23, 2023 •

edited

Loading

architkulkarni Feb 23, 2023

zcin left a comment

architkulkarni commented Feb 24, 2023

[Serve] Add hash function of RunningReplicaInfo #32772

[Serve] Add hash function of RunningReplicaInfo #32772

Conversation

sihanwang41 commented Feb 23, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

architkulkarni Feb 23, 2023

Choose a reason for hiding this comment

sihanwang41 Feb 23, 2023

Choose a reason for hiding this comment

architkulkarni left a comment

Choose a reason for hiding this comment

architkulkarni Feb 23, 2023

Choose a reason for hiding this comment

sihanwang41 Feb 23, 2023 • edited Loading

Choose a reason for hiding this comment

architkulkarni Feb 23, 2023

Choose a reason for hiding this comment

zcin left a comment

Choose a reason for hiding this comment

architkulkarni commented Feb 24, 2023

sihanwang41 commented Feb 23, 2023 •

edited

Loading

sihanwang41 Feb 23, 2023 •

edited

Loading