Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve][Cherry-Pick] Fix the max_concurrent_queries issue #32974

Conversation

sihanwang41
Copy link
Contributor

@sihanwang41 sihanwang41 commented Mar 2, 2023

cherry pick: #32772
cherry pick: #33022

Related issue number
Closes #32652

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@sihanwang41 sihanwang41 marked this pull request as ready for review March 2, 2023 17:24
@zhe-thoughts
Copy link
Collaborator

@edoakes Could you help review this one as well since you didn't review the original PR #32772

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sihanwang41 one nit and another high level comment: we don't seem to have a test for the actual behavior reported in the github issue, can we add that? The unit test for the hash value here is useful but not comprehensive.

Please address this and my nit below on master, otherwise the change LGTM.

python/ray/serve/_private/common.py Show resolved Hide resolved
@sihanwang41
Copy link
Contributor Author

@sihanwang41 one nit and another high level comment: we don't seem to have a test for the actual behavior reported in the github issue, can we add that? The unit test for the hash value here is useful but not comprehensive.

Please address this and my nit below on master, otherwise the change LGTM.

I am going to add a test to cover it on the master.

@sihanwang41
Copy link
Contributor Author

sihanwang41 commented Mar 3, 2023

Sorry
please don't merge this pr now
The issue is not fixed by this pr.

@sihanwang41
Copy link
Contributor Author

Okay, the cherry pick is completed. Ready to be reviewed and merged

When the longpoll client timeout happens, the all internal objects will be cleaned up. The longpoll client will try to poll again. When this happens, longpoll client will receive another object which having same information but different object id (ActorHandle) from the controller.
Then after compute_iterable_delta is called, it will replace the same replica in in_flight_queries, and cause router clean up ongoing request_ref, and keep assigning new request to the replica, which will break the max_concurrent_queries parameter.

Related issue number
Closes ray-project#32652

Signed-off-by: Sihan Wang <[email protected]>
For the `hashable` object, __eq__ and __hash__ both need to be provided for correctness.  https://docs.python.org/3.9/glossary.html#term-hashable

And add tests to make sure the long poll timeout issue won't happen.
@sihanwang41 sihanwang41 changed the title [Serve] Add hash function of RunningReplicaInfo (#32772) [Serve] Fix the max_concurrent_queries issue (#32772) Mar 7, 2023
@sihanwang41 sihanwang41 changed the title [Serve] Fix the max_concurrent_queries issue (#32772) [Serve][Cherry-Pick] Fix the max_concurrent_queries issue Mar 7, 2023
@sihanwang41 sihanwang41 merged commit eedc667 into ray-project:releases/2.3.1 Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants