Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Optimize DeploymentStateManager.get_deployment_statuses #45872

Conversation

JoshKarpel
Copy link
Contributor

@JoshKarpel JoshKarpel commented Jun 11, 2024

Why are these changes needed?

We've been looking at various scaling issues in the Serve Controller and found that there was a large amount of time spent in DeploymentStateManager.get_deployment_statuses(). This PR optimizes that method.

Before, with ~1800 apps each with 1 deployment, Serve Controller consistently sitting at >100% CPU usage, total time spent in DeploymentStateManager.get_deployment_statuses is ~7% in run_control_loop() and ~13% in get_serve_instance_details():

controller-high-load-no-handles

After, same number of apps/deployments, Serve Controller often has <100% CPU usage, DeploymentStateManager.get_deployment_statuses is now less than 1% of total time:

controller-high-load-no-handles-after-fix

(Note that we're testing with zero DeploymentHandles in the system, so if you're tracking our other threads like #45777, note that there's relatively fewer RPCs being served in these flamegraphs).

Looking at the flamegraph I suspect there are some more improvements to be made in this area but I'm going to try to work in small batches to keep the diffs focused.

Related issue number

Closes #45792

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests (existing)
    • Release tests
    • This PR is not tested :(

@JoshKarpel JoshKarpel changed the title Optimize DeploymentStateManager.get_deployment_statuses [Serve] Optimize DeploymentStateManager.get_deployment_statuses Jun 11, 2024
python/ray/serve/_private/deployment_state.py Show resolved Hide resolved
Comment on lines 2574 to 2575
states = (self._deployment_states.get(id) for id in ids)
return [state.curr_status_info for state in states if state]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loop over the ids instead of over the deployments. This actually goes from O(number of deployments * number of ids) down to O(number of ids), because the existing code was checking id in ids for each existing deployment, which is O(number of ids) since ids is a list.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this changing the behavior ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nvm, it's the same.

@JoshKarpel JoshKarpel marked this pull request as ready for review June 11, 2024 20:18
Copy link
Contributor

@zcin zcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, just one small suggestion.

Comment on lines 2574 to 2575
states = (self._deployment_states.get(id) for id in ids)
return [state.curr_status_info for state in states if state]
Copy link
Contributor

@zcin zcin Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to change it to something like the following so we don't have to do 2 for loops? It would also improve readability I think.

Suggested change
states = (self._deployment_states.get(id) for id in ids)
return [state.curr_status_info for state in states if state]
return [self._deployment_states[id].curr_status_info for id in ids if id in self._deployment_states]

In fact, even better might be the following, it is easier to read:

Suggested change
states = (self._deployment_states.get(id) for id in ids)
return [state.curr_status_info for state in states if state]
statuses = []
for id in ids:
if id in self._deployment_states:
statuses.append(self._deployment_states[id].curr_status_info)
return statuses

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if id in self._deployment_states means we'd be doing a double-lookup/look-before-you-leap. In Python it's typically faster to just try it and handle the failure (especially if failures are rare), which is what the .get() pattern does.

Copy link
Contributor

@zcin zcin Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I'm not very familiar with the best way to hyper-optimize Python. If the lookup time is your concern, how about:

            statuses = []
            for id in ids:
                state = self._deployment_states.get(id)
                if state is not None:
                    statuses.append(state.curr_status_info)
            return statuses

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

@shrekris-anyscale shrekris-anyscale added the go add ONLY when ready to merge, run all tests label Jun 20, 2024
@shrekris-anyscale shrekris-anyscale merged commit 223233c into ray-project:master Jun 20, 2024
7 checks passed
@JoshKarpel JoshKarpel deleted the issue-45792-optimize-get-deployment-details branch June 21, 2024 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Serve] Optimize the get_deployment_statuses function
4 participants