[Serve] Optimize `DeploymentStateManager.get_deployment_statuses` #45872

JoshKarpel · 2024-06-11T18:48:47Z

Why are these changes needed?

We've been looking at various scaling issues in the Serve Controller and found that there was a large amount of time spent in DeploymentStateManager.get_deployment_statuses(). This PR optimizes that method.

Before, with ~1800 apps each with 1 deployment, Serve Controller consistently sitting at >100% CPU usage, total time spent in DeploymentStateManager.get_deployment_statuses is ~7% in run_control_loop() and ~13% in get_serve_instance_details():

After, same number of apps/deployments, Serve Controller often has <100% CPU usage, DeploymentStateManager.get_deployment_statuses is now less than 1% of total time:

(Note that we're testing with zero DeploymentHandles in the system, so if you're tracking our other threads like #45777, note that there's relatively fewer RPCs being served in these flamegraphs).

Looking at the flamegraph I suspect there are some more improvements to be made in this area but I'm going to try to work in small batches to keep the diffs focused.

Related issue number

Closes #45792

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests (existing)
- Release tests
- This PR is not tested :(

Signed-off-by: Josh Karpel <[email protected]>

python/ray/serve/_private/deployment_state.py

JoshKarpel · 2024-06-11T20:09:07Z

python/ray/serve/_private/deployment_state.py

+            states = (self._deployment_states.get(id) for id in ids)
+            return [state.curr_status_info for state in states if state]


Loop over the ids instead of over the deployments. This actually goes from O(number of deployments * number of ids) down to O(number of ids), because the existing code was checking id in ids for each existing deployment, which is O(number of ids) since ids is a list.

Is this changing the behavior ?

Oh, nvm, it's the same.

zcin

Overall LGTM, just one small suggestion.

zcin · 2024-06-18T23:32:25Z

python/ray/serve/_private/deployment_state.py

+            states = (self._deployment_states.get(id) for id in ids)
+            return [state.curr_status_info for state in states if state]


Is there any reason not to change it to something like the following so we don't have to do 2 for loops? It would also improve readability I think.

Suggested change

states = (self._deployment_states.get(id) for id in ids)

return [state.curr_status_info for state in states if state]

return [self._deployment_states[id].curr_status_info for id in ids if id in self._deployment_states]

In fact, even better might be the following, it is easier to read:

Suggested change

states = (self._deployment_states.get(id) for id in ids)

return [state.curr_status_info for state in states if state]

statuses = []

for id in ids:

if id in self._deployment_states:

statuses.append(self._deployment_states[id].curr_status_info)

return statuses

if id in self._deployment_states means we'd be doing a double-lookup/look-before-you-leap. In Python it's typically faster to just try it and handle the failure (especially if failures are rare), which is what the .get() pattern does.

Got it, I'm not very familiar with the best way to hyper-optimize Python. If the lookup time is your concern, how about:

statuses = [] for id in ids: state = self._deployment_states.get(id) if state is not None: statuses.append(state.curr_status_info) return statuses

Signed-off-by: Josh Karpel <[email protected]>

…ment-details Signed-off-by: Josh Karpel <[email protected]>

Signed-off-by: Josh Karpel <[email protected]>

JoshKarpel added 2 commits June 11, 2024 11:45

optimize DeploymentStateManager.get_deployment_statuses

4ec9aba

Signed-off-by: Josh Karpel <[email protected]>

call only once

2c28b5b

Signed-off-by: Josh Karpel <[email protected]>

JoshKarpel changed the title ~~Optimize DeploymentStateManager.get_deployment_statuses~~ [Serve] Optimize DeploymentStateManager.get_deployment_statuses Jun 11, 2024

JoshKarpel commented Jun 11, 2024

View reviewed changes

Merge branch 'master' into issue-45792-optimize-get-deployment-details

9cdeef8

JoshKarpel marked this pull request as ready for review June 11, 2024 20:18

hamdi-jenzri approved these changes Jun 11, 2024

View reviewed changes

JoshKarpel mentioned this pull request Jun 11, 2024

[Serve] Optimize ServeController.get_app_config() #45878

Merged

8 tasks

Merge branch 'master' into issue-45792-optimize-get-deployment-details

c59d76e

JoshKarpel mentioned this pull request Jun 12, 2024

[Serve] Avoid looping over all snapshot ids for each long poll request #45881

Merged

8 tasks

zcin approved these changes Jun 18, 2024

View reviewed changes

JoshKarpel added 5 commits June 18, 2024 18:48

comments/docstring

cf9bd6d

Signed-off-by: Josh Karpel <[email protected]>

Merge branch 'refs/heads/master' into issue-45792-optimize-get-deploy…

6d08ea2

…ment-details Signed-off-by: Josh Karpel <[email protected]>

Merge branch 'refs/heads/master' into issue-45792-optimize-get-deploy…

d2189d0

…ment-details Signed-off-by: Josh Karpel <[email protected]>

just one loop

8a16047

Signed-off-by: Josh Karpel <[email protected]>

Merge branch 'master' into issue-45792-optimize-get-deployment-details

cd77c44

zcin approved these changes Jun 20, 2024

View reviewed changes

shrekris-anyscale approved these changes Jun 20, 2024

View reviewed changes

shrekris-anyscale added the go add ONLY when ready to merge, run all tests label Jun 20, 2024

shrekris-anyscale merged commit 223233c into ray-project:master Jun 20, 2024
7 checks passed

JoshKarpel deleted the issue-45792-optimize-get-deployment-details branch June 21, 2024 00:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Optimize `DeploymentStateManager.get_deployment_statuses` #45872

[Serve] Optimize `DeploymentStateManager.get_deployment_statuses` #45872

JoshKarpel commented Jun 11, 2024 •

edited

Loading

JoshKarpel Jun 11, 2024

hamdi-jenzri Jun 11, 2024

hamdi-jenzri Jun 11, 2024

zcin left a comment

zcin Jun 18, 2024 •

edited

Loading

JoshKarpel Jun 18, 2024

zcin Jun 18, 2024 •

edited

Loading

JoshKarpel Jun 19, 2024

		states = (self._deployment_states.get(id) for id in ids)
		return [state.curr_status_info for state in states if state]

	states = (self._deployment_states.get(id) for id in ids)
	return [state.curr_status_info for state in states if state]
	return [self._deployment_states[id].curr_status_info for id in ids if id in self._deployment_states]

-            states = (self._deployment_states.get(id) for id in ids)
-            return [state.curr_status_info for state in states if state]
+            statuses = []
+            for id in ids:
+                if id in self._deployment_states:
+                    statuses.append(self._deployment_states[id].curr_status_info)
+            return statuses

[Serve] Optimize DeploymentStateManager.get_deployment_statuses #45872

[Serve] Optimize DeploymentStateManager.get_deployment_statuses #45872

Conversation

JoshKarpel commented Jun 11, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

JoshKarpel Jun 11, 2024

Choose a reason for hiding this comment

hamdi-jenzri Jun 11, 2024

Choose a reason for hiding this comment

hamdi-jenzri Jun 11, 2024

Choose a reason for hiding this comment

zcin left a comment

Choose a reason for hiding this comment

zcin Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

JoshKarpel Jun 18, 2024

Choose a reason for hiding this comment

zcin Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

JoshKarpel Jun 19, 2024

Choose a reason for hiding this comment

[Serve] Optimize `DeploymentStateManager.get_deployment_statuses` #45872

[Serve] Optimize `DeploymentStateManager.get_deployment_statuses` #45872

JoshKarpel commented Jun 11, 2024 •

edited

Loading

zcin Jun 18, 2024 •

edited

Loading

zcin Jun 18, 2024 •

edited

Loading