-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Optimize DeploymentStateManager.get_deployment_statuses
#45872
[Serve] Optimize DeploymentStateManager.get_deployment_statuses
#45872
Conversation
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
DeploymentStateManager.get_deployment_statuses
DeploymentStateManager.get_deployment_statuses
states = (self._deployment_states.get(id) for id in ids) | ||
return [state.curr_status_info for state in states if state] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loop over the ids
instead of over the deployments. This actually goes from O(number of deployments * number of ids)
down to O(number of ids)
, because the existing code was checking id in ids
for each existing deployment, which is O(number of ids)
since ids
is a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this changing the behavior ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nvm, it's the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, just one small suggestion.
states = (self._deployment_states.get(id) for id in ids) | ||
return [state.curr_status_info for state in states if state] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason not to change it to something like the following so we don't have to do 2 for loops? It would also improve readability I think.
states = (self._deployment_states.get(id) for id in ids) | |
return [state.curr_status_info for state in states if state] | |
return [self._deployment_states[id].curr_status_info for id in ids if id in self._deployment_states] |
In fact, even better might be the following, it is easier to read:
states = (self._deployment_states.get(id) for id in ids) | |
return [state.curr_status_info for state in states if state] | |
statuses = [] | |
for id in ids: | |
if id in self._deployment_states: | |
statuses.append(self._deployment_states[id].curr_status_info) | |
return statuses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if id in self._deployment_states
means we'd be doing a double-lookup/look-before-you-leap. In Python it's typically faster to just try it and handle the failure (especially if failures are rare), which is what the .get()
pattern does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I'm not very familiar with the best way to hyper-optimize Python. If the lookup time is your concern, how about:
statuses = []
for id in ids:
state = self._deployment_states.get(id)
if state is not None:
statuses.append(state.curr_status_info)
return statuses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
Signed-off-by: Josh Karpel <[email protected]>
…ment-details Signed-off-by: Josh Karpel <[email protected]>
…ment-details Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Why are these changes needed?
We've been looking at various scaling issues in the Serve Controller and found that there was a large amount of time spent in
DeploymentStateManager.get_deployment_statuses()
. This PR optimizes that method.Before, with ~1800 apps each with 1 deployment, Serve Controller consistently sitting at >100% CPU usage, total time spent in
DeploymentStateManager.get_deployment_statuses
is ~7% inrun_control_loop()
and ~13% inget_serve_instance_details()
:After, same number of apps/deployments, Serve Controller often has <100% CPU usage,
DeploymentStateManager.get_deployment_statuses
is now less than 1% of total time:(Note that we're testing with zero
DeploymentHandle
s in the system, so if you're tracking our other threads like #45777, note that there's relatively fewer RPCs being served in these flamegraphs).Looking at the flamegraph I suspect there are some more improvements to be made in this area but I'm going to try to work in small batches to keep the diffs focused.
Related issue number
Closes #45792
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.