[Serve] Amortize handle metrics pushing by grouping metrics by process #45777
Labels
enhancement
Request for new feature and/or capability
P1
Issue that should be fixed within a few weeks
serve
Ray Serve Related Issue
Description
This needs some prototyping, which I'll be trying soon!
Similar to #45776, we're seeing a lot of pressure on the Serve controller from metrics push tasks. Presumably some of this pressure is purely from the overhead of lots of RPC connections incoming to the controller. We might be able to amortize this overhead (and presumably similar overhead in the handles too) by having the metrics push happen at the per-process level instead of the per-handle level.
Use case
Our system is running a very large number of
DeploymentHandle
s (see #44784 for more details). We've noticed that the Serve controller gets overloaded (>100% CPU usage) trying to accept all of the metrics pushes, which leads to an ever-increasing number of increasingly-stalerecord_handle_metrics
tasks idle on the controller, which then eventually runs out of memory and crashes.The text was updated successfully, but these errors were encountered: