[Serve] Amortize handle metrics pushing by grouping metrics by process #45777

JoshKarpel · 2024-06-06T18:58:56Z

Description

This needs some prototyping, which I'll be trying soon!

Similar to #45776, we're seeing a lot of pressure on the Serve controller from metrics push tasks. Presumably some of this pressure is purely from the overhead of lots of RPC connections incoming to the controller. We might be able to amortize this overhead (and presumably similar overhead in the handles too) by having the metrics push happen at the per-process level instead of the per-handle level.

Use case

Our system is running a very large number of DeploymentHandles (see #44784 for more details). We've noticed that the Serve controller gets overloaded (>100% CPU usage) trying to accept all of the metrics pushes, which leads to an ever-increasing number of increasingly-stale record_handle_metrics tasks idle on the controller, which then eventually runs out of memory and crashes.

The text was updated successfully, but these errors were encountered:

JoshKarpel added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 6, 2024

anyscalesam added the serve Ray Serve Related Issue label Jun 12, 2024

JoshKarpel linked a pull request Jun 27, 2024 that will close this issue

[Serve] Group DeploymentHandle autoscaling metrics pushes by process #45957

Draft

8 tasks

zcin assigned JoshKarpel and zcin Jun 27, 2024

zcin added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Amortize handle metrics pushing by grouping metrics by process #45777

[Serve] Amortize handle metrics pushing by grouping metrics by process #45777

JoshKarpel commented Jun 6, 2024

[Serve] Amortize handle metrics pushing by grouping metrics by process #45777

[Serve] Amortize handle metrics pushing by grouping metrics by process #45777

Comments

JoshKarpel commented Jun 6, 2024

Description

Use case