[Serve] performance bottlenecked by the ProxyActor #42565

ShenJiahuan · 2024-01-22T06:06:32Z

What happened + What you expected to happen

Issue Summary

We've identified a potential performance bottleneck within Ray Serve due to the limitation of having a single ProxyActor per node. This architecture may be hindering scalability and maximum request handling capacity.

Performance Test Details

To evaluate the performance implications, we conducted a test using the following setup:

Server Specification: 2 * 12 core Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz.
Test Code: see below.

The test results indicated that the system is currently capped at approximately 70 requests per second.

Source Code Review and Concerns

Upon reviewing the Ray source code, it has become evident that the design choice of a single ProxyActor per node is responsible for handling all incoming requests, replica selection, and other associated tasks. This centralized approach is likely what prevents Ray Serve from scaling effectively with the available hardware resources.

Proposed Discussion Points

Is there a particular reason for the current design with a single ProxyActor?
Would it be feasible to consider a design that allows for multiple ProxyActors per node to distribute the load?
Has this issue been observed by others, and are there any known workarounds or plans for future updates to address this?

I believe addressing this bottleneck could significantly improve Ray Serve's performance and scalability. I look forward to the community's input and potential solutions.

Versions / Dependencies

Ray 2.9.1

Reproduction script

Test Code Snippet

To reproduce the performance issue, please use the following code which sets up a Ray Serve deployment with 48 replicas:

import ray
from ray import serve
from starlette.requests import Request

@serve.deployment(
    num_replicas=48,
    route_prefix="/",
)
class TestDeployment:
    async def __call__(self, request: Request):
        return "hi"


def model(args):
    return TestDeployment.bind()

Starting Ray Serve

Ray Serve is initiated with the following command:

serve run test:model

Load Testing

For stressing the service and measuring the throughput, we are using ApacheBench with the following command:

ab -n 1000 -c 100 http://127.0.0.1:8000/

Issue Severity

High: It blocks me from completing my task.

edoakes · 2024-01-23T23:05:54Z

@ShenJiahuan thanks for filing the issue! We're aware of the proxy as the bottleneck for each node. The reason for this design decision is because the proxy includes quite a bit of intelligence to handle dynamic service discovery/routing, features like model multiplexing, and handling edge cases like properly draining requests upon node removal (e.g., spot instance interruption).

We've done some work to ensure performance is within a reasonable bound, but we're aware that this may be a dealbreaker for some use cases. Likely our chosen path to improve this bottleneck would be to optimize the proxy performance (reduce overhead from Ray actor calls, consider implementing it in a faster language like C++) rather than rearchitect the system to remove it.

However, the 70 QPS number you've cited here is dramatically lower than we see in our microbenchmarks. I re-ran this benchmark on my laptop (2021 MacBook Pro with M1 max) on the master branch and saw ~800qps:

(ray) ➜  ray git:(master) ✗ ab -n 10000 -c 100 http://127.0.0.1:8000/

This is ApacheBench, Version 2.3 <$Revision: 1903618 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests


Server Software:        uvicorn
Server Hostname:        127.0.0.1
Server Port:            8000

Document Path:          /
Document Length:        12 bytes

Concurrency Level:      100
Time taken for tests:   12.200 seconds
Complete requests:      10000
Failed requests:        0
Total transferred:      2060000 bytes
HTML transferred:       120000 bytes
Requests per second:    819.64 [#/sec] (mean)
Time per request:       122.005 [ms] (mean)
Time per request:       1.220 [ms] (mean, across all concurrent requests)
Transfer rate:          164.89 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   0.6      0       4
Processing:    17  121  24.2    115     214
Waiting:       15  117  24.3    112     210
Total:         18  121  24.2    116     215
WARNING: The median and mean for the initial connection time are not within a normal deviation
        These results are probably not that reliable.

Percentage of the requests served within a certain time (ms)
  50%    116
  66%    123
  75%    130
  80%    137
  90%    157
  95%    174
  98%    184
  99%    192
 100%    215 (longest request)

What instance type did you run the benchmark on? Are you able to re-run it on one that is publicly available such as an AWS instance type?

ShenJiahuan · 2024-01-25T01:55:45Z

@edoakes, thank you for the quick and detailed response!

Upon reevaluating our setup, I discovered a misconfiguration on our end that constrained the performance. 🥲 After addressing this, I am now observing a throughput of approximately 280 QPS on our corrected setup.

Furthermore, I conducted a test on an AWS t3.2xlarge instance (8-core Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz), which yielded around 175 QPS. While this is a marked improvement over the initial 70 QPS, it seems that we could still encounter scalability limitations under high-load scenarios.

xjhust · 2024-02-01T09:46:11Z

@edoakes, thank you for the quick and detailed response!

Upon reevaluating our setup, I discovered a misconfiguration on our end that constrained the performance. 🥲 After addressing this, I am now observing a throughput of approximately 280 QPS on our corrected setup.

Furthermore, I conducted a test on an AWS t3.2xlarge instance (8-core Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz), which yielded around 175 QPS. While this is a marked improvement over the initial 70 QPS, it seems that we could still encounter scalability limitations under high-load scenarios.

I meet this issue in my task too, and I re-run the script you provide, the throughput is approximately 220 QPS，can you tell me the misconfiguration adjust to improve the performance, thanks

FlyTOmeLight · 2024-02-01T10:43:31Z

@edoakes Hi! I would like to know what factors might affect the performance level of ProxyActor, such as the configuration of the server or the complexity of the application. If there are multiple nodes, is the improvement linear? We are hoping to use RayServe in a production environment. Given our current observations, this performance issue is very important to us. Do you know when there might be significant progress on this?

ShenJiahuan · 2024-02-01T10:56:16Z

I meet this issue in my task too, and I re-run the script you provide, the throughput is approximately 220 QPS，can you tell me the misconfiguration adjust to improve the performance, thanks

@xjhust, a throughput of 220 QPS appears to be reasonable and does not necessarily indicate a misconfiguration, especially when compared with my subsequent experiments. The performance is closely tied to the single-core performance of the CPU. The 800 QPS achieved by @edoakes is likely due to the superior single-core performance of the M1 Max. In fact, our server's CPU was previously set to 'powersave' mode, which resulted in a reduced performance of approximately 70 QPS.

edoakes · 2024-02-02T20:49:36Z

I meet this issue in my task too, and I re-run the script you provide, the throughput is approximately 220 QPS，can you tell me the misconfiguration adjust to improve the performance, thanks

@xjhust, a throughput of 220 QPS appears to be reasonable and does not necessarily indicate a misconfiguration, especially when compared with my subsequent experiments. The performance is closely tied to the single-core performance of the CPU. The 800 QPS achieved by @edoakes is likely due to the superior single-core performance of the M1 Max. In fact, our server's CPU was previously set to 'powersave' mode, which resulted in a reduced performance of approximately 70 QPS.

Yes this makes sense given that the proxy is currently a Python process which is largely single-threaded.

edoakes · 2024-02-02T20:53:12Z

@edoakes Hi! I would like to know what factors might affect the performance level of ProxyActor, such as the configuration of the server or the complexity of the application. If there are multiple nodes, is the improvement linear? We are hoping to use RayServe in a production environment. Given our current observations, this performance issue is very important to us. Do you know when there might be significant progress on this?

The overall throughput scaling is ~linear as you add more nodes to the cluster as each node will have its own proxy (and they scale independently).

We are making incremental progress to this all the time (example), but I wouldn't expect orders-of-magnitude improvement in the near future (i.e., the next few releases). Most of our users' workloads involve fairly heavyweight ML inference requests, so supporting very high QPS per node is not often the primary goal. So we are instead focusing our efforts on other issues like improved autoscaling, stability, and observability.

decadance-dance · 2024-05-18T16:58:34Z

Any updates?
I go with ray serve but faced my model infers 0.5s but a __call__ func takes 1s so it is x2.
I saw @edoakes mentioned that they are focusing on other things but I think that it is also a very important aspect because there is no a lot of alternatives (bentoml - very slow, triton - so buggy) to run models to production.
So It'd great if devs like me can get performance improvements.

edoakes · 2024-05-20T13:12:06Z

Any updates? I go with ray serve but faced my model infers 0.5s but a __call__ func takes 1s so it is x2. I saw @edoakes mentioned that they are focusing on other things but I think that it is also a very important aspect because there is no a lot of alternatives (bentoml - very slow, triton - so buggy) to run models to production. So It'd great if devs like me can get performance improvements.

@decadance-dance given the limited information you have provided, I doubt that what you're seeing is the same issue described in this ticket. If you get in touch on the Ray Slack or file a separate issue and provide more details about your setup, we may be able to point out some improvements.

Superskyyy · 2024-07-19T02:50:28Z

Working on an investigation, will update here from our benchmark.

Superskyyy · 2024-07-19T03:14:59Z

I tested on a 16-core linux machine with i9-11900K, it might be a bit overpowered so I'm moved the test to a typical cloud baremetal 2.6GHZ. (See results in later part, both tests are based on 2.32.0)

11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz
    CPU family:          6
    Model:               167
    Thread(s) per core:  2
    Core(s) per socket:  8

Fastapi with 1/48 workers:

Cloud Results: Cloud baremetal machine with 128 cores, CPU 2.6GHz:

I get ~180 qps, latency 500ms-1000ms, which is pretty low.

Serve:

FastAPI:

ShenJiahuan added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 22, 2024

anyscalesam added the serve Ray Serve Related Issue label Jan 22, 2024

sihanwang41 assigned edoakes Jan 23, 2024

sihanwang41 added enhancement Request for new feature and/or capability and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) bug Something that is supposed to be working; but isn't labels Jan 23, 2024

edoakes removed their assignment Mar 11, 2024

Superskyyy mentioned this issue Jul 24, 2024

Ray Serve with Fastapi is a lot (10x) slower than plain fastapi #46693

Closed

anyscalesam added the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Jul 25, 2024

GeneDer self-assigned this Aug 21, 2024

GeneDer added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] performance bottlenecked by the ProxyActor #42565

[Serve] performance bottlenecked by the ProxyActor #42565

ShenJiahuan commented Jan 22, 2024

edoakes commented Jan 23, 2024 •

edited

Loading

ShenJiahuan commented Jan 25, 2024

xjhust commented Feb 1, 2024

FlyTOmeLight commented Feb 1, 2024

ShenJiahuan commented Feb 1, 2024

edoakes commented Feb 2, 2024

edoakes commented Feb 2, 2024

decadance-dance commented May 18, 2024 •

edited

Loading

edoakes commented May 20, 2024

Superskyyy commented Jul 19, 2024

Superskyyy commented Jul 19, 2024 •

edited

Loading

[Serve] performance bottlenecked by the ProxyActor #42565

[Serve] performance bottlenecked by the ProxyActor #42565

Comments

ShenJiahuan commented Jan 22, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

edoakes commented Jan 23, 2024 • edited Loading

ShenJiahuan commented Jan 25, 2024

xjhust commented Feb 1, 2024

FlyTOmeLight commented Feb 1, 2024

ShenJiahuan commented Feb 1, 2024

edoakes commented Feb 2, 2024

edoakes commented Feb 2, 2024

decadance-dance commented May 18, 2024 • edited Loading

edoakes commented May 20, 2024

Superskyyy commented Jul 19, 2024

Superskyyy commented Jul 19, 2024 • edited Loading

edoakes commented Jan 23, 2024 •

edited

Loading

decadance-dance commented May 18, 2024 •

edited

Loading

Superskyyy commented Jul 19, 2024 •

edited

Loading