Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] performance bottlenecked by the ProxyActor #42565

Open
ShenJiahuan opened this issue Jan 22, 2024 · 11 comments
Open

[Serve] performance bottlenecked by the ProxyActor #42565

ShenJiahuan opened this issue Jan 22, 2024 · 11 comments
Assignees
Labels
enhancement Request for new feature and/or capability P2 Important issue, but not time-critical serve Ray Serve Related Issue

Comments

@ShenJiahuan
Copy link

What happened + What you expected to happen

Issue Summary

We've identified a potential performance bottleneck within Ray Serve due to the limitation of having a single ProxyActor per node. This architecture may be hindering scalability and maximum request handling capacity.

Performance Test Details

To evaluate the performance implications, we conducted a test using the following setup:

  • Server Specification: 2 * 12 core Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz.
  • Test Code: see below.

The test results indicated that the system is currently capped at approximately 70 requests per second.

Source Code Review and Concerns

Upon reviewing the Ray source code, it has become evident that the design choice of a single ProxyActor per node is responsible for handling all incoming requests, replica selection, and other associated tasks. This centralized approach is likely what prevents Ray Serve from scaling effectively with the available hardware resources.

Proposed Discussion Points

  • Is there a particular reason for the current design with a single ProxyActor?
  • Would it be feasible to consider a design that allows for multiple ProxyActors per node to distribute the load?
  • Has this issue been observed by others, and are there any known workarounds or plans for future updates to address this?

I believe addressing this bottleneck could significantly improve Ray Serve's performance and scalability. I look forward to the community's input and potential solutions.

Versions / Dependencies

Ray 2.9.1

Reproduction script

Test Code Snippet

To reproduce the performance issue, please use the following code which sets up a Ray Serve deployment with 48 replicas:

import ray
from ray import serve
from starlette.requests import Request

@serve.deployment(
    num_replicas=48,
    route_prefix="/",
)
class TestDeployment:
    async def __call__(self, request: Request):
        return "hi"


def model(args):
    return TestDeployment.bind()

Starting Ray Serve

Ray Serve is initiated with the following command:

serve run test:model

Load Testing

For stressing the service and measuring the throughput, we are using ApacheBench with the following command:

ab -n 1000 -c 100 http://127.0.0.1:8000/

Issue Severity

High: It blocks me from completing my task.

@ShenJiahuan ShenJiahuan added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 22, 2024
@anyscalesam anyscalesam added the serve Ray Serve Related Issue label Jan 22, 2024
@sihanwang41 sihanwang41 added enhancement Request for new feature and/or capability and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) bug Something that is supposed to be working; but isn't labels Jan 23, 2024
@edoakes
Copy link
Contributor

edoakes commented Jan 23, 2024

@ShenJiahuan thanks for filing the issue! We're aware of the proxy as the bottleneck for each node. The reason for this design decision is because the proxy includes quite a bit of intelligence to handle dynamic service discovery/routing, features like model multiplexing, and handling edge cases like properly draining requests upon node removal (e.g., spot instance interruption).

We've done some work to ensure performance is within a reasonable bound, but we're aware that this may be a dealbreaker for some use cases. Likely our chosen path to improve this bottleneck would be to optimize the proxy performance (reduce overhead from Ray actor calls, consider implementing it in a faster language like C++) rather than rearchitect the system to remove it.

However, the 70 QPS number you've cited here is dramatically lower than we see in our microbenchmarks. I re-ran this benchmark on my laptop (2021 MacBook Pro with M1 max) on the master branch and saw ~800qps:

(ray) ➜  ray git:(master) ✗ ab -n 10000 -c 100 http://127.0.0.1:8000/

This is ApacheBench, Version 2.3 <$Revision: 1903618 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests


Server Software:        uvicorn
Server Hostname:        127.0.0.1
Server Port:            8000

Document Path:          /
Document Length:        12 bytes

Concurrency Level:      100
Time taken for tests:   12.200 seconds
Complete requests:      10000
Failed requests:        0
Total transferred:      2060000 bytes
HTML transferred:       120000 bytes
Requests per second:    819.64 [#/sec] (mean)
Time per request:       122.005 [ms] (mean)
Time per request:       1.220 [ms] (mean, across all concurrent requests)
Transfer rate:          164.89 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   0.6      0       4
Processing:    17  121  24.2    115     214
Waiting:       15  117  24.3    112     210
Total:         18  121  24.2    116     215
WARNING: The median and mean for the initial connection time are not within a normal deviation
        These results are probably not that reliable.

Percentage of the requests served within a certain time (ms)
  50%    116
  66%    123
  75%    130
  80%    137
  90%    157
  95%    174
  98%    184
  99%    192
 100%    215 (longest request)

What instance type did you run the benchmark on? Are you able to re-run it on one that is publicly available such as an AWS instance type?

@ShenJiahuan
Copy link
Author

@edoakes, thank you for the quick and detailed response!

Upon reevaluating our setup, I discovered a misconfiguration on our end that constrained the performance. 🥲 After addressing this, I am now observing a throughput of approximately 280 QPS on our corrected setup.

Furthermore, I conducted a test on an AWS t3.2xlarge instance (8-core Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz), which yielded around 175 QPS. While this is a marked improvement over the initial 70 QPS, it seems that we could still encounter scalability limitations under high-load scenarios.

@xjhust
Copy link

xjhust commented Feb 1, 2024

@edoakes, thank you for the quick and detailed response!

Upon reevaluating our setup, I discovered a misconfiguration on our end that constrained the performance. 🥲 After addressing this, I am now observing a throughput of approximately 280 QPS on our corrected setup.

Furthermore, I conducted a test on an AWS t3.2xlarge instance (8-core Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz), which yielded around 175 QPS. While this is a marked improvement over the initial 70 QPS, it seems that we could still encounter scalability limitations under high-load scenarios.

I meet this issue in my task too, and I re-run the script you provide, the throughput is approximately 220 QPS,can you tell me the misconfiguration adjust to improve the performance, thanks

@FlyTOmeLight
Copy link

@edoakes Hi! I would like to know what factors might affect the performance level of ProxyActor, such as the configuration of the server or the complexity of the application. If there are multiple nodes, is the improvement linear? We are hoping to use RayServe in a production environment. Given our current observations, this performance issue is very important to us. Do you know when there might be significant progress on this?

@ShenJiahuan
Copy link
Author

I meet this issue in my task too, and I re-run the script you provide, the throughput is approximately 220 QPS,can you tell me the misconfiguration adjust to improve the performance, thanks

@xjhust, a throughput of 220 QPS appears to be reasonable and does not necessarily indicate a misconfiguration, especially when compared with my subsequent experiments. The performance is closely tied to the single-core performance of the CPU. The 800 QPS achieved by @edoakes is likely due to the superior single-core performance of the M1 Max. In fact, our server's CPU was previously set to 'powersave' mode, which resulted in a reduced performance of approximately 70 QPS.

@edoakes
Copy link
Contributor

edoakes commented Feb 2, 2024

I meet this issue in my task too, and I re-run the script you provide, the throughput is approximately 220 QPS,can you tell me the misconfiguration adjust to improve the performance, thanks

@xjhust, a throughput of 220 QPS appears to be reasonable and does not necessarily indicate a misconfiguration, especially when compared with my subsequent experiments. The performance is closely tied to the single-core performance of the CPU. The 800 QPS achieved by @edoakes is likely due to the superior single-core performance of the M1 Max. In fact, our server's CPU was previously set to 'powersave' mode, which resulted in a reduced performance of approximately 70 QPS.

Yes this makes sense given that the proxy is currently a Python process which is largely single-threaded.

@edoakes
Copy link
Contributor

edoakes commented Feb 2, 2024

@edoakes Hi! I would like to know what factors might affect the performance level of ProxyActor, such as the configuration of the server or the complexity of the application. If there are multiple nodes, is the improvement linear? We are hoping to use RayServe in a production environment. Given our current observations, this performance issue is very important to us. Do you know when there might be significant progress on this?

The overall throughput scaling is ~linear as you add more nodes to the cluster as each node will have its own proxy (and they scale independently).

We are making incremental progress to this all the time (example), but I wouldn't expect orders-of-magnitude improvement in the near future (i.e., the next few releases). Most of our users' workloads involve fairly heavyweight ML inference requests, so supporting very high QPS per node is not often the primary goal. So we are instead focusing our efforts on other issues like improved autoscaling, stability, and observability.

@edoakes edoakes removed their assignment Mar 11, 2024
@decadance-dance
Copy link

decadance-dance commented May 18, 2024

Any updates?
I go with ray serve but faced my model infers 0.5s but a __call__ func takes 1s so it is x2.
I saw @edoakes mentioned that they are focusing on other things but I think that it is also a very important aspect because there is no a lot of alternatives (bentoml - very slow, triton - so buggy) to run models to production.
So It'd great if devs like me can get performance improvements.

@edoakes
Copy link
Contributor

edoakes commented May 20, 2024

Any updates? I go with ray serve but faced my model infers 0.5s but a __call__ func takes 1s so it is x2. I saw @edoakes mentioned that they are focusing on other things but I think that it is also a very important aspect because there is no a lot of alternatives (bentoml - very slow, triton - so buggy) to run models to production. So It'd great if devs like me can get performance improvements.

@decadance-dance given the limited information you have provided, I doubt that what you're seeing is the same issue described in this ticket. If you get in touch on the Ray Slack or file a separate issue and provide more details about your setup, we may be able to point out some improvements.

@Superskyyy
Copy link
Contributor

Working on an investigation, will update here from our benchmark.

@Superskyyy
Copy link
Contributor

Superskyyy commented Jul 19, 2024

I tested on a 16-core linux machine with i9-11900K, it might be a bit overpowered so I'm moved the test to a typical cloud baremetal 2.6GHZ. (See results in later part, both tests are based on 2.32.0)

image

11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz
    CPU family:          6
    Model:               167
    Thread(s) per core:  2
    Core(s) per socket:  8

Fastapi with 1/48 workers:
image
image

Cloud Results: Cloud baremetal machine with 128 cores, CPU 2.6GHz:

I get ~180 qps, latency 500ms-1000ms, which is pretty low.

Serve:
image

FastAPI:
image

@anyscalesam anyscalesam added the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Jul 25, 2024
@GeneDer GeneDer self-assigned this Aug 21, 2024
@GeneDer GeneDer added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability P2 Important issue, but not time-critical serve Ray Serve Related Issue
Projects
None yet
Development

No branches or pull requests

9 participants