Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Router::nest has poor performance on 0.6 #1624

Closed
nicolaspernoud opened this issue Dec 7, 2022 · 35 comments · Fixed by #1711
Closed

Router::nest has poor performance on 0.6 #1624

nicolaspernoud opened this issue Dec 7, 2022 · 35 comments · Fixed by #1711
Labels
A-axum C-performance Category: Performance related issues and PRs

Comments

@nicolaspernoud
Copy link

Bug Report (more of a questions actually)

I know it is kind of a blind shot, but maybe someone can give some insight into this...

Version

├── axum v0.6.1
│ ├── axum-core v0.3.0
├── axum-extra v0.4.2
│ ├── axum v0.6.1 ()
├── axum-server v0.4.2
│ ├── axum-server v0.4.2 (
)

Platform

Linux *** 5.19.0-26-generic #27-Ubuntu SMP PREEMPT_DYNAMIC Wed Nov 23 20:44:15 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Crates

axum = { version="0.6.1", features = ["query", "json", "http2", "tokio", "headers"], default-features = false }
axum-extra = { version = "0.4.2", features = ["cookie-private"], default-features = false }

Description

I migrated my app from axum 0.5 to axum 0.6 as it is detailed here : nicolaspernoud/atrium@d002700#diff-28504f0a0ed99657d5f5bee11b72eb6ef2db4faf68b0f6e94bfe28c77e34e35a .
The old code manages approximately 24 500 req/sec in a reverse proxy use case.
The new code and so on manage approximately 16 500 req/sec, which is a huge drop.
The new code also use state instead of extensions, but I am not sure that is the problem, it seems to be related to axum itself : a temp branch kept extensions for configuration and it was kind of the same performance.

Any idea or advice on how to improve that ?

Test code

The test code is here : https://github.com/nicolaspernoud/go_vs_rust_reverse_proxy_benchmark/blob/main/test_atrium.sh .

By the way, in a super simple use case as shown here : https://github.com/nicolaspernoud/go_vs_rust_reverse_proxy_benchmark/blob/main/test.sh ; that is not the case, axum 0.6 is actually slightly faster.

I suppose that is related to the way axum extract something from the request (?).

@davidpdrsn
Copy link
Member

That is hard to say with such a big amount of code. Are you able to make a smaller repro?

@nicolaspernoud
Copy link
Author

Well, that will take some work (and time), but I will try...

@davidpdrsn
Copy link
Member

We do have some benchmarks here that you try to poke at and see if you notice anything.

@nicolaspernoud
Copy link
Author

Ok, I will try to work something out from that...

@jplatte
Copy link
Member

jplatte commented Dec 7, 2022

One thing that could be relevant: Are you nesting routers? If yes, can you try instead adding prefix to all sub-routes individually and .merge the previously nested routers into the main one instead?

@jplatte
Copy link
Member

jplatte commented Dec 7, 2022

Also @davidpdrsn could we change the implementation of nest (not nest_service) to use merge instead of nest_service? Regardless of performance, I think this could actually simplify things and notably, it should mean that the "sometimes can't extract matched path in middleware" thing only affects non-Router services.

@nicolaspernoud
Copy link
Author

@jplatte : I just tried it, and although the merged routes were not routed to in the benchmark, I DID get back the original performance ! Amazing ! So, that solve my problem ! Thanks a lot !
@davidpdrsn : I let to you choosing to close the issue right now or waiting for a change in the nest implementation... But for me, that is ok.

@davidpdrsn
Copy link
Member

I let to you choosing to close the issue right now or waiting for a change in the nest implementation... But for me, that is ok.

If nest is slow then we should try and fix that. So I'll leave this open for now.

Also @davidpdrsn could we change the implementation of nest (not nest_service) to use merge instead of nest_service? Regardless of performance, I think this could actually simplify things and notably, it should mean that the "sometimes can't extract matched path in middleware" thing only affects non-Router services.

Thats basically what we did in 0.5, but in a problematic way where we only had one nest function with two code paths. Maybe its fine now since we have different nest functions 🤔 As that seems to fix the issue its definitely worth considering.

First thing should be to add a benchmark for it. Would also be good to see if we can improve the general nest performance.

@FSMaxB
Copy link
Contributor

FSMaxB commented Dec 14, 2022

We've seen an even worse drop from 1500req/s to 75req/s (that is for an endpoint that only returns some static data). Will check if removing nesting fixes it.
Admittedly, this slowdown includes some other changes in how the router is built because just switching from 0.5 to 0.6 broke the merging of routes somehow, showing conflicts that weren't there before. If it helps I can also try to reduce the code to get a reproducer for that other issue, just don't know when I'll find the time to do that.

@davidpdrsn
Copy link
Member

from 1500req/s to 75req/s

That sounds really bad 😬 I'm curious to hear if it is just because of nest, though that sounds quite extreme.

switching from 0.5 to 0.6 broke the merging of routes somehow, showing conflicts that weren't there before.

A repro of that would be great! Off the top of my head I don't think there were any intentional changes. merge is equivalent to just calling route yourself for each route so it shouldn't change which routes overlap which each other.

@pythoneer
Copy link

pythoneer commented Dec 14, 2022

We're running into the same problem and nesting, heavily reduces the performance. I made a little example that demonstrates the Problem. Changing from 0.5 -> 0.6 reduces the req/s from ~ 20'000 req/s to 1'500 req/s further nesting makes it worse.

EDIT: forgot to mention that the numbers where in debug mode. I've also tested in release now but its just different numbers but the same problem 27'000 req/s to 2500 req/s

use axum::{
    routing::get,
    http::StatusCode,
    response::IntoResponse,
    Json, Router,
};
use serde::Serialize;
use std::net::SocketAddr;

#[tokio::main]
async fn main() {

    // ab -c 15 -n 100 http://localhost:3001/v1/v2/v3/v4/v5/users  -> 1500 req/s on 0.6 | 20'000 req/s on 0.5
    /******************************************************************/
    let users = Router::new().route("/users", get(get_user));
    let v5 = Router::new().nest("/v5", users);
    let v4 = Router::new().nest("/v4", v5);
    let v3 = Router::new().nest("/v3", v4);
    let v2 = Router::new().nest("/v2", v3);

    let app = Router::new().nest("/v1", v2);
    /******************************************************************/

    // ab -c 15 -n 100 http://localhost:3001/v1/v2/v3/v4/v5/users  -> 20'000 req/s for both 0.5 + 0.6
    /******************************************************************/
    // let app = Router::new().route("/v1/v2/v3/v4/v5/users", get(get_user));
    /******************************************************************/

    let addr = SocketAddr::from(([127, 0, 0, 1], 3001));
    tracing::debug!("listening on {}", addr);
    axum::Server::bind(&addr)
        .serve(app.into_make_service())
        .await
        .unwrap();
}

async fn get_user() -> impl IntoResponse {
    let user = User {
        id: 1337,
        username: "payload.username".to_owned(),
    };

    (StatusCode::OK, Json(user))
}

#[derive(Serialize)]
struct User {
    id: u64,
    username: String,
}

@davidpdrsn davidpdrsn changed the title Question : why Axum 0.6 is (a lot) slower than axum 0.5 (in my use case) Router::nest has poor performance on 0.6 Dec 14, 2022
@davidpdrsn davidpdrsn added A-axum C-performance Category: Performance related issues and PRs labels Dec 14, 2022
@FSMaxB
Copy link
Contributor

FSMaxB commented Dec 14, 2022

I'm curious to hear if it is just because of nest, though that sounds quite extreme.

Final numbers are in (release builds this time 😅):

  • 0.6, nesting removed: ~950 req/s
  • 0.6 with nesting: ~100 req/s (those were 3 or 4 levels)
  • 0.5 with nesting (and routes arranged slightly differently): ~950/reqs

Note that I got very inconsistent results with outliers of ~30% in some test runs, so take the above with a very big grain of salt.

But I would say those results are consistent with this being only a nesting issue without any other regression.

@davidpdrsn
Copy link
Member

@FSMaxB thanks for providing those numbers! Definitely something going on that we should look into.

@romanstingler
Copy link

romanstingler commented Dec 14, 2022

@davidpdrsn

with ab -c 15 -n 2000 http://localhost:3001/v1/v2/v3/v4/v5/users
with the script from @pythoneer
I get more than a magnitude more calls on 0.6 than in 0.5 .
I have profiled these 2 commits (callgrind & cachegrind output attached)

0.6 = 3279d64
0.5 = 2c00e61

0.6 on the left and 0.5 on the right side
image

0.5 on the left and 0.6 on the right side (sorry)
image

cachegrind.out.zip

callgrind.out.zip

On a single call of the URL we get with 0.6

image

and with 0.5
05

@alexpyattaev
Copy link
Contributor

A silly question here - given how nest is designed, would it not be always slower than merge? Since nested routers essentially need to be called after the "outer" router has worked out the path, would it not need to allocate a new String to hold the "tail" portion of the path?
Seems like here is the allocation call that happens, so calling a nested router would cause 2 allocations per request instead of one:

let path = req.uri().path().to_owned();

With that in mind, would it not be reasonable to have some sort of better way to handle that? Extra allocation per request is not such a great idea for performance AFAIK...
PS: apologies if I am talking nonsense here ...

@davidpdrsn
Copy link
Member

@alexpyattaev yes it would. The plan is to flatten the routes like we did in 0.5. See #1711. That'll give the same performance as regular routes because that's what it'll be internally.

@nicolaspernoud
Copy link
Author

Not sure that is actually fixed.
With 0.6.12 I have ~25 000 req/s with merge and ~15 000 req/s with nest.
With 0.6.15 (and only changing that) I have ~15 000 req/s with merge and ~15 000 req/s with nest.
So performance is now the same... but for the worse.

@davidpdrsn davidpdrsn reopened this Apr 12, 2023
@davidpdrsn
Copy link
Member

davidpdrsn commented Apr 12, 2023

@nicolaspernoud I can't reproduce those numbers with the benchmarks we have in this repo. Can you provide a benchmark?

Edit: I've added the benchmarks I use in 377d73a

@nicolaspernoud
Copy link
Author

nicolaspernoud commented Apr 12, 2023

@davidpdrsn , well my use case, as noted before, is quite complicated, atrium being an axum based reverse proxy. Still, the benchmark I use is here : https://github.com/nicolaspernoud/atrium/blob/development/scripts/benchmark/benchmark.sh . The file to alter to switch between nest/merge is here : https://github.com/nicolaspernoud/atrium/blob/development/backend/src/server.rs . When reverting to merge, the performance impact is clear just by altering Cargo.toml and pinning axum between 0.6.12/15.

Maybe the performance hit is not due to the change on nest and comes from somewhere else altogether…

@davidpdrsn
Copy link
Member

That's quite a lot of setup indeed. Are you able to try and make a smaller reproduction script? I'm afraid I can't really debug that. No other changes in 0.6.15 come to mind.

@FSMaxB
Copy link
Contributor

FSMaxB commented Apr 13, 2023

As for my case, there doesn't seem to be a regression in performance, or if so only a small one.

The codebase has evolved a bit since then, but let's redo the numbers:

  • 0.6.12 no nesting: ~650 req/s
  • 0.6.14 no nesting: ~680 req/s
  • 0.6.15 no nesting: ~620 req/s

I also don't have the nested code anymore, so can't test with that.

Given that the values varied between 620 and 780 req/s I would say that the above is not statistically significant (without having made any statistical analysis on the data, but I've seen very large outliers in all cases and they tended to settle differently between runs.)

@nicolaspernoud
Copy link
Author

@davidpdrsn I will try to set up a minimum test project when I have some time.

@davidpdrsn
Copy link
Member

@FSMaxB thanks for testing! I'm a bit less worried now 😅

I also tried making a micro benchmark that doesn't use hyper (and therefore not the network) and just calls the router directly. Code for that is here https://gist.github.com/davidpdrsn/4d60c2a7a16047f4eff76b6e71fe055d

To send 10k requests I get these numbers:

axum = "=0.6.15"
no-nesting: 6.53075ms
10 levels of nesting: 50.194459ms

axum = "=0.6.12"
no-nesting: 6.445834ms
10 levels of nesting: very slow, not gonna wait for it. More than 10 seconds

axum = "=0.5.17"
no-nesting: 7.3635ms
nesting: 45.244ms

So the performance does seem comparable between 0.6.15 and 0.5, which is good. I'm curious if there is anything we can do to make nesting faster in general. It does additional processing such as stripping the matched prefix so does make sense that it'd be slower.

@davidpdrsn I will try to set up a minimum test project when I have some time.

Thanks!

@pythoneer
Copy link

pythoneer commented Apr 13, 2023

I ran the tests (in release mode) from my comment

0.6.12 nest  4291
0.6.15 nest 28097

0.6.12 no nest 31113
0.6.15 no nest 28296

The ~10% difference between 0.6.12 and 0.6.15 in the "no nest" case looks kinda stable on my machine.

@nicolaspernoud
Copy link
Author

@davidpdrsn Hello, I tried to put together a simplified version of my app here https://github.com/nicolaspernoud/axum_benchmark : it does not proxy anything but only answers hello world, the webdav handler is gone, and the complex TLS/Let's encrypt configuration as well. I let some of the custom extractors, middlewares and global state to keep it realistic. It could be simplified some more if needed, but that might lose some of the specificities that make the performance hit.

You normally just need to clone the repo and run benchmark.sh to see the results. There is some included in the reports directory. The switching between 0.6.12 and 0.6.15 is brutally made with a sed in Cargo.toml...

The nested performance between 0.6.12 and 0.6.15 is quite similar. But the merged performance is a lot better in 0.6.12.

Version Type Reqs/s
0.6.12 Nest 39469
0.6.15 Nest 32408
0.6.12 Merge 64265
0.6.15 Merge 37365

@davidpdrsn
Copy link
Member

davidpdrsn commented Apr 15, 2023

There's gotta be something else going on with your setup if merge is that much slower. All merge does is move the routes from one router to another, as if you had called .route() yourself. It doesn't add any middleware or any additional processing. I also cannot reproduce the difference in a micro benchmark.

Are you able to reproduce a similar difference without doing any networking and instead just calling the axum router directly, like my micro benchmark does?

axum doesn't handle any networking stuff like parsing the incoming stream or dealing with connections. Hyper does all that. So it's easier to evaluate axum's performance by bypassing all that.

@nicolaspernoud
Copy link
Author

My guess is that is not a merge vs nest problem anymore but a v12 vs v15 problem on something else, that is more easy to see now that merge and nest have the same performance. It happens on an app that has some extractors, middlewares, etc. That could be why isolated or micro benchmarks do not show it. I will try (next week) to narrow the version which causes the performance gap between v12 and v15, to see when something was changed, other than the nest improvement.

@davidpdrsn
Copy link
Member

The overhead of middleware or extractors should be constant regardless how you build your router.

I changed my benchmark to instead print requests per second and compare 0.5.17, 0.6.12, and 0.6.15:

0.5.17
 flat: 1759735 rps
merge: 1746821 rps
 nest:  253203 rps

0.6.12
 flat: 1969787 rps
merge: 2005635 rps
 nest:      66 rps

0.6.15
 flat: 1895037 rps
merge: 1909538 rps
 nest:  222305 rps

This shows that 0.6.15 is roughly inline with 0.5. So I do believe this issue has been fixed. I'll close this again but please do comment if you find more issues!

@nicolaspernoud
Copy link
Author

I ran several more tests that you can find here : https://github.com/nicolaspernoud/axum_benchmark/tree/main/reports
There is some variability, but it is clear that performance between 0.6.12 and 0.6.13 decreased a lot (~60k Req/Sec to ~40k Req/Sec).
It is not a nest issue anymore, so that issue can effectively be closed, but should not another issue be opened ?

@davidpdrsn
Copy link
Member

davidpdrsn commented Apr 17, 2023

0.6.13 contained a println that was committed by accident. 0.6.15 doesn't have that, so don't test 0.6.13 😊

I updated my numbers to include 12, 13, 14, and 15 (and a test with a middleware):

0.6.12: flat: 2003882 rps
0.6.13: flat:  708557 rps
0.6.14: flat: 1867643 rps
0.6.15: flat: 1873534 rps

0.6.12: with middleware: 1473827 rps
0.6.13: with middleware:  595125 rps
0.6.14: with middleware: 1376340 rps
0.6.15: with middleware: 1408086 rps

0.6.12: nest:     62 rps
0.6.13: nest: 156008 rps
0.6.14: nest: 215389 rps
0.6.15: nest: 217711 rps

0.6.12: nest with middleware:     58 rps
0.6.13: nest with middleware: 142710 rps
0.6.14: nest with middleware: 203133 rps
0.6.15: nest with middleware: 203579 rps

Doesn't seem like there is an issue to me.

@nicolaspernoud
Copy link
Author

Well, that is the results I get :

Version Reqs/s
0.6.12 ~60k
0.6.13 ~40k
0.6.14 ~40k
0.6.15 ~40k

So I am pretty positive that "something" makes it run slower. It may or may not be related to the merge/nest change, but be a side effect of something else (?).
Could you just clone my repo https://github.com/nicolaspernoud/axum_benchmark.git and run benchmark.sh to see if that happens on your side ?
If you look at benchmark.sh, you'll see the only thing it changes is the axum version. And I test 0.6.12 twice (on the beginning and the end to take into account CPU warming). Every time 0.6.13+ is significantly slower. I tried on two different computers as well.

@davidpdrsn
Copy link
Member

I can't run it:

❯ ./benchmark.sh
mkdir: /Users/david.pedersen/Desktop/axum_benchmark/reports: File exists
sed: 1: "Cargo.toml": invalid command code C
Error: Changing Axum version in Cargo.toml failed !

This is after I comment out the apt install commands. I'm on a Mac.

This is why I made a script that you just cargo run, no additional dependencies needed.

@davidpdrsn
Copy link
Member

@nicolaspernoud are you in the tokio discord? Might be quicker to chat there 😅 and we avoid spamming people with notifications while we figure stuff out.

@nicolaspernoud
Copy link
Author

@davidpdrsn : yes, joining you here !

@davidpdrsn
Copy link
Member

It turns out this was a decrease in the performance of Router::fallback. To support fallback inheritance it needs to perform some additional processing so I don't consider this a bug.

@nicolaspernoud's used Router::new().fallback(...) a lot as a way to accept all requests. That meant all requests would go through the fallback. It is faster to use Handler::into_service or axum::routing::any instead. The change can be found here.

I don't think there is anything we need to do in axum but I documented that Router::new().fallback(...) isn't the optimal way to accept all requests in #1940

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-axum C-performance Category: Performance related issues and PRs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants