Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yet another comparison between io_uring and epoll on network performance #536

Closed
beef9999 opened this issue Feb 22, 2022 · 29 comments
Closed

Comments

@beef9999
Copy link

beef9999 commented Feb 22, 2022

Backgroud: io_uring vs epoll

Nowadays there are many issues and projects focused on io_uring network performance, and the competitor is always epoll.

However most of the tests are merely demos, and lack of verification in a production scenario. So I started to integrate io_uring socket into our C++ coroutine library, and did some full evaluations on it. By the way, all the coroutines are running in a single OS thread, which fits io_uring event model quite well.

Network workloads

In my opinion, there are basically two types of workloads in the network. Althrough generated by two different clients, a typical echo server could handle both.

  • Ping-Pong mode client

This is most echo clients look like. The clients will be continueously sending and receving requests in a loop.

// client demo code
while (true) {
    send();
    recv();
}
  • Streaming mode client

Streaming clients is not rare to see. It means multiple channels will be multiplexing on a single connection, for instance, RPC and HTTP 2.0. Usually it doesn't have too many clients, but the throughput could be high.
Below is an approach to simulate streaming workloads. Send coroutine and recv coroutine are running their loops separately.

// client demo code: coroutine 1
while (true) {
    send();
}

// client demo code: coroutine 2
while (true) {
    recv();
}

This example might be a little bit extreme, but with good simplicity. In real scenario, multiple coroutines will do ping-pong send/recv in their own loops. Because the excution contexts of coroutine would keep switching, so if you observe the network on any side of the full duplex socket, you will see that the channel has been filled with packets. So this scenario is basically the same as the above example code.

Implementations

  • In an epoll program, the stereotype is to use non-blocking fd + epoll_wait + psync read/write
  • In an io_uring program, we can simply use its async APIs, as long as we already have a event engine driven by io_uring. This part is provided by the coroutine lib.

Quick conclusion

  • io_uring is faster than epoll if the workloads are ping-pong mode
  • io_uring is slower than epoll if the workloads are streaming mode

There are two ways to minish the performance gap.

  • Increase buffer size
  • Increase connections number

And an aternative to bypass the problem.

  • Use io_uring to poll, but not to process socket IO. Namely, non-blocking fd + io_uring poll + psync read/write

Note that this article will NOT disscuss the Ping-pong mode, because io_uring can always surpass epoll in this situation. I just want to throw out a question in terms of why io_uring is sometimes slower in the Streaming mode.

Environment

Two VMs in a cloud environment, Intel Xeon 8369B 2.70GHz, 96 cores 128GB, 40Gb network bandwidth.
CentOS 8, Kernel 6.0.7-1.el8, IORING_FEAT_FAST_POLL is enabled by default.

Test 1, Echo server performance (streaming client, single connection)

Note that I only setup one client, and there is only one connection within it.

The QPS is shown in the terminal. The throughput is observed by iftop.

server type buf size client num server qps server throughput
epoll 64 1 1565K 780Mb/s
io_uring 64 1 506K 260Mb/s
epoll 512 1 1250K 4.79Gb/s
io_uring 512 1 447K 1.70Gb/s
epoll 4096 1 669K 20.4Gb/s
io_uring 4096 1 343K 10.3Gb/s
epoll 16384 1 224K 27.3Gb/s
io_uring 16384 1 183K 22.5Gb/s

Conclusions:

  1. io_uring is slower than epoll in the streaming mode
  2. When buf size increases, the performance gap is drawing near

Test 2, Echo server performance (streaming client, multiple connections)

Note that I will setup multiple client processes this time. One connection per client, as before.

outdated data

Conclusions

  • Increasing the connections number is helpful to io_uring

Test 3, io_uring IO vs psync IO (with memory backend, and IO depth = 1)

In this test, I just want to verify an idea that when IO backend is in memory, psync stack is more efficient that io_uring stack.

Not providing source code here, but you can create a normal file under /dev/shm/ (tmpfs) and use io_uring to write it (with 1 concurrency). Don't do reads because I'm not sure if page cache would affect performance. Eventually you will find psync is 3~4 times faster than io_uring.

The result is easy to understand. When your data is all in memory, psync IO stack is almost like doing memcpy. And with only 1 concurrency (IO depth = 1), you will centainly find that the io_uring's async event system not leveraging its full power.

Network buffer is similar to this situation, and for a specific fd/connection, the IO depth is always 1. So perhaps when there is still free network buffer to write to, or there is still data to read from, we should consider using psync stack.

Final Conclusions

  • Socket is not like the file IO. Reading/writing a network fd/connection is sequential (IO depth = 1).
  • When memory is the backend, e.g., network buffer or tmpfs, psync stack is more efficient than io_uring.
  • io_uring is an async event system. It performs better when having more fds, and larger buffer

How to solve this problem?

From a user's perspective, my idea to solve this performance issue of io_uring is like below:

int fd = socket();
set_non_blocking(fd);
...
while (not_read_enough_size()) {
    int ret = read(fd, buf, size);
    if (ret < 0 && errno == EAGAIN) {
        new_io_uring_read(fd);
    }
}

The new_io_uring_read means that the kernel will still execute a FAST_POLL for this non-blocking fd, and return cqe after the next read finished.

Because for most of the time, the network buffer will be able to read, so this would leverage psync efficiency while utilize iouring FAST_POLL read at the same time.

But unfortunately there is no such a kernel to provide this behavior by far. I'll ask some kernel guys for help and re-test it later.

Appendix

Architecture of coroutine based net server

Both of io_uring server and epoll server have a frontend and backend. The frontend is responsible for submiting async IO (and start polling), and falls into sleep in current coroutine. The backend will be running an event engine, and awake the sleeping coroutine when IO finished.

io_uring server

// io_uring frontend
io_uring_get_sqe();
io_uring_prep_send(); // or io_uring_prep_recv()
io_uring_submit();
coroutine_sleep();

// io_uring backend
while (should_wait_for_event()) {
    io_uring_wait_cqes();
    unsigned i = 0;
    io_uring_for_each_cqe(m_ring, head, cqe) {
        ++i;
        coroutine_interrupt();  // wake sleeping coroutine
    }
    io_uring_cq_advance(m_ring, i);
}

epoll server

// epoll frontend 
int fd = socket();
set_non_blocking(fd);
while (not_sent_enough_size()) {
    int ret = write(fd, buf, size);
    if (ret < 0 && errno == EAGAIN)
        wait_fd_writable();    // coroutine sleep here
}

// epoll backend
while (should_wait_for_event()) {
    epoll_wait();                  // fd is writable
    coroutine_interrupt();   // wake sleeping coroutine
}

How to reproduce

Test code

The full test code is here. You are welcome to run it in your own environment.

Build

# centos
dnf install gcc-c++ epel-release cmake
dnf install openssl-devel libcurl-devel libaio-devel
dnf config-manager --set-enabled powertools
dnf install gtest-devel gmock-devel gflags-devel fuse-devel libgsasl-devel

# ubuntu
apt install cmake
apt install libssl-dev libcurl4-openssl-dev libaio-dev
apt install libgtest-dev libgmock-dev libgflags-dev libfuse-dev libgsasl7-dev

git clone https://github.com/alibaba/PhotonLibOS.git
cd PhotonLibOS
git fetch && git pull origin main     # some test code has been updated in Nov. 7.
cmake -D BUILD_TESTING=1 -D ENABLE_SASL=1 -D ENABLE_FUSE=1 -D ENABLE_URING=1 -D CMAKE_BUILD_TYPE=Release -B build
cmake --build build -t net-perf -j

Run epoll server

./build/output/net-perf -buf_size 512 -port 9527

Run epoll client

./build/output/net-perf -client -buf_size 512 -client_mode streaming -ip <server_ip> -port 9527 

Run io_uring server

You will need to modify some code to switch to io_uring server.

  1. Change photon::INIT_EVENT_EPOLL to photon::INIT_EVENT_IOURING

https://github.com/alibaba/PhotonLibOS/blob/f858f0a8d7e507c4d3667f0cc7da023600f46e8f/examples/perf/net-perf.cpp#L245

  1. Comment out the new_tcp_socket_server , use the new_iouring_tcp_server in the next line.

https://github.com/alibaba/PhotonLibOS/blob/f858f0a8d7e507c4d3667f0cc7da023600f46e8f/examples/perf/net-perf.cpp#L177-L178

Run io_uring client

You will need to modify some code to switch to io_uring client. Of course, you may still use epoll client to test againt io_uring server, in order to reduce variables.

  1. Change photon::INIT_EVENT_EPOLL to photon::INIT_EVENT_IOURING

https://github.com/alibaba/PhotonLibOS/blob/f858f0a8d7e507c4d3667f0cc7da023600f46e8f/examples/perf/net-perf.cpp#L245

  1. Change new_tcp_socket_client to new_iouring_tcp_client

https://github.com/alibaba/PhotonLibOS/blob/e07ce42648864528f0724b6c339d17317a4003c9/examples/perf/net-perf.cpp#L119

How to setup multiple clients

I just wrote a batch script to make them running in the background.

for i in `seq 1 100`; do
    sleep 0.01
    ./build/output/net-perf -client -buf_size 512 -client_mode streaming -ip <server_ip> -port 9527  > /dev/null &
done
@v3ss0n
Copy link

v3ss0n commented Oct 16, 2022

Great work, this test need proper discussion, I think you should post on HN.

@beef9999
Copy link
Author

Great work, this test need proper discussion, I think you should post on HN.

What is HN?

@ammarfaizi2
Copy link
Contributor

Great work, this test need proper discussion, I think you should post on HN.

What is HN?

https://news.ycombinator.com/

@GavinRay97
Copy link

GavinRay97 commented Oct 27, 2022

Did this ever get posted there? I also agree someone should post it (ideally @beef9999 if they want)

I can also post it, but I don't want that to come off as "Sure, I'll take all those upvotes for your hard work." since it's like two seconds to submit a post.

@beef9999
Copy link
Author

@GavinRay97 I don’t have an account of that forum. It’s OK if you post it for me. But please wait until this weekend so I can make some modifications on the performance data, and upload the full test code as well.

@v3ss0n
Copy link

v3ss0n commented Oct 27, 2022

i could post for you but since i don't want to take credit you should @beef9999 . Its easy to register there (user and pass only
no email needed ) and it is arguably best community of tech people. that forum is backed by world top startup accelerator called YCombinator where a lot of tech people from google , FAANG , and Unicorn startups and big companies are there. io_uring is big interest there too.

@axboe
Copy link
Owner

axboe commented Oct 27, 2022

Not sure why it's so interesting to post on HN, honestly most of the commentary there is vitriol and not very useful. What are we trying to accomplish?

For the performance side, try and set IORING_SETUP_DEFER_TASKRUN when the ring is created. That has shown nice results for this kind of workload recently.

@axboe
Copy link
Owner

axboe commented Oct 27, 2022

Here's one from this week: https://lore.kernel.org/io-uring/[email protected]/

@GavinRay97
Copy link

GavinRay97 commented Oct 27, 2022

Not sure why it's so interesting to post on HN, honestly most of the commentary there is vitriol and not very useful. What are we trying to accomplish?

(Personally) I like to share/evangelize stuff by people I think is interesting and deserves attention, or that other people might find interesting.

They seem to be pretty keen on performance stuff and io_uring in general, though there's a (rightfully so) certain rigor expected if you're going to post benchmarks.

Even if a particular topic doesn't trend well or some people post negative comments, it's nice for the folks browsing that are interested in that thing that otherwise wouldn't have known about it IMO.

Sometimes I find posts where I have a highly positive opinion of the thing/think it's neat and nobody else does. Oh well, their loss.

That's my $0.02 at least

@axboe
Copy link
Owner

axboe commented Oct 27, 2022

I'm just not a fan, most of the commentary (to me) are from folks looking to look smart and not knowing a lot of the details. In many ways, not that different from reddit. Not useful imho, from the cases I've seen. Arguably, I haven't spent a lot of time on the site, this is just my experience from the couple of times when I have.

@v3ss0n
Copy link

v3ss0n commented Oct 27, 2022

Not sure why it's so interesting to post on HN, honestly most of the commentary there is vitriol and not very useful. What are we trying to accomplish?

(Personally) I like to share/evangelize stuff by people I think is interesting and deserves attention, or that other people might find interesting.

They seem to be pretty keen on performance stuff and io_uring in general, though there's a (rightfully so) certain rigor expected if you're going to post benchmarks.

Even if a particular topic doesn't trend well or some people post negative comments, it's nice for the folks browsing that are interested in that thing that otherwise wouldn't have known about it IMO.

Sometimes I find posts where I have a highly positive opinion of the thing/think it's neat and nobody else does. Oh well, their loss.

That's my $0.02 at least

Yeah , same reason , and that community that is quite interested in io_uring , sharing and discussiong @axboe 's tweets like everyweek , and that how i found out about io_uring too.
Reddit used to be good and intellect now it is quite the opposite, no real discussion going there.

@beef9999
Copy link
Author

beef9999 commented Nov 5, 2022

@GavinRay97 I have simplified the tests and rephrased some explanations. Please help post it if you are convenient.

@GavinRay97
Copy link

GavinRay97 commented Nov 6, 2022

@beef9999 I have posted it at A performance review of io_uring vs. epoll for standard/streamed socket traffic 👍

Hopefully some people find it interesting

@ghost
Copy link

ghost commented Nov 6, 2022

This is interesting. Thank you for this.
I wrote an epoll echo server which multiplexes multiple clients over each thread. The idea is that each core can scale the number of clients it serves.
I want to add io_uring maybe I can learn it from this repository

I wonder what how is the performance when multiple cores are run.

Its kind of similar to libuv. I use IO threads to handle IO. It's incomplete though but a proof of idea.

https://github.com/samsquire/epoll-server

It is based on a multiconsumer multiproducer RingBuffer by Alexander Krizhanovsky.

https://www.linuxjournal.com/content/lock-free-multi-produce...

I also wrote a userspace 1:M:N lightweight thread scheduler which should be integrated with the epoll server. This is an alternative to coroutines. I multiplex multiple lightweight threads on a kernel thread and switch between them fast. The scheduler thread preempts hot for and while loops by setting the looping variable to the limit. This allows preemption to occur when the code finished the current iteration. This is why I call it userspace preemption.

https://github.com/samsquire/preemptible-thread

One idea I have for even higher performance is to split sending and receiving to their own threads and multiplex sending and receiving across threads. This means you can scale sending and receiving.

@axboe
Copy link
Owner

axboe commented Nov 6, 2022

Tried to compile this as I'm pretty convinced something is amiss with the single thread performance, but it fails for me:

Consolidate compiler generated dependencies of target photon_obj
[  1%] Building CXX object CMakeFiles/photon_obj.dir/io/signal.cpp.o
/home/axboe/git/PhotonLibOS/io/signal.cpp:259:9: error: use of undeclared identifier 'pthread_atfork'
        pthread_atfork(nullptr, nullptr, &fork_hook_child);
        ^
1 error generated.
make[2]: *** [CMakeFiles/photon_obj.dir/build.make:440: CMakeFiles/photon_obj.dir/io/signal.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:104: CMakeFiles/photon_obj.dir/all] Error 2
make: *** [Makefile:111: all] Error 2

I'm on debian testing. Outside of that, I failed to find examples of how to run it? Maybe I'm just blind, but hints would be appreciated.

@axboe
Copy link
Owner

axboe commented Nov 6, 2022

OK got it going, and the examples built. signal.ccp is missing a pthread.h include.

@victorstewart
Copy link

juxtaposing the performance variance between epoll and io_uring for 512 + 1 client in test 1... vs equivalent performance in test 2 with that usleep... my intuition is all the test 2 data are poisoned.

@axboe
Copy link
Owner

axboe commented Nov 6, 2022

juxtaposing the performance variance between epoll and io_uring for 512 + 1 client in test 1... vs equivalent performance in test 2 with that usleep... my intuition is all the test 2 data are poisoned.

I agree, it all looks very odd to me.

@axboe
Copy link
Owner

axboe commented Nov 6, 2022

Got it built and running, but there are no docs on how to run with the various backends on either the client or server side. The interval thing doesn't seem to work either, it always keeps running without dumping stats until the client is killed/interrupted.

Will be happy to take a look at the perf differences, but I don't want to spend ages figuring out how to run this thing. Please provide examples, I can't find any.

@GavinRay97
Copy link

GavinRay97 commented Nov 6, 2022

The interval thing doesn't seem to work either, it always keeps running without dumping stats until the client is killed/interrupted.

It appears to be just a NGINX-like static server, which defaults to an epoll backend:

[user@MSI PhotonLibOS]$ ./build/output/server_perf
2022/11/07 05:54:32|INFO |th=0000000000B76050|/home/user/projects/PhotonLibOS/io/epoll.cpp:289|new_epoll_engine:Init event engine: epoll
2022/11/07 05:54:33|INFO |th=00007FBC8FFCEB00|/home/user/projects/PhotonLibOS/net/http/test/server_perf.cpp:44|show_qps_loop:qps: 0
2022/11/07 05:54:34|INFO |th=00007FBC8FFCEB00|/home/user/projects/PhotonLibOS/net/http/test/server_perf.cpp:44|show_qps_loop:qps: 0
2022/11/07 05:54:35|INFO |th=00007FBC8FFCEB00|/home/user/projects/PhotonLibOS/net/http/test/server_perf.cpp:44|show_qps_loop:qps: 0

I think you're meant to use something like k6/wrk2 to send an HTTP load test to the URL it's running at, which seems to be http://localhost:19876 by default. I thought it would have generated some load/throughput by itself.

It seems you are meant to run the client-perf.cpp binary alongside the server-perf.cpp one, and it will generate the HTTP requests.

I see in the docs that you can switch the epoll engine out for io_uring, but I don't seem to be able to do that.
What I've done was:

  // I think this tries to initialize some global event engine?
  int ret = photon::init(photon::INIT_EVENT_IOURING, photon::INIT_IO_LIBAIO);
  if (ret != 0) {
    LOG_ERRNO_RETURN(0, -1, "photon init failed");
  }

  // Replaced this with io_uring specific method
  auto tcpserv = net::new_iouring_tcp_server();

  // Specified `io_uring` engine for FS
  auto fs = fs::new_localfs_adaptor(".", photon::fs::ioengine_iouring);

This still logs as using the epoll engine though 🙁


I also had to modify a few things to get it to build:

  • Like Jens mentioned, there was an #include <pthreads> needed in one of the headers
  • The CMake variable for including GTest/GMock headers was incorrectly named, it was singular instead of plural

@beef9999
Copy link
Author

beef9999 commented Nov 7, 2022

Yes, C++ programs are very sensitive to the environment, platform specific… We only tested the compiling on CentOS and Ubuntu before, didn’t have the pthread header problem.

I’ll add some instructions about how to run the program with appropriate parameters.

@beef9999
Copy link
Author

beef9999 commented Nov 7, 2022

Hi, Everyone. I have updated this issue and added the how to reproduce instructions.

About test 2, I deleted this line

In order to ease the server's pressure (for it only enabled one core), I added a 10 μs sleep in the client's send/recv loop.

It's not a MUST DO. I just added in my own code.

@beef9999
Copy link
Author

beef9999 commented Nov 7, 2022

juxtaposing the performance variance between epoll and io_uring for 512 + 1 client in test 1... vs equivalent performance in test 2 with that usleep... my intuition is all the test 2 data are poisoned.

That's because stress is high in streaming mode, only one client could almost occupied the server CPU (one core). So I figured out this method to reduce server stress.

@GavinRay97
Copy link

GavinRay97 commented Nov 7, 2022

Yes, C++ programs are very sensitive to the environment, platform specific… We only tested the compiling on CentOS and Ubuntu before, didn’t have the pthread header problem.

If it's any help, I am running on Fedora 37, compiling with Clang 15, and GCC 12 toolchain (/usr/include/c++/12/)

@beef9999
Copy link
Author

beef9999 commented Nov 7, 2022

@GavinRay97 Are you able to reproduce my data for test 1 ?

@axboe
Copy link
Owner

axboe commented Nov 7, 2022

With the actual instructions, I gave it a test spin. From a quick look, you're doing a lot more on the io_uring side than you are on the epoll side. I made the following 2 minute tweaks:

  • Don't arm a timer, use the appropriate io_uring_submit_and_wait_timeout()
  • Register the ring fd
  • Update liburing to something that isn't 1+ years old

and got a 50% increase from that alone. I'm sure there's a lot more that could be done, but I'm pretty skeptical that this is a apples-to-apples epoll vs io_uring test case as it is. Other notes:

  • Use fixed files?
  • Why read on a socket? recv would be more efficient, at least on the io_uring side
  • How are buffers managed? Is it the same on epoll vs io_uring?
  • What are the linked timeouts doing?

@axboe
Copy link
Owner

axboe commented Nov 7, 2022

Another note - lots of receives will have cflags == 0x04 == IORING_CQE_F_SOCK_NONEMPTY, meaning that the socket still had more data after this receive? Is this really a ping-pong test, or is it just blasting data in both directions?

@axboe
Copy link
Owner

axboe commented Nov 7, 2022

We're also spending a ton of time in __vdso_gettimeofday() when run with io_uring, and I see nothing if using epoll. This is about ~10% of the time spent! It's coming off resume_threads().

I'm not going to spend more time on this, there are vast differences between what is being run here and I think some debugging and checking+optimizing of the io_uring side would go a long way toward improving the single thread / single connection disparity.

@beef9999
Copy link
Author

beef9999 commented Nov 8, 2022

@axboe Thanks for your time. Try to answer some of your questions:

  1. Why read on a socket?
    Because the Linux manual says read is identical to recv in terms of socket. Didn't know io_uring has this specialty.

  2. How are buffers managed? Is it the same on epoll vs io_uring?
    They are the same. Both allocated on stack. Didn't register to io_uring, or epoll eighter.

  3. What are the linked timeouts doing?
    To replace io_uring_submit_and_wait_timeout, because this bug (Why does io_uring_wait_cqe_timeout always have a minimum overhead of 2 milliseconds ? #531) I reported before was only merged in to the latest kernel. Didn't have chance to upgrade my kernel yet.

io_uring_submit_and_wait_timeout was invoked in the coroutine scheduling, now I wrote these code to replace it.

__kernel_timespec ts = get_timeout_ts();
io_uring_prep_timeout(sqe, &ts, 1, 0);
io_uring_submit_and_wait(ring, 1);

  Why there is performance disparity between these two approaches?

  1. IORING_CQE_F_SOCK_NONEMPTY, meaning that the socket still had more data after this receive?
    Yes, this test is all about streaming mode client. The socket had been filled with continuously coming data.

I'd like to say something about why this test ever exists. Because unlike the traditional usage, if you need to pipeline the socket IO ( or technically said, concurrent read/write ), then an event engine is a necessary technology. You can hardly find a mature async event engine driven by io_uring in the open source world nowadays, except ours. And I think that's why people didn't meet the streaming client performance issue before.

I believe our old epoll event engine has been optimized quite well, otherwise it wouldn't be able to surpass other IO engine in performance. According to our tests, in streaming mode, boost::asio can only got 50% throughput of ours. What I mean is the upper limit is high.

Another interesting thing to mention is that if I use nonblocking fd + io_uring poll + psync read/write, the performance would still be rising to epoll as well. That means my io_uring event engine is proven to be capable.

Anyway, I'll keep on optimizing the io_uring code based on your notes. Thank you.


Updated on Nov. 8, I upgraded my kernel to 6.0.7.

  • I can confirm that the 50% performance increase (current QPS is 660K) came from the io_uring_submit_and_wait_timeout. The timer is slow, indeed. But there is still a huge gap from 660K to epoll's 1200K. I don't think any trivial optimization would cover this gap.
  • Register ring fd didn't bring any benefits. This is a single thread program.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants