Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance on file operations #830

Closed
SUPERCILEX opened this issue Mar 22, 2023 · 66 comments
Closed

Poor performance on file operations #830

SUPERCILEX opened this issue Mar 22, 2023 · 66 comments

Comments

@SUPERCILEX
Copy link

I'm hoping to take advantage of io_uring to minimize syscall count, but I can't get anywhere close to non-io_uring performance. My application is a fast version of rm: every directory is independently deleted on its own thread.

Here is my io_uring patch: SUPERCILEX/fuc@a01a22b?w=1. The gist is that I queue up a bunch of unlinks and then submit + wait for them to complete. I have to periodically reap the submission queue b/c I'm using getdents64 and passing the unlink path pointers from that directory buffer (so the unlinks must be done before I can make the next getdents call).

I've experimented with a bunch of options:

  • IORING_SETUP_COOP_TASKRUN and IORING_SETUP_SINGLE_ISSUER don't seem to make a difference.
  • IORING_SETUP_DEFER_TASKRUN is ~10% worse which is surprising.
  • 128 entries seems optimal. More and things get a little slower. Very surprising is that a low entry count is extremely slow. With only 1 entry we're talking a 3x slowdown which I don't understand. I'd expare par performance with non io_uring since the syscall count should be equivalent.
  • Using either iowq_max_workers = [1, 0] or IO_LINK prevents a performance cliff from every file in a directory being deleted on its own thread.

Ideally, I'd like to be able to tell io_uring to execute the submission queue on the calling thread since that's what will be most efficient. I would have hoped that IORING_SETUP_DEFER_TASKRUN did that, but it does not appear to be the case.


Linux 6.2.6-76060206-generic

Benchmark with hyperfine --warmup 3 -N --prepare "ftzz g -n 100K /tmp/ftzz" "./old /tmp/ftzz" "./new /tmp/ftzz" where old is master and new is the branched binary, both built with cargo b --release (copy from target/release/rmz).

Benchmark 1: ./old /tmp/ftzz
  Time (mean ± σ):      23.9 ms ±   1.6 ms    [User: 10.2 ms, System: 285.4 ms]
  Range (min … max):    21.0 ms …  28.4 ms    59 runs
 
Benchmark 2: ./new /tmp/ftzz
  Time (mean ± σ):      31.8 ms ±   1.8 ms    [User: 4.4 ms, System: 361.6 ms]
  Range (min … max):    28.3 ms …  36.1 ms    49 runs
 
Summary
  './old /tmp/ftzz' ran
    1.33 ± 0.12 times faster than './new /tmp/ftzz'
@redbaron
Copy link

redbaron commented Mar 23, 2023

You use IORING_SETUP_SINGLE_ISSUER , but share io_uring across threads. Try creating io_uring in each thread

@SUPERCILEX
Copy link
Author

Oh yep, I forgot to mention that I already have a uring per thread. IORING_SETUP_ATTACH_WQ didn't seem to make a difference.

@axboe
Copy link
Owner

axboe commented Mar 23, 2023

Some opcodes are just never going to be super fast, until they get converted to being able to do nonblocking issue. UNLINK is one of them - it'll always return -EAGAIN on inline issue, which means it gets punted to the internal io-wq pool for handling. This is why it isn't always faster than a sync unlink(), as even if we could perform it without blocking, we're still offloading it to a thread.

The situation might change if you do:

echo 3 > /proc/sys/vm/drop_caches

before the run, as it'd be cache cold at that point. But depending on what you unlink, you would probably only still have a few of the sync unlinks blocking before the meta data is cached.

@SUPERCILEX
Copy link
Author

SUPERCILEX commented Mar 23, 2023

Is there documentation that explains the io_uring architecture? I don't know what it would mean for a syscall to be non-blocking (though I'm guessing it means start the work and set up some interrupt to be notified when it's done).

If anything, I think I want blocking behavior because that should remove pointless overhead waiting on queues. It'd be neat if there was a "do it inline anyway" flag or perhaps that can be inferred from a iowq_max_workers = [1, 0] + submit_and_wait(queue_size) call (though a flag is probably clearer).

@axboe
Copy link
Owner

axboe commented Mar 23, 2023

t'd be neat if there was a "do it inline anyway" flag or perhaps that can be inferred from a iowq_max_workers = [1, 0] + submit_and_wait(queue_size) call (though a flag is probably clearer).

It can't be inferred from that, but yes I agree, I have considered that. It'd be trivial to add as an IORING_ENTER_INLINE flag or similar.

@axboe
Copy link
Owner

axboe commented Mar 23, 2023

I can do a quick hack of that if you want to test it...

@SUPERCILEX
Copy link
Author

I can do a quick hack of that if you want to test it...

If you'll let me know how to test it 😅, then sure!

@axboe
Copy link
Owner

axboe commented Mar 24, 2023

Something like the below, totally untested...

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 1d59c816a5b8..539b1e3ac695 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -442,6 +442,7 @@ struct io_cqring_offsets {
 #define IORING_ENTER_SQ_WAIT		(1U << 2)
 #define IORING_ENTER_EXT_ARG		(1U << 3)
 #define IORING_ENTER_REGISTERED_RING	(1U << 4)
+#define IORING_ENTER_INLINE		(1U << 5)
 
 /*
  * Passed in for io_uring_setup(2). Copied back with updated info on success
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 24be4992821b..dfb52497e49e 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2078,12 +2078,11 @@ static void io_queue_async(struct io_kiocb *req, int ret)
 		io_queue_linked_timeout(linked_timeout);
 }
 
-static inline void io_queue_sqe(struct io_kiocb *req)
-	__must_hold(&req->ctx->uring_lock)
+static inline void __io_queue_sqe(struct io_kiocb *req, unsigned issue_flags)
 {
 	int ret;
 
-	ret = io_issue_sqe(req, IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER);
+	ret = io_issue_sqe(req, issue_flags);
 
 	/*
 	 * We async punt it if the file wasn't marked NOWAIT, or if the file
@@ -2095,6 +2094,12 @@ static inline void io_queue_sqe(struct io_kiocb *req)
 		io_queue_async(req, ret);
 }
 
+static inline void io_queue_sqe(struct io_kiocb *req)
+	__must_hold(&req->ctx->uring_lock)
+{
+	__io_queue_sqe(req, IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER);
+}
+
 static void io_queue_sqe_fallback(struct io_kiocb *req)
 	__must_hold(&req->ctx->uring_lock)
 {
@@ -2295,10 +2300,11 @@ static __cold int io_submit_fail_init(const struct io_uring_sqe *sqe,
 }
 
 static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
-			 const struct io_uring_sqe *sqe)
+			 const struct io_uring_sqe *sqe, bool force_inline)
 	__must_hold(&ctx->uring_lock)
 {
 	struct io_submit_link *link = &ctx->submit_state.link;
+	unsigned issue_flags = IO_URING_F_COMPLETE_DEFER;
 	int ret;
 
 	ret = io_init_req(ctx, req, sqe);
@@ -2344,6 +2350,9 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		return 0;
 	}
 
+	if (!force_inline)
+		issue_flags |= IO_URING_F_NONBLOCK;
+	__io_queue_sqe(req, issue_flags);
 	io_queue_sqe(req);
 	return 0;
 }
@@ -2425,7 +2434,7 @@ static bool io_get_sqe(struct io_ring_ctx *ctx, const struct io_uring_sqe **sqe)
 	return false;
 }
 
-int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
+int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, bool force_inline)
 	__must_hold(&ctx->uring_lock)
 {
 	unsigned int entries = io_sqring_entries(ctx);
@@ -2454,7 +2463,7 @@ int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
 		 * Continue submitting even for sqe failure if the
 		 * ring was setup with IORING_SETUP_SUBMIT_ALL
 		 */
-		if (unlikely(io_submit_sqe(ctx, req, sqe)) &&
+		if (unlikely(io_submit_sqe(ctx, req, sqe, force_inline)) &&
 		    !(ctx->flags & IORING_SETUP_SUBMIT_ALL)) {
 			left--;
 			break;
@@ -3465,7 +3474,8 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 
 	if (unlikely(flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP |
 			       IORING_ENTER_SQ_WAIT | IORING_ENTER_EXT_ARG |
-			       IORING_ENTER_REGISTERED_RING)))
+			       IORING_ENTER_REGISTERED_RING |
+			       IORING_ENTER_INLINE)))
 		return -EINVAL;
 
 	/*
@@ -3521,7 +3531,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 			goto out;
 
 		mutex_lock(&ctx->uring_lock);
-		ret = io_submit_sqes(ctx, to_submit);
+		ret = io_submit_sqes(ctx, to_submit, flags & IORING_ENTER_INLINE);
 		if (ret != to_submit) {
 			mutex_unlock(&ctx->uring_lock);
 			goto out;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 2711865f1e19..ff90ebb33850 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -62,7 +62,7 @@ int io_uring_alloc_task_context(struct task_struct *task,
 				struct io_ring_ctx *ctx);
 
 int io_poll_issue(struct io_kiocb *req, bool *locked);
-int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr);
+int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, bool force_inline);
 int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin);
 void io_free_batch_list(struct io_ring_ctx *ctx, struct io_wq_work_node *node);
 int io_req_prep_async(struct io_kiocb *req);
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index 0119d3f1a556..fcf1f1753fde 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -190,7 +190,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries)
 		 */
 		if (to_submit && likely(!percpu_ref_is_dying(&ctx->refs)) &&
 		    !(ctx->flags & IORING_SETUP_R_DISABLED))
-			ret = io_submit_sqes(ctx, to_submit);
+			ret = io_submit_sqes(ctx, to_submit, false);
 		mutex_unlock(&ctx->uring_lock);
 
 		if (to_submit && wq_has_sleeper(&ctx->sqo_sq_wait))

@axboe
Copy link
Owner

axboe commented Mar 24, 2023

Easiest way to test would be to just enable it unconditionally in liburing and recompile the library and your test app. Something like this, again utterly untested:

diff --git a/src/queue.c b/src/queue.c
index b784b10c0996..c1525eb89ce1 100644
--- a/src/queue.c
+++ b/src/queue.c
@@ -69,7 +69,7 @@ static int _io_uring_get_cqe(struct io_uring *ring,
 
 	do {
 		bool need_enter = false;
-		unsigned flags = 0;
+		unsigned flags = (1U << 5);
 		unsigned nr_available;
 		int ret;
 
@@ -93,7 +93,7 @@ static int _io_uring_get_cqe(struct io_uring *ring,
 			need_enter = true;
 		}
 		if (data->wait_nr > nr_available || need_enter) {
-			flags = IORING_ENTER_GETEVENTS | data->get_flags;
+			flags |= IORING_ENTER_GETEVENTS | data->get_flags;
 			need_enter = true;
 		}
 		if (sq_ring_needs_enter(ring, data->submit, &flags))
@@ -378,6 +378,7 @@ static int __io_uring_submit(struct io_uring *ring, unsigned submitted,
 			flags |= IORING_ENTER_GETEVENTS;
 		if (ring->int_flags & INT_FLAG_REG_RING)
 			flags |= IORING_ENTER_REGISTERED_RING;
+		flags |= (1U << 5);
 
 		ret = __sys_io_uring_enter(ring->enter_ring_fd, submitted,
 					   wait_nr, flags, NULL);

We'd probably want this to be a specific io_uring_submit_foo() helper instead, but this will do for testing. Would be interesting to re-run your unlink test with that.

@SUPERCILEX
Copy link
Author

For that first patch, is there a way to include it in a live kernel? Or do I need to build and boot my own kernel? Any pointers would be appreciated.

@axboe
Copy link
Owner

axboe commented Mar 24, 2023

You need to patch and build the kernel, I'm afraid. If you want, send me your test app and I can run it here and we can compare. That might be easier.

@SUPERCILEX
Copy link
Author

SUPERCILEX commented Mar 24, 2023

If you want, send me your test app and I can run it here and we can compare. That might be easier.

Yes please :). If you'd like, I can build binaries for you provided a target architecture (I assume x86_64?). Otherwise:

# Install rust if necessary: https://doc.rust-lang.org/cargo/getting-started/installation.html
git clone [email protected]:SUPERCILEX/fuc.git
cd fuc
git checkout io_uring_rmz

cargo +nightly build --release --workspace
cp target/release/rmz force

git checkout a01a22
cargo +nightly build --release --workspace
cp target/release/rmz patchless

# Benchmark
cargo install hyperfine
hyperfine --warmup 3 -N --prepare "ftzz g -n 100K /tmp/ftzz" "./patchless /tmp/ftzz" "./force /tmp/ftzz"

Note that I disabled IOSQE_IO_LINK chaining since I assume it's not necessary if everything is run inline?

@axboe
Copy link
Owner

axboe commented Mar 24, 2023

axboe@r7525 ~/gi/fuc (io_uring_rmz)> hyperfine --warmup 3 -N --prepare "ftzz g -n 1000K /dev/shm/ftzz" "taskset -c 0 ./patchless /dev/shm/ftzz" "taskset -c 0 ./force /dev/shm/ftzz"
Benchmark 1: taskset -c 0 ./patchless /dev/shm/ftzz
  Time (mean ± σ):      2.078 s ±  0.038 s    [User: 0.008 s, System: 2.066 s]
  Range (min … max):    2.019 s …  2.127 s    10 runs
 
Benchmark 2: taskset -c 0 ./force /dev/shm/ftzz
  Time (mean ± σ):      1.808 s ±  0.033 s    [User: 0.012 s, System: 1.785 s]
  Range (min … max):    1.771 s …  1.882 s    10 runs
 
Summary
  'taskset -c 0 ./force /dev/shm/ftzz' ran
    1.15 ± 0.03 times faster than 'taskset -c 0 ./patchless /dev/shm/ftzz'

@axboe
Copy link
Owner

axboe commented Mar 24, 2023

Not sure how representative this is on tmpfs, as profiles show we're spending most of the time on grabbing a lock for eviction:

+   85.36%  force    [kernel.vmlinux]  [k] queued_spin_lock_slowpath

@SUPERCILEX
Copy link
Author

Going to assume that's why you pinned the benches to a core.

You patch does fix the slowdown! I ran the same benchmark but between uring and non-uring (with /tmp as a tmpfs) and found that uring was 15% slower than non-uring. Since your results show a 15% speedup with force, I'm pretty sure that puts uring at parity with normal syscalls.

Now whether or not uring would actually be faster for rm isn't clear. I think the cp case has a lot more potential since it has way more syscalls: open-stat-open-copy_file_range-close-close. With fixed files, the two closes can be deleted entirely, open-stat can be submitted in one go, and so can open-copy_file_range. That should give uring a pretty big efficiency advantage that I would hope translates to better performance.

$ hyperfine --warmup 3 -N --prepare "ftzz g -n 1M /tmp/ftzz" "taskset -c 0 ./normal /tmp/ftzz" "taskset -c 0 ./uring /tmp/ftzz" 
Benchmark 1: taskset -c 0 ./normal /tmp/ftzz
  Time (mean ± σ):      1.675 s ±  0.098 s    [User: 0.053 s, System: 1.521 s]
  Range (min … max):    1.606 s …  1.834 s    10 runs
 
Benchmark 2: taskset -c 0 ./uring /tmp/ftzz
  Time (mean ± σ):      1.928 s ±  0.012 s    [User: 0.010 s, System: 1.853 s]
  Range (min … max):    1.919 s …  1.959 s    10 runs
 
Summary
  'taskset -c 0 ./normal /tmp/ftzz' ran
    1.15 ± 0.07 times faster than 'taskset -c 0 ./uring /tmp/ftzz'

@axboe
Copy link
Owner

axboe commented Mar 24, 2023

Also note that my testing was done with all kinds of security mitigations turned off, obviously the batching done would greatly benefit with mitigations turned ON as the syscalls and switches become more expensive there. IOW, it's worst case for the comparison.

I think the main questions to ponder here are:

  1. Should this be an in_uring_enter(2) flag, or should it be a ring setup flag? If the former, we probably want to use a FEAT flag for it too.
  2. What's a good name? INLINE kind of describes it, but only if you know the internals. Suggestions welcome!

I changed various things in the patch I tested, once we get a final design nailed down, I'll post them as well.

@axboe
Copy link
Owner

axboe commented Mar 24, 2023

I think the cp case has a lot more potential since it has way more syscalls: open-stat-open-copy_file_range-close-close. With fixed files, the two closes can be deleted entirely, open-stat can be submitted in one go, and so can open-copy_file_range. That should give uring a pretty big efficiency advantage that I would hope translates to better performance.

Definitely agree.

@SUPERCILEX
Copy link
Author

Should this be an in_uring_enter(2) flag, or should it be a ring setup flag? If the former, we probably want to use a FEAT flag for it too.

Is it possible to put the flag in io_uring_sqe entries? So each entry can choose where it wants to go? Otherwise, in_uring_enter seems more flexible since you could decide to change it up without having to recreate the uring.

What's a good name? INLINE kind of describes it, but only if you know the internals. Suggestions welcome!

Maybe NO_ASYNC? Since we already have ASYNC, it kinda makes sense to have a don't-do-that option. But I'm not sure if it's an accurate name.

@axboe
Copy link
Owner

axboe commented Mar 25, 2023

NO_ASYNC isn't great, as it may very well be async. Eg if you can queue it up for normal async execution, then you'd still do that. The point of the flag is just to do those opcodes that would otherwise always punt to io-wq inline, instead of punting them to io-wq. NO_OFFLOAD? NO_IOWQ? NO_IO_WORKER?

@SUPERCILEX
Copy link
Author

Gotya, INLINE, NO_OFFLOAD, and NO_IOWQ all seem pretty reasonable. No real preference unfortunately. :)

@axboe
Copy link
Owner

axboe commented Mar 25, 2023

It could go in the sqe, but there's only one flag left for use. And I don't think people will generally mix and match these - either whatever is using the ring is fine with potential submission stalls due to blocking, or it's not. Hence I do think a setup flag would be just fine, but the enter flag is a bit more flexible. Per-sqe flag probably going too far.

I'm partial to NO_IOWQ so I'll spin it like that.

@SUPERCILEX
Copy link
Author

Makes sense about the sqes. I still think this flag would be more useful with enter (for example maybe the copy_file_ranges would be faster if executed in parallel), but I'm not sure how this feature would interact with IORING_SETUP_SQPOLL as they seem incompatible.

@axboe
Copy link
Owner

axboe commented Mar 25, 2023

Agree, enter is more useful for sure. But yes, for SQPOLL, it'd have to be a setup flag.

@SUPERCILEX
Copy link
Author

SUPERCILEX commented Mar 25, 2023

But yes, for SQPOLL, it'd have to be a setup flag.

As in putting the NO_IOWQ flag on enter is a problem? If not, I don't see why you'd use SQPOLL with NO_IOWQ so putting NO_IOWQ on enter should be fine.

@axboe
Copy link
Owner

axboe commented Mar 25, 2023

Not sure that SQPOLL and NO_IOWQ would be mutually exclusive, though.

@SUPERCILEX
Copy link
Author

Hmmm, this is kinda gross, but could enter be stateful for SQPOLL? As in it says "everything processed after now will be NO_IOWQ." The semantics of when "now" is seem unpleasant to pin down. That's where making the flag part of each entry simplifies things, but using up the last flag does seem a little excessive. So I guess being part of setup is simplest, but forcing the uring to stay NO_IOWQ is a bit of bummer.

@axboe
Copy link
Owner

axboe commented Mar 26, 2023

You can't really synchronize with SQPOLL like that. By the time you've filled in the SQEs you want to submit, SQPOLL could already be processing them. Or vice versa, you set it and it's still processing entries from previously. I think that'd be a mess to try and coordinate, and you really can't unless you have the thread idle on either side of that sequence. Or use on the SQ ring flags for it, which would still be racy. For SQPOLL, if we don't have SQE flags, then it'd have to be a per-instance kind of thing. In practice, a ring setup flag would be fine. If you have a need for both kinds of behavior, you could always setup two rings and have them share the SQPOLL thread. Then while processing ring A it could be in NO_IOWQ mode, and while processing ring B it would be in the normal mode. Without an SQE flag, SQPOLL would have to be setup a bit differently.

For the non-SQPOLL case, I don't like the ring setup flag at all, I do think the enter flag is a lot cleaner.

@FrankReh
Copy link
Contributor

The reasons for some using this per enter call and others per ring seem clear enough.
The reason for a name like NO_IOWQ seemed clear, it causes the operations found in the squeue to run in an immediate mode, without punting to the worker queue if the operation has to be waited on.

But the comment

if you can queue it up for normal async execution, then you'd still do that. The point of the flag is just to do those opcodes that would otherwise always punt to io-wq inline, instead of punting them to io-wq.

left me wondering what was meant. The man pages generally don't talk about which operations can be queued up for normal async execution do they? When this feature lands, will it be easy to describe for which operations this is useful?

I think here, as a first use case, the emphasis has been on how to make unlink more efficient when many unlinks can be batched. What other operations or operation types might one consider this new feature for? Answers can wait for the man page(s). Maybe how the new feature affects thinks like linked operations will also be clear then.

Thank you.
-- Interested Bystander

@SUPERCILEX
Copy link
Author

@axboe so are you suggesting we have the same flag in both setup and enter? If so, that seems pretty reasonable to me.

@FrankReh I like the idea of documenting which ops are pollable, that'd be helpful.

@SUPERCILEX
Copy link
Author

🤩 This is exciting, thank you!

@axboe
Copy link
Owner

axboe commented Apr 19, 2023

@SUPERCILEX Can you test it? As from the posting, you can either just setup the ring with IORING_SETUP_NO_OFFLOAD or you can use IORING_ENTER_NO_OFFLOAD with io_uring_enter(2). That should cover all bases, SQPOLL or not.

@SUPERCILEX
Copy link
Author

Does giving you a commit to benchmark on your machine work again? If so I can get that done tomorrow-ish. Otherwise, we can punt this to June when I'll have the spare capacity to reinstall my system if things go wrong. :)

@axboe
Copy link
Owner

axboe commented Apr 21, 2023

That works

@SUPERCILEX
Copy link
Author

Done!

# Install rust if necessary: https://doc.rust-lang.org/cargo/getting-started/installation.html
cargo install ftzz hyperfine

# Clone
git clone [email protected]:SUPERCILEX/fuc.git
cd fuc

# Build targets
cargo +nightly build --release --workspace
cp target/release/rmz no_uring

git checkout io_uring_rmz
cargo +nightly build --release --workspace
cp target/release/rmz no_offload

git checkout HEAD^
cargo +nightly build --release --workspace
cp target/release/rmz offload

# Benchmark
hyperfine --warmup 10 -N --prepare "ftzz g -n 100K /dev/shm/ftzz" "./no_uring /dev/shm/ftzz" "./no_offload /dev/shm/ftzz" "./offload /dev/shm/ftzz"

@SUPERCILEX
Copy link
Author

PS: forgot to mention I force pushed a bunch of stuff, so if you still have fuc hanging around probs easiest to delete it and start fresh.

@axboe
Copy link
Owner

axboe commented Apr 21, 2023

Side note - you'll want to use non-signal based notifications for io_uring, or you will slow things down. The below is with IORING_SETUP_DEFER_TASKRUN | IORING_SETUP_SINGLE_ISSUER set for ring creation. Didn't get IORING_SETUP_COOP_TASKRUN | IORING_SETUP_SINGLE_ISSUER, would expect similar performance from that. To make things a bit more realistic, I used an actual drive with XFS, and we drop caches after the directory creations.

axboe@7950x ~/g/fuc ((37638409))> taskset -c 0 hyperfine --warmup 10 -N --prepare "./prep.sh" "./no_uring /data1/ftzz" "./no_offload /data1/ftzz" "./offload /data1/ftzz"
Benchmark 1: ./no_uring /data1/ftzz
  Time (mean ± σ):     348.3 ms ±   2.7 ms    [User: 2.7 ms, System: 198.2 ms]
  Range (min … max):   343.8 ms … 352.5 ms    10 runs
 
Benchmark 2: ./no_offload /data1/ftzz
  Time (mean ± σ):     337.3 ms ±   2.6 ms    [User: 3.1 ms, System: 152.1 ms]
  Range (min … max):   334.2 ms … 342.4 ms    10 runs
 
Benchmark 3: ./offload /data1/ftzz
  Time (mean ± σ):     434.1 ms ±   2.9 ms    [User: 2.2 ms, System: 283.9 ms]
  Range (min … max):   430.6 ms … 438.4 ms    10 runs
 
Summary
  './no_offload /data1/ftzz' ran
    1.03 ± 0.01 times faster than './no_uring /data1/ftzz'
    1.29 ± 0.01 times faster than './offload /data1/ftzz'

@SUPERCILEX
Copy link
Author

Damn, that no_offload vs offload performance is insane! And we finally beat plain syscalls!

Regarding IORING_SETUP_DEFER_TASKRUN, it's broken somehow I think. I get 3x (you read that right) worse performance when I add that flag. If you want to play with it, you can add the flag here and here (it's called setup_defer_taskrun).

@axboe
Copy link
Owner

axboe commented Apr 21, 2023

You might want to debug that separately, as mentioned my testing was done with IORING_SETUP_DEFER_TASKRUN. I didn't modify your code as I could not quickly find where it even sets up the ring, so I just hacked the kernel to set those two flags mentioned by default.

@SUPERCILEX
Copy link
Author

so I just hacked the kernel to set those two flags mentioned by default.

Ah, gotya. I'll investigate sometime before shipping the io_uring implementation.

@axboe
Copy link
Owner

axboe commented Apr 21, 2023

Also possible I messed it up somehow, but seems to be the right kernel and the change was pretty trivial:

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 28144cce1fe5..dd0ccb93707d 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3984,6 +3984,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 			return -EINVAL;
 	}
 
+	p.flags |= IORING_SETUP_DEFER_TASKRUN | IORING_SETUP_SINGLE_ISSUER;
+
 	if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL |
 			IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE |
 			IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ |

and I didn't see any issues with that running the binaries from earlier.

@SUPERCILEX
Copy link
Author

SUPERCILEX commented Apr 21, 2023

Hmmm, that looks fine. My stuff also looks right though.

Without defer:

108:[pid 28656] io_uring_setup(128, {flags=0x1100 /* IORING_SETUP_??? */, sq_thread_cpu=0, sq_thread_idle=0, sq_entries=128, cq_entries=256, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS|0x1800, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=4416}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, flags=280}}) = 3
114:[pid 28656] io_uring_enter(3, 1, 1, IORING_ENTER_GETEVENTS, NULL, 128) = 1

With defer:

106:[pid 28641] io_uring_setup(128, {flags=0x3100 /* IORING_SETUP_??? */, sq_thread_cpu=0, sq_thread_idle=0, sq_entries=128, cq_entries=256, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS|0x1800, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=4416}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, flags=280}}) = 3
112:[pid 28641] io_uring_enter(3, 1, 1, IORING_ENTER_GETEVENTS, NULL, 128) = 1

The flag bits match up.

Perf without defer:

+   12.16%     0.25%  old              [kernel.kallsyms]                        [k] io_submit_sqes
+   11.86%     0.00%  swapper          [kernel.kallsyms]                        [k] secondary_startup_64_no_verify
+   11.86%     0.05%  swapper          [kernel.kallsyms]                        [k] cpu_startup_entry
+   11.82%     0.00%  swapper          [kernel.kallsyms]                        [k] do_idle
+   11.30%     0.00%  swapper          [kernel.kallsyms]                        [k] start_secondary
+   10.39%     0.05%  swapper          [kernel.kallsyms]                        [k] cpuidle_idle_call
+   10.05%     0.00%  swapper          [kernel.kallsyms]                        [k] cpuidle_enter
+   10.05%     0.00%  swapper          [kernel.kallsyms]                        [k] cpuidle_enter_state
+   10.00%    10.00%  swapper          [kernel.kallsyms]                        [k] mwait_idle_with_hints.constprop.0
+    9.95%     0.20%  old              [kernel.kallsyms]                        [k] io_init_req
+    9.85%     0.20%  old              [kernel.kallsyms]                        [k] getname_flags.part.0
+    9.80%     0.15%  old              [kernel.kallsyms]                        [k] getname
+    9.75%     0.15%  old              [kernel.kallsyms]                        [k] io_unlinkat_prep

With defer:

+   26.12%     0.20%  new              [kernel.kallsyms]                          [k] io_cqring_wait
+   21.50%     0.00%  swapper          [kernel.kallsyms]                          [k] secondary_startup_64_no_verify
+   21.45%     0.06%  swapper          [kernel.kallsyms]                          [k] cpu_startup_entry
+   21.26%     0.29%  swapper          [kernel.kallsyms]                          [k] do_idle
+   20.24%     0.00%  swapper          [kernel.kallsyms]                          [k] start_secondary
+   15.08%     0.18%  new              [kernel.kallsyms]                          [k] io_run_local_work
+   14.72%     0.17%  new              [kernel.kallsyms]                          [k] io_run_task_work_sig
+   14.65%     0.63%  new              [kernel.kallsyms]                          [k] __io_run_local_work
+   12.99%     0.10%  new              [kernel.kallsyms]                          [k] io_req_task_submit
+   12.93%     0.15%  new              [kernel.kallsyms]                          [k] io_queue_iowq
+   12.80%     0.10%  new              [kernel.kallsyms]                          [k] io_queue_async
+   11.11%     0.24%  swapper          [kernel.kallsyms]                          [k] cpuidle_idle_call
+    9.95%     0.12%  new              [kernel.kallsyms]                          [k] schedule_hrtimeout
+    9.82%     0.14%  new              [kernel.kallsyms]                          [k] schedule_hrtimeout_range_clock
+    9.74%     0.00%  swapper          [kernel.kallsyms]                          [k] cpuidle_enter
+    9.63%     0.11%  swapper          [kernel.kallsyms]                          [k] cpuidle_enter_state
+    8.73%     0.10%  new              [kernel.kallsyms]                          [k] io_wq_enqueue
+    8.46%     0.19%  new              [kernel.kallsyms]                          [k] io_wqe_enqueue
+    8.20%     0.51%  new              [kernel.kallsyms]                          [k] __schedule
+    8.04%     0.04%  new              [kernel.kallsyms]                          [k] schedule
+    7.92%     0.40%  new              [kernel.kallsyms]                          [k] io_wqe_activate_free_worker
+    7.40%     0.00%  new              [kernel.kallsyms]                          [k] wake_up_process
+    7.38%     0.40%  new              [kernel.kallsyms]                          [k] try_to_wake_up
+    7.22%     7.22%  swapper          [kernel.kallsyms]                          [k] mwait_idle_with_hints.constprop.0
+    5.18%     0.25%  swapper          [kernel.kallsyms]                          [k] __flush_smp_call_function_queue

So it looks like defer includes a ton of scheduling contention that isn't there normally.

@SUPERCILEX
Copy link
Author

Yeah, the path to io_uring is different and immediately hits a wait.

Without defer:

-   21.67%     0.00%  old              [kernel.kallsyms]                        [k] entry_SYSCALL_64_after_hwframe                              ▒
   - entry_SYSCALL_64_after_hwframe                                                                                                             ▒
      - do_syscall_64                                                                                                                           ▒
         - 13.35% __x64_sys_io_uring_enter                                                                                                      ▒
            - __do_sys_io_uring_enter                                                                                                           ▒
               + 12.16% io_submit_sqes                                                                                                          ▒
               + 1.14% io_cqring_wait                                                                                                           ▒
         + 4.32% __x64_sys_getdents64                                                                                                           ▒
         + 1.55% __x64_sys_openat                                                                                                               ▒
         + 0.90% __x64_sys_unlinkat                                                                                                             ▒

With defer:

-   32.38%     0.00%  new              [kernel.kallsyms]                          [k] entry_SYSCALL_64_after_hwframe                            ▒
     entry_SYSCALL_64_after_hwframe                                                                                                             ▒
   - do_syscall_64                                                                                                                              ▒
      - 29.93% __x64_sys_io_uring_enter                                                                                                         ▒
         - __do_sys_io_uring_enter                                                                                                              ▒
            - 26.12% io_cqring_wait                                                                                                             ▒
               + 14.68% io_run_task_work_sig                                                                                                    ▒
               + 9.82% schedule_hrtimeout                                                                                                       ▒
               + 0.79% io_run_local_work                                                                                                        ▒
                 0.53% prepare_to_wait_exclusive                                                                                                ▒
            + 3.64% io_submit_sqes                                                                                                              ▒
      + 1.23% __x64_sys_getdents64                                                                                                              ▒

@axboe
Copy link
Owner

axboe commented Apr 21, 2023

Yep that doesn't look good. I am running the patches on top of what is pending for the 6.4 release, which does include a set of patches making the wait + task_work run handling more efficient. I'll give it a whirl with the 6.3 kernel instead.

What kernel are you running?

@SUPERCILEX
Copy link
Author

SUPERCILEX commented Apr 21, 2023

Linux pop-os 6.2.6-76060206-generic #202303130630~1681329778~22.04~d824cd4 SMP PREEMPT_DYNAMIC Wed A x86_64 x86_64 x86_64 GNU/Linux

@SUPERCILEX
Copy link
Author

Hey, actually I wonder if it's because you have taskset -c 0? I'm testing this with all cores gunning, but if you run things on a single thread there shouldn't be any contention right. So maybe taskset -c 0-8 would surface the bad behavior?

@axboe
Copy link
Owner

axboe commented Apr 21, 2023

Could indeed be that. What is your threading setup like? Do you have multiple threads accessing the same ring?

@axboe
Copy link
Owner

axboe commented Apr 21, 2023

Without affinitizing, ran a quick test and here's the profile (for "no_offload"):

+   80.42%  no_offload  [kernel.vmlinux]  [k] queued_spin_lock_slowpath              
+    1.44%  no_offload  [kernel.vmlinux]  [k] shmem_undo_range                       
+    1.38%  no_offload  [kernel.vmlinux]  [k] destroy_inode                          
+    1.05%  no_offload  [kernel.vmlinux]  [k] shmem_evict_inode                      
+    0.77%  no_offload  [kernel.vmlinux]  [k] generic_permission                     
+    0.72%  no_offload  [kernel.vmlinux]  [k] shmem_destroy_inode                    
+    0.64%  no_offload  [kernel.vmlinux]  [k] drop_nlink                             
+    0.62%  no_offload  [kernel.vmlinux]  [k] try_to_unlazy                          
     0.49%  no_offload  [kernel.vmlinux]  [k] make_vfsgid                            

which is all just inode eviction in shmfs locking. If I affinitize to cpu 0, it looks much better:

+   11.28%  no_offload  [kernel.vmlinux]  [k] generic_permission                    ◆
+    7.19%  no_offload  [kernel.vmlinux]  [k] __filemap_get_folio                   ▒
+    5.99%  no_offload  [kernel.vmlinux]  [k] shmem_free_in_core_inode              ▒
+    5.60%  no_offload  [kernel.vmlinux]  [k] filldir64                             ▒
+    4.74%  no_offload  [kernel.vmlinux]  [k] i_callback                            ▒
+    3.09%  no_offload  [kernel.vmlinux]  [k] kmem_cache_free                       ▒
+    3.00%  no_offload  [kernel.vmlinux]  [k] iput                                  ▒
+    2.94%  no_offload  [kernel.vmlinux]  [k] strncpy_from_user                     ▒
+    2.27%  no_offload  [kernel.vmlinux]  [k] dput                                  ▒
+    2.23%  no_offload  [kernel.vmlinux]  [k] shmem_unlink                          ▒
     2.08%  no_offload  [kernel.vmlinux]  [k] destroy_inode                         ▒

@axboe
Copy link
Owner

axboe commented Apr 21, 2023

A better test for the efficiency gains of NO_OFFLOAD might be to run IORING_OP_STATX or similar instead, for various cases on unlink we're mostly bottlenecked on the fs anyway.

You could also test IORING_SETUP_COOP_TASKRUN | IORING_SETUP_SINGLE_ISSUER as well, that'd avoid the local task work but still get rid of the rude signaling.

@axboe
Copy link
Owner

axboe commented Apr 21, 2023

OK, looks like you do have many threads, trying to parallelize the unlink. That will certainly run into lots of resource contention on the filesystem.

Edit: this is regardless of whether you're using io_uring in those threads or just doing unlink(2) from them, the issue method won't change that much.

@SUPERCILEX
Copy link
Author

Do you have multiple threads accessing the same ring?

Yes and no. I have one uring per thread, but since my application already takes full advantage of the machine's available parallelism, I share the wq with IORING_SETUP_ATTACH_WQ. So it would make sense that there's contention since all threads are going to the same wq, but what doesn't make sense is why adding IORING_SETUP_DEFER_TASKRUN would make it surface differently.

Note that I've tried various combinations of this including not sharing the wq, limiting its parallelism with io_uring_register_iowq_max_workers, and using IOSQE_IO_LINK to prevent wq parallelism.

The implementation we've been benchmarking submits all requests as one long IOSQE_IO_LINK chain into an IORING_SETUP_ATTACH_WQ shared wq that is goes to a per-thread uring.

IORING_SETUP_COOP_TASKRUN | IORING_SETUP_SINGLE_ISSUER

Yup, already using these.

might be to run IORING_OP_STATX

Do you want me to tweak the benchmark to run that instead?

@axboe
Copy link
Owner

axboe commented Apr 21, 2023

but what doesn't make sense is why adding IORING_SETUP_DEFER_TASKRUN would make it surface differently.

Agree, and unless I'm mistaken, you should have zero task_work activity with this test anyway for the inline submission. That'd only show up for io-wq related stuff. Which leads me to know think that your profile was for the offloaded case, and not no_offload?

In general, io-wq is not a fast path, this is why we're even discussing doing this inline thing. For most opcodes, io-wq is just a fallback, and we should never really hit it for any pure disk or networked IO. However, some requests depend on it because there's no way to do an async issue and then completion post. The NO_OFFLOAD is mostly a way for those requests to use io_uring as a way to do syscall reductions, as you could bundle a ton of requests into a single syscall.

might be to run IORING_OP_STATX

Do you want me to tweak the benchmark to run that instead?

I think it'd be useful as it'd provide a much better way to measure the wins of bundling inline submissions, by eliminating a lot of external contention.

@SUPERCILEX
Copy link
Author

Which leads me to know think that your profile was for the offloaded case, and not no_offload?

Oh sorry yes unless it wasn't clear I'm not running your patches so all of this is equivalent to offload.

I think it'd be useful as it'd provide a much better way to measure the wins of bundling inline submissions, by eliminating a lot of external contention.

Cool beans. Away from my desktop rn, but should have this ready in an hour hr or two.

@SUPERCILEX
Copy link
Author

Oh sorry yes unless it wasn't clear I'm not running your patches so all of this is equivalent to offload.

Hmmm, but still why does defer tank performance so drastically when we're offloading?

@axboe
Copy link
Owner

axboe commented Apr 21, 2023

Hmmm, but still why does defer tank performance so drastically when we're offloading?

Not sure, but honestly also not that interesting as the use case isn't very interesting to begin with. My quick guess would be a bunch of different io-wq threads each wanting to add task_work to the one original issuing task, which would imply locking for DEFER_TASKRUN.

@SUPERCILEX
Copy link
Author

Gotya, sg.

@axboe
Copy link
Owner

axboe commented Apr 21, 2023

FWIW, not seeing that same contention here with offload and DEFER_TASKRUN, but that may be the newer kernel helping out. Or it may be something else entirely...

@SUPERCILEX
Copy link
Author

Ok, I force pushed the io_uring_rmz branch. Commit SUPERCILEX/fuc@3fdccb8 contains a setup_attach_wq,setup_coop_taskrun,setup_single_issuer statx workload while the next commit (SUPERCILEX/fuc@532b76a) contains the same workload but with setup_coop_taskrun,setup_single_issuer,setup_defer_taskrun,setup_no_offload instead. I didn't include a non-uring statx workload, but can if you'd like to see that.

@the8472
Copy link

the8472 commented May 8, 2023

for various cases on unlink we're mostly bottlenecked on the fs anyway.

Since the goal here is batching, how much potential would there be in coalescing multiple unlinks into one operation?

@axboe axboe closed this as completed Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants