-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance on file operations #830
Comments
You use |
Oh yep, I forgot to mention that I already have a uring per thread. IORING_SETUP_ATTACH_WQ didn't seem to make a difference. |
Some opcodes are just never going to be super fast, until they get converted to being able to do nonblocking issue. UNLINK is one of them - it'll always return -EAGAIN on inline issue, which means it gets punted to the internal io-wq pool for handling. This is why it isn't always faster than a sync unlink(), as even if we could perform it without blocking, we're still offloading it to a thread. The situation might change if you do: echo 3 > /proc/sys/vm/drop_cachesbefore the run, as it'd be cache cold at that point. But depending on what you unlink, you would probably only still have a few of the sync unlinks blocking before the meta data is cached. |
Is there documentation that explains the io_uring architecture? I don't know what it would mean for a syscall to be non-blocking (though I'm guessing it means start the work and set up some interrupt to be notified when it's done). If anything, I think I want blocking behavior because that should remove pointless overhead waiting on queues. It'd be neat if there was a "do it inline anyway" flag or perhaps that can be inferred from a |
It can't be inferred from that, but yes I agree, I have considered that. It'd be trivial to add as an IORING_ENTER_INLINE flag or similar. |
I can do a quick hack of that if you want to test it... |
If you'll let me know how to test it 😅, then sure! |
Something like the below, totally untested...
|
Easiest way to test would be to just enable it unconditionally in liburing and recompile the library and your test app. Something like this, again utterly untested:
We'd probably want this to be a specific io_uring_submit_foo() helper instead, but this will do for testing. Would be interesting to re-run your unlink test with that. |
For that first patch, is there a way to include it in a live kernel? Or do I need to build and boot my own kernel? Any pointers would be appreciated. |
You need to patch and build the kernel, I'm afraid. If you want, send me your test app and I can run it here and we can compare. That might be easier. |
Yes please :). If you'd like, I can build binaries for you provided a target architecture (I assume x86_64?). Otherwise:
Note that I disabled |
|
Not sure how representative this is on tmpfs, as profiles show we're spending most of the time on grabbing a lock for eviction:
|
Going to assume that's why you pinned the benches to a core. You patch does fix the slowdown! I ran the same benchmark but between uring and non-uring (with /tmp as a tmpfs) and found that uring was 15% slower than non-uring. Since your results show a 15% speedup with Now whether or not uring would actually be faster for rm isn't clear. I think the cp case has a lot more potential since it has way more syscalls: open-stat-open-copy_file_range-close-close. With fixed files, the two closes can be deleted entirely, open-stat can be submitted in one go, and so can open-copy_file_range. That should give uring a pretty big efficiency advantage that I would hope translates to better performance.
|
Also note that my testing was done with all kinds of security mitigations turned off, obviously the batching done would greatly benefit with mitigations turned ON as the syscalls and switches become more expensive there. IOW, it's worst case for the comparison. I think the main questions to ponder here are:
I changed various things in the patch I tested, once we get a final design nailed down, I'll post them as well. |
Definitely agree. |
Is it possible to put the flag in
Maybe |
|
Gotya, |
It could go in the sqe, but there's only one flag left for use. And I don't think people will generally mix and match these - either whatever is using the ring is fine with potential submission stalls due to blocking, or it's not. Hence I do think a setup flag would be just fine, but the enter flag is a bit more flexible. Per-sqe flag probably going too far. I'm partial to |
Makes sense about the sqes. I still think this flag would be more useful with enter (for example maybe the copy_file_ranges would be faster if executed in parallel), but I'm not sure how this feature would interact with IORING_SETUP_SQPOLL as they seem incompatible. |
Agree, enter is more useful for sure. But yes, for SQPOLL, it'd have to be a setup flag. |
As in putting the NO_IOWQ flag on enter is a problem? If not, I don't see why you'd use SQPOLL with NO_IOWQ so putting NO_IOWQ on enter should be fine. |
Not sure that SQPOLL and NO_IOWQ would be mutually exclusive, though. |
Hmmm, this is kinda gross, but could enter be stateful for SQPOLL? As in it says "everything processed after now will be NO_IOWQ." The semantics of when "now" is seem unpleasant to pin down. That's where making the flag part of each entry simplifies things, but using up the last flag does seem a little excessive. So I guess being part of setup is simplest, but forcing the uring to stay NO_IOWQ is a bit of bummer. |
You can't really synchronize with SQPOLL like that. By the time you've filled in the SQEs you want to submit, SQPOLL could already be processing them. Or vice versa, you set it and it's still processing entries from previously. I think that'd be a mess to try and coordinate, and you really can't unless you have the thread idle on either side of that sequence. Or use on the SQ ring flags for it, which would still be racy. For SQPOLL, if we don't have SQE flags, then it'd have to be a per-instance kind of thing. In practice, a ring setup flag would be fine. If you have a need for both kinds of behavior, you could always setup two rings and have them share the SQPOLL thread. Then while processing ring A it could be in NO_IOWQ mode, and while processing ring B it would be in the normal mode. Without an SQE flag, SQPOLL would have to be setup a bit differently. For the non-SQPOLL case, I don't like the ring setup flag at all, I do think the enter flag is a lot cleaner. |
The reasons for some using this per But the comment
left me wondering what was meant. The man pages generally don't talk about which operations can be queued up for normal async execution do they? When this feature lands, will it be easy to describe for which operations this is useful? I think here, as a first use case, the emphasis has been on how to make unlink more efficient when many unlinks can be batched. What other operations or operation types might one consider this new feature for? Answers can wait for the man page(s). Maybe how the new feature affects thinks like linked operations will also be clear then. Thank you. |
🤩 This is exciting, thank you! |
@SUPERCILEX Can you test it? As from the posting, you can either just setup the ring with |
Does giving you a commit to benchmark on your machine work again? If so I can get that done tomorrow-ish. Otherwise, we can punt this to June when I'll have the spare capacity to reinstall my system if things go wrong. :) |
That works |
Done!
|
PS: forgot to mention I force pushed a bunch of stuff, so if you still have |
Side note - you'll want to use non-signal based notifications for io_uring, or you will slow things down. The below is with IORING_SETUP_DEFER_TASKRUN | IORING_SETUP_SINGLE_ISSUER set for ring creation. Didn't get IORING_SETUP_COOP_TASKRUN | IORING_SETUP_SINGLE_ISSUER, would expect similar performance from that. To make things a bit more realistic, I used an actual drive with XFS, and we drop caches after the directory creations.
|
Damn, that no_offload vs offload performance is insane! And we finally beat plain syscalls! Regarding |
You might want to debug that separately, as mentioned my testing was done with IORING_SETUP_DEFER_TASKRUN. I didn't modify your code as I could not quickly find where it even sets up the ring, so I just hacked the kernel to set those two flags mentioned by default. |
Ah, gotya. I'll investigate sometime before shipping the io_uring implementation. |
Also possible I messed it up somehow, but seems to be the right kernel and the change was pretty trivial:
and I didn't see any issues with that running the binaries from earlier. |
Hmmm, that looks fine. My stuff also looks right though. Without defer:
With defer:
The flag bits match up. Perf without defer:
With defer:
So it looks like defer includes a ton of scheduling contention that isn't there normally. |
Yeah, the path to io_uring is different and immediately hits a wait. Without defer:
With defer:
|
Yep that doesn't look good. I am running the patches on top of what is pending for the 6.4 release, which does include a set of patches making the wait + task_work run handling more efficient. I'll give it a whirl with the 6.3 kernel instead. What kernel are you running? |
|
Hey, actually I wonder if it's because you have |
Could indeed be that. What is your threading setup like? Do you have multiple threads accessing the same ring? |
Without affinitizing, ran a quick test and here's the profile (for "no_offload"):
which is all just inode eviction in shmfs locking. If I affinitize to cpu 0, it looks much better:
|
A better test for the efficiency gains of NO_OFFLOAD might be to run IORING_OP_STATX or similar instead, for various cases on unlink we're mostly bottlenecked on the fs anyway. You could also test IORING_SETUP_COOP_TASKRUN | IORING_SETUP_SINGLE_ISSUER as well, that'd avoid the local task work but still get rid of the rude signaling. |
OK, looks like you do have many threads, trying to parallelize the unlink. That will certainly run into lots of resource contention on the filesystem. Edit: this is regardless of whether you're using io_uring in those threads or just doing unlink(2) from them, the issue method won't change that much. |
Yes and no. I have one uring per thread, but since my application already takes full advantage of the machine's available parallelism, I share the wq with Note that I've tried various combinations of this including not sharing the wq, limiting its parallelism with The implementation we've been benchmarking submits all requests as one long
Yup, already using these.
Do you want me to tweak the benchmark to run that instead? |
Agree, and unless I'm mistaken, you should have zero task_work activity with this test anyway for the inline submission. That'd only show up for io-wq related stuff. Which leads me to know think that your profile was for the offloaded case, and not no_offload? In general, io-wq is not a fast path, this is why we're even discussing doing this inline thing. For most opcodes, io-wq is just a fallback, and we should never really hit it for any pure disk or networked IO. However, some requests depend on it because there's no way to do an async issue and then completion post. The NO_OFFLOAD is mostly a way for those requests to use io_uring as a way to do syscall reductions, as you could bundle a ton of requests into a single syscall.
I think it'd be useful as it'd provide a much better way to measure the wins of bundling inline submissions, by eliminating a lot of external contention. |
Oh sorry yes unless it wasn't clear I'm not running your patches so all of this is equivalent to offload.
Cool beans. Away from my desktop rn, but should have this ready in an hour hr or two. |
Hmmm, but still why does defer tank performance so drastically when we're offloading? |
Not sure, but honestly also not that interesting as the use case isn't very interesting to begin with. My quick guess would be a bunch of different io-wq threads each wanting to add task_work to the one original issuing task, which would imply locking for DEFER_TASKRUN. |
Gotya, sg. |
FWIW, not seeing that same contention here with offload and DEFER_TASKRUN, but that may be the newer kernel helping out. Or it may be something else entirely... |
Ok, I force pushed the |
Since the goal here is batching, how much potential would there be in coalescing multiple unlinks into one operation? |
I'm hoping to take advantage of io_uring to minimize syscall count, but I can't get anywhere close to non-io_uring performance. My application is a fast version of rm: every directory is independently deleted on its own thread.
Here is my io_uring patch: SUPERCILEX/fuc@a01a22b?w=1. The gist is that I queue up a bunch of unlinks and then submit + wait for them to complete. I have to periodically reap the submission queue b/c I'm using getdents64 and passing the unlink path pointers from that directory buffer (so the unlinks must be done before I can make the next getdents call).
I've experimented with a bunch of options:
IORING_SETUP_COOP_TASKRUN
andIORING_SETUP_SINGLE_ISSUER
don't seem to make a difference.IORING_SETUP_DEFER_TASKRUN
is ~10% worse which is surprising.iowq_max_workers = [1, 0]
orIO_LINK
prevents a performance cliff from every file in a directory being deleted on its own thread.Ideally, I'd like to be able to tell io_uring to execute the submission queue on the calling thread since that's what will be most efficient. I would have hoped that
IORING_SETUP_DEFER_TASKRUN
did that, but it does not appear to be the case.Linux 6.2.6-76060206-generic
Benchmark with
hyperfine --warmup 3 -N --prepare "ftzz g -n 100K /tmp/ftzz" "./old /tmp/ftzz" "./new /tmp/ftzz"
where old is master and new is the branched binary, both built withcargo b --release
(copy fromtarget/release/rmz
).The text was updated successfully, but these errors were encountered: