Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARNING: possible circular locking dependency detected (io_rsrc_put_work+0x13d/0x1a0) #404

Closed
ammarfaizi2 opened this issue Aug 9, 2021 · 3 comments

Comments

@ammarfaizi2
Copy link
Contributor

Hi Jens,
I got a test failed with a potential deadlock at <rsrc_tags>.

Kernel: 5.14.0-rc3-bluetea-test-00249-gc7d102232649

[ 9257.317473] Running test 232c93d07b74-test:
[ 9261.828744] Running test 35fa71a030ca-test:
[ 9321.858788] Running test 500f9fbadef8-test:
[ 9323.308544] Running test 7ad0e4b2f83c-test:
[ 9324.419221] Running test 8a9973408177-test:
[ 9325.635902] Running test 917257daa0fe-test:
[ 9325.755814] Running test a0908ae19763-test:
[ 9325.871406] Running test a4c0b3decb33-test:
[ 9340.684624] Running test accept:
[ 9342.070441] Running test accept-link:
[ 9342.389549] Running test accept-reuse:
[ 9342.511045] Running test accept-test:
[ 9342.627728] Running test across-fork:
[ 9342.745884] Running test splice:
[ 9342.872690] Running test b19062a56726-test:
[ 9342.981963] Running test b5837bd5311d-test:
[ 9343.100657] Running test ce593a6c480a-test:
[ 9344.229048] Running test close-opath:
[ 9344.339398] Running test connect:
[ 9344.363080] TCP: request_sock_TCP: Possible SYN flooding on port 29878. Sending cookies.  Check SNMP counters.
[ 9344.453743] Running test cq-full:
[ 9344.563967] Running test cq-overflow:
[ 9355.410900] Running test cq-peek-batch:
[ 9355.522211] Running test cq-ready:
[ 9355.635496] Running test cq-size:
[ 9355.747607] Running test d4ae271dfaae-test:
[ 9356.041987] Running test d77a67ed5f27-test:
[ 9356.303955] Running test defer:
[ 9359.413848] Running test double-poll-crash:
[ 9359.528898] Running test eeed8b54e0df-test:
[ 9359.683170] Running test eventfd:
[ 9359.814646] Running test eventfd-disable:
[ 9359.945202] Running test eventfd-ring:
[ 9360.092271] Running test fadvise:
[ 9360.413969] Running test fallocate:
[ 9360.561731] Running test fc2a85cb02ef-test:
[ 9360.677307] Running test file-register:
[ 9378.660480] Running test file-update:
[ 9381.894220] Running test files-exit-hang-poll:
[ 9382.996801] Running test files-exit-hang-timeout:
[ 9384.095303] Running test fixed-link:
[ 9384.194950] Running test fsync:
[ 9384.422513] Running test hardlink:
[ 9384.528068] Running test io-cancel:
[ 9387.144482] Running test io_uring_enter:
[ 9387.506670] Running test io_uring_register:
[ 9389.784601] Running test io_uring_setup:
[ 9389.892289] Running test iopoll:
[ 9402.967112] Running test lfs-openat:
[ 9403.070135] Running test lfs-openat-write:
[ 9403.173498] Running test link:
[ 9403.313011] Running test link-timeout:
[ 9404.916786] Running test link_drain:
[ 9405.841861] Running test madvise:
[ 9406.152754] Running test mkdir:
[ 9406.256519] Running test multicqes_drain:
[ 9451.376548] Running test nop:
[ 9451.477609] Running test nop-all-sizes:
[ 9451.622920] Running test open-close:
[ 9451.868377] Running test openat2:
[ 9452.092030] Running test personality:
[ 9452.197029] Running test pipe-eof:
[ 9452.301000] Running test pipe-reuse:
[ 9452.409544] Running test poll:
[ 9452.512898] Running test poll-cancel:
[ 9452.613946] Running test poll-cancel-ton:
[ 9453.008100] Running test poll-link:
[ 9453.316323] Running test poll-many:
[ 9465.760035] Running test poll-mshot-update:
[ 9474.184772] Running test poll-ring:
[ 9474.294344] Running test poll-v-poll:
[ 9474.815263] Running test probe:
[ 9474.925634] Running test read-write:
[ 9519.635522] Running test register-restrictions:
[ 9519.995146] Running test rename:
[ 9520.108175] Running test ring-leak:
[ 9520.246420] Running test ring-leak2:
[ 9521.360344] Running test rw_merge_test:
[ 9521.470459] Running test self:
[ 9521.582026] Running test send_recv:
[ 9521.702652] Running test send_recvmsg:
[ 9521.820930] Running test shared-wq:
[ 9521.934517] Running test short-read:
[ 9522.136810] Running test shutdown:
[ 9522.253045] Running test sigfd-deadlock:
[ 9522.372489] Running test socket-rw:
[ 9522.485295] Running test socket-rw-eagain:
[ 9522.597577] Running test sq-full:
[ 9522.707849] Running test sq-poll-dup:
[ 9526.798436] Running test sq-poll-kthread:
[ 9529.006681] Running test sq-poll-share:
[ 9536.551806] Running test sqpoll-disable-exit:
[ 9537.131130] Running test sqpoll-exit-hang:
[ 9538.244750] Running test sqpoll-cancel-hang:
[ 9539.352283] Running test sqpoll-sleep:
[ 9539.466045] Running test sq-space_left:
[ 9539.576885] Running test stdout:
[ 9541.760714] Running test submit-reuse:
[ 9545.929462] Running test symlink:
[ 9546.033826] Running test teardowns:
[ 9546.236248] Running test thread-exit:
[ 9546.423347] Running test timeout:
[ 9552.550575] Running test timeout-new:
[ 9555.062722] Running test timeout-overflow:
[ 9555.168925] Running test unlink:
[ 9555.268847] Running test wakeup-hang:
[ 9557.369264] Running test sendmsg_fs_cve:
[ 9557.476769] Running test rsrc_tags:

[ 9568.748103] ======================================================
[ 9568.748111] WARNING: possible circular locking dependency detected
[ 9568.748120] 5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5 Tainted: G           OE    
[ 9568.748130] ------------------------------------------------------
[ 9568.748136] kworker/2:4/2684 is trying to acquire lock:
[ 9568.748145] ffff88814bb1c0a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_rsrc_put_work+0x13d/0x1a0
[ 9568.748183] 
               but task is already holding lock:
[ 9568.748190] ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530
[ 9568.748219] 
               which lock already depends on the new lock.

[ 9568.748226] 
               the existing dependency chain (in reverse order) is:
[ 9568.748232] 
               -> #1 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}:
[ 9568.748249]        __flush_work+0x31b/0x490
[ 9568.748263]        io_rsrc_ref_quiesce.part.0.constprop.0+0x35/0xb0
[ 9568.748276]        __do_sys_io_uring_register+0x45b/0x1060
[ 9568.748289]        do_syscall_64+0x35/0xb0
[ 9568.748301]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 9568.748316] 
               -> #0 (&ctx->uring_lock){+.+.}-{3:3}:
[ 9568.748332]        __lock_acquire+0x119a/0x1e10
[ 9568.748344]        lock_acquire+0xc8/0x2f0
[ 9568.748355]        __mutex_lock+0x86/0x740
[ 9568.748365]        io_rsrc_put_work+0x13d/0x1a0
[ 9568.748376]        process_one_work+0x236/0x530
[ 9568.748388]        worker_thread+0x52/0x3b0
[ 9568.748400]        kthread+0x135/0x160
[ 9568.748411]        ret_from_fork+0x1f/0x30
[ 9568.748426] 
               other info that might help us debug this:

[ 9568.748432]  Possible unsafe locking scenario:

[ 9568.748438]        CPU0                    CPU1
[ 9568.748443]        ----                    ----
[ 9568.748448]   lock((work_completion)(&(&ctx->rsrc_put_work)->work));
[ 9568.748459]                                lock(&ctx->uring_lock);
[ 9568.748469]                                lock((work_completion)(&(&ctx->rsrc_put_work)->work));
[ 9568.748479]   lock(&ctx->uring_lock);
[ 9568.748488] 
                *** DEADLOCK ***

[ 9568.748494] 2 locks held by kworker/2:4/2684:
[ 9568.748503]  #0: ffff88810004d938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530
[ 9568.748531]  #1: ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530
[ 9568.748558] 
               stack backtrace:
[ 9568.748566] CPU: 2 PID: 2684 Comm: kworker/2:4 Tainted: G           OE     5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5
[ 9568.748580] Hardware name: Acer Aspire ES1-421/OLVIA_BE, BIOS V1.05 07/02/2015
[ 9568.748589] Workqueue: events io_rsrc_put_work
[ 9568.748605] Call Trace:
[ 9568.748620]  dump_stack_lvl+0x6a/0x9a
[ 9568.748639]  check_noncircular+0xfe/0x110
[ 9568.748660]  __lock_acquire+0x119a/0x1e10
[ 9568.748680]  lock_acquire+0xc8/0x2f0
[ 9568.748692]  ? io_rsrc_put_work+0x13d/0x1a0
[ 9568.748712]  __mutex_lock+0x86/0x740
[ 9568.748724]  ? io_rsrc_put_work+0x13d/0x1a0
[ 9568.748738]  ? io_rsrc_put_work+0x13d/0x1a0
[ 9568.748756]  ? io_rsrc_put_work+0x13d/0x1a0
[ 9568.748767]  ? process_one_work+0x1ce/0x530
[ 9568.748779]  io_rsrc_put_work+0x13d/0x1a0
[ 9568.748796]  process_one_work+0x236/0x530
[ 9568.748816]  worker_thread+0x52/0x3b0
[ 9568.748830]  ? process_one_work+0x530/0x530
[ 9568.748845]  kthread+0x135/0x160
[ 9568.748857]  ? set_kthread_struct+0x40/0x40
[ 9568.748872]  ret_from_fork+0x1f/0x30
[ 9579.199425] Running test statx:
[ 9579.337521] Running test sq-full-cpp:

Cc: @isilence

@axboe
Copy link
Owner

axboe commented Aug 9, 2021

Can you try and add this on top of your kernel?

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 6a8257233061..0c3e90f974e9 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7195,17 +7195,19 @@ static int io_rsrc_ref_quiesce(struct io_rsrc_data *data, struct io_ring_ctx *ct
 		/* kill initial ref, already quiesced if zero */
 		if (atomic_dec_and_test(&data->refs))
 			break;
+		mutex_unlock(&ctx->uring_lock);
 		flush_delayed_work(&ctx->rsrc_put_work);
 		ret = wait_for_completion_interruptible(&data->done);
-		if (!ret)
+		if (!ret) {
+			mutex_lock(&ctx->uring_lock);
 			break;
+		}
 
 		atomic_inc(&data->refs);
 		/* wait for all works potentially completing data->done */
 		flush_delayed_work(&ctx->rsrc_put_work);
 		reinit_completion(&data->done);
 
-		mutex_unlock(&ctx->uring_lock);
 		ret = io_run_task_work_sig();
 		mutex_lock(&ctx->uring_lock);
 	} while (ret >= 0);

@ammarfaizi2
Copy link
Contributor Author

ammarfaizi2 commented Aug 10, 2021

Can you try and add this on top of your kernel?

Alright, this patch is a cure. All tests passed. Thanks. Does this affect the stable kernel?

Tested-by: Ammar Faizi <[email protected]>

Test output

ammarfaizi2@integral:~/project/now/liburing$ git log -n1
commit 0ea4ccd1c0e4bb05cb4471d64e55d2e7d4e8a24c (HEAD -> master, origin/master, origin/HEAD)
Author: Jens Axboe <[email protected]>
Date:   Sun Aug 8 21:32:29 2021 -0600

    src/queue: don't flush SQ ring for new wait interface
    
    If we have IORING_FEAT_EXT_ARG, then timeouts are done through the
    syscall instead of by posting an internal timeout. This was done
    to be both more efficient, but also to enable multi-threaded use
    the wait side. If we touch the SQ state by flushing it, that isn't
    safe without synchronization.
    
    Fixes: https://github.com/axboe/liburing/issues/402
    Signed-off-by: Jens Axboe <[email protected]>
ammarfaizi2@integral:~/project/now/liburing$ sudo taskset -c 0,1,2,3 make -j4 runtests
make[1]: Entering directory '/home/ammarfaizi2/project/now/liburing/src'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/ammarfaizi2/project/now/liburing/src'
make[1]: Entering directory '/home/ammarfaizi2/project/now/liburing/test'
     CC 232c93d07b74-test
     CC 35fa71a030ca-test
     CC 500f9fbadef8-test
     CC 7ad0e4b2f83c-test
     CC 8a9973408177-test
     CC 917257daa0fe-test
     CC a0908ae19763-test
     CC a4c0b3decb33-test
     CC accept
     CC accept-link
     CC accept-reuse
     CC accept-test
     CC across-fork
     CC splice
     CC b19062a56726-test
     CC b5837bd5311d-test
     CC ce593a6c480a-test
     CC close-opath
     CC connect
     CC cq-full
     CC cq-overflow
     CC cq-peek-batch
     CC cq-ready
     CC cq-size
     CC d4ae271dfaae-test
     CC d77a67ed5f27-test
     CC defer
     CC double-poll-crash
     CC eeed8b54e0df-test
     CC eventfd
     CC eventfd-disable
     CC eventfd-ring
     CC fadvise
     CC fallocate
     CC fc2a85cb02ef-test
     CC file-register
     CC file-update
     CC files-exit-hang-poll
     CC files-exit-hang-timeout
     CC fixed-link
     CC fsync
     CC hardlink
     CC io-cancel
     CC io_uring_enter
     CC io_uring_register
     CC io_uring_setup
     CC iopoll
     CC lfs-openat
     CC lfs-openat-write
     CC link
     CC link-timeout
     CC link_drain
     CC madvise
     CC mkdir
     CC multicqes_drain
     CC nop
     CC nop-all-sizes
     CC open-close
     CC openat2
     CC personality
     CC pipe-eof
     CC pipe-reuse
     CC poll
     CC poll-cancel
     CC poll-cancel-ton
     CC poll-link
     CC poll-many
     CC poll-mshot-update
     CC poll-ring
     CC poll-v-poll
     CC probe
     CC read-write
     CC register-restrictions
     CC rename
     CC ring-leak
     CC ring-leak2
     CC rw_merge_test
     CC self
     CC send_recv
     CC send_recvmsg
     CC shared-wq
     CC short-read
     CC shutdown
     CC sigfd-deadlock
     CC socket-rw
     CC socket-rw-eagain
     CC sq-full
     CC sq-poll-dup
     CC sq-poll-kthread
     CC sq-poll-share
     CC sqpoll-disable-exit
     CC sqpoll-exit-hang
     CC sqpoll-cancel-hang
     CC sqpoll-sleep
     CC sq-space_left
     CC stdout
     CC submit-reuse
     CC symlink
     CC teardowns
     CC thread-exit
     CC timeout
     CC timeout-new
     CC timeout-overflow
     CC unlink
     CC wakeup-hang
     CC sendmsg_fs_cve
     CC rsrc_tags
     CC statx
     CXX sq-full-cpp
make[1]: Leaving directory '/home/ammarfaizi2/project/now/liburing/test'
make[1]: Entering directory '/home/ammarfaizi2/project/now/liburing/examples'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/ammarfaizi2/project/now/liburing/examples'
make[1]: Entering directory '/home/ammarfaizi2/project/now/liburing/test'
Running test 232c93d07b74-test:
Running test 35fa71a030ca-test:
Test 35fa71a030ca-test timed out (may not be a failure)
Running test 500f9fbadef8-test:
Running test 7ad0e4b2f83c-test:
Running test 8a9973408177-test:
Running test 917257daa0fe-test:
Running test a0908ae19763-test:
Running test a4c0b3decb33-test:
Running test accept:
Running test accept-link:
Running test accept-reuse:
Running test accept-test:
Running test across-fork:
Running test splice:
Running test b19062a56726-test:
Running test b5837bd5311d-test:
Running test ce593a6c480a-test:
Running test close-opath:
Running test connect:
Running test cq-full:
Running test cq-overflow:
Running test cq-peek-batch:
Running test cq-ready:
Running test cq-size:
Running test d4ae271dfaae-test:
Running test d77a67ed5f27-test:
Running test defer:
Running test double-poll-crash:
Running test eeed8b54e0df-test:
Running test eventfd:
Running test eventfd-disable:
Running test eventfd-ring:
Running test fadvise:
Running test fallocate:
Running test fc2a85cb02ef-test:
Test needs failslab/fail_futex/fail_page_alloc enabled, skipped
Running test file-register:
Running test file-update:
Running test files-exit-hang-poll:
Running test files-exit-hang-timeout:
Running test fixed-link:
Running test fsync:
Running test hardlink:
linkat not supported, skipping
Running test io-cancel:
Running test io_uring_enter:
io_uring_enter(3, 1, 0, 4294967295, (nil))
io_uring_enter(-1, 0, 0, 0, (nil))
io_uring_enter(0, 0, 0, 0, (nil))
io_uring_enter(3, 0, 0, 0, (nil))
Allocating 4096 sqes
Submitting 4096 I/Os
Done
Waiting for 4096 events
Reaping 4096 I/Os
Submitting invalid sqe index.
PASS
Running test io_uring_register:
RELIMIT_MEMLOCK: 1210114048 (1210114048)
io_uring_register(-1, 0, (nil), 0)
io_uring_register(3, 0, (nil), 0)
io_uring_register(4, 4294967295, (nil), 0)
io_uring_register(4, 0, 0x7fff62104b50, 1)
io_uring_register(4, 0, 0x7fff62104b50, 1)
io_uring_register(4, 0, 0x7fff62104b50, 1)
Unable to map a huge page.  Try increasing /proc/sys/vm/nr_hugepages by at least 1.
Skipping the hugepage test
reserve file-backed buffers
io_uring_register(4, 0, 0x7fff62104b50, 1)
io_uring_register(4, 0, 0x7f0ddab0d010, 1000000)
io_uring_register(4, 0, 0x7f0ddab0d010, 1024)
Not enough memory for this test, skipping
io_uring_submit:
	opcode: 6
	flags:  0x00000000
	fd:     4
	poll_events: 0x00000005
io_uring_register(4, 2, 0x7fff62104a9c, 1)
PASS
Running test io_uring_setup:
io_uring_setup(0, 0x7ffda11ffbb0), flags: none, feat: none, resv: 0x00000000 0x00000000 0x00000000, sq_thread_cpu: 0
io_uring_setup(1, (nil)), flags: none, feat: none, resv: , sq_thread_cpu: 0
io_uring_setup(1, 0x7ffda11ffbb0), flags: none, feat: none, resv: 0x00000001 0x00000001 0x00000001, sq_thread_cpu: 0
io_uring_setup(1, 0x7ffda11ffbb0), flags: 0xffffffff, feat: none, resv: 0x00000000 0x00000000 0x00000000, sq_thread_cpu: 0
io_uring_setup(1, 0x7ffda11ffbb0), flags: IORING_SETUP_SQ_AFF, feat: none, resv: 0x00000000 0x00000000 0x00000000, sq_thread_cpu: 0
io_uring_setup(1, 0x7ffda11ffbb0), flags: IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF, feat: none, resv: 0x00000000 0x00000000 0x00000000, sq_thread_cpu: 4
PASS
Running test iopoll:
Running test lfs-openat:
Running test lfs-openat-write:
Running test link:
Running test link-timeout:
Running test link_drain:
Running test madvise:
Suspicious timings (62 > 38)
Running test mkdir:
mkdirat not supported, skipping
Running test multicqes_drain:
Running test nop:
Running test nop-all-sizes:
Running test open-close:
Running test openat2:
Running test personality:
Running test pipe-eof:
Running test pipe-reuse:
Running test poll:
Running test poll-cancel:
Running test poll-cancel-ton:
Running test poll-link:
Running test poll-many:
Running test poll-mshot-update:
Running test poll-ring:
Running test poll-v-poll:
Running test probe:
Running test read-write:
Running test register-restrictions:
Running test rename:
Running test ring-leak:
Running test ring-leak2:
Running test rw_merge_test:
Running test self:
Running test send_recv:
Running test send_recvmsg:
Running test shared-wq:
Running test short-read:
Running test shutdown:
Running test sigfd-deadlock:
Running test socket-rw:
Running test socket-rw-eagain:
Running test sq-full:
Running test sq-poll-dup:
Running test sq-poll-kthread:
Running test sq-poll-share:
Running test sqpoll-disable-exit:
Running test sqpoll-exit-hang:
Running test sqpoll-cancel-hang:
Running test sqpoll-sleep:
Running test sq-space_left:
Running test stdout:
This is a pipe test
This is a fixed pipe test
Running test submit-reuse:
Running test symlink:
symlinkat not supported, skipping
Running test teardowns:
Running test thread-exit:
Running test timeout:
Running test timeout-new:
Running test timeout-overflow:
Skipping
Running test unlink:
Running test wakeup-hang:
Running test sendmsg_fs_cve:
Running test rsrc_tags:
Running test statx:
Running test sq-full-cpp:
All tests passed
make[1]: Leaving directory '/home/ammarfaizi2/project/now/liburing/test'
ammarfaizi2@integral:~/project/now/liburing$ 

Kernel Log

<3>[  309.627538][    T5] Bluetooth: hci0: command 0x1003 tx timeout
<3>[  309.638176][ T2475] Bluetooth: hci0: unexpected event for opcode 0x1003
<6>[  311.483479][ T2759] IPv6: ADDRCONF(NETDEV_CHANGE): wlp2s0: link becomes ready
<6>[  314.627403][ T3050] io scheduler bfq registered
<6>[  318.646697][ T3247] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
<6>[  346.203477][ T3563] Bluetooth: RFCOMM TTY layer initialized
<6>[  346.203634][ T3563] Bluetooth: RFCOMM socket layer initialized
<6>[  346.203751][ T3563] Bluetooth: RFCOMM ver 1.11
<4>[  347.701283][   T44] kauditd_printk_skb: 29 callbacks suppressed
<5>[  347.701296][   T44] audit: type=1400 audit(1628557064.993:41): apparmor="DENIED" operation="capable" profile="libvirtd" pid=2908 comm="daemon-init" capability=17  capname="sys_rawio"
<5>[  359.290314][   T44] audit: type=1400 audit(1628557076.582:42): apparmor="DENIED" operation="capable" profile="libvirtd" pid=2908 comm="daemon-init" capability=17  capname="sys_rawio"
<6>[  381.441045][ T3636] loop8: detected capacity change from 0 to 8
<12>[  420.100770][ T3786] Running test 232c93d07b74-test:
<12>[  424.869089][ T3805] Running test 35fa71a030ca-test:
<12>[  484.808620][T14563] Running test 500f9fbadef8-test:
<12>[  490.415422][T14975] Running test 7ad0e4b2f83c-test:
<12>[  491.705790][T14986] Running test 8a9973408177-test:
<12>[  493.163207][T14998] Running test 917257daa0fe-test:
<12>[  493.461249][T15010] Running test a0908ae19763-test:
<12>[  493.740508][T15021] Running test a4c0b3decb33-test:
<12>[  536.701528][T25041] Running test accept:
<12>[  540.261432][T25053] Running test accept-link:
<12>[  540.769375][T25067] Running test accept-reuse:
<12>[  541.069828][T25078] Running test accept-test:
<12>[  541.368776][T25089] Running test across-fork:
<12>[  541.775466][T25103] Running test splice:
<12>[  542.095216][T25117] Running test b19062a56726-test:
<12>[  542.396804][T25128] Running test b5837bd5311d-test:
<12>[  542.700743][T25139] Running test ce593a6c480a-test:
<12>[  544.106692][T25151] Running test close-opath:
<12>[  544.473766][T25162] Running test connect:
<6>[  544.606850][    C3] TCP: request_sock_TCP: Possible SYN flooding on port 32022. Sending cookies.  Check SNMP counters.
<12>[  544.820126][T25173] Running test cq-full:
<12>[  545.108846][T25184] Running test cq-overflow:
<12>[  556.738449][T25196] Running test cq-peek-batch:
<12>[  557.077184][T25207] Running test cq-ready:
<12>[  557.387326][T25218] Running test cq-size:
<12>[  557.706220][T25229] Running test d4ae271dfaae-test:
<12>[  558.078277][T25241] Running test d77a67ed5f27-test:
<12>[  558.514842][T25253] Running test defer:
<12>[  561.822918][T25265] Running test double-poll-crash:
<12>[  562.201123][T25282] Running test eeed8b54e0df-test:
<12>[  562.596434][T25293] Running test eventfd:
<12>[  562.923280][T25304] Running test eventfd-disable:
<12>[  563.218718][T25315] Running test eventfd-ring:
<12>[  563.689836][T25327] Running test fadvise:
<12>[  564.216050][T25340] Running test fallocate:
<12>[  564.524538][T25352] Running test fc2a85cb02ef-test:
<12>[  564.812246][T25363] Running test file-register:
<12>[  584.439917][T25379] Running test file-update:
<12>[  587.998315][T25391] Running test files-exit-hang-poll:
<12>[  589.287710][T25402] Running test files-exit-hang-timeout:
<12>[  590.585481][T25413] Running test fixed-link:
<12>[  590.884950][T25425] Running test fsync:
<12>[  591.346689][T25438] Running test hardlink:
<12>[  591.667178][T25449] Running test io-cancel:
<12>[  594.542376][T25468] Running test io_uring_enter:
<12>[  595.220557][T25480] Running test io_uring_register:
<12>[  598.023906][T25493] Running test io_uring_setup:
<12>[  598.311982][T25504] Running test iopoll:
<12>[  612.402572][T25521] Running test lfs-openat:
<12>[  612.707410][T25535] Running test lfs-openat-write:
<12>[  613.017022][T25547] Running test link:
<12>[  613.339886][T25558] Running test link-timeout:
<12>[  615.141145][T25572] Running test link_drain:
<12>[  618.781080][T25587] Running test madvise:
<12>[  619.517360][T25623] Running test mkdir:
<12>[  619.895954][T25634] Running test multicqes_drain:
<12>[  665.327107][T25649] Running test nop:
<12>[  667.299958][T25660] Running test nop-all-sizes:
<12>[  667.762178][T25671] Running test open-close:
<12>[  668.199852][T25682] Running test openat2:
<12>[  668.626481][T25693] Running test personality:
<12>[  668.924627][T25704] Running test pipe-eof:
<12>[  669.231073][T25717] Running test pipe-reuse:
<12>[  669.527250][T25729] Running test poll:
<12>[  669.840106][T25741] Running test poll-cancel:
<12>[  670.118957][T25752] Running test poll-cancel-ton:
<12>[  672.665516][T25763] Running test poll-link:
<12>[  673.166575][T25777] Running test poll-many:
<12>[  721.799874][T25788] Running test poll-mshot-update:
<12>[  756.551995][T26799] Running test poll-ring:
<12>[  756.878452][T26810] Running test poll-v-poll:
<12>[  757.591800][T26835] Running test probe:
<12>[  757.932940][T26846] Running test read-write:
<12>[  806.131239][T26997] Running test register-restrictions:
<12>[  806.709986][T27010] Running test rename:
<12>[  807.064036][T27022] Running test ring-leak:
<12>[  807.369705][T27036] Running test ring-leak2:
<12>[  808.667244][T27051] Running test rw_merge_test:
<12>[  808.954251][T27064] Running test self:
<12>[  809.260712][T27076] Running test send_recv:
<12>[  809.575584][T27092] Running test send_recvmsg:
<12>[  809.925982][T27107] Running test shared-wq:
<12>[  810.254864][T27118] Running test short-read:
<12>[  810.609793][T27129] Running test shutdown:
<12>[  810.911959][T27141] Running test sigfd-deadlock:
<12>[  811.201698][T27152] Running test socket-rw:
<12>[  811.514685][T27163] Running test socket-rw-eagain:
<12>[  811.795975][T27174] Running test sq-full:
<12>[  812.101141][T27185] Running test sq-poll-dup:
<12>[  822.053145][T27199] Running test sq-poll-kthread:
<12>[  824.849719][T27222] Running test sq-poll-share:
<12>[  846.004185][T27234] Running test sqpoll-disable-exit:
<12>[  847.467230][T27448] Running test sqpoll-exit-hang:
<12>[  848.759199][T27460] Running test sqpoll-cancel-hang:
<12>[  850.059074][T27473] Running test sqpoll-sleep:
<12>[  850.345403][T27485] Running test sq-space_left:
<12>[  850.668587][T27496] Running test stdout:
<12>[  853.228854][T27509] Running test submit-reuse:
<12>[  871.202048][T27588] Running test symlink:
<12>[  871.516765][T27599] Running test teardowns:
<12>[  872.263552][T27652] Running test thread-exit:
<12>[  872.605036][T27679] Running test timeout:
<12>[  878.922230][T27693] Running test timeout-new:
<12>[  881.638133][T27707] Running test timeout-overflow:
<12>[  881.924161][T27718] Running test unlink:
<12>[  882.236647][T27730] Running test wakeup-hang:
<12>[  884.539288][T27745] Running test sendmsg_fs_cve:
<12>[  884.851683][T27758] Running test rsrc_tags:
<12>[  907.323541][T27773] Running test statx:
<12>[  907.695528][T27785] Running test sq-full-cpp:
ammarfaizi2@integral:~/project/now/liburing$ 

@axboe
Copy link
Owner

axboe commented Aug 10, 2021

It does affect stable, at least recent stable. I'll mark it for backport. Thanks for reporting, and testing!

@axboe axboe closed this as completed Aug 10, 2021
fengguang pushed a commit to 0day-ci/linux that referenced this issue Aug 10, 2021
Ammar reports that he's seeing a lockdep splat on running test/rsrc_tags
from the regression suite:

======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5 Tainted: G           OE
------------------------------------------------------
kworker/2:4/2684 is trying to acquire lock:
ffff88814bb1c0a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_rsrc_put_work+0x13d/0x1a0

but task is already holding lock:
ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}:
       __flush_work+0x31b/0x490
       io_rsrc_ref_quiesce.part.0.constprop.0+0x35/0xb0
       __do_sys_io_uring_register+0x45b/0x1060
       do_syscall_64+0x35/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #0 (&ctx->uring_lock){+.+.}-{3:3}:
       __lock_acquire+0x119a/0x1e10
       lock_acquire+0xc8/0x2f0
       __mutex_lock+0x86/0x740
       io_rsrc_put_work+0x13d/0x1a0
       process_one_work+0x236/0x530
       worker_thread+0x52/0x3b0
       kthread+0x135/0x160
       ret_from_fork+0x1f/0x30

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock((work_completion)(&(&ctx->rsrc_put_work)->work));
                               lock(&ctx->uring_lock);
                               lock((work_completion)(&(&ctx->rsrc_put_work)->work));
  lock(&ctx->uring_lock);

 *** DEADLOCK ***

2 locks held by kworker/2:4/2684:
 #0: ffff88810004d938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530
 #1: ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530

stack backtrace:
CPU: 2 PID: 2684 Comm: kworker/2:4 Tainted: G           OE     5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5
Hardware name: Acer Aspire ES1-421/OLVIA_BE, BIOS V1.05 07/02/2015
Workqueue: events io_rsrc_put_work
Call Trace:
 dump_stack_lvl+0x6a/0x9a
 check_noncircular+0xfe/0x110
 __lock_acquire+0x119a/0x1e10
 lock_acquire+0xc8/0x2f0
 ? io_rsrc_put_work+0x13d/0x1a0
 __mutex_lock+0x86/0x740
 ? io_rsrc_put_work+0x13d/0x1a0
 ? io_rsrc_put_work+0x13d/0x1a0
 ? io_rsrc_put_work+0x13d/0x1a0
 ? process_one_work+0x1ce/0x530
 io_rsrc_put_work+0x13d/0x1a0
 process_one_work+0x236/0x530
 worker_thread+0x52/0x3b0
 ? process_one_work+0x530/0x530
 kthread+0x135/0x160
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x1f/0x30

which is due to holding the ctx->uring_lock when flushing existing
pending work, while the pending work flushing may need to grab the uring
lock if we're using IOPOLL.

Fix this by dropping the uring_lock a bit earlier as part of the flush.

Link: axboe/liburing#404
Signed-off-by: Jens Axboe <[email protected]>
ammarfaizi2 pushed a commit to ammarfaizi2/linux-block that referenced this issue Aug 10, 2021
Ammar reports that he's seeing a lockdep splat on running test/rsrc_tags
from the regression suite:

======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5 Tainted: G           OE
------------------------------------------------------
kworker/2:4/2684 is trying to acquire lock:
ffff88814bb1c0a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_rsrc_put_work+0x13d/0x1a0

but task is already holding lock:
ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}:
       __flush_work+0x31b/0x490
       io_rsrc_ref_quiesce.part.0.constprop.0+0x35/0xb0
       __do_sys_io_uring_register+0x45b/0x1060
       do_syscall_64+0x35/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #0 (&ctx->uring_lock){+.+.}-{3:3}:
       __lock_acquire+0x119a/0x1e10
       lock_acquire+0xc8/0x2f0
       __mutex_lock+0x86/0x740
       io_rsrc_put_work+0x13d/0x1a0
       process_one_work+0x236/0x530
       worker_thread+0x52/0x3b0
       kthread+0x135/0x160
       ret_from_fork+0x1f/0x30

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock((work_completion)(&(&ctx->rsrc_put_work)->work));
                               lock(&ctx->uring_lock);
                               lock((work_completion)(&(&ctx->rsrc_put_work)->work));
  lock(&ctx->uring_lock);

 *** DEADLOCK ***

2 locks held by kworker/2:4/2684:
 #0: ffff88810004d938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530
 #1: ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530

stack backtrace:
CPU: 2 PID: 2684 Comm: kworker/2:4 Tainted: G           OE     5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5
Hardware name: Acer Aspire ES1-421/OLVIA_BE, BIOS V1.05 07/02/2015
Workqueue: events io_rsrc_put_work
Call Trace:
 dump_stack_lvl+0x6a/0x9a
 check_noncircular+0xfe/0x110
 __lock_acquire+0x119a/0x1e10
 lock_acquire+0xc8/0x2f0
 ? io_rsrc_put_work+0x13d/0x1a0
 __mutex_lock+0x86/0x740
 ? io_rsrc_put_work+0x13d/0x1a0
 ? io_rsrc_put_work+0x13d/0x1a0
 ? io_rsrc_put_work+0x13d/0x1a0
 ? process_one_work+0x1ce/0x530
 io_rsrc_put_work+0x13d/0x1a0
 process_one_work+0x236/0x530
 worker_thread+0x52/0x3b0
 ? process_one_work+0x530/0x530
 kthread+0x135/0x160
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x1f/0x30

which is due to holding the ctx->uring_lock when flushing existing
pending work, while the pending work flushing may need to grab the uring
lock if we're using IOPOLL.

Fix this by dropping the uring_lock a bit earlier as part of the flush.

Cc: [email protected]
Link: axboe/liburing#404
Tested-by: Ammar Faizi <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
skruven96 pushed a commit to skruven96/linux that referenced this issue Oct 31, 2021
Ammar reports that he's seeing a lockdep splat on running test/rsrc_tags
from the regression suite:

======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc3-bluetea-test-00249-gc7d102232649 csi2115-super#5 Tainted: G           OE
------------------------------------------------------
kworker/2:4/2684 is trying to acquire lock:
ffff88814bb1c0a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_rsrc_put_work+0x13d/0x1a0

but task is already holding lock:
ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> csi2115-super#1 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}:
       __flush_work+0x31b/0x490
       io_rsrc_ref_quiesce.part.0.constprop.0+0x35/0xb0
       __do_sys_io_uring_register+0x45b/0x1060
       do_syscall_64+0x35/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #0 (&ctx->uring_lock){+.+.}-{3:3}:
       __lock_acquire+0x119a/0x1e10
       lock_acquire+0xc8/0x2f0
       __mutex_lock+0x86/0x740
       io_rsrc_put_work+0x13d/0x1a0
       process_one_work+0x236/0x530
       worker_thread+0x52/0x3b0
       kthread+0x135/0x160
       ret_from_fork+0x1f/0x30

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock((work_completion)(&(&ctx->rsrc_put_work)->work));
                               lock(&ctx->uring_lock);
                               lock((work_completion)(&(&ctx->rsrc_put_work)->work));
  lock(&ctx->uring_lock);

 *** DEADLOCK ***

2 locks held by kworker/2:4/2684:
 #0: ffff88810004d938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530
 csi2115-super#1: ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530

stack backtrace:
CPU: 2 PID: 2684 Comm: kworker/2:4 Tainted: G           OE     5.14.0-rc3-bluetea-test-00249-gc7d102232649 csi2115-super#5
Hardware name: Acer Aspire ES1-421/OLVIA_BE, BIOS V1.05 07/02/2015
Workqueue: events io_rsrc_put_work
Call Trace:
 dump_stack_lvl+0x6a/0x9a
 check_noncircular+0xfe/0x110
 __lock_acquire+0x119a/0x1e10
 lock_acquire+0xc8/0x2f0
 ? io_rsrc_put_work+0x13d/0x1a0
 __mutex_lock+0x86/0x740
 ? io_rsrc_put_work+0x13d/0x1a0
 ? io_rsrc_put_work+0x13d/0x1a0
 ? io_rsrc_put_work+0x13d/0x1a0
 ? process_one_work+0x1ce/0x530
 io_rsrc_put_work+0x13d/0x1a0
 process_one_work+0x236/0x530
 worker_thread+0x52/0x3b0
 ? process_one_work+0x530/0x530
 kthread+0x135/0x160
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x1f/0x30

which is due to holding the ctx->uring_lock when flushing existing
pending work, while the pending work flushing may need to grab the uring
lock if we're using IOPOLL.

Fix this by dropping the uring_lock a bit earlier as part of the flush.

Cc: [email protected]
Link: axboe/liburing#404
Tested-by: Ammar Faizi <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
vlsunil pushed a commit to ventana-micro-systems/RISC-V-Linux that referenced this issue Nov 23, 2021
Ammar reports that he's seeing a lockdep splat on running test/rsrc_tags
from the regression suite:

======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5 Tainted: G           OE
------------------------------------------------------
kworker/2:4/2684 is trying to acquire lock:
ffff88814bb1c0a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_rsrc_put_work+0x13d/0x1a0

but task is already holding lock:
ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}:
       __flush_work+0x31b/0x490
       io_rsrc_ref_quiesce.part.0.constprop.0+0x35/0xb0
       __do_sys_io_uring_register+0x45b/0x1060
       do_syscall_64+0x35/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #0 (&ctx->uring_lock){+.+.}-{3:3}:
       __lock_acquire+0x119a/0x1e10
       lock_acquire+0xc8/0x2f0
       __mutex_lock+0x86/0x740
       io_rsrc_put_work+0x13d/0x1a0
       process_one_work+0x236/0x530
       worker_thread+0x52/0x3b0
       kthread+0x135/0x160
       ret_from_fork+0x1f/0x30

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock((work_completion)(&(&ctx->rsrc_put_work)->work));
                               lock(&ctx->uring_lock);
                               lock((work_completion)(&(&ctx->rsrc_put_work)->work));
  lock(&ctx->uring_lock);

 *** DEADLOCK ***

2 locks held by kworker/2:4/2684:
 #0: ffff88810004d938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530
 #1: ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530

stack backtrace:
CPU: 2 PID: 2684 Comm: kworker/2:4 Tainted: G           OE     5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5
Hardware name: Acer Aspire ES1-421/OLVIA_BE, BIOS V1.05 07/02/2015
Workqueue: events io_rsrc_put_work
Call Trace:
 dump_stack_lvl+0x6a/0x9a
 check_noncircular+0xfe/0x110
 __lock_acquire+0x119a/0x1e10
 lock_acquire+0xc8/0x2f0
 ? io_rsrc_put_work+0x13d/0x1a0
 __mutex_lock+0x86/0x740
 ? io_rsrc_put_work+0x13d/0x1a0
 ? io_rsrc_put_work+0x13d/0x1a0
 ? io_rsrc_put_work+0x13d/0x1a0
 ? process_one_work+0x1ce/0x530
 io_rsrc_put_work+0x13d/0x1a0
 process_one_work+0x236/0x530
 worker_thread+0x52/0x3b0
 ? process_one_work+0x530/0x530
 kthread+0x135/0x160
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x1f/0x30

which is due to holding the ctx->uring_lock when flushing existing
pending work, while the pending work flushing may need to grab the uring
lock if we're using IOPOLL.

Fix this by dropping the uring_lock a bit earlier as part of the flush.

Cc: [email protected]
Link: axboe/liburing#404
Tested-by: Ammar Faizi <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants