Fix dnode_hold_impl() soft lockup #8433

lidongyang · 2019-02-19T04:14:16Z

Soft lockups could happen when multiple threads trying
to get zrl on the same dnode handle in order to allocate
and initialize the dnode marked as DN_SLOT_ALLOCATED.

Don't loop from beginning when we can't get zrl, otherwise
we would increase the zrl refcount and nobody can actually
lock it.

Signed-off-by: Li Dongyang [email protected]
Closes #8426

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.

behlendorf

Nice analysis and root cause. The proposed fix looks good, though it's not ideal that we may spin in the unlikely case where the dnode is highly contended. I think you should additionally an a call to cond_resched() in both the tryenter while loops to add an explicit place where the scheduler can be called.

lidongyang · 2019-02-19T23:55:27Z

@behlendorf agreed, shall we do the same to the while loop in dnode_free_interior_slots()?
JFYI this issue was discovered on a lustre MDT sever with zfs backend, here is the full trace:

[ 946.702862] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [mdt00_023:36936]
[ 946.703859] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [mdt00_021:32775]
[ 946.703901] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) libcfs(OE) bonding rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) mlx4_en(OE) mlx4_ib(OE) ib_core(OE) mlx4_core(OE) dm_mirror dm_region_hash dm_log zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) dm_round_robin sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm ipmi_si irqbypass crc32_pclmul ghash_clmulni_intel ipmi_devintf iTCO_wdt iTCO_vendor_support nfsd aesni_intel ioatdma lrw dm_multipath gf128mul glue_helper sg hpwdt hpilo ablk_helper auth_rpcgss i2c_i801 lpc_ich pcspkr wmi cryptd nfs_acl acpi_power_meter
[ 946.703920] dm_mod dca shpchp ipmi_msghandler acpi_cpufreq lockd grace sunrpc knem(OE) ip_tables xfs libcrc32c mgag200 i2c_algo_bit sd_mod crc_t10dif drm_kms_helper crct10dif_generic syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm mlx5_core(OE) mlxfw(OE) mlx_compat(OE) crct10dif_pclmul tg3 devlink crct10dif_common ptp crc32c_intel serio_raw i2c_core pps_core hpsa(OE) scsi_transport_sas
[ 946.703923] CPU: 0 PID: 32775 Comm: mdt00_021 Tainted: P OE ------------ 3.10.0-693.21.1.el7.x86_64 #1
[ 946.703924] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 04/25/2017
[ 946.703925] task: ffff885eef3c0fd0 ti: ffff885ec13a0000 task.ti: ffff885ec13a0000
[ 946.703983] RIP: 0010:[] [] zrl_is_locked+0x9/0x20 [zfs]
[ 946.703984] RSP: 0018:ffff885ec13a3818 EFLAGS: 00000297
[ 946.703985] RAX: 000000000000000a RBX: ffff885ec13a37a0 RCX: 0000000000000011
[ 946.703985] RDX: 0000000000000009 RSI: 0000000000000001 RDI: ffff88bc4fbd8160
[ 946.703986] RBP: ffff885ec13a3848 R08: ffff885ec13a0000 R09: ffff885eff417a90
[ 946.703987] R10: 7fffffffffffffff R11: 0000000000000000 R12: ffff885eabd69d40
[ 946.703987] R13: ffff88befcbff1c0 R14: ffff885eabd69d98 R15: ffff885eabd69de8
[ 946.703989] FS: 0000000000000000(0000) GS:ffff885eff400000(0000) knlGS:0000000000000000
[ 946.703989] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 946.703990] CR2: 0000562c1ac93098 CR3: 000000bec728c000 CR4: 00000000003607f0
[ 946.703991] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 946.703992] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 946.703992] Call Trace:
[ 946.704013] [] ? dnode_slots_rele+0x48/0x70 [zfs]
[ 946.704032] [] dnode_hold_impl+0x195/0xc30 [zfs]
[ 946.704046] [] ? dmu_buf_rele+0xe/0x10 [zfs]
[ 946.704071] [] ? zap_unlockdir+0x3f/0x60 [zfs]
[ 946.704097] [] ? zap_lookup_norm_by_dnode+0x97/0xc0 [zfs]
[ 946.704115] [] dnode_hold+0x1b/0x20 [zfs]
[ 946.704130] [] dmu_bonus_hold+0x32/0x1d0 [zfs]
[ 946.704140] [] __osd_obj2dnode+0x34/0xc0 [osd_zfs]
[ 946.704144] [] osd_object_init+0x11e/0x850 [osd_zfs]
[ 946.704187] [] ? lu_object_add+0x2d/0x40 [obdclass]
[ 946.704201] [] ? lod_object_init+0x20c/0x3c0 [lod]
[ 946.704218] [] lu_object_alloc+0xe5/0x320 [obdclass]
[ 946.704233] [] lu_object_find_at+0x180/0x2b0 [obdclass]
[ 946.704247] [] lu_object_find+0x16/0x20 [obdclass]
[ 946.704267] [] mdt_object_find+0x4b/0x170 [mdt]
[ 946.704274] [] mdt_getattr_name_lock+0x746/0x1910 [mdt]
[ 946.704281] [] mdt_intent_getattr+0x2b0/0x480 [mdt]
[ 946.704288] [] mdt_intent_policy+0x441/0xc70 [mdt]
[ 946.704341] [] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
[ 946.704369] [] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
[ 946.704398] [] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
[ 946.704437] [] tgt_enqueue+0x62/0x210 [ptlrpc]
[ 946.704468] [] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[ 946.704494] [] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[ 946.704520] [] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[ 946.704525] [] ? wake_up_state+0x20/0x20
[ 946.704551] [] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[ 946.704578] [] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
[ 946.704581] [] kthread+0xd1/0xe0
[ 946.704583] [] ? insert_kthread_work+0x40/0x40
[ 946.704586] [] ret_from_fork+0x77/0xb0
[ 946.704587] [] ? insert_kthread_work+0x40/0x40

behlendorf · 2019-02-20T02:37:20Z

shall we do the same to the while loop in dnode_free_interior_slots()?

Yes, good idea. Please do so.

lidongyang · 2019-02-20T05:04:42Z

Looks like dnode.c is used to build libzpool.so, and unlike zfs_znode.c, we don't have #ifdef _KERNEL to distinguish the functions needed by userland and kernel. That makes it hard to add cond_resched() and include linux/sched.h
Maybe that's why we don't have it in dnode_free_interior_slots()?

PaulZ-98 · 2019-02-20T16:44:31Z

I think it is ok to selectively #ifdef _KERNEL on selected code in dnode.c.

behlendorf · 2019-02-20T19:30:50Z

We're going to want to add a user space wrapper for this which is a no-op to fix the build. I'd suggest adding the following to include/sys/zfs_context.h which provides similar wrapped for user space already.

diff --git a/include/sys/zfs_context.h b/include/sys/zfs_context.h
index d186c40..52b8f5b 100644
--- a/include/sys/zfs_context.h
+++ b/include/sys/zfs_context.h
@@ -236,6 +236,7 @@ extern kthread_t *zk_thread_create(void (*func)(void *), void *arg,
 
 #define        kpreempt_disable()      ((void)0)
 #define        kpreempt_enable()       ((void)0)
+#define        cond_resched()          ((void)0)
 
 /*
  * Mutexes

lidongyang · 2019-02-20T23:19:03Z

cool, thanks for pointing out zfs_context.h, we don't need open coding like #ifdef _KERNEL in dnode.c
before I update the commit, shall we use sched_yield() here for userland?

behlendorf · 2019-02-20T23:47:57Z

That makes sense, yes let's define cond_resched as sched_yield in user space.

Soft lockups could happen when multiple threads trying to get zrl on the same dnode handle in order to allocate and initialize the dnode marked as DN_SLOT_ALLOCATED. Don't loop from beginning when we can't get zrl, otherwise we would increase the zrl refcount and nobody can actually lock it. Signed-off-by: Li Dongyang <[email protected]> Closes openzfs#8426

codecov · 2019-02-21T08:39:20Z

Codecov Report

Merging #8433 into master will decrease coverage by 0.05%.
The diff coverage is 80.55%.

@@            Coverage Diff             @@
##           master    #8433      +/-   ##
==========================================
- Coverage   78.55%   78.49%   -0.06%     
==========================================
  Files         380      380              
  Lines      116004   115900     -104     
==========================================
- Hits        91125    90977     -148     
- Misses      24879    24923      +44

Flag	Coverage Δ
#kernel	`79.02% <83.87%> (+0.06%)`	⬆️
#user	`67.03% <77.77%> (-0.3%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9abbee4...02a0da1. Read the comment docs.

tcaputi

This looks good to me. I never really liked the way we looped there anyway so this is a double win for me.

jasimmons1973 · 2019-04-22T23:16:01Z

We have deployed ZFS 0.7.13 with this patch applied and it reduced the number of times this bug happens but we did hit it again with our production system. Should this ticket be reopened ? Similar back trace:

[1465132.552779] [] ? dnode_slots_rele+0x57/0x70 [zfs]
[1465132.552779] Call Trace:
[1465132.552781] [] ? __schedule+0x3ff/0x890
[1465132.552800] [] ? dnode_hold_impl+0x185/0xc40 [zfs]
[1465132.552802] [] ? __mutex_lock_slowpath+0x1b8/0x1d0
[1465132.552820] [] ? dnode_hold+0x1b/0x20 [zfs]
[1465132.552822] [] ? mutex_lock+0x1f/0x2f
[1465132.552840] [] ? dmu_bonus_hold+0x32/0x1d0 [zfs]
[1465132.552844] [] ? cv_wait_common+0xf2/0x150 [spl]
[1465132.552850] [] ? __osd_obj2dnode+0x34/0xc0 [osd_zfs]
[1465132.552852] [] ? wake_up_atomic_t+0x30/0x30
[1465132.552859] [] ? osd_object_init+0x1de/0xc90 [osd_zfs]
[1465132.552882] [] ? zrl_tryenter+0x9/0x30 [zfs]
[1465132.552900] [] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[1465132.552920] [] ? dnode_slots_tryenter+0x48/0xb0 [zfs]
[1465132.552938] [] ? lu_object_add+0x2d/0x40 [obdclass]
[1465132.552957] [] ? dnode_hold_impl+0x195/0xc40 [zfs]
[1465132.552975] [] ? lu_object_alloc+0xe5/0x320 [obdclass]
[1465132.552999] [] ? zap_unlockdir+0x3f/0x60 [zfs]
[1465132.553016] [] ? lu_object_find_at+0x180/0x2b0 [obdclass]
[1465132.553039] [] ? zap_lookup_norm+0xa0/0xd0 [zfs]
[1465132.553057] [] ? lu_object_find+0x16/0x20 [obdclass]
[1465132.553076] [] ? dnode_hold+0x1b/0x20 [zfs]
[1465132.553082] [] ? ofd_object_find+0x35/0x100 [ofd]

lidongyang · 2019-04-23T02:04:12Z

@jasimmons1973 can we have a complete trace? is it still soft lockup?

jasimmons1973 · 2019-04-23T16:23:15Z

Do you need just the complete dmesg output or do you need kdump info as well?

jasimmons1973 · 2019-04-23T16:59:43Z

Here is the crash dump bt

#25 [ffff8dcd0805b6f8] apic_timer_interrupt at ffffffffa1775df2
[exception RIP: zrl_add_impl+82]
RIP: ffffffffc0a7dd82 RSP: ffff8dcd0805b7a0 RFLAGS: 00000202
RAX: 000000000000000b RBX: 00000000000003dd RCX: 0000000000000010
RDX: 000000000000000f RSI: ffffffffc0aee3f0 RDI: ffff8dcbc311edd8
RBP: ffff8dcd0805b7c8 R8: 0000000000000101 R9: 0000000000000001
R10: 00000000000003dd R11: 0000000000000000 R12: ffff8dcd0805b720
R13: ffff8dcd0805b7c8 R14: ffff8dc373925d98 R15: 0000000000000000
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#26 [ffff8dcd0805b7d0] dnode_hold_impl at ffffffffc09d2124 [zfs]
#27 [ffff8dcd0805b878] dnode_hold at ffffffffc09d2c2b [zfs]
#28 [ffff8dcd0805b888] dmu_bonus_hold at ffffffffc09bb682 [zfs]
#29 [ffff8dcd0805b8b8] __osd_obj2dnode at ffffffffc10ec874 [osd_zfs]
#30 [ffff8dcd0805b8f0] osd_object_init at ffffffffc10ecade [osd_zfs]
#31 [ffff8dcd0805b990] lu_object_alloc at ffffffffc11c6dd5 [obdclass]
#32 [ffff8dcd0805b9e8] lu_object_find_at at ffffffffc11c71f0 [obdclass]
#33 [ffff8dcd0805ba50] lu_object_find at ffffffffc11c7336 [obdclass]
#34 [ffff8dcd0805ba60] ofd_object_find at ffffffffc169c445 [ofd]
#35 [ffff8dcd0805ba88] ofd_preprw at ffffffffc16a5ea0 [ofd]
#36 [ffff8dcd0805bb28] tgt_brw_read at ffffffffc1417799 [ptlrpc]
#37 [ffff8dcd0805bd00] tgt_request_handle at ffffffffc141605a [ptlrpc]
#38 [ffff8dcd0805bd48] ptlrpc_server_handle_request at ffffffffc13bb843 [ptlrpc]
#39 [ffff8dcd0805bde8] ptlrpc_main at ffffffffc13beff2 [ptlrpc]
#40 [ffff8dcd0805bec8] kthread at ffffffffa10c1c71
#41 [ffff8dcd0805bf50] ret_from_fork_nospec_begin at ffffffffa1774c1d

lidongyang · 2019-04-23T23:42:58Z

complete dmesg would also help.
can you have a look at dnode_hold_alloc_lock_retry and zrl->zr_refcount from the kdump?

if you can get the kdump uploaded somewhere it would be even better.
Since the patch was cherry picked on top of 0.7.13, we need your packages(including debuginfo packages) as well.

jasimmons1973 · 2019-04-24T17:55:35Z

Sadly we are not allowed to release vmcores. You can get the dmesg from:

http://www.infradead.org/~jsimmons/vmcore-dmesg.txt

Its too big to post here.

jasimmons1973 · 2019-04-25T15:09:30Z

Sorry but how does one examine data like zrl->zr_refcount with kdump. I'm attempting to use bt -ff but I'm not see a clear way to find the zr_refcount value.

lidongyang · 2019-04-30T23:48:20Z

James,
for dnode_hold_alloc_lock_retry you can get it in the crash by looking at dnode_stats.

for zrl->zr_refcount, you need to figure out the address of db from the stack, which is a pointer to dmu_buf_impl_t, then look at db->db_user it's an pointer to dnc->dnc_dbu. so you can locate dnc, which is struct dnode_children, then you can look at dnc->dnc_children which is dynamically allocated array of dnode_handle_t. then for each dnode_hanlde_t you can see the dnh_zr_lock which is struct zrlock_t, then finally the zr_refount.

I'll see if I can come up a step to step guide without the dump, it could be difficult.
Dongyang

bzzz77 · 2019-07-09T16:57:33Z

we've got another report ( https://jira.whamcloud.com/browse/EDU-119 ) which seem to be related at least. and frankly I don't quite understand how the original patch is supposed to work given multiple threads might be spinning trying to grab the slots starting from different idx.

lidongyang · 2019-07-10T06:48:44Z

Alex, if the threads cant grab all the slots down the way they will release the zrl they have and starts from beginning, so the one with the largest idx will win.
however whether we should use zrl and spin on write is another question...

bzzz77 · 2019-07-10T08:21:00Z

let's consider a case:
thread1 wants to grab 4-slots starting from idx=0
thread2 wants to grab 2-slots starting from idx=2
T1 got slot0 and slot1, going to take slot2, but then T2 got slot2 -- both restarts, again and again?

lidongyang · 2019-07-10T10:45:12Z

correct me if I'm wrong but I think after T2 got slot2, T1 will release slot0 and slot1 then restart, T2 will continue with slot3, and after it's done with the locks it will release slot2 and 3,
then after that T1 will be able to continue.

behlendorf self-requested a review February 19, 2019 17:24

behlendorf requested changes Feb 19, 2019

View reviewed changes

behlendorf added the Status: Code Review Needed Ready for review and testing label Feb 19, 2019

behlendorf requested a review from ofaaland February 19, 2019 18:54

lidongyang force-pushed the master branch from ec7f4bc to 38104e0 Compare February 20, 2019 03:28

lidongyang force-pushed the master branch from 38104e0 to 02a0da1 Compare February 20, 2019 23:59

tcaputi approved these changes Feb 21, 2019

View reviewed changes

behlendorf approved these changes Feb 21, 2019

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Feb 21, 2019

behlendorf merged commit 8d9e51c into openzfs:master Feb 22, 2019

mattaezell mentioned this pull request Jul 8, 2019

dnode_hold_impl() deadlock with Lustre #8994

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dnode_hold_impl() soft lockup #8433

Fix dnode_hold_impl() soft lockup #8433

lidongyang commented Feb 19, 2019

behlendorf left a comment

lidongyang commented Feb 19, 2019

behlendorf commented Feb 20, 2019

lidongyang commented Feb 20, 2019

PaulZ-98 commented Feb 20, 2019

behlendorf commented Feb 20, 2019

lidongyang commented Feb 20, 2019

behlendorf commented Feb 20, 2019

codecov bot commented Feb 21, 2019 •

edited

Loading

tcaputi left a comment

jasimmons1973 commented Apr 22, 2019

lidongyang commented Apr 23, 2019

jasimmons1973 commented Apr 23, 2019

jasimmons1973 commented Apr 23, 2019

lidongyang commented Apr 23, 2019

jasimmons1973 commented Apr 24, 2019 •

edited

Loading

jasimmons1973 commented Apr 25, 2019

lidongyang commented Apr 30, 2019

bzzz77 commented Jul 9, 2019

lidongyang commented Jul 10, 2019

bzzz77 commented Jul 10, 2019

lidongyang commented Jul 10, 2019

Fix dnode_hold_impl() soft lockup #8433

Fix dnode_hold_impl() soft lockup #8433

Conversation

lidongyang commented Feb 19, 2019

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

behlendorf left a comment

Choose a reason for hiding this comment

lidongyang commented Feb 19, 2019

behlendorf commented Feb 20, 2019

lidongyang commented Feb 20, 2019

PaulZ-98 commented Feb 20, 2019

behlendorf commented Feb 20, 2019

lidongyang commented Feb 20, 2019

behlendorf commented Feb 20, 2019

codecov bot commented Feb 21, 2019 • edited Loading

Codecov Report

tcaputi left a comment

Choose a reason for hiding this comment

jasimmons1973 commented Apr 22, 2019

lidongyang commented Apr 23, 2019

jasimmons1973 commented Apr 23, 2019

jasimmons1973 commented Apr 23, 2019

lidongyang commented Apr 23, 2019

jasimmons1973 commented Apr 24, 2019 • edited Loading

jasimmons1973 commented Apr 25, 2019

lidongyang commented Apr 30, 2019

bzzz77 commented Jul 9, 2019

lidongyang commented Jul 10, 2019

bzzz77 commented Jul 10, 2019

lidongyang commented Jul 10, 2019

codecov bot commented Feb 21, 2019 •

edited

Loading

jasimmons1973 commented Apr 24, 2019 •

edited

Loading