-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dnode_hold_impl() soft lockup #8433
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice analysis and root cause. The proposed fix looks good, though it's not ideal that we may spin in the unlikely case where the dnode is highly contended. I think you should additionally an a call to cond_resched()
in both the tryenter while loops to add an explicit place where the scheduler can be called.
@behlendorf agreed, shall we do the same to the while loop in dnode_free_interior_slots()? [ 946.702862] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [mdt00_023:36936] |
Yes, good idea. Please do so. |
Looks like dnode.c is used to build libzpool.so, and unlike zfs_znode.c, we don't have #ifdef _KERNEL to distinguish the functions needed by userland and kernel. That makes it hard to add cond_resched() and include linux/sched.h |
I think it is ok to selectively #ifdef _KERNEL on selected code in dnode.c. |
We're going to want to add a user space wrapper for this which is a no-op to fix the build. I'd suggest adding the following to diff --git a/include/sys/zfs_context.h b/include/sys/zfs_context.h
index d186c40..52b8f5b 100644
--- a/include/sys/zfs_context.h
+++ b/include/sys/zfs_context.h
@@ -236,6 +236,7 @@ extern kthread_t *zk_thread_create(void (*func)(void *), void *arg,
#define kpreempt_disable() ((void)0)
#define kpreempt_enable() ((void)0)
+#define cond_resched() ((void)0)
/*
* Mutexes |
cool, thanks for pointing out zfs_context.h, we don't need open coding like #ifdef _KERNEL in dnode.c |
That makes sense, yes let's define |
Soft lockups could happen when multiple threads trying to get zrl on the same dnode handle in order to allocate and initialize the dnode marked as DN_SLOT_ALLOCATED. Don't loop from beginning when we can't get zrl, otherwise we would increase the zrl refcount and nobody can actually lock it. Signed-off-by: Li Dongyang <[email protected]> Closes openzfs#8426
Codecov Report
@@ Coverage Diff @@
## master #8433 +/- ##
==========================================
- Coverage 78.55% 78.49% -0.06%
==========================================
Files 380 380
Lines 116004 115900 -104
==========================================
- Hits 91125 90977 -148
- Misses 24879 24923 +44
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I never really liked the way we looped there anyway so this is a double win for me.
We have deployed ZFS 0.7.13 with this patch applied and it reduced the number of times this bug happens but we did hit it again with our production system. Should this ticket be reopened ? Similar back trace: [1465132.552779] [] ? dnode_slots_rele+0x57/0x70 [zfs] |
@jasimmons1973 can we have a complete trace? is it still soft lockup? |
Do you need just the complete dmesg output or do you need kdump info as well? |
Here is the crash dump bt #25 [ffff8dcd0805b6f8] apic_timer_interrupt at ffffffffa1775df2 |
complete dmesg would also help. if you can get the kdump uploaded somewhere it would be even better. |
Sadly we are not allowed to release vmcores. You can get the dmesg from: http://www.infradead.org/~jsimmons/vmcore-dmesg.txt Its too big to post here. |
Sorry but how does one examine data like zrl->zr_refcount with kdump. I'm attempting to use bt -ff but I'm not see a clear way to find the zr_refcount value. |
James, for zrl->zr_refcount, you need to figure out the address of db from the stack, which is a pointer to dmu_buf_impl_t, then look at db->db_user it's an pointer to dnc->dnc_dbu. so you can locate dnc, which is struct dnode_children, then you can look at dnc->dnc_children which is dynamically allocated array of dnode_handle_t. then for each dnode_hanlde_t you can see the dnh_zr_lock which is struct zrlock_t, then finally the zr_refount. I'll see if I can come up a step to step guide without the dump, it could be difficult. |
we've got another report ( https://jira.whamcloud.com/browse/EDU-119 ) which seem to be related at least. and frankly I don't quite understand how the original patch is supposed to work given multiple threads might be spinning trying to grab the slots starting from different idx. |
Alex, if the threads cant grab all the slots down the way they will release the zrl they have and starts from beginning, so the one with the largest idx will win. |
let's consider a case: |
correct me if I'm wrong but I think after T2 got slot2, T1 will release slot0 and slot1 then restart, T2 will continue with slot3, and after it's done with the locks it will release slot2 and 3, |
Soft lockups could happen when multiple threads trying
to get zrl on the same dnode handle in order to allocate
and initialize the dnode marked as DN_SLOT_ALLOCATED.
Don't loop from beginning when we can't get zrl, otherwise
we would increase the zrl refcount and nobody can actually
lock it.
Signed-off-by: Li Dongyang [email protected]
Closes #8426
Motivation and Context
Description
How Has This Been Tested?
Types of changes
Checklist:
Signed-off-by
.