-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the stack usage of dsl_dataset_remove_clones_key #1730
Conversation
This looks good, we'll get it in. One question though, do you know what the new stack frame size is and does it resolve your test case? |
IOW, it should bring down the stack frame size from 392 to 72. |
The actual obsered result from |
I noticed that I'm seeing lots of kernel stack trace after this change:
But I'm puzzled as to why. The "SPL: Showing stack for process" output comes from |
The stack must come from |
Due to the allocation flags used it's warning a deadlock is possible because this was called in the sync context. Use KM_PUSHPAGE instead of KM_SLEEP. |
dsl_dataset_remove_clones_key does recursion, so if the recursion goes deep it can overrun the linux kernel stack size of 8KB. I have seen this happen in the actual deployment, and subsequently confirmed it by running a test workload on a custom-built kernel that uses 32KB stack. See the following stack trace as an example of the case where it would have run over the 8KB stack kernel: Depth Size Location (42 entries) ----- ---- -------- 0) 11192 72 __kmalloc+0x2e/0x240 1) 11120 144 kmem_alloc_debug+0x20e/0x500 [spl] 2) 10976 72 dbuf_hold_impl+0x4a/0xa0 [zfs] 3) 10904 120 dbuf_prefetch+0xd3/0x280 [zfs] 4) 10784 80 dmu_zfetch_dofetch.isra.5+0x10f/0x180 [zfs] 5) 10704 240 dmu_zfetch+0x5f7/0x10e0 [zfs] 6) 10464 168 dbuf_read+0x71e/0x8f0 [zfs] 7) 10296 104 dnode_hold_impl+0x1ee/0x620 [zfs] 8) 10192 16 dnode_hold+0x19/0x20 [zfs] 9) 10176 88 dmu_buf_hold+0x42/0x1b0 [zfs] 10) 10088 144 zap_lockdir+0x48/0x730 [zfs] 11) 9944 128 zap_cursor_retrieve+0x1c4/0x2f0 [zfs] 12) 9816 392 dsl_dataset_remove_clones_key.isra.14+0xab/0x190 [zfs] 13) 9424 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs] 14) 9032 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs] 15) 8640 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs] 16) 8248 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs] 17) 7856 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs] 18) 7464 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs] 19) 7072 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs] 20) 6680 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs] 21) 6288 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs] 22) 5896 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs] 23) 5504 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs] 24) 5112 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs] 25) 4720 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs] 26) 4328 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs] 27) 3936 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs] 28) 3544 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs] 29) 3152 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs] 30) 2760 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs] 31) 2368 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs] 32) 1976 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs] 33) 1584 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs] 34) 1192 232 dsl_dataset_destroy_sync+0x311/0xf60 [zfs] 35) 960 72 dsl_sync_task_group_sync+0x12f/0x230 [zfs] 36) 888 168 dsl_pool_sync+0x48b/0x5c0 [zfs] 37) 720 184 spa_sync+0x417/0xb00 [zfs] 38) 536 184 txg_sync_thread+0x325/0x5b0 [zfs] 39) 352 48 thread_generic_wrapper+0x7a/0x90 [spl] 40) 304 128 kthread+0xc0/0xd0 41) 176 176 ret_from_fork+0x7c/0xb0 As suggested in openzfs#1726, this change is a quick fix to reduce the stack usage in dsl_dataset_remove_clones_key by allocating structures in heap, not in stack. This is not a fundamental fix, as one can create an arbitrary large data set that runs over any fixed size stack, but this will make the problem far less likely. Reference: openzfs#1726 Signed-off-by: Kohsuke Kawaguchi <[email protected]> Closes openzfs#1726
Patch revised based on the feedback. For my own education, where can I learn the difference between KM_PUSHPAGE vs KM_SLEEP and what "the sync context" is? |
@kohsuke briefly: These flags are defined and briefly described in include/sys/kmem.h in the SPL. You can find many other discussions on the topic if you search the SPL and ZFS repositories and issue trackers. For example see 8630650a openzfs/spl@eb0f407a2 and #924 (comment) |
Thanks for the details. I really appreciate that! |
I made a comment on the diff about using KM_SLEEP instead of KM_PUSHPAGE before reading the rest of the discussion. Please disregard it. |
This fixes issue #1726