Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the stack usage of dsl_dataset_remove_clones_key #1730

Closed
wants to merge 1 commit into from

Conversation

kohsuke
Copy link
Contributor

@kohsuke kohsuke commented Sep 17, 2013

This fixes issue #1726

@behlendorf
Copy link
Contributor

This looks good, we'll get it in. One question though, do you know what the new stack frame size is and does it resolve your test case?

@kohsuke
Copy link
Contributor Author

kohsuke commented Sep 19, 2013

sizeof(zap_attribute_t) is 280 and sizeof(zap_cursor_t) is 56 on 64bit system, so it is 280+56-2*8=320 bytes saving.

IOW, it should bring down the stack frame size from 392 to 72.

@kohsuke
Copy link
Contributor Author

kohsuke commented Sep 19, 2013

The actual obsered result from /sys/kernel/debug/tracing/stack_trace after loading the modified module was 96. Not sure where I made the calculation error.

@kohsuke
Copy link
Contributor Author

kohsuke commented Sep 19, 2013

I noticed that I'm seeing lots of kernel stack trace after this change:

[ 2002.228803] SPL: Showing stack for process 559
[ 2002.229935] Pid: 559, comm: txg_sync Tainted: P            3.0.84-1-lts #1
[ 2002.231202] Call Trace:
[ 2002.232323]  [<ffffffffa0456867>] spl_debug_dumpstack+0x27/0x40 [spl]
[ 2002.233468]  [<ffffffffa045a98e>] kmem_alloc_debug+0x46e/0x500 [spl]
[ 2002.234619]  [<ffffffffa07aa7d9>] dsl_dataset_remove_clones_key.isra.14+0x99/0x1c0 [zfs]
[ 2002.235783]  [<ffffffffa07aa871>] dsl_dataset_remove_clones_key.isra.14+0x131/0x1c0 [zfs]
[ 2002.236932]  [<ffffffffa07aa871>] dsl_dataset_remove_clones_key.isra.14+0x131/0x1c0 [zfs]
[ 2002.238059]  [<ffffffffa07aa871>] dsl_dataset_remove_clones_key.isra.14+0x131/0x1c0 [zfs]
[ 2002.239170]  [<ffffffffa07aa871>] dsl_dataset_remove_clones_key.isra.14+0x131/0x1c0 [zfs]
[ 2002.240283]  [<ffffffffa07aa871>] dsl_dataset_remove_clones_key.isra.14+0x131/0x1c0 [zfs]
[ 2002.241500]  [<ffffffffa07aa871>] dsl_dataset_remove_clones_key.isra.14+0x131/0x1c0 [zfs]
[ 2002.242566]  [<ffffffffa07ac48d>] dsl_dataset_destroy_sync+0x31d/0xf80 [zfs]
[ 2002.243621]  [<ffffffff81421196>] ? ftrace_call+0x5/0x2b
[ 2002.244672]  [<ffffffff8141837e>] ? mutex_unlock+0xe/0x10
[ 2002.245731]  [<ffffffffa07a7255>] ? dsl_dataset_destroy_check+0x115/0x1d0 [zfs]
[ 2002.246793]  [<ffffffff81421196>] ? ftrace_call+0x5/0x2b
[ 2002.247867]  [<ffffffffa07bf46f>] dsl_sync_task_group_sync+0x12f/0x230 [zfs]
[ 2002.248951]  [<ffffffffa07b745b>] dsl_pool_sync+0x48b/0x5c0 [zfs]
[ 2002.250058]  [<ffffffffa07cf84d>] spa_sync+0x42d/0xb10 [zfs]
[ 2002.251262]  [<ffffffff81421196>] ? ftrace_call+0x5/0x2b
[ 2002.252330]  [<ffffffffa07df23d>] txg_sync_thread+0x33d/0x5c0 [zfs]
[ 2002.253396]  [<ffffffffa07def00>] ? txg_delay+0x160/0x160 [zfs]
[ 2002.254451]  [<ffffffffa045ec0a>] thread_generic_wrapper+0x7a/0x90 [spl]
[ 2002.255508]  [<ffffffffa045eb90>] ? __thread_exit+0xa0/0xa0 [spl]
[ 2002.256566]  [<ffffffff8107a0ec>] kthread+0x8c/0xa0
[ 2002.257637]  [<ffffffff81422524>] kernel_thread_helper+0x4/0x10
[ 2002.258725]  [<ffffffff8107a060>] ? kthread_flush_work_fn+0x20/0x20
[ 2002.259839]  [<ffffffff81422520>] ? gs_change+0x13/0x13
[ 2002.262344] SPL: Showing stack for process 559

But I'm puzzled as to why. The "SPL: Showing stack for process" output comes from spl_debug_dumpstack, but kmem_alloc_debug doesn't call that method. It's from kmem_alloc_track. In either case, this should only happen with a large kmem_alloc request, which I'm not making, and I don't see the "large kmem_alloc(...)" message. What's going on?

@nedbass
Copy link
Contributor

nedbass commented Sep 19, 2013

The stack must come from kmalloc_nofail->sanitize_flags. I think you need to use KM_PUSHPAGE instead of KM_SLEEP. Maybe your spl_debug_* module options are masking out the console message.

@behlendorf
Copy link
Contributor

Due to the allocation flags used it's warning a deadlock is possible because this was called in the sync context. Use KM_PUSHPAGE instead of KM_SLEEP.

dsl_dataset_remove_clones_key does recursion, so if the recursion goes
deep it can overrun the linux kernel stack size of 8KB. I have seen this
happen in the actual deployment, and subsequently confirmed it by
running a test workload on a custom-built kernel that uses 32KB stack.

See the following stack trace as an example of the case where it would
have run over the 8KB stack kernel:

        Depth    Size   Location    (42 entries)
        -----    ----   --------
  0)    11192      72   __kmalloc+0x2e/0x240
  1)    11120     144   kmem_alloc_debug+0x20e/0x500 [spl]
  2)    10976      72   dbuf_hold_impl+0x4a/0xa0 [zfs]
  3)    10904     120   dbuf_prefetch+0xd3/0x280 [zfs]
  4)    10784      80   dmu_zfetch_dofetch.isra.5+0x10f/0x180 [zfs]
  5)    10704     240   dmu_zfetch+0x5f7/0x10e0 [zfs]
  6)    10464     168   dbuf_read+0x71e/0x8f0 [zfs]
  7)    10296     104   dnode_hold_impl+0x1ee/0x620 [zfs]
  8)    10192      16   dnode_hold+0x19/0x20 [zfs]
  9)    10176      88   dmu_buf_hold+0x42/0x1b0 [zfs]
 10)    10088     144   zap_lockdir+0x48/0x730 [zfs]
 11)     9944     128   zap_cursor_retrieve+0x1c4/0x2f0 [zfs]
 12)     9816     392   dsl_dataset_remove_clones_key.isra.14+0xab/0x190 [zfs]
 13)     9424     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs]
 14)     9032     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs]
 15)     8640     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs]
 16)     8248     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs]
 17)     7856     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs]
 18)     7464     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs]
 19)     7072     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs]
 20)     6680     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs]
 21)     6288     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs]
 22)     5896     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs]
 23)     5504     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs]
 24)     5112     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs]
 25)     4720     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs]
 26)     4328     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs]
 27)     3936     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs]
 28)     3544     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs]
 29)     3152     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs]
 30)     2760     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs]
 31)     2368     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs]
 32)     1976     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs]
 33)     1584     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 [zfs]
 34)     1192     232   dsl_dataset_destroy_sync+0x311/0xf60 [zfs]
 35)      960      72   dsl_sync_task_group_sync+0x12f/0x230 [zfs]
 36)      888     168   dsl_pool_sync+0x48b/0x5c0 [zfs]
 37)      720     184   spa_sync+0x417/0xb00 [zfs]
 38)      536     184   txg_sync_thread+0x325/0x5b0 [zfs]
 39)      352      48   thread_generic_wrapper+0x7a/0x90 [spl]
 40)      304     128   kthread+0xc0/0xd0
 41)      176     176   ret_from_fork+0x7c/0xb0

As suggested in openzfs#1726, this
change is a quick fix to reduce the stack usage in
dsl_dataset_remove_clones_key by allocating structures in heap, not in
stack.

This is not a fundamental fix, as one can create an arbitrary large data
set that runs over any fixed size stack, but this will make the problem
far less likely.

Reference: openzfs#1726
Signed-off-by: Kohsuke Kawaguchi <[email protected]>
Closes openzfs#1726
@kohsuke
Copy link
Contributor Author

kohsuke commented Sep 19, 2013

Patch revised based on the feedback.

For my own education, where can I learn the difference between KM_PUSHPAGE vs KM_SLEEP and what "the sync context" is?

@nedbass
Copy link
Contributor

nedbass commented Sep 19, 2013

@kohsuke briefly:
KM_SLEEP means the process may reschedule and the allocation cannot fail. KM_PUSHPAGE means the allocation must not internally issue any IO. Sync context means a thread that may issue synchronous IO to disk. I believe this includes the txg_sync(), zfs_putpage(), zvol_write(), and zvol_discard() call paths.

These flags are defined and briefly described in include/sys/kmem.h in the SPL. You can find many other discussions on the topic if you search the SPL and ZFS repositories and issue trackers. For example see 8630650a openzfs/spl@eb0f407a2 and #924 (comment)

@kohsuke
Copy link
Contributor Author

kohsuke commented Sep 21, 2013

Thanks for the details. I really appreciate that!

@ryao
Copy link
Contributor

ryao commented Sep 21, 2013

I made a comment on the diff about using KM_SLEEP instead of KM_PUSHPAGE before reading the rest of the discussion. Please disregard it.

@behlendorf behlendorf closed this Sep 26, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants