-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU stall / hung - only able to clear via hard power cycle #557
Comments
G'day, "Me too". linux-3.1.10, openzfs/spl@3c6ed54, b4b599d I just got these for the first time today, after 8 days of uptime with a constant inbound rsync load:
...and the machine required a hard reset. The above is a direct cut-n-paste from kern.log, i.e. there weren't any stack traces associated with these stalls that might help diagnose where they came from (Documentation/RCU/stallwarn.txt). The kernel seems to have stack tracing enabled:
Is there something I need to enable to get a stack trace with the stall warning? Paul, are there any stack traces associated with your stalls? In contrast to Paul, I don't have any hung task messages since the machine booted. Chris |
G'day Chris, Sorry to hear you are affected also. I have not been able to pin down stack trace relevant data, but I am not an expert in that regard, so for me has been the proverbial "needle in the haystack". I am hoping there is a definitive reply soon on what precisely we need to do to give the dev's the data they need to debug/nail this issue down. Cheers Paul |
Hi Paul, If you have stack traces they should be in your log just after the related "detected stall" line, something like... oh, let me see, I think I might have one or two around... :-) :-( (note: this one's unrelated to the "detected stall" stuff):
...all the stuff between the "BUG" line and the "end trace" line should be included. It would be interesting to get a bunch of traces for your "detected stall" instances if you have them. Chris |
OK, looks like I can reproduce the stalls at will by doing a Hmmm... Paul, are you using xattrs? I had an open window to the system which was still responsive (for a while) and I was able to get some stack traces using
Also:
And, after a reboot and triggering the problem again (note: the "9 users" were all me with multiple windows open):
|
Well done Chris, and no - I am not using xattr's -----Original Message----- OK, looks like I can reproduce the stalls at will by doing a Hmmm... Paul, are you using xattrs? I had an open window to the system which was still responsive (for a while) and I was able to get some stack traces using
Also:
And, after a reboot and triggering the problem again (note: the "9 users" were all me with multiple windows open):
Reply to this email directly or view it on GitHub: |
OK, I got a stall on linux-3.2.5 (was previously on 3.1.10), openzfs/spl@3c6ed54, b4b599d, and this time it gave me a trace (below). Once again it's related xattrs, but this time on a newly created pool and zfs, with
|
I can contribute with a trace from ubuntu 11.10, kernel 3.0.0, running 0.6.0.48-0ubuntu1~oneiric1 from ppa. [958419.164001] INFO: rcu_sched_state detected stall on CPU 1 (t=15000 jiffies) [958560.780112] INFO: task z_fr_iss/0:3713 blocked for more than 120 seconds. [958560.780230] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [958560.780352] z_fr_iss/0 D 0000000000000000 0 3713 2 0x00000000 [958560.780363] ffff8801a6b07d00 0000000000000046 ffffffff81606cce ffff8801a6b07d80 [958560.780373] ffff8801a6b07fd8 ffff8801a6b07fd8 ffff8801a6b07fd8 0000000000012a40 [958560.780382] ffff8802028fdc80 ffff8801d9bb4560 ffff8801a6b07fd8 ffff8801817c7288 [958560.780390] Call Trace: [958560.780407] [] ? common_interrupt+0xe/0x13 [958560.780416] [] schedule+0x3f/0x60 [958560.780424] [] __mutex_lock_slowpath+0xd7/0x150 [958560.780433] [] mutex_lock+0x22/0x40 [958560.780531] [] zio_ready+0x1e0/0x3b0 [zfs] [958560.780602] [] zio_execute+0x9f/0xf0 [zfs] [958560.780626] [] taskq_thread+0x1b1/0x430 [spl] [958560.780635] [] ? try_to_wake_up+0x200/0x200 [958560.780654] [] ? task_alloc+0x160/0x160 [spl] [958560.780663] [] kthread+0x8c/0xa0 [958560.780671] [] kernel_thread_helper+0x4/0x10 [958560.780679] [] ? flush_kthread_worker+0xa0/0xa0 [958560.780687] [] ? gs_change+0x13/0x13 [958560.780693] INFO: task z_fr_iss/2:3715 blocked for more than 120 seconds. [958560.780799] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [958560.780920] z_fr_iss/2 D 0000000000000002 0 3715 2 0x00000000 [958560.780929] ffff8801a1a87d00 0000000000000046 ffffffff8160f54e ffff8801a1a87d80 [958560.780938] ffff8801a1a87fd8 ffff8801a1a87fd8 ffff8801a1a87fd8 0000000000012a40 [958560.780947] ffff8801fc720000 ffff8801dc4c8000 ffff8801817c7288 ffff8801817c7288 [958560.780955] Call Trace: [958560.780964] [] ? apic_timer_interrupt+0xe/0x20 [958560.780973] [] schedule+0x3f/0x60 [958560.780981] [] __mutex_lock_slowpath+0xd7/0x150 [958560.780990] [] mutex_lock+0x22/0x40 [958560.781056] [] zio_ready+0x1e0/0x3b0 [zfs] [958560.781123] [] zio_execute+0x9f/0xf0 [zfs] [958560.781143] [] taskq_thread+0x1b1/0x430 [spl] [958560.781151] [] ? try_to_wake_up+0x200/0x200 [958560.781170] [] ? task_alloc+0x160/0x160 [spl] [958560.781178] [] kthread+0x8c/0xa0 [958560.781185] [] kernel_thread_helper+0x4/0x10 [958560.781194] [] ? flush_kthread_worker+0xa0/0xa0 [958560.781201] [] ? gs_change+0x13/0x13 [958560.781208] INFO: task txg_sync:3730 blocked for more than 120 seconds. [958560.784608] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [958560.787970] txg_sync D 0000000000000002 0 3730 2 0x00000000 [958560.787979] ffff880197fdfaf0 0000000000000046 ffffffff8160f54e ffff880197fdfb70 [958560.787988] ffff880197fdffd8 ffff880197fdffd8 ffff880197fdffd8 0000000000012a40 [958560.788015] ffff8801dc4c8000 ffff8801c059c560 0000000000000000 ffff8801817c7288 [958560.788024] Call Trace: [958560.788033] [] ? apic_timer_interrupt+0xe/0x20 [958560.788042] [] schedule+0x3f/0x60 [958560.788050] [] __mutex_lock_slowpath+0xd7/0x150 [958560.788059] [] ? __kmalloc+0x31/0x160 [958560.788067] [] mutex_lock+0x22/0x40 [958560.788137] [] zio_add_child+0x61/0x120 [zfs] [958560.788210] [] zio_create+0x426/0x520 [zfs] [958560.788283] [] zio_free_sync+0x76/0x80 [zfs] [958560.788357] [] spa_free_sync_cb+0x43/0x60 [zfs] [958560.788433] [] ? bpobj_enqueue_cb+0x20/0x20 [zfs] [958560.788491] [] bplist_iterate+0x7a/0xb0 [zfs] [958560.788566] [] spa_sync+0x3c3/0xa00 [zfs] [958560.788579] [] ? default_wake_function+0x12/0x20 [958560.788594] [] ? autoremove_wake_function+0x16/0x40 [958560.788608] [] ? __wake_up+0x53/0x70 [958560.788683] [] txg_sync_thread+0x216/0x390 [zfs] [958560.788760] [] ? txg_init+0x260/0x260 [zfs] [958560.788835] [] ? txg_init+0x260/0x260 [zfs] [958560.788862] [] thread_generic_wrapper+0x78/0x90 [spl] [958560.788886] [] ? __thread_create+0x160/0x160 [spl] [958560.788899] [] kthread+0x8c/0xa0 [958560.788913] [] kernel_thread_helper+0x4/0x10 [958560.788927] [] ? flush_kthread_worker+0xa0/0xa0 [958560.788940] [] ? gs_change+0x13/0x13 [958560.788953] INFO: task rsync:19346 blocked for more than 120 seconds. [958560.792342] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [958560.795766] rsync D ffffffff81805120 0 19346 19345 0x00000000 [958560.795776] ffff88018f711b88 0000000000000082 0000000000000001 ffff8801f00302e0 [958560.795785] ffff88018f711fd8 ffff88018f711fd8 ffff88018f711fd8 0000000000012a40 [958560.795794] ffff880202935c80 ffff8801febc0000 ffff88018f711b98 ffff8801f0030330 [958560.795802] Call Trace: [958560.795810] [] schedule+0x3f/0x60 [958560.795833] [] cv_wait_common+0x77/0xd0 [spl] [958560.795842] [] ? add_wait_queue+0x60/0x60 [958560.795862] [] __cv_wait+0x13/0x20 [spl] [958560.795934] [] txg_wait_open+0x73/0xa0 [zfs] [958560.795993] [] dmu_tx_wait+0xed/0xf0 [zfs] [958560.796076] [] zfs_write+0x377/0xc50 [zfs] [958560.796093] [] ? perf_event_task_sched_out+0x2e/0xa0 [958560.796164] [] zpl_write_common+0x52/0x80 [zfs] [958560.796236] [] zpl_write+0x68/0xa0 [zfs] [958560.796251] [] vfs_write+0xb3/0x180 [958560.796265] [] sys_write+0x4a/0x90 [958560.796279] [] system_call_fastpath+0x16/0x1b [958599.283997] INFO: rcu_sched_state detected stall on CPU 1 (t=60030 jiffies) [958680.796112] INFO: task z_fr_iss/0:3713 blocked for more than 120 seconds. [958680.799596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [958680.803078] z_fr_iss/0 D 0000000000000000 0 3713 2 0x00000000 [958680.803090] ffff8801a6b07d00 0000000000000046 ffffffff81606cce ffff8801a6b07d80 [958680.803100] ffff8801a6b07fd8 ffff8801a6b07fd8 ffff8801a6b07fd8 0000000000012a40 [958680.803109] ffff8802028fdc80 ffff8801d9bb4560 ffff8801a6b07fd8 ffff8801817c7288 [958680.803119] Call Trace: [958680.803136] [] ? common_interrupt+0xe/0x13 [958680.803151] [] schedule+0x3f/0x60 [958680.803165] [] __mutex_lock_slowpath+0xd7/0x150 [958680.803180] [] mutex_lock+0x22/0x40 [958680.803286] [] zio_ready+0x1e0/0x3b0 [zfs] [958680.803363] [] zio_execute+0x9f/0xf0 [zfs] [958680.803394] [] taskq_thread+0x1b1/0x430 [spl] [958680.803408] [] ? try_to_wake_up+0x200/0x200 [958680.803432] [] ? task_alloc+0x160/0x160 [spl] [958680.803447] [] kthread+0x8c/0xa0 [958680.803460] [] kernel_thread_helper+0x4/0x10 [958680.803473] [] ? flush_kthread_worker+0xa0/0xa0 [958680.803487] [] ? gs_change+0x13/0x13 [958680.803497] INFO: task z_fr_iss/2:3715 blocked for more than 120 seconds. [958680.806982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [958680.810472] z_fr_iss/2 D 0000000000000002 0 3715 2 0x00000000 [958680.810482] ffff8801a1a87d00 0000000000000046 ffffffff8160f54e ffff8801a1a87d80 [958680.810492] ffff8801a1a87fd8 ffff8801a1a87fd8 ffff8801a1a87fd8 0000000000012a40 [958680.810501] ffff8801fc720000 ffff8801dc4c8000 ffff8801817c7288 ffff8801817c7288 [958680.810510] Call Trace: [958680.810519] [] ? apic_timer_interrupt+0xe/0x20 [958680.810528] [] schedule+0x3f/0x60 [958680.810543] [] __mutex_lock_slowpath+0xd7/0x150 [958680.810557] [] mutex_lock+0x22/0x40 [958680.810632] [] zio_ready+0x1e0/0x3b0 [zfs] [958680.810705] [] zio_execute+0x9f/0xf0 [zfs] [958680.810731] [] taskq_thread+0x1b1/0x430 [spl] [958680.810744] [] ? try_to_wake_up+0x200/0x200 [958680.810768] [] ? task_alloc+0x160/0x160 [spl] [958680.810781] [] kthread+0x8c/0xa0 [958680.810794] [] kernel_thread_helper+0x4/0x10 [958680.810808] [] ? flush_kthread_worker+0xa0/0xa0 [958680.810821] [] ? gs_change+0x13/0x13 [958680.810832] INFO: task txg_sync:3730 blocked for more than 120 seconds. [958680.814340] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [958680.817912] txg_sync D 0000000000000002 0 3730 2 0x00000000 [958680.817922] ffff880197fdfaf0 0000000000000046 ffffffff8160f54e ffff880197fdfb70 [958680.817931] ffff880197fdffd8 ffff880197fdffd8 ffff880197fdffd8 0000000000012a40 [958680.817940] ffff8801dc4c8000 ffff8801c059c560 0000000000000000 ffff8801817c7288 [958680.817948] Call Trace: [958680.817957] [] ? apic_timer_interrupt+0xe/0x20 [958680.817966] [] schedule+0x3f/0x60 [958680.817979] [] __mutex_lock_slowpath+0xd7/0x150 [958680.817994] [] ? __kmalloc+0x31/0x160 [958680.818007] [] mutex_lock+0x22/0x40 [958680.818082] [] zio_add_child+0x61/0x120 [zfs] [958680.818155] [] zio_create+0x426/0x520 [zfs] [958680.818227] [] zio_free_sync+0x76/0x80 [zfs] [958680.818302] [] spa_free_sync_cb+0x43/0x60 [zfs] [958680.818377] [] ? bpobj_enqueue_cb+0x20/0x20 [zfs] [958680.818434] [] bplist_iterate+0x7a/0xb0 [zfs] [958680.818508] [] spa_sync+0x3c3/0xa00 [zfs] [958680.818522] [] ? default_wake_function+0x12/0x20 [958680.818536] [] ? autoremove_wake_function+0x16/0x40 [958680.818549] [] ? __wake_up+0x53/0x70 [958680.818624] [] txg_sync_thread+0x216/0x390 [zfs] [958680.818700] [] ? txg_init+0x260/0x260 [zfs] [958680.818775] [] ? txg_init+0x260/0x260 [zfs] [958680.818801] [] thread_generic_wrapper+0x78/0x90 [spl] [958680.818825] [] ? __thread_create+0x160/0x160 [spl] [958680.818839] [] kthread+0x8c/0xa0 [958680.818852] [] kernel_thread_helper+0x4/0x10 [958680.818865] [] ? flush_kthread_worker+0xa0/0xa0 [958680.818878] [] ? gs_change+0x13/0x13 [958680.818890] INFO: task rsync:19346 blocked for more than 120 seconds. [958680.822496] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [958680.826156] rsync D ffffffff81805120 0 19346 19345 0x00000000 [958680.826166] ffff88018f711b88 0000000000000082 0000000000000001 ffff8801f00302e0 [958680.826174] ffff88018f711fd8 ffff88018f711fd8 ffff88018f711fd8 0000000000012a40 [958680.826183] ffff880202935c80 ffff8801febc0000 ffff88018f711b98 ffff8801f0030330 [958680.826191] Call Trace: [958680.826200] [] schedule+0x3f/0x60 [958680.826223] [] cv_wait_common+0x77/0xd0 [spl] [958680.826237] [] ? add_wait_queue+0x60/0x60 [958680.826263] [] __cv_wait+0x13/0x20 [spl] [958680.826339] [] txg_wait_open+0x73/0xa0 [zfs] [958680.826403] [] dmu_tx_wait+0xed/0xf0 [zfs] [958680.826477] [] zfs_write+0x377/0xc50 [zfs] [958680.826493] [] ? perf_event_task_sched_out+0x2e/0xa0 [958680.826564] [] zpl_write_common+0x52/0x80 [zfs] [958680.826635] [] zpl_write+0x68/0xa0 [zfs] [958680.826650] [] vfs_write+0xb3/0x180 [958680.826664] [] sys_write+0x4a/0x90 [958680.826678] [] system_call_fastpath+0x16/0x1b [958779.403997] INFO: rcu_sched_state detected stall on CPU 1 (t=105060 jiffies) [958800.824113] INFO: task z_fr_iss/0:3713 blocked for more than 120 seconds. [958800.827877] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [958800.831644] z_fr_iss/0 D 0000000000000000 0 3713 2 0x00000000 [958800.831655] ffff8801a6b07d00 0000000000000046 ffffffff81606cce ffff8801a6b07d80 [958800.831665] ffff8801a6b07fd8 ffff8801a6b07fd8 ffff8801a6b07fd8 0000000000012a40 [958800.831675] ffff8802028fdc80 ffff8801d9bb4560 ffff8801a6b07fd8 ffff8801817c7288 [958800.831684] Call Trace: [958800.831701] [] ? common_interrupt+0xe/0x13 [958800.831716] [] schedule+0x3f/0x60 [958800.831730] [] __mutex_lock_slowpath+0xd7/0x150 [958800.831744] [] mutex_lock+0x22/0x40 [958800.831848] [] zio_ready+0x1e0/0x3b0 [zfs] [958800.831925] [] zio_execute+0x9f/0xf0 [zfs] [958800.831955] [] taskq_thread+0x1b1/0x430 [spl] [958800.831969] [] ? try_to_wake_up+0x200/0x200 [958800.831994] [] ? task_alloc+0x160/0x160 [spl] [958800.832012] [] kthread+0x8c/0xa0 [958800.832021] [] kernel_thread_helper+0x4/0x10 [958800.832035] [] ? flush_kthread_worker+0xa0/0xa0 [958800.832048] [] ? gs_change+0x13/0x13 [958800.832059] INFO: task z_fr_iss/2:3715 blocked for more than 120 seconds. [958800.835841] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [958800.839694] z_fr_iss/2 D 0000000000000002 0 3715 2 0x00000000 [958800.839704] ffff8801a1a87d00 0000000000000046 ffffffff8160f54e ffff8801a1a87d80 [958800.839714] ffff8801a1a87fd8 ffff8801a1a87fd8 ffff8801a1a87fd8 0000000000012a40 [958800.839723] ffff8801fc720000 ffff8801dc4c8000 ffff8801817c7288 ffff8801817c7288 [958800.839732] Call Trace: [958800.839742] [] ? apic_timer_interrupt+0xe/0x20 [958800.839751] [] schedule+0x3f/0x60 [958800.839765] [] __mutex_lock_slowpath+0xd7/0x150 [958800.839778] [] mutex_lock+0x22/0x40 [958800.839852] [] zio_ready+0x1e0/0x3b0 [zfs] [958800.839925] [] zio_execute+0x9f/0xf0 [zfs] [958800.839951] [] taskq_thread+0x1b1/0x430 [spl] [958800.839963] [] ? try_to_wake_up+0x200/0x200 [958800.839987] [] ? task_alloc+0x160/0x160 [spl] [958800.839999] [] kthread+0x8c/0xa0 [958800.840015] [] kernel_thread_helper+0x4/0x10 [958800.840024] [] ? flush_kthread_worker+0xa0/0xa0 [958800.840037] [] ? gs_change+0x13/0x13 [958959.523996] INFO: rcu_sched_state detected stall on CPU 1 (t=150090 jiffies) [959139.643996] INFO: rcu_sched_state detected stall on CPU 1 (t=195120 jiffies) [959319.763996] INFO: rcu_sched_state detected stall on CPU 1 (t=240150 jiffies) |
In this thread it was suggested to up the /proc/sys/vm/min_free_kbytes echo 135168 > /proc/sys/vm/min_free_kbytes |
Thanks for the additional debugging on this. Many of the stacks suggest this is related to ZFS's heavy usage of vmalloc(), so the suggestion of increasing min_free_kbytes to take some pressure off the VM may be helpful. Related to this here's a link to the kernel docs for this warning. |
For those suffering from this problem I'd appreciate it you could give the following patch a try. It's a significant step towards integrating more tightly with Linux's memory reclaim mechanisms and shedding the Solaris VM baggage. Thus far it has performed well for me but I'd like to see it get more use with a wide variety of workloads before I consider merging it. I believe it may help with the memory issues described here. |
Excuse my git / github ignorance, but how to I get to commit behlendorf/zfs@062b89c via a local clone of I found an almost-identical commit behlendorf/zfs@2a349bd in the
The commit behlendorf/zfs@062b89c definitely exists in github because clicking on the link takes you to it, but when I try to see it in a local clone it behaves as if that commit doesn't exist:
|
Sorry, my fault. I meant to update the link, I force updated the branch so the commit id changed. Just go ahead and use the https://github.com/behlendorf/zfs/tree/vm branch or cherry pick the behlendorf/zfs@2a349bd commit. Using a new repo: git clone https://github.com/behlendorf/zfs.git zfs-behlendorf cd zfs-behlendorf git checkout -b vm origin/vm or adding my repo as a remote and cherry-picking. git remote add behlendorf https://github.com/behlendorf/zfs.git git fetch behlendorf git cherry-pick 2a349bd9380b18efa71c504baa3b1103c48e7205 |
OK, got it, will start testing and report back... |
Just got these soft lockups below, with spl@a3a69b7, zfs@42cb381 + behlendorf/zfs@2a349bd, linux-3.3.0-rc7
|
The soft lockups are due to issue #457. |
FYI, I've been running with behlendorf/zfs@2a349bd for the last 2 weeks with moderate to severe load and haven't experienced any problems apart from the #457 soft lockups. The load has been mostly parallel rsyncs and tar-to-tar copies into ZFS (i.e. data writes with lots of metadata including dir-based xattr read/write), with some of the data pulled from separate machines on the same local network and some from ext4 on md/lvm on the same machine, and occasionally running I.e. for my load, pull request #618 is looking good! ...hmmm, except that #618 has changed the reclaimable pages calculation in I'll switch to that code base and continue testing, and report over in #618. |
Great, please keep me posted and update to the latest patch. We found a few issues in the initial version which is why we refreshed it. |
Has this patch been rolled in to the nightly builds yet? I'm experiencing this problem right now... |
Yes, these patches have been merged in to master |
Any idea when? I was running a nightly from probably Thursday last week (0.60.63 from looking at my apt logs), though I just updated to this morning's version. There had been several days of very heavy I/O before the problem was triggered. |
These changes have been in the ubuntu daily release since 0.6.0.62. |
OK - after reviewing my logs I still had a 0.6.0.58 release of SPL running at the time even though my ZFS module was up to 0.6.0.63. Everything is at 0.6.0.64 now with no repeats thus far, will update if anything reoccurs. |
Since I haven't heard anything for a few months now I'm going to assume that everythings OK with the updated code so I'm closing the issue. We can reopen it if that's not the case, or just go ahead a file a new bug with the latest symptoms. |
This splat_vprint is using tq_arg->name after tq_arg is freed. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#557
G'day ZFS on Linux community.
Running ZFS rc6 from PPA on Ubuntu 11.10 x64. Really hoping this can be solved when rc7 is achieved please. Any debugging or assistance we can provide to the devs/project with our 8x SANs will be provided freely.........
When moderate activity placed on file services presented via ZFS, will cause a "rcu_sched_state detected stall on CPU" error to be logged by the kernel, and / or the error logged being "hung_task_timeout_secs". See bottom of this post for relevant logging extracts, where despite all the great work being done by the devs/project - the issue has remained outstanding for the past 2+months. Issue routinely experienced across a diverse range of physical hardware from $2k through to $100k kit.
The net effect is continually escalating high load average will occur, climbing up to as high as 1000 load average value seen if left to run for hours. If entering "reboot" command the server will never actually reboot - the only action to clear the condition is to hard power cycle.
We have been able to reduce the frequency of these attacks from a few times a day to a few times a month, by limiting the size of "zfs_arc_max" to 1/4 of available RAM.
This is a very serious issue for us - as it is a deal-breaking stability issue that is preventing more wide scale and more serious production rollout.
I will take a moment to celebrate the ZFS on Linux's ability to maintain data integrity in the face of this issue - there are very few SAN systems that can routinely have the power literally hard cycled as much as we have done, and NEVER lose a single file or corrupt a single file. Utterly impressive :-)
Very much looking forward to solving this stability issue as after months of trying every possible suggestion in the newsgroups, we have had not found a viable workaround.
Cheers
Paul
root@gsan1-coy:~# cat /var/log/syslog | grep -e hung
Nov 24 17:15:23 gsan1-coy kernel: [232737.512137] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 17:17:24 gsan1-coy kernel: [232857.504204] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 17:19:24 gsan1-coy kernel: [232977.495599] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 17:21:24 gsan1-coy kernel: [233097.487354] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 17:27:24 gsan1-coy kernel: [233457.413291] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 17:43:22 gsan1-coy kernel: [ 361.018833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 18:08:18 gsan1-coy kernel: [ 361.024042] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 18:08:18 gsan1-coy kernel: [ 361.051342] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 18:42:29 gsan1-coy kernel: [ 241.052531] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 18:44:29 gsan1-coy kernel: [ 361.045481] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 18:56:33 gsan1-coy kernel: [ 1077.001800] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 18:58:28 gsan1-coy kernel: [ 1196.981191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 18:58:28 gsan1-coy kernel: [ 1196.996135] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 19:00:25 gsan1-coy kernel: [ 1316.976739] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 19:00:26 gsan1-coy kernel: [ 1316.994442] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 19:02:25 gsan1-coy kernel: [ 1436.982179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 19:16:25 gsan1-coy kernel: [ 2276.771722] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 19:29:28 gsan1-coy kernel: [ 3059.941819] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
root@gsan1-coy:~# cat /var/log/syslog | grep stall
Nov 24 18:54:25 gsan1-coy kernel: [ 432.316135] INFO: rcu_sched_state detected stall on CPU 2 (t=15000 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 432.324129] INFO: rcu_sched_state detected stalls on CPUs/tasks: { 2} (detected by 7, t=15002 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 575.229472] INFO: rcu_sched_state detected stall on CPU 6 (t=15000 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 575.229476] INFO: rcu_sched_state detected stall on CPU 2 (t=15000 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 575.229480] INFO: rcu_sched_state detected stall on CPU 7 (t=15000 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 575.229485] INFO: rcu_sched_state detected stall on CPU 5 (t=15000 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 575.229489] INFO: rcu_sched_state detected stall on CPU 1 (t=15000 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 575.229494] INFO: rcu_sched_state detected stall on CPU 4 (t=15000 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 575.229497] INFO: rcu_sched_state detected stall on CPU 3 (t=15000 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 755.300583] INFO: rcu_sched_state detected stall on CPU 6 (t=60030 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 755.300589] INFO: rcu_sched_state detected stall on CPU 5 (t=60030 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 755.300593] INFO: rcu_sched_state detected stall on CPU 7 (t=60030 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 755.300597] INFO: rcu_sched_state detected stall on CPU 3 (t=60030 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 755.300601] INFO: rcu_sched_state detected stall on CPU 2 (t=60030 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 755.300604] INFO: rcu_sched_state detected stall on CPU 1 (t=60030 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 755.300609] INFO: rcu_sched_state detected stall on CPU 4 (t=60030 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 935.371591] INFO: rcu_sched_state detected stall on CPU 5 (t=105060 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 935.371596] INFO: rcu_sched_state detected stall on CPU 6 (t=105060 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 935.371601] INFO: rcu_sched_state detected stall on CPU 7 (t=105060 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 935.371605] INFO: rcu_sched_state detected stall on CPU 1 (t=105060 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 935.371608] INFO: rcu_sched_state detected stall on CPU 2 (t=105060 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 935.371612] INFO: rcu_sched_state detected stall on CPU 3 (t=105060 jiffies)
Nov 24 18:54:25 gsan1-coy kernel: [ 935.371616] INFO: rcu_sched_state detected stall on CPU 4 (t=105060 jiffies)
Nov 24 19:25:28 gsan1-coy kernel: [ 2562.893218] INFO: rcu_sched_state detected stall on CPU 6 (t=15000 jiffies)
Nov 24 19:25:28 gsan1-coy kernel: [ 2562.901208] INFO: rcu_sched_state detected stalls on CPUs/tasks: { 6} (detected by 5, t=15002 jiffies)
Nov 24 19:25:28 gsan1-coy kernel: [ 2742.970645] INFO: rcu_sched_state detected stall on CPU 6 (t=60032 jiffies)
Nov 24 19:25:28 gsan1-coy kernel: [ 2742.978638] INFO: rcu_sched_state detected stalls on CPUs/tasks: { 6} (detected by 5, t=60034 jiffies)
Nov 24 19:38:48 gsan1-coy kernel: [ 3196.369340] INFO: rcu_sched_state detected stall on CPU 7 (t=15000 jiffies)
Nov 24 19:38:48 gsan1-coy kernel: [ 3196.369345] INFO: rcu_sched_state detected stall on CPU 5 (t=15000 jiffies)
Nov 24 19:38:48 gsan1-coy kernel: [ 3196.369349] INFO: rcu_sched_state detected stall on CPU 6 (t=15000 jiffies)
Nov 24 19:38:48 gsan1-coy kernel: [ 3196.369353] INFO: rcu_sched_state detected stall on CPU 1 (t=15000 jiffies)
Nov 24 19:38:48 gsan1-coy kernel: [ 3196.369357] INFO: rcu_sched_state detected stall on CPU 2 (t=15000 jiffies)
Nov 24 19:38:48 gsan1-coy kernel: [ 3196.369362] INFO: rcu_sched_state detected stall on CPU 4 (t=15000 jiffies)
Nov 24 19:38:48 gsan1-coy kernel: [ 3196.369365] INFO: rcu_sched_state detected stall on CPU 3 (t=15000 jiffies)
Nov 24 19:38:48 gsan1-coy kernel: [ 3196.369368] INFO: rcu_sched_state detected stall on CPU 0 (t=15000 jiffies)
Nov 24 19:39:04 gsan1-coy kernel: [ 3376.440249] INFO: rcu_sched_state detected stall on CPU 6 (t=60030 jiffies)
Nov 24 19:39:04 gsan1-coy kernel: [ 3376.440255] INFO: rcu_sched_state detected stall on CPU 7 (t=60030 jiffies)
Nov 24 19:39:04 gsan1-coy kernel: [ 3376.440259] INFO: rcu_sched_state detected stall on CPU 5 (t=60030 jiffies)
Nov 24 19:39:04 gsan1-coy kernel: [ 3376.440263] INFO: rcu_sched_state detected stall on CPU 1 (t=60030 jiffies)
Nov 24 19:39:04 gsan1-coy kernel: [ 3376.440268] INFO: rcu_sched_state detected stall on CPU 4 (t=60030 jiffies)
Nov 24 19:39:04 gsan1-coy kernel: [ 3376.440271] INFO: rcu_sched_state detected stall on CPU 3 (t=60030 jiffies)
Nov 24 19:39:04 gsan1-coy kernel: [ 3376.440275] INFO: rcu_sched_state detected stall on CPU 2 (t=60030 jiffies)
Nov 24 19:39:04 gsan1-coy kernel: [ 3376.440278] INFO: rcu_sched_state detected stall on CPU 0 (t=60030 jiffies)
Nov 24 19:39:04 gsan1-coy kernel: [ 3556.510874] INFO: rcu_sched_state detected stall on CPU 7 (t=105060 jiffies)
Nov 24 19:39:04 gsan1-coy kernel: [ 3556.510879] INFO: rcu_sched_state detected stall on CPU 6 (t=105060 jiffies)
Nov 24 19:39:04 gsan1-coy kernel: [ 3556.510884] INFO: rcu_sched_state detected stall on CPU 5 (t=105060 jiffies)
Nov 24 19:39:04 gsan1-coy kernel: [ 3556.510888] INFO: rcu_sched_state detected stall on CPU 3 (t=105060 jiffies)
Nov 24 19:39:04 gsan1-coy kernel: [ 3556.510892] INFO: rcu_sched_state detected stall on CPU 1 (t=105060 jiffies)
Nov 24 19:39:04 gsan1-coy kernel: [ 3556.510896] INFO: rcu_sched_state detected stall on CPU 4 (t=105060 jiffies)
Nov 24 19:39:04 gsan1-coy kernel: [ 3556.510899] INFO: rcu_sched_state detected stall on CPU 2 (t=105060 jiffies)
Nov 24 19:39:04 gsan1-coy kernel: [ 3556.510903] INFO: rcu_sched_state detected stall on CPU 0 (t=105060 jiffies)
Nov 24 20:21:30 gsan1-coy kernel: [ 5914.420208] INFO: rcu_sched_state detected stall on CPU 1 (t=15000 jiffies)
Nov 24 20:21:30 gsan1-coy kernel: [ 6094.494578] INFO: rcu_sched_state detected stall on CPU 1 (t=60031 jiffies)
Nov 24 22:13:27 gsan1-coy kernel: [12757.532111] INFO: rcu_sched_state detected stall on CPU 2 (t=15000 jiffies)
Nov 24 22:13:27 gsan1-coy kernel: [12757.540104] INFO: rcu_sched_state detected stalls on CPUs/tasks: { 2} (detected by 7, t=15002 jiffies)
Nov 24 22:47:10 gsan1-coy kernel: [14827.405601] INFO: rcu_sched_state detected stall on CPU 7 (t=15000 jiffies)
Nov 24 22:47:10 gsan1-coy kernel: [14827.405606] INFO: rcu_sched_state detected stall on CPU 6 (t=15000 jiffies)
Nov 24 22:47:10 gsan1-coy kernel: [14827.405611] INFO: rcu_sched_state detected stall on CPU 5 (t=15000 jiffies)
Nov 24 22:47:10 gsan1-coy kernel: [14827.405614] INFO: rcu_sched_state detected stall on CPU 1 (t=15000 jiffies)
Nov 24 22:47:10 gsan1-coy kernel: [14827.405618] INFO: rcu_sched_state detected stall on CPU 2 (t=15000 jiffies)
Nov 24 22:47:10 gsan1-coy kernel: [14827.405622] INFO: rcu_sched_state detected stall on CPU 3 (t=15000 jiffies)
Nov 24 22:47:10 gsan1-coy kernel: [14827.405625] INFO: rcu_sched_state detected stall on CPU 0 (t=15000 jiffies)
Nov 24 22:47:10 gsan1-coy kernel: [14827.405629] INFO: rcu_sched_state detected stall on CPU 4 (t=15000 jiffies)
Nov 25 00:19:28 gsan1-coy kernel: [20240.935394] INFO: rcu_sched_state detected stall on CPU 5 (t=15000 jiffies)
Nov 25 00:19:28 gsan1-coy kernel: [20421.005987] INFO: rcu_sched_state detected stall on CPU 5 (t=60030 jiffies)
Nov 25 00:26:41 gsan1-coy kernel: [20888.081850] INFO: rcu_sched_state detected stall on CPU 5 (t=15000 jiffies)
Nov 25 00:26:41 gsan1-coy kernel: [20888.081855] INFO: rcu_sched_state detected stall on CPU 6 (t=15000 jiffies)
Nov 25 00:26:41 gsan1-coy kernel: [20888.081860] INFO: rcu_sched_state detected stall on CPU 7 (t=15000 jiffies)
Nov 25 00:26:41 gsan1-coy kernel: [20888.081864] INFO: rcu_sched_state detected stall on CPU 2 (t=15000 jiffies)
Nov 25 00:26:41 gsan1-coy kernel: [20888.081868] INFO: rcu_sched_state detected stall on CPU 3 (t=15000 jiffies)
Nov 25 00:26:41 gsan1-coy kernel: [20888.081872] INFO: rcu_sched_state detected stall on CPU 4 (t=15000 jiffies)
Nov 25 00:26:41 gsan1-coy kernel: [20888.081875] INFO: rcu_sched_state detected stall on CPU 1 (t=15000 jiffies)
Nov 25 00:29:42 gsan1-coy kernel: [21068.152456] INFO: rcu_sched_state detected stall on CPU 6 (t=60030 jiffies)
Nov 25 00:29:42 gsan1-coy kernel: [21068.152461] INFO: rcu_sched_state detected stall on CPU 5 (t=60030 jiffies)
Nov 25 00:29:42 gsan1-coy kernel: [21068.152466] INFO: rcu_sched_state detected stall on CPU 7 (t=60030 jiffies)
Nov 25 00:29:42 gsan1-coy kernel: [21068.152470] INFO: rcu_sched_state detected stall on CPU 4 (t=60030 jiffies)
Nov 25 00:29:42 gsan1-coy kernel: [21068.152474] INFO: rcu_sched_state detected stall on CPU 2 (t=60030 jiffies)
Nov 25 00:29:42 gsan1-coy kernel: [21068.152477] INFO: rcu_sched_state detected stall on CPU 1 (t=60030 jiffies)
Nov 25 00:29:42 gsan1-coy kernel: [21068.152481] INFO: rcu_sched_state detected stall on CPU 3 (t=60030 jiffies)
The text was updated successfully, but these errors were encountered: