Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs list sometimes hangs with SPL panic in zfs_ioc_pool_stats #3405

Closed
dechamps opened this issue May 12, 2015 · 10 comments
Closed

zfs list sometimes hangs with SPL panic in zfs_ioc_pool_stats #3405

dechamps opened this issue May 12, 2015 · 10 comments

Comments

@dechamps
Copy link
Contributor

A week ago, I upgraded from zfs-0.6.3 to zfs-0.6.4.1. I have some crontab entry that runs the following every minute as part of a longer script:

zfs list -H -o name -t filesystem

At first everything went just fine, but then after ~48 hours of uptime (so after ~3000 invocations), the command hanged with the following in dmesg:

VERIFY3(nvlist_pack(nvl, &packed, sizep, 0, 0x0000) == 0) failed (14 == 0)
PANIC at fnvpair.c:81:fnvlist_pack()
Showing stack for process 10070
CPU: 2 PID: 10070 Comm: zfs Tainted: P           O  3.16.0-4-amd64 #1 Debian 3.16.7-ckt9-3
Hardware name: System manufacturer System Product Name/P8Z68-V PRO, BIOS 3603 11/09/2012
ffff880291c6fe18 ffffffff8150ac96 ffffffffa017e057 ffffffffa00f990f
ffff8801376f8700 ffffffff00000030 ffff880291c6fe28 ffff880291c6fdc8
2833594649524556 705f7473696c766e 2c6c766e286b6361 64656b6361702620
Call Trace:
 [<ffffffff8150ac96>] ? dump_stack+0x41/0x51
 [<ffffffffa00f990f>] ? spl_panic+0xbf/0xf0 [spl]
 [<ffffffffa01784f4>] ? nvlist_common.part.102+0xe4/0x200 [znvpair]
 [<ffffffffa01787f5>] ? nvlist_xpack+0x115/0x120 [znvpair]
 [<ffffffffa0178ed7>] ? fnvlist_pack+0x67/0x80 [znvpair]
 [<ffffffffa032301d>] ? put_nvlist+0x5d/0xa0 [zfs]
 [<ffffffffa032452b>] ? zfs_ioc_pool_stats+0x3b/0x60 [zfs]
 [<ffffffffa0327309>] ? zfsdev_ioctl+0x489/0x4c0 [zfs]
 [<ffffffff811ba2ff>] ? do_vfs_ioctl+0x2cf/0x4b0
 [<ffffffff811ba561>] ? SyS_ioctl+0x81/0xa0
 [<ffffffff81512e68>] ? page_fault+0x28/0x30
 [<ffffffff81510e4d>] ? system_call_fast_compare_end+0x10/0x15

The zfs list process then become stuck and unkillable. I had to reboot the system to make it go away. It's worth noting that this didn't seem to affect anything else though - in fact, I was able to run the same command just fine even while another zfs list process was stuck.

I suspect this is a regression from 0.6.3 to 0.6.4.1, because this absolutely never happened before I upgraded.

@dechamps dechamps changed the title zfs list sometimes results in SPL panic in zfs_ioc_pool_stats zfs list sometimes hangs with SPL panic in zfs_ioc_pool_stats May 12, 2015
@dechamps
Copy link
Contributor Author

I should note that I have other crontab entries that make snapshots every 5 minutes, so it might be some kind of race condition between zfs list and zfs snapshot.

@nedbass
Copy link
Contributor

nedbass commented May 12, 2015

Possibly related to #3335.

@behlendorf
Copy link
Contributor

EFAULT from nvlist_pack(), that sure does suggest that something was concurrently messing with the config nvlist. Although at this point in the code we should be working on a private copy and it looks like things were locked properly. But I must be missing something.

@dechamps
Copy link
Contributor Author

Got it again after 64 hours of uptime. This time it got blocked in zfs snapshot (which apparently also calls zfs_ioc_pool_stats at some point). Same error, same stack trace.

@dechamps
Copy link
Contributor Author

I think I managed to make it reproducible. Running the following script in a Debian Jessie VM (4 virtual CPUs) with zfs-0.6.4.1 manages to trigger the panic within seconds:

#!/bin/bash

set -e

dd if=/dev/zero of=/tmp/disk bs=1 count=1 seek="$((3 * 1024 * 1024 * 1024))"
zpool create racetest /tmp/disk

spawn_list_thread() {
    local ID="$1"
    while :
    do
        echo LIST $ID
        zfs list -H -o name -t filesystem >/dev/null
    done &
}

spawn_snapshot_thread() {
    local ID="$1"
    zfs create "racetest/$ID"
    while :
    do
        echo SNAPSHOT $ID
        zfs snapshot "racetest/${ID}@$(date '+%s')_$RANDOM"
    done &
}

# Spawn 8 zfs list threads, and 64 zfs snapshot threads
for I in $(seq 1 8)
do  
    spawn_list_thread "$I"
done
for I in $(seq 1 64)
do
    spawn_snapshot_thread "$I"
done

wait

Now I can start bissecting the thing.

@dechamps
Copy link
Contributor Author

Okay, this is interesting. Starting from zfs-0.6.4.1, reverting the offending commit from #3335 makes the issue impossible to reproduce using the above script. By that I mean, the following fixes the issue:

git revert 417104bdd3c7ce07ec58674dd078f9891c3bc780

However, @nedbass's fix in #3339 doesn't seem to work in my case. By that I mean, the following does NOT fix the issue:

git cherry-pick 22095809d18851ecfb75dacfcadf18b8cde326f5

The error and stack trace are still exactly the same with 2209580.

@nigoroll
Copy link

I got this with 0.6.4-1-2:

Message from syslogd@haggis at May 17 14:17:48 ...
 kernel:[ 1575.105017] VERIFY3(nvlist_pack(nvl, &packed, sizep, 0, 0x0000) == 0) failed (14 == 0)

Message from syslogd@haggis at May 17 14:17:48 ...
 kernel:[ 1575.105021] PANIC at fnvpair.c:81:fnvlist_pack()
May 17 14:17:48 haggis kernel: [ 1575.105022] Showing stack for process 6606
May 17 14:17:48 haggis kernel: [ 1575.105025] CPU: 0 PID: 6606 Comm: zfs Tainted: P           O 3.13-1-amd64 #1 Debian 3.13.10-1
May 17 14:17:48 haggis kernel: [ 1575.105026] Hardware name: LENOVO 20BGCTO1WW/20BGCTO1WW, BIOS GNET32WW (1.14 ) 12/09/2013
May 17 14:17:48 haggis kernel: [ 1575.105028]  ffff880296d41e10 ffffffff814a1997 ffffffffa083a9e8 ffffffffa080d6da
May 17 14:17:48 haggis kernel: [ 1575.105031]  ffff88028d4f0f00 ffffffff00000030 ffff880296d41e20 ffff880296d41dc0
May 17 14:17:48 haggis kernel: [ 1575.105034]  2833594649524556 705f7473696c766e 2c6c766e286b6361 64656b6361702620
May 17 14:17:48 haggis kernel: [ 1575.105037] Call Trace:
May 17 14:17:48 haggis kernel: [ 1575.105042]  [<ffffffff814a1997>] ? dump_stack+0x41/0x51
May 17 14:17:48 haggis kernel: [ 1575.105052]  [<ffffffffa080d6da>] ? spl_panic+0xba/0xf0 [spl]
May 17 14:17:48 haggis kernel: [ 1575.105057]  [<ffffffffa0835434>] ? nvlist_common.part.102+0xe4/0x200 [znvpair]
May 17 14:17:48 haggis kernel: [ 1575.105061]  [<ffffffffa0835725>] ? nvlist_xpack+0x115/0x120 [znvpair]
May 17 14:17:48 haggis kernel: [ 1575.105065]  [<ffffffffa0835de2>] ? fnvlist_pack+0x62/0x70 [znvpair]
May 17 14:17:48 haggis kernel: [ 1575.105079]  [<ffffffffa0977075>] ? put_nvlist+0x55/0xa0 [zfs]
May 17 14:17:48 haggis kernel: [ 1575.105089]  [<ffffffffa0978546>] ? zfs_ioc_pool_stats+0x36/0x60 [zfs]
May 17 14:17:48 haggis kernel: [ 1575.105099]  [<ffffffffa097b249>] ? zfsdev_ioctl+0x479/0x4b0 [zfs]
May 17 14:17:48 haggis kernel: [ 1575.105102]  [<ffffffff8118b94f>] ? do_vfs_ioctl+0x2cf/0x4a0
May 17 14:17:48 haggis kernel: [ 1575.105104]  [<ffffffff8118bba0>] ? SyS_ioctl+0x80/0xa0
May 17 14:17:48 haggis kernel: [ 1575.105108]  [<ffffffff814aeb79>] ? system_call_fastpath+0x16/0x1b

when running zfs destroy and list conncurrently

    for baseds in "$@"; do
    for ds in $(zfs list -rt filesystem,volume -H -o name "${baseds}") ; do
        sn=($(zfs list -H -d 1 -t snapshot -o name $ds | egrep @${snapprefix}))
        if [[ ${#sn[@]} -le ${keep} ]] ; then
        continue
        fi
        l=$((${#sn[@]} - ${keep}))
        for ((i=0; i<l; i++)) ; do
        zfs destroy ${sn[$i]} &
        if [[ $(( $i % 20 )) -eq 0 ]] ; then
            wait
        fi
        done
    done
    done

@behlendorf
Copy link
Contributor

@dechamps thanks for confirming where this was accidentally introduced. Looks like we overlooked something and we'll definitely want to get this resolved in the next point release. Thanks for the reproducer.

@nedbass
Copy link
Contributor

nedbass commented May 18, 2015

@dechamps the patch you cherry-picked still has a race that was fixed before merging to master. Please try 4eb30c6.

@behlendorf
Copy link
Contributor

Closing this was fixed in 0.6.4.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants