zfs list sometimes hangs with SPL panic in zfs_ioc_pool_stats #3405

dechamps · 2015-05-12T21:13:05Z

A week ago, I upgraded from zfs-0.6.3 to zfs-0.6.4.1. I have some crontab entry that runs the following every minute as part of a longer script:

zfs list -H -o name -t filesystem

At first everything went just fine, but then after ~48 hours of uptime (so after ~3000 invocations), the command hanged with the following in dmesg:

VERIFY3(nvlist_pack(nvl, &packed, sizep, 0, 0x0000) == 0) failed (14 == 0)
PANIC at fnvpair.c:81:fnvlist_pack()
Showing stack for process 10070
CPU: 2 PID: 10070 Comm: zfs Tainted: P           O  3.16.0-4-amd64 #1 Debian 3.16.7-ckt9-3
Hardware name: System manufacturer System Product Name/P8Z68-V PRO, BIOS 3603 11/09/2012
ffff880291c6fe18 ffffffff8150ac96 ffffffffa017e057 ffffffffa00f990f
ffff8801376f8700 ffffffff00000030 ffff880291c6fe28 ffff880291c6fdc8
2833594649524556 705f7473696c766e 2c6c766e286b6361 64656b6361702620
Call Trace:
 [<ffffffff8150ac96>] ? dump_stack+0x41/0x51
 [<ffffffffa00f990f>] ? spl_panic+0xbf/0xf0 [spl]
 [<ffffffffa01784f4>] ? nvlist_common.part.102+0xe4/0x200 [znvpair]
 [<ffffffffa01787f5>] ? nvlist_xpack+0x115/0x120 [znvpair]
 [<ffffffffa0178ed7>] ? fnvlist_pack+0x67/0x80 [znvpair]
 [<ffffffffa032301d>] ? put_nvlist+0x5d/0xa0 [zfs]
 [<ffffffffa032452b>] ? zfs_ioc_pool_stats+0x3b/0x60 [zfs]
 [<ffffffffa0327309>] ? zfsdev_ioctl+0x489/0x4c0 [zfs]
 [<ffffffff811ba2ff>] ? do_vfs_ioctl+0x2cf/0x4b0
 [<ffffffff811ba561>] ? SyS_ioctl+0x81/0xa0
 [<ffffffff81512e68>] ? page_fault+0x28/0x30
 [<ffffffff81510e4d>] ? system_call_fast_compare_end+0x10/0x15

The zfs list process then become stuck and unkillable. I had to reboot the system to make it go away. It's worth noting that this didn't seem to affect anything else though - in fact, I was able to run the same command just fine even while another zfs list process was stuck.

I suspect this is a regression from 0.6.3 to 0.6.4.1, because this absolutely never happened before I upgraded.

The text was updated successfully, but these errors were encountered:

dechamps · 2015-05-12T21:28:57Z

I should note that I have other crontab entries that make snapshots every 5 minutes, so it might be some kind of race condition between zfs list and zfs snapshot.

nedbass · 2015-05-12T22:06:10Z

Possibly related to #3335.

behlendorf · 2015-05-12T22:32:14Z

EFAULT from nvlist_pack(), that sure does suggest that something was concurrently messing with the config nvlist. Although at this point in the code we should be working on a private copy and it looks like things were locked properly. But I must be missing something.

dechamps · 2015-05-16T13:30:14Z

Got it again after 64 hours of uptime. This time it got blocked in zfs snapshot (which apparently also calls zfs_ioc_pool_stats at some point). Same error, same stack trace.

dechamps · 2015-05-16T17:30:51Z

I think I managed to make it reproducible. Running the following script in a Debian Jessie VM (4 virtual CPUs) with zfs-0.6.4.1 manages to trigger the panic within seconds:

#!/bin/bash

set -e

dd if=/dev/zero of=/tmp/disk bs=1 count=1 seek="$((3 * 1024 * 1024 * 1024))"
zpool create racetest /tmp/disk

spawn_list_thread() {
    local ID="$1"
    while :
    do
        echo LIST $ID
        zfs list -H -o name -t filesystem >/dev/null
    done &
}

spawn_snapshot_thread() {
    local ID="$1"
    zfs create "racetest/$ID"
    while :
    do
        echo SNAPSHOT $ID
        zfs snapshot "racetest/${ID}@$(date '+%s')_$RANDOM"
    done &
}

# Spawn 8 zfs list threads, and 64 zfs snapshot threads
for I in $(seq 1 8)
do  
    spawn_list_thread "$I"
done
for I in $(seq 1 64)
do
    spawn_snapshot_thread "$I"
done

wait

Now I can start bissecting the thing.

dechamps · 2015-05-16T18:37:23Z

Okay, this is interesting. Starting from zfs-0.6.4.1, reverting the offending commit from #3335 makes the issue impossible to reproduce using the above script. By that I mean, the following fixes the issue:

git revert 417104bdd3c7ce07ec58674dd078f9891c3bc780

However, @nedbass's fix in #3339 doesn't seem to work in my case. By that I mean, the following does NOT fix the issue:

git cherry-pick 22095809d18851ecfb75dacfcadf18b8cde326f5

The error and stack trace are still exactly the same with 2209580.

nigoroll · 2015-05-17T12:23:48Z

I got this with 0.6.4-1-2:

Message from syslogd@haggis at May 17 14:17:48 ...
 kernel:[ 1575.105017] VERIFY3(nvlist_pack(nvl, &packed, sizep, 0, 0x0000) == 0) failed (14 == 0)

Message from syslogd@haggis at May 17 14:17:48 ...
 kernel:[ 1575.105021] PANIC at fnvpair.c:81:fnvlist_pack()

May 17 14:17:48 haggis kernel: [ 1575.105022] Showing stack for process 6606
May 17 14:17:48 haggis kernel: [ 1575.105025] CPU: 0 PID: 6606 Comm: zfs Tainted: P           O 3.13-1-amd64 #1 Debian 3.13.10-1
May 17 14:17:48 haggis kernel: [ 1575.105026] Hardware name: LENOVO 20BGCTO1WW/20BGCTO1WW, BIOS GNET32WW (1.14 ) 12/09/2013
May 17 14:17:48 haggis kernel: [ 1575.105028]  ffff880296d41e10 ffffffff814a1997 ffffffffa083a9e8 ffffffffa080d6da
May 17 14:17:48 haggis kernel: [ 1575.105031]  ffff88028d4f0f00 ffffffff00000030 ffff880296d41e20 ffff880296d41dc0
May 17 14:17:48 haggis kernel: [ 1575.105034]  2833594649524556 705f7473696c766e 2c6c766e286b6361 64656b6361702620
May 17 14:17:48 haggis kernel: [ 1575.105037] Call Trace:
May 17 14:17:48 haggis kernel: [ 1575.105042]  [<ffffffff814a1997>] ? dump_stack+0x41/0x51
May 17 14:17:48 haggis kernel: [ 1575.105052]  [<ffffffffa080d6da>] ? spl_panic+0xba/0xf0 [spl]
May 17 14:17:48 haggis kernel: [ 1575.105057]  [<ffffffffa0835434>] ? nvlist_common.part.102+0xe4/0x200 [znvpair]
May 17 14:17:48 haggis kernel: [ 1575.105061]  [<ffffffffa0835725>] ? nvlist_xpack+0x115/0x120 [znvpair]
May 17 14:17:48 haggis kernel: [ 1575.105065]  [<ffffffffa0835de2>] ? fnvlist_pack+0x62/0x70 [znvpair]
May 17 14:17:48 haggis kernel: [ 1575.105079]  [<ffffffffa0977075>] ? put_nvlist+0x55/0xa0 [zfs]
May 17 14:17:48 haggis kernel: [ 1575.105089]  [<ffffffffa0978546>] ? zfs_ioc_pool_stats+0x36/0x60 [zfs]
May 17 14:17:48 haggis kernel: [ 1575.105099]  [<ffffffffa097b249>] ? zfsdev_ioctl+0x479/0x4b0 [zfs]
May 17 14:17:48 haggis kernel: [ 1575.105102]  [<ffffffff8118b94f>] ? do_vfs_ioctl+0x2cf/0x4a0
May 17 14:17:48 haggis kernel: [ 1575.105104]  [<ffffffff8118bba0>] ? SyS_ioctl+0x80/0xa0
May 17 14:17:48 haggis kernel: [ 1575.105108]  [<ffffffff814aeb79>] ? system_call_fastpath+0x16/0x1b

when running zfs destroy and list conncurrently

    for baseds in "$@"; do
    for ds in $(zfs list -rt filesystem,volume -H -o name "${baseds}") ; do
        sn=($(zfs list -H -d 1 -t snapshot -o name $ds | egrep @${snapprefix}))
        if [[ ${#sn[@]} -le ${keep} ]] ; then
        continue
        fi
        l=$((${#sn[@]} - ${keep}))
        for ((i=0; i<l; i++)) ; do
        zfs destroy ${sn[$i]} &
        if [[ $(( $i % 20 )) -eq 0 ]] ; then
            wait
        fi
        done
    done
    done

behlendorf · 2015-05-18T20:58:29Z

@dechamps thanks for confirming where this was accidentally introduced. Looks like we overlooked something and we'll definitely want to get this resolved in the next point release. Thanks for the reproducer.

nedbass · 2015-05-18T22:44:31Z

@dechamps the patch you cherry-picked still has a race that was fixed before merging to master. Please try 4eb30c6.

behlendorf · 2015-09-21T21:31:54Z

Closing this was fixed in 0.6.4.2.

dechamps changed the title ~~zfs list sometimes results in SPL panic in zfs_ioc_pool_stats~~ zfs list sometimes hangs with SPL panic in zfs_ioc_pool_stats May 12, 2015

behlendorf added the Bug - Point Release label May 18, 2015

behlendorf closed this as completed Sep 21, 2015

behlendorf removed the Bug - Point Release label Sep 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zfs list sometimes hangs with SPL panic in zfs_ioc_pool_stats #3405

zfs list sometimes hangs with SPL panic in zfs_ioc_pool_stats #3405

dechamps commented May 12, 2015

dechamps commented May 12, 2015

nedbass commented May 12, 2015

behlendorf commented May 12, 2015

dechamps commented May 16, 2015

dechamps commented May 16, 2015

dechamps commented May 16, 2015

nigoroll commented May 17, 2015

behlendorf commented May 18, 2015

nedbass commented May 18, 2015

behlendorf commented Sep 21, 2015

zfs list sometimes hangs with SPL panic in zfs_ioc_pool_stats #3405

zfs list sometimes hangs with SPL panic in zfs_ioc_pool_stats #3405

Comments

dechamps commented May 12, 2015

dechamps commented May 12, 2015

nedbass commented May 12, 2015

behlendorf commented May 12, 2015

dechamps commented May 16, 2015

dechamps commented May 16, 2015

dechamps commented May 16, 2015

nigoroll commented May 17, 2015

behlendorf commented May 18, 2015

nedbass commented May 18, 2015

behlendorf commented Sep 21, 2015