Integer overflow in dmu.c #8778

mazouffre · 2019-05-21T13:44:20Z

System information

Type	Version/Name
Distribution Name	openSUSE Tumbleweed
Distribution Version	20190423
Linux Kernel	4.19.42 with MPTCP (https://www.multipath-tcp.org/)
Architecture	x86_64
ZFS Version	0.8.0-rc5
SPL Version	0.8.0-rc5

Describe the problem you're observing

I have encountered a kernel panic when importing a specific pool (divide by 0 exception). The panic was triggered by the execution of one entry in the pool delete queue. I have managed to import the pool with the parameter zfs_unlink_suspend_progress set. The zfs status command reported that the pool was healthy, no error was detected by a scrub.

After digging, the panic seems to come from an integer overflow on this line: https://github.com/zfsonlinux/zfs/blob/master/module/zfs/dmu.c#L723.

	uint64_t iblkrange =
	    dn->dn_datablksz * EPB(dn->dn_indblkshift, SPA_BLKPTRSHIFT);

The arithmetic is done in unsigned 32 bit, so with dn->dn_indblkshift that is 17 and SPA_BLKPTRSHIFT that is 7, the maximum value allowed for dn->dn_datablksz is 4 MiB - 1.
But the record size of the pool is 4 MiB, so the overflow and the panic. The manual of the zfs kernel module states that up to 16MiB record size is supported.

The kernel log with few debug prink added to the code of the get_next_chunk function of dmu.c:

[ 2899.818313] ZFS DEBUG: Passed get_next_chunk line 727
[ 2899.818457] ZFS DEBUG: dn->dn_datablksz: 4194304
[ 2899.818599] ZFS DEBUG: sizeof(dn->dn_datablksz): 4
[ 2899.818599] ZFS DEBUG: dn->dn_indblkshift: 17
[ 2899.818600] ZFS DEBUG: sizeof(dn->dn_indblkshift): 1
[ 2899.818601] ZFS DEBUG: iblkrange: 0
[ 2899.818601] ZFS DEBUG: minimum: 0
[ 2899.818602] ZFS DEBUG: EPB(dn->dn_indblkshift, SPA_BLKPTRSHIFT): 1024
[ 2899.818603] ZFS DEBUG: sizeof(EPB(dn->dn_indblkshift, SPA_BLKPTRSHIFT)): 4
[ 2899.818604] ZFS DEBUG: Passed get_next_chunk line 742

In my case, because of the overflow and the value of the other variables, there are several 0/0 operations on this line: https://github.com/zfsonlinux/zfs/blob/master/module/zfs/dmu.c#L733. So, depending on the code generated by the compiler, a divide by 0 exception is triggered or not. I have encountered the two cases. I have used a dump (zdb -bcc -x) of the pool to reproduce the bug at will.

After addition of a cast operator:

	uint64_t iblkrange =
	    (uint64_t) dn->dn_datablksz * EPB(dn->dn_indblkshift, SPA_BLKPTRSHIFT);

the bug seems gone:

[10595.977775] ZFS DEBUG: dn->dn_datablksz: 4194304
[10595.977931] ZFS DEBUG: sizeof(dn->dn_datablksz): 4
[10595.978093] ZFS DEBUG: dn->dn_indblkshift: 17
[10595.978240] ZFS DEBUG: sizeof(dn->dn_indblkshift): 1
[10595.978407] ZFS DEBUG: iblkrange: 4294967296
[10595.978552] ZFS DEBUG: minimum: 0
[10595.978665] ZFS DEBUG: EPB(dn->dn_indblkshift, SPA_BLKPTRSHIFT): 1024
[10595.978906] ZFS DEBUG: sizeof(EPB(dn->dn_indblkshift, SPA_BLKPTRSHIFT)): 4
[10595.979139] ZFS DEBUG: Passed get_next_chunk line 738

I can import the pool without error.

Include any warning/errors/backtraces from the system logs

The kernel log of the panic:

[ 2545.992358] divide error: 0000 [#1] PREEMPT SMP NOPTI
[ 2546.001037] CPU: 1 PID: 15489 Comm: z_unlinked_drai Tainted: G           O      4.19.42-mptcp_zfs+ #1
[ 2546.001354] Hardware name: Hewlett-Packard HP Z800 Workstation/0AECh, BIOS 786G5 v03.60 02/24/2016
[ 2546.001730] RIP: 0010:dmu_free_long_range+0x153/0x510 [zfs]
[ 2546.001925] Code: 04 45 8b 6f 70 31 d2 4c 8d 54 05 00 41 0f b6 47 6b 4c 89 54 24 60 8d 48 01 41 d3 fe 8d 48 f9 41 d3 e5 4d 63 f6 4b 8d 44 2a ff <49> f7 f5 31 d2 49 89 c4 48 89 e8 49 f7 f5 31 d2 49 29 c4 4c 89 e0
[ 2546.002565] RSP: 0018:ffffa53da0f37ac8 EFLAGS: 00010247
[ 2546.002748] RAX: 0000000000bfffff RBX: ffff95e7e0d8e000 RCX: 000000000000000a
[ 2546.002995] RDX: 0000000000000000 RSI: 0000000014000000 RDI: ffff95e7e0d8e508
[ 2546.003238] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 2546.003478] R10: 0000000000c00000 R11: 0000000000000008 R12: ffff95e7c60bd000
[ 2546.003717] R13: 0000000000000000 R14: 0000000000000100 R15: ffff95e7bf1f3b00
[ 2546.003957] FS:  0000000000000000(0000) GS:ffff95e80b640000(0000) knlGS:0000000000000000
[ 2546.004229] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2546.004423] CR2: 00007f52f2f85a60 CR3: 000000005820a000 CR4: 00000000000006e0
[ 2546.004663] Call Trace:
[ 2546.004791]  ? zfs_znode_hold_enter+0x115/0x170 [zfs]
[ 2546.005001]  zfs_rmnode+0x260/0x330 [zfs]
[ 2546.005177]  ? zfs_zinactive+0xd7/0xf0 [zfs]
[ 2546.005361]  zfs_inactive+0x82/0x200 [zfs]
[ 2546.005504]  ? unmap_mapping_pages+0x5e/0x130
[ 2546.006257]  zpl_evict_inode+0x3c/0x50 [zfs]
[ 2546.006451]  evict+0xc4/0x190
[ 2546.006592]  zfs_unlinked_drain_task+0x89/0xf0 [zfs]
[ 2546.006764]  ? account_entity_dequeue+0x63/0xd0
[ 2546.006919]  ? dequeue_task_fair+0x137/0xe10
[ 2546.007065]  ? pick_next_task_fair+0x2b7/0x5d0
[ 2546.007218]  ? __switch_to+0x8c/0x470
[ 2546.007343]  ? __update_idle_core+0x20/0xb0
[ 2546.007487]  ? finish_task_switch+0x74/0x260
[ 2546.007635]  ? __schedule+0x2ac/0x8a0
[ 2546.007762]  ? __wake_up_common_lock+0x89/0xc0
[ 2546.007914]  ? remove_wait_queue+0x12/0x50
[ 2546.008060]  taskq_thread+0x2ca/0x490 [spl]
[ 2546.008204]  ? account_entity_dequeue+0x63/0xd0
[ 2546.008358]  ? wake_up_q+0x70/0x70
[ 2546.008479]  ? taskq_thread_should_stop+0x70/0x70 [spl]
[ 2546.008658]  kthread+0x112/0x130
[ 2546.008770]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 2546.008942]  ret_from_fork+0x1f/0x40
[ 2546.009065] Modules linked in: fuse mpt3sas raid_class af_packet iscsi_ibft iscsi_boot_sysfs vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) smsc47b397 joydev pktcdvd snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec intel_powerclamp snd_hda_core snd_hwdep coretemp snd_pcm kvm_intel kvm snd_timer hp_wmi sparse_keymap gpio_ich snd lpc_ich irqbypass pcc_cpufreq mfd_core mptctl i7core_edac soundcore pcspkr rfkill wmi_bmof acpi_cpufreq sch_fq_codel zfs btrfs libcrc32c xor zstd_decompress zstd_compress xxhash raid6_pq hid_generic usbhid crc32c_intel nouveau serio_raw video mxm_wmi i2c_algo_bit drm_kms_helper syscopyarea mptsas sysfillrect mptscsih sysimgblt fb_sys_fops mptbase xhci_pci firewire_ohci scsi_transport_iscsi ehci_pci uhci_hcd ttm scsi_transport_sas megaraid_sas sr_mod
[ 2546.011958]  tg3 firewire_core ehci_hcd cdrom xhci_hcd crc_itu_t libphy drm usbcore wmi button sg zunicode zlua sunrpc dm_mirror dm_region_hash dm_log zcommon znvpair zavl icp spl softdog iTCO_wdt iTCO_vendor_support tcp_highspeed tcp_illinois tcp_hybla tcp_htcp tcp_bic tcp_cdg tcp_dctcp tcp_yeah tcp_vegas tcp_westwood mptcp_wvegas mptcp_rr mptcp_redundant mptcp_olia mptcp_ndiffports mptcp_fullmesh mptcp_coupled mptcp_binder mptcp_balia dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua
[ 2546.014913] ---[ end trace b85db039b4f1f0a9 ]---

The text was updated successfully, but these errors were encountered:

behlendorf · 2019-05-21T15:53:21Z

@mazouffre your analysis looks correct to me. Thank you very much for taking the time to get to the root cause of the issue. Would you mind opening a PR which adds the needed cast.

dn->dn_datablksz type is uint32_t and need to be casted to uint64_t to avoid an overflow when the record size is greater than 4 MiB. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olivier Mazouffre <[email protected]> Closes #8778 Closes #8797

Issue: openzfs/zfs#8816 Issue: openzfs/zfs#8778 Bug: https://bugs.gentoo.org/635002 Package-Manager: Portage-2.3.67, Repoman-2.3.12 Signed-off-by: Georgy Yakovlev <[email protected]>

dn->dn_datablksz type is uint32_t and need to be casted to uint64_t to avoid an overflow when the record size is greater than 4 MiB. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olivier Mazouffre <[email protected]> Closes #8778 Closes #8797

dn->dn_datablksz type is uint32_t and need to be casted to uint64_t to avoid an overflow when the record size is greater than 4 MiB. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olivier Mazouffre <[email protected]> Closes openzfs#8778 Closes openzfs#8797

Issue: openzfs/zfs#8816 Issue: openzfs/zfs#8778 Bug: https://bugs.gentoo.org/635002 Package-Manager: Portage-2.3.67, Repoman-2.3.12 Signed-off-by: Georgy Yakovlev <[email protected]>

behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label May 21, 2019

mazouffre mentioned this issue May 23, 2019

Fix integer overflow in get_next_chunk() #8797

Merged

12 tasks

loli10K mentioned this issue May 29, 2019

Segmentation fault when deleting files that are bigger than 4MB #8832

Closed

behlendorf closed this as completed in #8797 May 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integer overflow in dmu.c #8778

Integer overflow in dmu.c #8778

mazouffre commented May 21, 2019

behlendorf commented May 21, 2019

Integer overflow in dmu.c #8778

Integer overflow in dmu.c #8778

Comments

mazouffre commented May 21, 2019

System information

Describe the problem you're observing

Include any warning/errors/backtraces from the system logs

behlendorf commented May 21, 2019