Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel panic in dm_softirq_done [dm_mod] when rebooting target server on UEK6 with node.session.nr_sessions > 1 #27

Closed
mvelikikh opened this issue Aug 30, 2024 · 1 comment

Comments

@mvelikikh
Copy link

mvelikikh commented Aug 30, 2024

Traces with the latest 5.4.17-2136.334.6.1.el8uek.x86_64 UEK6 kernel.

kernel NULL pointer dereference

[ 1924.256982] BUG: kernel NULL pointer dereference, address: 0000000000000058
[ 1924.260317] #PF: supervisor read access in kernel mode
[ 1924.260317] #PF: error_code(0x0000) - not-present page
[ 1924.260317] PGD 897b1e067 P4D 0 
[ 1924.260317] Oops: 0000 [#1] SMP NOPTI
[ 1924.260317] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 5.4.17-2136.334.6.1.el8uek.x86_64 #3
[ 1924.260317] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 05/13/2024
[ 1924.260317] RIP: 0010:dm_softirq_done+0x4f/0x240 [dm_mod]
[ 1924.260317] Code: 51 01 00 00 44 0f b6 bf 60 01 00 00 4d 8b ac 24 10 01 00 00 45 89 fe f6 47 1d 04 75 58 49 8b 7d 08 48 85 ff 74 4f 48 8b 47 08 <48> 8b 40 58 48 85 c0 74 42 49 8d 4d 50 44 89 fa 4c 89 e6 e8 69 ff
[ 1924.260317] RSP: 0018:ff66070e00210ee0 EFLAGS: 00010282
[ 1924.260317] RAX: 0000000000000000 RBX: ff3464adca1f0540 RCX: dead000000000122
[ 1924.260317] RDX: ff66070e00210f20 RSI: ff3464adca1f0598 RDI: ff66070e0009b040
[ 1924.260317] RBP: ff66070e00210f10 R08: ff3464ae1fbedfc0 R09: 0000000000000100
[ 1924.260317] R10: 0000000000000001 R11: 0000000000000230 R12: ff3464adc3ea0a80
[ 1924.260317] R13: ff3464adca1f0658 R14: 0000000000000000 R15: 0000000000000000
[ 1924.260317] FS:  0000000000000000(0000) GS:ff3464ae1fbc0000(0000) knlGS:0000000000000000
[ 1924.260317] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1924.260317] CR2: 0000000000000058 CR3: 000000087fd7e003 CR4: 0000000000361ee0
[ 1924.260317] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1924.260317] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1924.260317] Call Trace:
[ 1924.260317]  <IRQ>
[ 1924.260317]  ? show_regs.cold.12+0x1a/0x1c
[ 1924.260317]  ? __die+0x86/0xd2
[ 1924.260317]  ? no_context.isra.25+0x13f/0x552
[ 1924.260317]  ? ftrace_ops_assist_func+0x78/0x112
[ 1924.260317]  ? __bad_area_nosemaphore+0x43/0x1d8
[ 1924.260317]  ? bad_area_nosemaphore+0x16/0x1c
[ 1924.260317]  ? __do_page_fault+0x2c8/0x4b8
[ 1924.260317]  ? do_page_fault+0x36/0x122
[ 1924.358485]  ? page_fault+0x13d/0x142
[ 1924.358485]  ? dm_softirq_done+0x4f/0x240 [dm_mod]
[ 1924.363490]  blk_done_softirq+0xa5/0xd1
[ 1924.363490]  __do_softirq+0xd4/0x2cc
[ 1924.368482]  irq_exit+0x103/0x108
[ 1924.370487]  do_IRQ+0x59/0xe4
[ 1924.373481]  common_interrupt+0xf/0x1d2
[ 1924.373481]  </IRQ>
[ 1924.373481] RIP: 0010:native_safe_halt+0x12/0x18
[ 1924.373481] Code: 48 02 20 48 8b 00 a8 08 75 bc e9 60 ff ff ff cc cc cc cc cc cc cc cc cc 55 48 89 e5 0f 1f 44 00 00 0f 00 2d b2 c3 57 00 fb f4 <5d> c3 cc cc cc cc 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00
[ 1924.388482] RSP: 0018:ff66070e000d3e70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffde
[ 1924.390490] RAX: ffffffffa5a8fc90 RBX: 0000000000000007 RCX: 0000000000000001
[ 1924.390490] RDX: 000000000048dd7a RSI: ff66070e000d3e60 RDI: 0000000000000000
[ 1924.400415] RBP: ff66070e000d3e70 R08: fffffffffff396a8 R09: 00ea9b5c0bd41f3f
[ 1924.402483] R10: 00000000000000ec R11: 000000000000075a R12: 0000000000000007
[ 1924.405481] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1924.405481]  ? __sched_text_end+0x1/0x0
[ 1924.405481]  default_idle+0x22/0x151
[ 1924.416481]  arch_cpu_idle+0x15/0x1b
[ 1924.417974]  default_idle_call+0x30/0x36
[ 1924.420489]  do_idle+0x1e3/0x25a
[ 1924.422485]  cpu_startup_entry+0x1d/0x1f
[ 1924.422485]  start_secondary+0x177/0x1cb
[ 1924.422485]  secondary_startup_64+0xb6/0xb6
[ 1924.430482] Modules linked in: dm_queue_length iscsi_tcp libiscsi_tcp libiscsi target_core_user uio target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_mod dm_multipath vxlan ip6_udp_tunnel udp_tunnel act_mirred sch_ingress ifb cls_u32 act_gact cls_bpf sch_hfsc rfkill scsi_transport_iscsi nft_counter nft_chain_nat xt_nat nf_nat nft_compat sunrpc intel_rapl_msr intel_rapl_common nfit libnvdimm kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel vfat fat mlx5_ib ib_uverbs aesni_intel crypto_simd cryptd glue_helper ib_core pcspkr joydev hv_utils sch_fq_codel binfmt_misc xfs mlx5_core mlxfw tls psample sr_mod cdrom sd_mod pci_hyperv pci_hyperv_intf sg serio_raw hv_storvsc hv_netvsc hyperv_keyboard scsi_transport_fc hid_hyperv hv_vmbus dm_mirror dm_region_hash dm_log dm_mod nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c
[ 1924.466487] CR2: 0000000000000058
[ 1924.469483] ---[ end trace 7b55ee81713bcee6 ]---
[ 1924.469483] RIP: 0010:dm_softirq_done+0x4f/0x240 [dm_mod]
[ 1924.469483] Code: 51 01 00 00 44 0f b6 bf 60 01 00 00 4d 8b ac 24 10 01 00 00 45 89 fe f6 47 1d 04 75 58 49 8b 7d 08 48 85 ff 74 4f 48 8b 47 08 <48> 8b 40 58 48 85 c0 74 42 49 8d 4d 50 44 89 fa 4c 89 e6 e8 69 ff
[ 1924.486483] RSP: 0018:ff66070e00210ee0 EFLAGS: 00010282
[ 1924.486483] RAX: 0000000000000000 RBX: ff3464adca1f0540 RCX: dead000000000122
[ 1924.491484] RDX: ff66070e00210f20 RSI: ff3464adca1f0598 RDI: ff66070e0009b040
[ 1924.495487] RBP: ff66070e00210f10 R08: ff3464ae1fbedfc0 R09: 0000000000000100
[ 1924.500484] R10: 0000000000000001 R11: 0000000000000230 R12: ff3464adc3ea0a80
[ 1924.505277] R13: ff3464adca1f0658 R14: 0000000000000000 R15: 0000000000000000
[ 1924.505277] FS:  0000000000000000(0000) GS:ff3464ae1fbc0000(0000) knlGS:0000000000000000
[ 1924.505277] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1924.514485] CR2: 0000000000000058 CR3: 000000087fd7e003 CR4: 0000000000361ee0
[ 1924.518489] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1924.524494] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1924.524494] Kernel panic - not syncing: Fatal exception in interrupt
[ 1924.531484] Kernel Offset: 0x24000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 1924.531484] Rebooting in 1 seconds..

unable to handle page fault for address

[ 4499.650701] BUG: unable to handle page fault for address: ff6dba89c0223048
[ 4499.657016] #PF: supervisor read access in kernel mode
[ 4499.660694] #PF: error_code(0x0000) - not-present page
[ 4499.661044] scsi 76:0:0:0: Direct-Access     LIO-ORG  IBLOCK           4.0  PQ: 0 ANSI: 5
[ 4499.661853] PGD 107d65067 P4D 107d66067 PUD 107d67067 PMD 107499067 PTE 0
[ 4499.661853] Oops: 0000 [#1] SMP NOPTI
[ 4499.661853] CPU: 5 PID: 1897 Comm: flashgrid_initi Not tainted 5.4.17-2136.334.6.1.el8uek.x86_64 #3
[ 4499.661853] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 05/13/2024
[ 4499.661853] RIP: 0010:dm_softirq_done+0x4b/0x240 [dm_mod]
[ 4499.661853] Code: 85 e4 0f 84 51 01 00 00 44 0f b6 bf 60 01 00 00 4d 8b ac 24 10 01 00 00 45 89 fe f6 47 1d 04 75 58 49 8b 7d 08 48 85 ff 74 4f <48> 8b 47 08 48 8b 40 58 48 85 c0 74 42 49 8d 4d 50 44 89 fa 4c 89
[ 4499.661853] RSP: 0000:ff6dba89c01b8ee0 EFLAGS: 00010282
[ 4499.661853] RAX: ffffffffc0248c90 RBX: ff4862149ae40000 RCX: dead000000000122
[ 4499.661853] RDX: ff6dba89c01b8f20 RSI: ff4862149ae40058 RDI: ff6dba89c0223040
[ 4499.661853] RBP: ff6dba89c01b8f10 R08: ff4862149fb6dfc0 R09: 0000000000000100
[ 4499.661853] R10: 0000000000000001 R11: 00000000000004d0 R12: ff48621476b40a80
[ 4499.661853] R13: ff4862149ae40118 R14: 0000000000000000 R15: 0000000000000000
[ 4499.661853] FS:  00007f604bc61740(0000) GS:ff4862149fb40000(0000) knlGS:0000000000000000
[ 4499.661853] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4499.661853] CR2: ff6dba89c0223048 CR3: 000000088430e004 CR4: 0000000000361ee0
[ 4499.661853] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4499.661853] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4499.661853] Call Trace:
[ 4499.661853]  <IRQ>
[ 4499.661853]  ? show_regs.cold.12+0x1a/0x1c
[ 4499.661853]  ? __die+0x86/0xd2
[ 4499.661853]  ? no_context.isra.25+0x13f/0x552
[ 4499.661853]  ? kprobe_ftrace_handler+0xa1/0xff
[ 4499.668591] scsi 76:0:0:0: alua: supports implicit and explicit TPGS
[ 4499.668563]  ? __bad_area_nosemaphore+0x43/0x1d8
[ 4499.668563]  ? bad_area_nosemaphore+0x16/0x1c
[ 4499.668563]  ? do_kern_addr_fault+0x72/0x81
[ 4499.668563]  ? __do_page_fault+0x276/0x4b8
[ 4499.668563]  ? do_page_fault+0x36/0x122
[ 4499.668563]  ? page_fault+0x13d/0x142
[ 4499.674446] scsi 76:0:0:0: alua: device naa.60014059cc3a0c2d09041e1bea47f0bf port group 0 rel port 1
[ 4499.668563]  ? dm_mq_queue_rq+0x410/0x410 [dm_mod]
[ 4499.668563]  ? dm_softirq_done+0x4b/0x240 [dm_mod]
[ 4499.668563]  blk_done_softirq+0xa5/0xd1
[ 4499.668563]  __do_softirq+0xd4/0x2cc
[ 4499.668563]  irq_exit+0x103/0x108
[ 4499.684898] sd 76:0:0:0: Attached scsi generic sg92 type 0
[ 4499.685696] sd 76:0:0:0: [sdcn] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB)
[ 4499.685698] sd 76:0:0:0: [sdcn] 4096-byte physical blocks
[ 4499.685838] sd 76:0:0:0: [sdcn] Write Protect is off
[ 4499.685840] sd 76:0:0:0: [sdcn] Mode Sense: 43 00 00 08
[ 4499.686113] sd 76:0:0:0: [sdcn] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[ 4499.686479]  do_IRQ+0x59/0xe4
[ 4499.686479]  common_interrupt+0xf/0x1d2
[ 4499.686479]  </IRQ>
[ 4499.686479] RIP: 0033:0x7f604b3c0940
[ 4499.686479] Code: 25 7d 9f 50 00 0f 1f 44 00 00 f3 0f 1e fa f2 ff 25 75 9f 50 00 0f 1f 44 00 00 f3 0f 1e fa f2 ff 25 6d 9f 50 00 0f 1f 44 00 00 <f3> 0f 1e fa f2 ff 25 65 9f 50 00 0f 1f 44 00 00 f3 0f 1e fa f2 ff
[ 4499.686479] RSP: 002b:00007ffdc0c3ac78 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffde
[ 4499.686479] RAX: 0000000000c79d12 RBX: 00007f604bb91260 RCX: 00007f603d639660
[ 4499.686479] RDX: 0000000000d28270 RSI: 0000000000d28270 RDI: 00007f603d639660
[ 4499.686479] RBP: 00007f604ba9ed58 R08: 0000000000c79d06 R09: 0000000000000002
[ 4499.686479] R10: b6152e7475dc8841 R11: 000000000000000f R12: 00007f604ba9ec80
[ 4499.686479] R13: 00007f604ba9ed50 R14: 00007f604b9fca58 R15: 0000000000c79d04
[ 4499.686479] Modules linked in: dm_queue_length iscsi_tcp libiscsi_tcp libiscsi target_core_user uio target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_mod dm_multipath vxlan ip6_udp_tunnel udp_tunnel act_mirred sch_ingress ifb cls_u32 act_gact cls_bpf sch_hfsc rfkill scsi_transport_iscsi nft_counter nft_chain_nat xt_nat nf_nat nft_compat sunrpc intel_rapl_msr intel_rapl_common nfit libnvdimm kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul vfat fat ghash_clmulni_intel mlx5_ib aesni_intel crypto_simd ib_uverbs cryptd ib_core hv_utils glue_helper pcspkr joydev sch_fq_codel binfmt_misc xfs mlx5_core sr_mod mlxfw cdrom tls sd_mod psample sg pci_hyperv pci_hyperv_intf hv_storvsc serio_raw hv_netvsc scsi_transport_fc hid_hyperv hyperv_keyboard hv_vmbus dm_mirror dm_region_hash dm_log dm_mod nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c
[ 4499.691301] sd 76:0:0:0: [sdcn] Optimal transfer size 524288 bytes
[ 4499.883451] CR2: ff6dba89c0223048
[ 4499.892959] ---[ end trace 48e68afa59564894 ]---
[ 4499.892959] RIP: 0010:dm_softirq_done+0x4b/0x240 [dm_mod]
[ 4499.892959] Code: 85 e4 0f 84 51 01 00 00 44 0f b6 bf 60 01 00 00 4d 8b ac 24 10 01 00 00 45 89 fe f6 47 1d 04 75 58 49 8b 7d 08 48 85 ff 74 4f <48> 8b 47 08 48 8b 40 58 48 85 c0 74 42 49 8d 4d 50 44 89 fa 4c 89
[ 4499.892959] RSP: 0000:ff6dba89c01b8ee0 EFLAGS: 00010282
[ 4499.892959] RAX: ffffffffc0248c90 RBX: ff4862149ae40000 RCX: dead000000000122
[ 4499.892959] RDX: ff6dba89c01b8f20 RSI: ff4862149ae40058 RDI: ff6dba89c0223040
[ 4499.892959] RBP: ff6dba89c01b8f10 R08: ff4862149fb6dfc0 R09: 0000000000000100
[ 4499.892959] R10: 0000000000000001 R11: 00000000000004d0 R12: ff48621476b40a80
[ 4499.892959] R13: ff4862149ae40118 R14: 0000000000000000 R15: 0000000000000000
[ 4499.892959] FS:  00007f604bc61740(0000) GS:ff4862149fb40000(0000) knlGS:0000000000000000
[ 4499.892959] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4499.892959] CR2: ff6dba89c0223048 CR3: 000000088430e004 CR4: 0000000000361ee0
[ 4499.892959] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4499.892959] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4499.892959] Kernel panic - not syncing: Fatal exception in interrupt
[ 4499.892959] Kernel Offset: 0x15800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 4499.892959] Rebooting in 1 seconds..

Details

It happens on the initiator's side when the target server is unexpectedly rebooted on OL8 UEK6. We tested using different OS and iSCSI node.session.nr_sessions parameters and came to conclusion that only UEK6 with nr_sessions > 1 is affected. The failure frequency is about 5 failures for 25 reboots (20%).
The summary of the configurations we tested is below:

OS Kernel Kernel Acronym node.session.nr_sessions Kernel panic?
OL7 3.10.0-1160.119.1.0.1.el7.x86_64 RHCK 4 😃No
OL8 5.4.17-2136.334.6.1.el8uek.x86_64 UEK6 1 😃No
OL8 5.4.17-2136.334.6.1.el8uek.x86_64 UEK6 2 😡Yes
OL8 5.4.17-2136.334.6.1.el8uek.x86_64 UEK6 4 😡Yes
RHEL8 4.18.0-553.8.1.el8_10.x86_64 RHCK 4 😃No
OL9 5.15.0-205.149.5.4.el9uek.x86_64 UEK7 4 😃No
RHEL9 5.14.0-427.16.1.el9_4.x86_64 RHCK 4 😃No

These specific kernel traces are from Azure but we have encountered this issue on AWS too, so that it is not Azure specific.

@YoderExMachina
Copy link
Member

Oracle Linux customers, please file your issue at https://support.oracle.com

Thanks for filing an issue with Oracle Linux.

GitHub Issues is not an official support channel and we don't offer
product support here. If you're not yet an Oracle Linux customer,
consider signing up at https://linux.oracle.com.

Even if you're not a customer, if we can confirm that an issue is a
bug we will do our best to fix it and to update this issue
once it has been fixed. We don't guarantee a fix or feedback and
for now, we will close this issue. If you have Oracle Linux support,
please use support.oracle.com to report issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants