Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sfptpd kernel crash - null pointer - kernel 5.14.0-344.el9.x86_64 openonload 8.1.1.17 #168

Closed
agronaught opened this issue Aug 24, 2023 · 3 comments
Labels
sfc-bug Bug in sfc net driver component

Comments

@agronaught
Copy link

repeatable crash once sfptpd starts trying to send delay information.

uname -a
Linux omxps901 5.14.0-344.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Jul 24 09:26:29 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

[root@omxps901 127.0.0.1-2023-08-24-05:07:38]# rpm -ql onload-kmod-5.14.0-344.el9
/lib/modules/5.14.0-344.el9.x86_64/extra/driverlink_api.h
/lib/modules/5.14.0-344.el9.x86_64/extra/filter.h
/lib/modules/5.14.0-344.el9.x86_64/extra/onload.ko
/lib/modules/5.14.0-344.el9.x86_64/extra/onload.symvers
/lib/modules/5.14.0-344.el9.x86_64/extra/sfc.ko
/lib/modules/5.14.0-344.el9.x86_64/extra/sfc.symvers
/lib/modules/5.14.0-344.el9.x86_64/extra/sfc_char.ko
/lib/modules/5.14.0-344.el9.x86_64/extra/sfc_char.symvers
/lib/modules/5.14.0-344.el9.x86_64/extra/sfc_driverlink.ko
/lib/modules/5.14.0-344.el9.x86_64/extra/sfc_resource.ko
/lib/modules/5.14.0-344.el9.x86_64/extra/sfc_resource.symvers

vmcore extract:

[ 4427.831812] sfc 0000:08:00.1 et1: unknown private ioctl cmd ef29
[ 4427.831828] sfc 0000:08:00.1 et1: unknown private ioctl cmd ef29
[ 4427.832188] sfc 0000:08:00.1 et1: unknown private ioctl cmd ef29
[ 4427.832210] sfc 0000:08:00.1 et1: unknown private ioctl cmd ef29
[ 4427.832239] sfc 0000:08:00.1 et1: unknown private ioctl cmd ef29
[ 4530.887654] device bond0 entered promiscuous mode
[ 4530.887660] device et1 entered promiscuous mode
[ 4536.460709] device bond0 left promiscuous mode
[ 4536.460714] device et1 left promiscuous mode
[ 4780.692666] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 4780.694045] #PF: supervisor write access in kernel mode
[ 4780.695291] #PF: error_code(0x0002) - not-present page
[ 4780.696498] PGD 800000013ee5e067 P4D 800000013ee5e067 PUD 13ecd1067 PMD 0 
[ 4780.697716] Oops: 0002 [#1] PREEMPT SMP PTI
[ 4780.698922] CPU: 0 PID: 87021 Comm: ptp Kdump: loaded Tainted: G S      W  OE     -------  ---  5.14.0-344.el9.x86_64 #1
[ 4780.700153] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 04/29/2021
[ 4780.701388] RIP: 0010:skb_queue_tail+0x31/0x50
[ 4780.702618] Code: 4c 8d 67 14 55 48 89 f5 53 48 89 fb 4c 89 e7 e8 b5 ba 2a 00 4c 89 e7 48 89 c6 48 8b 43 08 48 89 5d 00 4
8 89 45 08 48 89 6b 08 <48> 89 28 8b 43 10 83 c0 01 89 43 10 5b 5d 41 5c e9 3a bb 2a 00 66
[ 4780.705179] RSP: 0018:ffffb06488c3b820 EFLAGS: 00010046
[ 4780.706474] RAX: 0000000000000000 RBX: ffff8c639c8de848 RCX: ffff8c677df9d400
[ 4780.707793] RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffff8c639c8de85c
[ 4780.709115] RBP: ffff8c639cf83d00 R08: ffff8c6c813374ac R09: 0000000000000001
[ 4780.710429] R10: 2c04cc78ba2a6697 R11: 736f6d6570736575 R12: ffff8c639c8de85c
[ 4780.711737] R13: 0000000000000000 R14: ffff8c639c8e5000 R15: ffff8c639cf83d00
[ 4780.713047] FS:  00007fd6ce5fe640(0000) GS:ffff8c6adfc00000(0000) knlGS:0000000000000000
[ 4780.714354] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4780.715642] CR2: 0000000000000000 CR3: 000000014802c001 CR4: 00000000001706f0
[ 4780.716988] Call Trace:
[ 4780.718349]  <TASK>
[ 4780.719701]  ? show_trace_log_lvl+0x1c4/0x2df
[ 4780.721060]  ? show_trace_log_lvl+0x1c4/0x2df
[ 4780.722414]  ? efx_ptp_tx+0x1a/0x50 [sfc]
[ 4780.723853]  ? __die_body.cold+0x8/0xd
[ 4780.725197]  ? page_fault_oops+0x134/0x170
[ 4780.726551]  ? kernelmode_fixup_or_oops+0x84/0x110
[ 4780.727913]  ? exc_page_fault+0x62/0x150
[ 4780.729269]  ? asm_exc_page_fault+0x22/0x30
[ 4780.730632]  ? skb_queue_tail+0x31/0x50
[ 4780.731986]  efx_ptp_tx+0x1a/0x50 [sfc]
[ 4780.733741]  dev_hard_start_xmit+0xc7/0x210
[ 4780.735433]  sch_direct_xmit+0x9e/0x370
[ 4780.737096]  __dev_xmit_skb+0x2b2/0x520
[ 4780.738758]  __dev_queue_xmit+0x362/0x6a0
[ 4780.740416]  ? ip_generic_getfrag+0x62/0x100
[ 4780.742091]  bond_start_xmit+0x40/0xa0 [bonding]
[ 4780.743695]  dev_hard_start_xmit+0xc7/0x210
[ 4780.745318]  __dev_queue_xmit+0x5b0/0x6a0
[ 4780.746924]  ? __ip_make_skb+0x2ff/0x490
[ 4780.748549]  ? ip_mc_output+0xbf/0x2e0
[ 4780.750087]  ip_finish_output2+0x21d/0x420
[ 4780.751757]  ip_send_skb+0x8a/0x90
[ 4780.753440]  udp_send_skb+0x154/0x370
[ 4780.755117]  udp_sendmsg+0xc41/0xf70
[ 4780.756779]  ? __pfx_ip_generic_getfrag+0x10/0x10
[ 4780.758474]  ? finish_task_switch.isra.0+0x207/0x2a0
[ 4780.760170]  ? sock_sendmsg+0x5b/0x70
@abower-amd
Copy link
Collaborator

Hi @agronaught, thanks for the report!

For a supported Onload version like this I would normally suggest raising a ticket with [email protected] but this is not yet a supported kernel!

Although you have shown Onload installed, with out-of-tree sfc driver, is the sfc actually loaded at this point?

The message:

[ 4427.831812] sfc 0000:08:00.1 et1: unknown private ioctl cmd ef29

suggests either that the in-tree sfc module is loaded or that an ancient version of sfptpd is running.

Could you check that the out-of-tree module is actually loaded (e.g. with ethtool -i <intf> which will report the kernel version if using the in-tree driver or the real sfc version if using the out-of-tree driver). This will determine whether this is an issue with the in-tree or out-of-tree driver.

Thanks!

@agronaught
Copy link
Author

found the cause anyway.

our build was setting the sfc module option:

options sfc rss_cpus=2

which was a legacy setup.

the impact of this was the following message in the boot log:

Aug 25 02:43:58 omxps901 kernel: sfc 0000:08:00.3 eth7: ERROR: PTP requires MSI-X and 1 additional interruptvector. PTP disabled

When sfptpd was started we had a kernel crash presumably due to the subroutine not being initialised in the module intitialisation.

So - root cause was a configuration issue (change to options sfc rss_cpus=32) however the resultant kernel crash shouldn't occur due to this.

@abower-amd abower-amd added bug Something isn't working sfc-bug Bug in sfc net driver component and removed bug Something isn't working labels Apr 16, 2024
@abower-amd
Copy link
Collaborator

This is fixed in 88ca3c8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sfc-bug Bug in sfc net driver component
Projects
None yet
Development

No branches or pull requests

2 participants