Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bluetooth TX Timeout Issue #66

Closed
ratherDashing opened this issue Jul 24, 2012 · 36 comments
Closed

Bluetooth TX Timeout Issue #66

ratherDashing opened this issue Jul 24, 2012 · 36 comments
Assignees

Comments

@ratherDashing
Copy link

Seems to be an issue with the 3.1 kernel: https://bugs.archlinux.org/task/27449

Steps to recreate:

  • Run bluetooth-agent 0000
  • Try to pair phone with computer

Background information:

hci0: Type: BR/EDR Bus: USB
BD Address: 00:02:72:BB:E5:54 ACL MTU: 1022:8 SCO MTU: 121:3
UP RUNNING PSCAN
RX bytes:18134 acl:205 sco:0 events:632 errors:0
TX bytes:14192 acl:221 sco:0 commands:195 errors:0
Features: 0xff 0xfe 0x0d 0xfe 0x98 0x7f 0x79 0x87
Packet type: DM1 DM3 DM5 DH1 DH3 DH5 HV1 HV2 HV3
Link policy: RSWITCH HOLD SNIFF
Link mode: SLAVE ACCEPT
Name: 'raspberrypi-0'
Class: 0x420100
Service Classes: Networking, Telephony
Device Class: Computer, Uncategorized
HCI Version: 3.0 (0x5) Revision: 0x9999
LMP Version: 3.0 (0x5) Subversion: 0x9999
Manufacturer: Atheros Communications, Inc. (69)

Distro: Raspbian
Kernel: [ 0.000000] Linux version 3.1.9+ (dc4@dc4-arm-01) (gcc version 4.5.1 (Broadcom-2708) ) #168 PREEMPT Sat Jul 14 18:56:31 BST 2012

@popcornmix
Copy link
Collaborator

Well, latest firmware is now 3.2.27. Also the USB driver has been udpated. Can you test again?
You could also try adding:
dwc_otg.microframe_schedule=1
to cmdline.txt.

@gregd72002
Copy link

I'm facing the same issue. Just tested it with the latest kernel:
Linux raspberrypi 3.2.27+ #60 PREEMPT Thu Aug 23 15:33:51 BST 2012 armv6l GNU/Linux

And the issues is still there.

It looks like we need kernel 3.3 to fix it (based on the archlinux bug report)
or the following patches:
http://article.gmane.org/gmane.linux.bluez.kernel/19950/
http://article.gmane.org/gmane.linux.bluez.kernel/19951/

@popcornmix
Copy link
Collaborator

@gregd72002
Have you tried applying the patches? Does it fix it?

@gregd72002
Copy link

No, I have not. I do not have the environment just now to compile it.

@gregd72002
Copy link

Tried it overnight and it works! However the patches do no apply cleanly.. Will try to issue a diff in a moment

@gregd72002
Copy link

This is a consolidated patch that fixes the issue in the current kernel sources.

I also tested it myself by issuing some scan commands and all seems to be fine!

diff -rupN linux.old/include/net/bluetooth/hci_core.h linux.new/include/net/bluetooth/hci_core.h
--- linux.old/include/net/bluetooth/hci_core.h  2012-08-25 10:09:56.532713215 +0100
+++ linux.new/include/net/bluetooth/hci_core.h  2012-08-25 10:17:33.167036880 +0100
@@ -130,7 +130,7 @@ struct hci_dev {
    __u8        major_class;
    __u8        minor_class;
    __u8        features[8];
-   __u8        extfeatures[8];
+   __u8        host_features[8];
    __u8        commands[64];
    __u8        ssp_mode;
    __u8        hci_ver;
@@ -618,7 +618,7 @@ void hci_conn_del_sysfs(struct hci_conn
 #define lmp_le_capable(dev)        ((dev)->features[4] & LMP_LE)

 /* ----- Extended LMP capabilities ----- */
-#define lmp_host_le_capable(dev)   ((dev)->extfeatures[0] & LMP_HOST_LE)
+#define lmp_host_le_capable(dev)   ((dev)->host_features[0] & LMP_HOST_LE)

 /* ----- HCI protocols ----- */
 struct hci_proto {
diff -rupN linux.old/net/bluetooth/hci_event.c linux.new/net/bluetooth/hci_event.c
--- linux.old/net/bluetooth/hci_event.c 2012-08-25 10:11:25.409662054 +0100
+++ linux.new/net/bluetooth/hci_event.c 2012-08-25 10:18:06.035908486 +0100
@@ -695,7 +695,14 @@ static void hci_cc_read_local_ext_featur
    if (rp->status)
        return;

-   memcpy(hdev->extfeatures, rp->features, 8);
+   switch (rp->page) {
+   case 0:
+       memcpy(hdev->features, rp->features, 8);
+       break;
+   case 1:
+       memcpy(hdev->host_features, rp->features, 8);
+       break;
+   }

    hci_req_complete(hdev, HCI_OP_READ_LOCAL_EXT_FEATURES, rp->status);
 }

@gregd72002
Copy link

Right, this does work partially on my brand-less dongle. The timeout issue now appears randomly. While I was able to pair one device once now the timeout issue is back. Not sure what it depends on.

Also, I tried to use a different bluetooth dongle (Belkin mini) and with this one nothing works at all..

I believe we really need kernel-3.3/3.4 in here... any timeframes for this?

@gregd72002
Copy link

Did some debugging:

[  940.548222] Bluetooth: Generic Bluetooth USB driver ver 0.6
[  940.548424] hci_register_dev: c8594800 name  bus 1 owner bf296d08
[  940.557617] hci_register_sysfs: c8594800 name hci0 bus 1
[  940.560798] hci_power_on: hci0
[  940.560857] hci_dev_get: 0
[  940.560878] hci_dev_open: hci0 c8594800
[  940.560970] __hci_request: hci0 start
[  940.561014] hci_init_req: hci0 0
[  940.561054] hci_send_cmd: hci0 opcode 0xc03 plen 0
[  940.561112] hci_send_cmd: skb len 3
[  940.561181] hci_cmd_task: hci0 cmd 1
[  940.561248] hci_send_frame: hci0 type 1 len 3
[  940.561387] hci_send_cmd: hci0 opcode 0x1003 plen 0
[  940.561484] hci_send_cmd: skb len 3
[  940.561527] hci_send_cmd: hci0 opcode 0x1001 plen 0
[  940.561600] hci_cmd_task: hci0 cmd 0
[  940.561655] hci_send_cmd: skb len 3
[  940.561722] hci_cmd_task: hci0 cmd 0
[  940.561756] hci_send_cmd: hci0 opcode 0x1005 plen 0
[  940.561792] hci_send_cmd: skb len 3
[  940.561810] hci_send_cmd: hci0 opcode 0x1009 plen 0
[  940.561862] hci_cmd_task: hci0 cmd 0
[  940.561915] hci_send_cmd: skb len 3
[  940.561934] hci_send_cmd: hci0 opcode 0xc23 plen 0
[  940.561969] hci_cmd_task: hci0 cmd 0
[  940.562007] hci_send_cmd: skb len 3
[  940.562042] hci_send_cmd: hci0 opcode 0xc14 plen 0
[  940.562084] hci_cmd_task: hci0 cmd 0
[  940.562136] hci_send_cmd: skb len 3
[  940.562172] hci_send_cmd: hci0 opcode 0xc25 plen 0
[  940.562207] hci_cmd_task: hci0 cmd 0
[  940.562260] hci_send_cmd: skb len 3
[  940.562296] hci_send_cmd: hci0 opcode 0xc05 plen 1
[  940.562332] hci_cmd_task: hci0 cmd 0
[  940.562384] hci_send_cmd: skb len 4
[  940.562420] hci_send_cmd: hci0 opcode 0xc16 plen 2
[  940.562455] hci_cmd_task: hci0 cmd 0
[  940.562507] hci_send_cmd: skb len 5
[  940.562543] hci_send_cmd: hci0 opcode 0xc12 plen 7
[  940.562587] hci_cmd_task: hci0 cmd 0
[  940.562618] hci_send_cmd: skb len 10
[  940.562674] hci_cmd_task: hci0 cmd 0
[  940.562793] hci_sock_dev_event: hdev hci0 event 1
[  940.562861] hci_send_to_sock: hdev   (null) len 8
[  940.563467] usbcore: registered new interface driver btusb
[  940.861966] hci_rx_task: hci0
[  940.862040] hci_cc_reset: hci0 status 0x0
[  940.862079] hci_req_complete: hci0 command 0x0c03 result 0x00
[  940.862145] hci_cmd_task: hci0 cmd 1
[  940.862192] hci_send_frame: hci0 type 1 len 3
[  940.863942] hci_rx_task: hci0
[  940.864011] hci_cc_read_local_features: hci0 status 0x0
[  940.864056] hci_cc_read_local_features: hci0 features 0xffff8ffe9bff7983
[  940.864103] hci_cmd_task: hci0 cmd 1
[  940.864164] hci_send_frame: hci0 type 1 len 3
[  940.865963] hci_rx_task: hci0
[  940.866034] hci_cc_read_local_version: hci0 status 0x0
[  940.866075] hci_cc_read_local_version: hci0 manufacturer 15 hci ver 4:20868
[  940.866121] hci_send_cmd: hci0 opcode 0xc01 plen 8
[  940.866175] hci_send_cmd: skb len 11
[  940.866211] hci_send_cmd: hci0 opcode 0x1002 plen 0
[  940.866248] hci_send_cmd: skb len 3
[  940.866284] hci_send_cmd: hci0 opcode 0xc56 plen 1
[  940.866444] hci_send_cmd: skb len 4
[  940.866508] hci_send_cmd: hci0 opcode 0xc45 plen 1
[  940.866561] hci_send_cmd: skb len 4
[  940.866596] hci_send_cmd: hci0 opcode 0xc58 plen 0
[  940.866641] hci_send_cmd: skb len 3
[  940.866676] hci_send_cmd: hci0 opcode 0x1004 plen 1
[  940.866727] hci_send_cmd: skb len 4
[  940.866802] hci_cmd_task: hci0 cmd 1
[  940.866874] hci_send_frame: hci0 type 1 len 3
[  940.867920] hci_rx_task: hci0
[  940.867979] hci_cc_read_buffer_size: hci0 status 0x0
[  940.868024] hci_cc_read_buffer_size: hci0 acl mtu 1021:8 sco mtu 64:1
[  940.868066] hci_cmd_task: hci0 cmd 1
[  940.868117] hci_send_frame: hci0 type 1 len 3
[  940.868939] hci_rx_task: hci0
[  940.868994] hci_cc_read_bd_addr: hci0 status 0x0
[  940.869033] hci_req_complete: hci0 command 0x1009 result 0x00
[  940.869091] hci_cmd_task: hci0 cmd 1
[  940.869156] hci_send_frame: hci0 type 1 len 3
[  940.870930] hci_rx_task: hci0
[  940.870977] hci_cc_read_class_of_dev: hci0 status 0x0
[  940.871018] hci_cc_read_class_of_dev: hci0 class 0x000000
[  940.871075] hci_cmd_task: hci0 cmd 1
[  940.871138] hci_send_frame: hci0 type 1 len 3
[  940.876953] hci_rx_task: hci0
[  940.877025] hci_event_packet: hci0 event 0x69
[  940.877969] hci_rx_task: hci0
[  940.878035] hci_event_packet: hci0 event 0x0
[  940.878090] hci_event_packet: hci0 event 0x0
[  940.878134] hci_event_packet: hci0 event 0x0
[  940.878170] hci_event_packet: hci0 event 0x0
[  940.878208] hci_event_packet: hci0 event 0x0
[  940.878249] hci_event_packet: hci0 event 0x0
[  940.878285] hci_event_packet: hci0 event 0x0
[  940.878320] hci_event_packet: hci0 event 0x0
[  940.878964] hci_rx_task: hci0
[  940.879014] hci_event_packet: hci0 event 0x0
[  940.879054] hci_event_packet: hci0 event 0x0
[  940.879091] hci_event_packet: hci0 event 0x0
[  940.879129] hci_event_packet: hci0 event 0x0
[  940.879179] hci_event_packet: hci0 event 0x0
[  940.879215] hci_event_packet: hci0 event 0x0
[  940.879251] hci_event_packet: hci0 event 0x0
[  940.879287] hci_event_packet: hci0 event 0x0
[  940.879956] hci_rx_task: hci0
[  940.880020] hci_event_packet: hci0 event 0x0
[  940.880073] hci_event_packet: hci0 event 0x0
[  940.880118] hci_event_packet: hci0 event 0x0
[  940.880154] hci_event_packet: hci0 event 0x0
[  940.880191] hci_event_packet: hci0 event 0x0
[  940.880226] hci_event_packet: hci0 event 0x0
[  940.880268] hci_event_packet: hci0 event 0x0
[  940.880304] hci_event_packet: hci0 event 0x0
[  940.880964] hci_rx_task: hci0
[  940.881028] hci_event_packet: hci0 event 0x0
[  940.881081] hci_event_packet: hci0 event 0x0
[  940.881102] hci_event_packet: hci0 event 0x0
[  940.881154] hci_event_packet: hci0 event 0x0
[  940.881190] hci_event_packet: hci0 event 0x0
[  940.881226] hci_event_packet: hci0 event 0x0
[  940.881268] hci_event_packet: hci0 event 0x0
[  940.881303] hci_event_packet: hci0 event 0x0
[  940.883971] hci_rx_task: hci0
[  940.884040] hci_event_packet: hci0 event 0x0
[  940.884094] hci_event_packet: hci0 event 0x0
[  940.884139] hci_event_packet: hci0 event 0x0
[  940.884175] hci_event_packet: hci0 event 0x0
[  940.884211] hci_event_packet: hci0 event 0x0
[  940.884253] hci_event_packet: hci0 event 0x0
[  940.884289] hci_event_packet: hci0 event 0x0
[  940.884325] hci_event_packet: hci0 event 0x0
[  940.884958] hci_rx_task: hci0
[  940.885003] hci_event_packet: hci0 event 0x0
[  940.885042] hci_event_packet: hci0 event 0x0
[  940.885095] hci_event_packet: hci0 event 0x0
[  940.885123] hci_event_packet: hci0 event 0x0
[  940.885159] hci_event_packet: hci0 event 0x0
[  940.885195] hci_event_packet: hci0 event 0x0
[  940.885232] hci_event_packet: hci0 event 0x0
[  940.885269] hci_event_packet: hci0 event 0x0
[  940.885967] hci_rx_task: hci0
[  940.886028] hci_event_packet: hci0 event 0x0
[  940.886081] hci_event_packet: hci0 event 0x0
[  940.886135] hci_event_packet: hci0 event 0x0
[  940.886172] hci_event_packet: hci0 event 0x0
[  940.886208] hci_event_packet: hci0 event 0x0
[  940.886261] hci_event_packet: hci0 event 0x0
[  940.886298] hci_event_packet: hci0 event 0x0
[  940.886450] hci_event_packet: hci0 event 0x0
[  940.886966] hci_rx_task: hci0
[  940.887017] hci_event_packet: hci0 event 0x0
[  940.887057] hci_event_packet: hci0 event 0x0
[  940.887108] hci_event_packet: hci0 event 0x0
[  940.887150] hci_event_packet: hci0 event 0x0
[  940.887186] hci_event_packet: hci0 event 0x0
[  940.887221] hci_event_packet: hci0 event 0x0
[  940.887259] hci_event_packet: hci0 event 0x0
[  940.887295] hci_event_packet: hci0 event 0x0
[  940.887965] hci_rx_task: hci0
[  940.888015] hci_event_packet: hci0 event 0x0
[  940.888054] hci_event_packet: hci0 event 0x0
[  940.888092] hci_event_packet: hci0 event 0x0
[  940.888131] hci_event_packet: hci0 event 0x0
[  940.888166] hci_event_packet: hci0 event 0x0
[  940.888202] hci_event_packet: hci0 event 0x0
[  940.888244] hci_event_packet: hci0 event 0x0
[  940.888279] hci_event_packet: hci0 event 0x0
[  940.888961] hci_rx_task: hci0
[  940.889010] hci_event_packet: hci0 event 0x0
[  940.889049] hci_event_packet: hci0 event 0x0
[  940.889085] hci_event_packet: hci0 event 0x0
[  940.889127] hci_event_packet: hci0 event 0x0
[  940.889163] hci_event_packet: hci0 event 0x0
[  940.889198] hci_event_packet: hci0 event 0x0
[  940.889242] hci_event_packet: hci0 event 0x0
[  940.889283] hci_event_packet: hci0 event 0x0
[  940.889967] hci_rx_task: hci0
[  940.890014] hci_event_packet: hci0 event 0x0
[  940.890053] hci_event_packet: hci0 event 0x0
[  940.890104] hci_event_packet: hci0 event 0x0
[  940.890147] hci_event_packet: hci0 event 0x0
[  940.890181] hci_event_packet: hci0 event 0x0
[  940.890217] hci_event_packet: hci0 event 0x0
[  940.890254] hci_event_packet: hci0 event 0x0
[  941.866443] Bluetooth: hci0 command tx timeout
[  941.866522] hci_cmd_task: hci0 cmd 1
[  941.866585] hci_send_frame: hci0 type 1 len 3
[  942.866455] Bluetooth: hci0 command tx timeout
[  942.866564] hci_cmd_task: hci0 cmd 1
[  942.866638] hci_send_frame: hci0 type 1 len 4
[  943.866492] Bluetooth: hci0 command tx timeout
[  943.866569] hci_cmd_task: hci0 cmd 1
[  943.866635] hci_send_frame: hci0 type 1 len 5
[  943.868358] hci_rx_task: hci0
[  943.868426] hci_event_packet: hci0 event 0x0
[  943.868491] hci_conn_request_evt: hci0 bdaddr 00:00:CF:E4:00:16 type 0x0
[  943.868553] sco_connect_ind: hdev hci0, bdaddr 00:00:CF:E4:00:16
[  943.868613] hci_inquiry_cache_lookup: cache c8594bdc, 00:00:CF:E4:00:16
[  943.868673] hci_conn_add: hci0 dst 00:00:CF:E4:00:16
[  943.868789] hci_conn_init_sysfs: conn c8587200
[  943.868848] hci_send_cmd: hci0 opcode 0x429 plen 21
[  943.868915] hci_send_cmd: skb len 24
[  943.868969] hci_remote_version_evt: hci0
[  943.869010] hci_cmd_task: hci0 cmd 0
[  944.866530] Bluetooth: hci0 command tx timeout
[  944.866607] hci_cmd_task: hci0 cmd 1
[  944.866681] hci_send_frame: hci0 type 1 len 10
[  944.944545] hci_rx_task: hci0
[  944.944601] hci_cc_delete_stored_link_key: hci0 status 0x0
[  944.944641] hci_req_complete: hci0 command 0x0c12 result 0x00
[  944.944708] hci_cmd_task: hci0 cmd 1
[  944.944755] hci_send_frame: hci0 type 1 len 11
[  944.946609] hci_rx_task: hci0
[  944.946658] hci_cc_set_event_mask: hci0 status 0x0
[  944.946699] hci_req_complete: hci0 command 0x0c01 result 0x00
[  944.946818] hci_cmd_task: hci0 cmd 1
[  944.946878] hci_send_frame: hci0 type 1 len 3
[  945.316991] usb 1-1.3.2: reset full-speed USB device number 6 using dwc_otg
[  945.647063] usb 1-1.3.1: reset full-speed USB device number 5 using dwc_otg
[  945.857047] usb 1-1.3.1: reset full-speed USB device number 5 using dwc_otg
[  945.946586] Bluetooth: hci0 command tx timeout
[  945.946659] hci_cmd_task: hci0 cmd 1
[  945.946738] hci_send_frame: hci0 type 1 len 4
[  946.946611] Bluetooth: hci0 command tx timeout
[  946.946671] hci_cmd_task: hci0 cmd 1
[  946.946747] hci_send_frame: hci0 type 1 len 4
[  947.717055] usb 1-1.3.1: reset full-speed USB device number 5 using dwc_otg
[  947.937177] usb 1-1.3.1: reset full-speed USB device number 5 using dwc_otg
[  947.946651] Bluetooth: hci0 command tx timeout
[  947.946788] hci_cmd_task: hci0 cmd 1
[  947.946885] hci_send_frame: hci0 type 1 len 3
[  948.457167] usb 1-1.3.2: reset full-speed USB device number 6 using dwc_otg
[  948.946676] Bluetooth: hci0 command tx timeout
[  948.946749] hci_cmd_task: hci0 cmd 1
[  948.946814] hci_send_frame: hci0 type 1 len 4
[  948.948068] hci_rx_task: hci0
[  948.948135] hci_cc_read_local_commands: hci0 status 0x0
[  948.948190] hci_send_cmd: hci0 opcode 0x80f plen 2
[  948.948236] hci_send_cmd: skb len 5
[  948.948274] hci_req_complete: hci0 command 0x1002 result 0x00
[  948.948339] hci_event_packet: hci0 event 0x10
[  948.948392] hci_inquiry_complete_evt: hci0 status 1
[  948.948427] hci_req_complete: hci0 command 0x0401 result 0x01
[  948.948467] hci_conn_check_pending: hdev hci0
[  948.948519] hci_event_packet: hci0 event 0x0
[  948.948554] hci_event_packet: hci0 event 0x0
[  948.948617] hci_event_packet: hci0 event 0x0
[  948.948655] hci_cmd_task: hci0 cmd 1
[  948.948698] hci_send_frame: hci0 type 1 len 24
[  949.227131] usb 1-1.3.1: reset full-speed USB device number 5 using dwc_otg
[  949.946695] Bluetooth: hci0 command tx timeout
[  949.946786] hci_cmd_task: hci0 cmd 1
[  949.946837] hci_send_frame: hci0 type 1 len 5
[  950.556840] __hci_request: TIMEOUT
[  950.556893] __hci_request: hci0 end: err -110
[  950.946690] Bluetooth: hci0 command tx timeout
[  950.946786] hci_cmd_task: hci0 cmd 1

and one more run of the same case but it fails in a different stage?

[  164.541215] Bluetooth: Core ver 2.16
[  164.543562] NET: Registered protocol family 31
[  164.543591] Bluetooth: HCI device and connection manager initialized
[  164.543611] Bluetooth: HCI socket layer initialized
[  164.543625] Bluetooth: L2CAP socket layer initialized
[  164.543698] Bluetooth: SCO socket layer initialized
[  205.951115] Bluetooth: Generic Bluetooth USB driver ver 0.6
[  205.951335] bluetooth:hci_register_dev: c84da000 name  bus 1 owner bf239d08
[  205.951619] bluetooth:hci_register_sysfs: c84da000 name hci0 bus 1
[  205.954824] bluetooth:hci_power_on: hci0
[  205.954861] bluetooth:hci_dev_get: 0
[  205.954880] bluetooth:hci_dev_open: hci0 c84da000
[  205.954974] bluetooth:__hci_request: hci0 start
[  205.954997] bluetooth:hci_init_req: hci0 0
[  205.955047] bluetooth:hci_send_cmd: hci0 opcode 0xc03 plen 0
[  205.955114] bluetooth:hci_send_cmd: skb len 3
[  205.955191] bluetooth:hci_cmd_task: hci0 cmd 1
[  205.955252] bluetooth:hci_send_frame: hci0 type 1 len 3
[  205.955350] bluetooth:hci_send_cmd: hci0 opcode 0x1003 plen 0
[  205.955528] bluetooth:hci_send_cmd: skb len 3
[  205.955585] bluetooth:hci_cmd_task: hci0 cmd 0
[  205.955621] bluetooth:hci_send_cmd: hci0 opcode 0x1001 plen 0
[  205.955660] bluetooth:hci_send_cmd: skb len 3
[  205.955698] bluetooth:hci_send_cmd: hci0 opcode 0x1005 plen 0
[  205.955735] bluetooth:hci_cmd_task: hci0 cmd 0
[  205.955792] bluetooth:hci_send_cmd: skb len 3
[  205.955829] bluetooth:hci_send_cmd: hci0 opcode 0x1009 plen 0
[  205.955865] bluetooth:hci_cmd_task: hci0 cmd 0
[  205.955907] bluetooth:hci_send_cmd: skb len 3
[  205.955943] bluetooth:hci_send_cmd: hci0 opcode 0xc23 plen 0
[  205.955979] bluetooth:hci_cmd_task: hci0 cmd 0
[  205.956054] bluetooth:hci_send_cmd: skb len 3
[  205.956093] bluetooth:hci_send_cmd: hci0 opcode 0xc14 plen 0
[  205.956130] bluetooth:hci_cmd_task: hci0 cmd 0
[  205.956170] bluetooth:hci_send_cmd: skb len 3
[  205.956208] bluetooth:hci_send_cmd: hci0 opcode 0xc25 plen 0
[  205.956243] bluetooth:hci_cmd_task: hci0 cmd 0
[  205.956289] bluetooth:hci_send_cmd: skb len 3
[  205.956326] bluetooth:hci_send_cmd: hci0 opcode 0xc05 plen 1
[  205.956362] bluetooth:hci_cmd_task: hci0 cmd 0
[  205.956428] bluetooth:hci_send_cmd: skb len 4
[  205.956466] bluetooth:hci_send_cmd: hci0 opcode 0xc16 plen 2
[  205.956502] bluetooth:hci_cmd_task: hci0 cmd 0
[  205.956542] bluetooth:hci_send_cmd: skb len 5
[  205.956579] bluetooth:hci_send_cmd: hci0 opcode 0xc12 plen 7
[  205.956614] bluetooth:hci_cmd_task: hci0 cmd 0
[  205.956661] bluetooth:hci_send_cmd: skb len 10
[  205.956735] bluetooth:hci_cmd_task: hci0 cmd 0
[  205.956839] bluetooth:hci_sock_dev_event: hdev hci0 event 1
[  205.956887] bluetooth:hci_send_to_sock: hdev   (null) len 8
[  205.957509] usbcore: registered new interface driver btusb
[  205.959127] bluetooth:hci_rx_task: hci0
[  205.959195] bluetooth:hci_cc_reset: hci0 status 0x0
[  205.959237] bluetooth:hci_req_complete: hci0 command 0x0c03 result 0x00
[  205.959304] bluetooth:hci_cmd_task: hci0 cmd 1
[  205.959350] bluetooth:hci_send_frame: hci0 type 1 len 3
[  206.950997] Bluetooth: hci0 command tx timeout
[  206.951085] bluetooth:hci_cmd_task: hci0 cmd 1
[  206.951166] bluetooth:hci_send_frame: hci0 type 1 len 3
[  206.952235] bluetooth:hci_rx_task: hci0
[  206.952304] bluetooth:hci_cc_read_local_version: hci0 status 0x0
[  206.952367] bluetooth:hci_cc_read_local_version: hci0 manufacturer 15 hci ver 4:20868
[  206.952428] bluetooth:hci_send_cmd: hci0 opcode 0xc01 plen 8
[  206.952484] bluetooth:hci_send_cmd: skb len 11
[  206.952521] bluetooth:hci_send_cmd: hci0 opcode 0x1002 plen 0
[  206.952565] bluetooth:hci_send_cmd: skb len 3
[  206.952625] bluetooth:hci_cmd_task: hci0 cmd 1
[  206.952692] bluetooth:hci_send_frame: hci0 type 1 len 3
[  206.955263] bluetooth:hci_rx_task: hci0
[  206.955335] bluetooth:hci_cc_read_buffer_size: hci0 status 0x0
[  206.955378] bluetooth:hci_cc_read_buffer_size: hci0 acl mtu 1021:8 sco mtu 64:1
[  206.955444] bluetooth:hci_cmd_task: hci0 cmd 1
[  206.955491] bluetooth:hci_send_frame: hci0 type 1 len 3
[  206.957223] bluetooth:hci_rx_task: hci0
[  206.957301] bluetooth:hci_cc_read_bd_addr: hci0 status 0x0
[  206.957341] bluetooth:hci_req_complete: hci0 command 0x1009 result 0x00
[  206.957387] bluetooth:hci_cmd_task: hci0 cmd 1
[  206.957455] bluetooth:hci_send_frame: hci0 type 1 len 3
[  206.959230] bluetooth:hci_rx_task: hci0
[  206.959301] bluetooth:hci_cc_read_class_of_dev: hci0 status 0x0
[  206.959340] bluetooth:hci_cc_read_class_of_dev: hci0 class 0x000000
[  206.959384] bluetooth:hci_cmd_task: hci0 cmd 1
[  206.959447] bluetooth:hci_send_frame: hci0 type 1 len 3
[  206.976246] bluetooth:hci_rx_task: hci0
[  206.976320] bluetooth:hci_cc_read_local_name: hci0 status 0x0
[  206.976388] bluetooth:hci_cmd_task: hci0 cmd 1
[  206.976454] bluetooth:hci_send_frame: hci0 type 1 len 3
[  206.978251] bluetooth:hci_rx_task: hci0
[  206.978316] bluetooth:hci_cc_read_voice_setting: hci0 status 0x0
[  206.978356] bluetooth:hci_cc_read_voice_setting: hci0 voice setting 0x0060
[  206.978411] bluetooth:hci_cmd_task: hci0 cmd 1
[  206.978471] bluetooth:hci_send_frame: hci0 type 1 len 4
[  206.980220] bluetooth:hci_rx_task: hci0
[  206.980269] bluetooth:hci_cc_set_event_flt: hci0 status 0x0
[  206.980310] bluetooth:hci_req_complete: hci0 command 0x0c05 result 0x00
[  206.980368] bluetooth:hci_cmd_task: hci0 cmd 1
[  206.980433] bluetooth:hci_send_frame: hci0 type 1 len 5
[  206.982259] bluetooth:hci_rx_task: hci0
[  206.982326] bluetooth:hci_cc_write_ca_timeout: hci0 status 0x0
[  206.982365] bluetooth:hci_req_complete: hci0 command 0x0c16 result 0x00
[  206.982420] bluetooth:hci_cmd_task: hci0 cmd 1
[  206.982480] bluetooth:hci_send_frame: hci0 type 1 len 10
[  207.060284] bluetooth:hci_rx_task: hci0
[  207.060357] bluetooth:hci_cc_delete_stored_link_key: hci0 status 0x0
[  207.060397] bluetooth:hci_req_complete: hci0 command 0x0c12 result 0x00
[  207.060485] bluetooth:hci_cmd_task: hci0 cmd 1
[  207.060552] bluetooth:hci_send_frame: hci0 type 1 len 11
[  207.062256] bluetooth:hci_rx_task: hci0
[  207.062330] bluetooth:hci_cc_set_event_mask: hci0 status 0x0
[  207.062373] bluetooth:hci_req_complete: hci0 command 0x0c01 result 0x00
[  207.062429] bluetooth:hci_cmd_task: hci0 cmd 1
[  207.062489] bluetooth:hci_send_frame: hci0 type 1 len 3
[  208.061037] Bluetooth: hci0 command tx timeout
[  208.061111] bluetooth:hci_cmd_task: hci0 cmd 1
[  215.951426] bluetooth:__hci_request: TIMEOUT
[  215.951479] bluetooth:__hci_request: hci0 end: err -110


So, what does it tell us? What is the next step?

@Diaoul
Copy link

Diaoul commented Aug 27, 2012

@gregd72002: I am interested in how you did to enable that much logging, how did you achieve this?

@gregd72002
Copy link

Dialoul: you need to compile kernel with CONFIG_DYNAMIC_DEBUG option (http://www.mjmwired.net/kernel/Documentation/dynamic-debug-howto.txt)

then once using the new kernel mount debugfs:

mount -t debugfs none /sys/kernel/debug/

and activate debug for items you interested in (refer to dynamic-debug-howto.txt for more info)):

echo -n "file hci_core.c +pf" > /sys/kernel/debug/dynamic_debug/control
echo -n "file hci_event.c +pf" > /sys/kernel/debug/dynamic_debug/control
echo -n "file mgmt.c +pf" > /sys/kernel/debug/dynamic_debug/control
echo -n "file btusb.c +pf" > /sys/kernel/debug/dynamic_debug/control
echo -n "module bluetooth +pf" > /sys/kernel/debug/dynamic_debug/control
echo -n "module btusb +pf" > /sys/kernel/debug/dynamic_debug/control

Then load the module you interested in and view the debug in 'dmesg'

@gregd72002
Copy link

All my investigations are leading to an issue with USB subsystem and not bluetooth subsystem...

Cross referencing to a possible root cause: #19
raspberrypi/firmware#19

@gregd72002
Copy link

Good progress!! Resolved!!

As thought - nothing to do with bluetooth subsystem.

Modify your cmdline.txt to have:

dwc_otg.microframe_schedule=1 dwc_otg.speed=1

my full cmdline.txt looks like:

dwc_otg.lpm_enable=0 console=ttyAMA0,115200 kgdboc=ttyAMA0,115200 console=tty1 root=/dev/mmcblk0p2 rootfstype=ext4 elevator=deadline dwc_otg.microframe_schedule=1 dwc_otg.speed=1 rootwait

and everything works now as expected!

Just to share my findings:

  1. the issue with kernel 3.1.x and 3.2.x that is mentioned in the initial comment is to do with broken bluetooth dongles; this is ones that have issues with providing features lists (extfeatures).
  2. the above issue was also resolved in kernels >3.3; so I tried to incorporate kernel 3.5.4 bluetooth system into RPi kernel and this did not resolve my problem either
  3. the timeout issue I had was appearing randomly on a different occasions so nothing to do really with initiation
  4. thus, usb was suspected to be broken...

@popcornmix
Copy link
Collaborator

@gregd72002
Glad it's working. Are both microframe_schedule and speed required?
(in latest firmware microframe_schedule is actually on by default, but can be disabled with =0)

@gregd72002
Copy link

@popcornmix
I just tried it and yes, indeed one just need 'speed' option, microframe_schedule is not required based on my last hour testing

@ratherDashing
Copy link
Author

dwc_otg.speed=1

my guess is that this pushes thing back to USB 1.1 speed. I have a USB hard drive also attached to the system so that would be an issue with me.

Is that the case?

@popcornmix
Copy link
Collaborator

Yes speed=1 will limit you to USB 1 speeds.

@gregd72002
Copy link

@ratherDashing
I wouldn't be surprised to see USB1 being faster than USB2 on RPi as of USB2 having discussed issues, so they might affecting the speed...

@sanderjo
Copy link

sanderjo commented Sep 3, 2012

When I add "dwc_otg.microframe_schedule=1 dwc_otg.speed=1" to /boot/cmdline.txt and reboot my raspi (3.1.9+, firmware newest version), the raspi boot gets into an endless loop, saying

ERROR::handle_hc_chhltd_intr_dma:2040: handle_hc_chhltd_intr_dma: Channel 0, DMA Mode -- ChHltd set, but reason for halting is unkown, hcint 0x00000002, intsts 0x06000021

Am I doing something wrong? Or is the entry only valid voor kernels above 3.1?

@popcornmix
Copy link
Collaborator

@sanderjo
microframe_schedule is not supported on 3.1.9. Can you update with rpi-update?

@sanderjo
Copy link

sanderjo commented Sep 3, 2012

"sudo rpi-update" says
Your firmware is already up to date

But do you mean I should update the kernel? Can I do that with rpi-update?

FWIW: just adding "dwc_otg.speed=1" (so no microframe_schedule) to cmdline.txt also results in the loop with error messages.

@gregd72002
Copy link

@sanderjo

sudo rm /boot/.firmware_version
sudo rpi-update

reboot and paste the output of:
uname -a

@sanderjo
Copy link

sanderjo commented Sep 3, 2012

Ah, cool:

Linux raspbian-armhf-SJ 3.2.27+ #102 PREEMPT Sat Sep 1 01:00:50 BST 2012 armv6l GNU/Linux

That's good, isn't it?

However, adding "dwc_otg.speed=1" to /boot/cmdline.txt and rebooting, still results in the endless loop.

@gregd72002
Copy link

@sanderjo

Unplug all USB devices from you RPi and check if it starts normally with speed=1 ..

@sanderjo
Copy link

sanderjo commented Sep 3, 2012

OK, unplugging the USB devices (keyboard + mouse) before the boot and then booting makes the boot work. I can ssh into the Raspi. Good.

However, plugging in the USB devices later on (so: after the succesful boot), results in this:

  • keyboard and mouse not working (even no red light under the mouse)
  • ssh stops reacting, no network
  • nothing extra on tail -f /var/log/messages in the ssh session (probably because the network is gone?)
    So: connecting the USB device stops all USB stuff, including the ethernet-on-USB?

Anything else I can do?

Just to be sure, here's some info from my system:

pi@raspbian-armhf-SJ ~ $ cat /boot/cmdline.txt
dwc_otg.lpm_enable=0 dwc_otg.speed=1  console=ttyAMA0,115200 kgdboc=ttyAMA0,115200 console=tty1 root=/dev/mmcblk0p2 rootfstype=ext4 elevator=deadline rootwait

pi@raspbian-armhf-SJ ~ $ uname -a
Linux raspbian-armhf-SJ 3.2.27+ #102 PREEMPT Sat Sep 1 01:00:50 BST 2012 armv6l GNU/Linux
pi@raspbian-armhf-SJ ~ $

@gregd72002
Copy link

What is the sequence - do you firstly plug keyboard which works and then plug mouse and everything stops to work? What if you do it way around? Plug mouse - does it work? then plug keyboard does it still work?

My guess is power supply - get a better power supply and your problem should be gone

@sanderjo
Copy link

sanderjo commented Sep 3, 2012

The Raspi already blocks with one device: keyboard or mouse.

I tried three different power supplies (one HTC, one Nokia, one Jabra), two different Raspi's, and so far no different response.

I removed the dwc_otg.speed=1 from cmdline.txt and the Raspi booted succesfully, with and without keyboard / mouse connected during boot.

I know a bad power supply can cause problems, but how can it cause the above problems?

pi@raspbian-armhf-SJ ~ $ lsusb
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 001 Device 002: ID 0424:9512 Standard Microsystems Corp. 
Bus 001 Device 003: ID 0424:ec00 Standard Microsystems Corp. 
Bus 001 Device 004: ID 046d:c00e Logitech, Inc. M-BJ58/M-BJ69 Optical Wheel Mouse
Bus 001 Device 005: ID 03f0:0024 Hewlett-Packard KU-0316 Keyboard
pi@raspbian-armhf-SJ ~ $

@popcornmix
Copy link
Collaborator

Can you measure the voltage on board:
http://elinux.org/R-Pi_Troubleshooting#Troubleshooting_power_problems

If you boot with no USB plugged in, does plugging in just the keyboard work? Does plugging in just the mouse work?

@sanderjo
Copy link

sanderjo commented Sep 5, 2012

I measured the voltage:

  • 4.85 V with USB keyboard and mouse connected
  • 4.88 V with no USB keyboard and mouse connected
    So only a small voltage drop of 0.03V. Seems OK to me.

With a plain /boot/cmdline.txt, the Raspi works well with the USB keyboard and/or mouse:

  • booting with both plugged in works,
  • first booting and then connecting them works too

The problem appears as soon as I add "dwc_otg.speed=1" to /boot/cmdline.txt and reboot:

  • booting without USB mouse and/or keyboard goes well, ... until I plug in USB mouse or keyboard: my SSH session dies, nothing in my SSH-ed /var/log/messages or syslog (but: I can't check what's going on onside the Raspi ... maybe there is some message?) ... no ping, no ssh to the Raspi. Nothing on the physically connected screen. When I remove the keyboard, ping/ssh etc do NOT come back. So it looks like the USB (and thus ethernet-via-USB) is really dead. The Raspi itself is still running: still output on my physically connected screen, and the four LEDs are on.
  • booting with USB mouse and/or keyboard connected, results in a loop of the mentioned error message. Disconnecting the USB mouse/keyboard after booting does NOT end that loop. No response from the Raspi via ping or ssh.

So I really think "dwc_otg.speed=1" is the cause of the problem. Not my power supply.

PS: something strange: with "dwc_otg.speed=1" activated but no keyboard/mouse connected, the power is 4.85 V. After connecting the USB keyboard, the power goes up to 4.94 V. So, hypothesis: by connecting the USB keyboard the whole USB system turns off, causing a lower load on the power supply, causing a lower voltage drop.

I wish I had a powered USB hub to see if that makes a difference.

I still repeat my statement that "dwc_otg.speed=1" is a cause of the problem.

@sanderjo
Copy link

sanderjo commented Sep 7, 2012

Some more tests:

I built a power supply with a 7805, with on the 7805's input side a big power supply, just to make sure the power supply to the Raspi is not the bottle neck.
All works well with with "dwc_otg.speed=1" with no USB connected. The current drawn by the Raspi is 0.37A - 0.40A (which happens to be the same current as charging my HTC smartphone). Everything is stable and works.

When I connect the USB keyboard, the current drops to 0.28A, and the Raspi immediately stops responding via SSH. So that proofs my earlier hypothesis "USB keyboard connected => Raspi's USB system turned off completely => lower current => lower voltage drop"
Needless to say the keyboard is not working.

So I rebooted without keyboard, removed "dwc_otg.speed=1" from /boot/cmdline.txt, rebooted, and - as expected - the keyboard is no problem: connecting it does not change the current (stays around 0.37A) so apparently a USB keyboard draws very little current, and dmesg shows:

[   57.612008] usb 1-1.2: new low-speed USB device number 4 using dwc_otg
[   57.721419] usb 1-1.2: New USB device found, idVendor=03f0, idProduct=0024
[   57.721460] usb 1-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[   57.721478] usb 1-1.2: Product: HP Basic USB Keyboard
[   57.721491] usb 1-1.2: Manufacturer: CHICONY
[   57.738476] input: CHICONY HP Basic USB Keyboard as /devices/platform/bcm2708_usb/usb1/1-1/1-1.2/1-1.2:1.0/input/input0
[   57.738624] generic-usb 0003:03F0:0024.0001: input: USB HID v1.10 Keyboard [CHICONY HP Basic USB Keyboard] on usb-bcm2708_usb-1.2/input0

So:

  • putting in a bigger power supply does not take away the problem with "dwc_otg.speed=1" active
  • the current does NOT change when connecting a USB keyboard with "dwc_otg.speed=1" disabled

Tips welcome ...

popcornmix pushed a commit that referenced this issue Jul 5, 2014
commit 86f4062 upstream.

When enable LPAE and big-endian in a hisilicon board, while specify
mem=384M mem=512M@7680M, will get bad page state:

Freeing unused kernel memory: 180K (c0466000 - c0493000)
BUG: Bad page state in process init  pfn:fa442
page:c7749840 count:0 mapcount:-1 mapping:  (null) index:0x0
page flags: 0x40000400(reserved)
Modules linked in:
CPU: 0 PID: 1 Comm: init Not tainted 3.10.27+ #66
[<c000f5f0>] (unwind_backtrace+0x0/0x11c) from [<c000cbc4>] (show_stack+0x10/0x14)
[<c000cbc4>] (show_stack+0x10/0x14) from [<c009e448>] (bad_page+0xd4/0x104)
[<c009e448>] (bad_page+0xd4/0x104) from [<c009e520>] (free_pages_prepare+0xa8/0x14c)
[<c009e520>] (free_pages_prepare+0xa8/0x14c) from [<c009f8ec>] (free_hot_cold_page+0x18/0xf0)
[<c009f8ec>] (free_hot_cold_page+0x18/0xf0) from [<c00b5444>] (handle_pte_fault+0xcf4/0xdc8)
[<c00b5444>] (handle_pte_fault+0xcf4/0xdc8) from [<c00b6458>] (handle_mm_fault+0xf4/0x120)
[<c00b6458>] (handle_mm_fault+0xf4/0x120) from [<c0013754>] (do_page_fault+0xfc/0x354)
[<c0013754>] (do_page_fault+0xfc/0x354) from [<c0008400>] (do_DataAbort+0x2c/0x90)
[<c0008400>] (do_DataAbort+0x2c/0x90) from [<c0008fb4>] (__dabt_usr+0x34/0x40)

The bad pfn:fa442 is not system memory(mem=384M mem=512M@7680M), after debugging,
I find in page fault handler, will get wrong pfn from pte just after set pte,
as follow:
do_anonymous_page()
{
	...
	set_pte_at(mm, address, page_table, entry);

	//debug code
	pfn = pte_pfn(entry);
	pr_info("pfn:0x%lx, pte:0x%llxn", pfn, pte_val(entry));

	//read out the pte just set
	new_pte = pte_offset_map(pmd, address);
	new_pfn = pte_pfn(*new_pte);
	pr_info("new pfn:0x%lx, new pte:0x%llxn", pfn, pte_val(entry));
	...
}

pfn:   0x1fa4f5,     pte:0xc00001fa4f575f
new_pfn:0xfa4f5, new_pte:0xc00000fa4f5f5f	//new pfn/pte is wrong.

The bug is happened in cpu_v7_set_pte_ext(ptep, pte):
An LPAE PTE is a 64bit quantity, passed to cpu_v7_set_pte_ext in the r2 and r3 registers.
On an LE kernel, r2 contains the LSB of the PTE, and r3 the MSB.
On a BE kernel, the assignment is reversed.

Unfortunately, the current code always assumes the LE case,
leading to corruption of the PTE when clearing/setting bits.

This patch fixes this issue much like it has been done already in the
cpu_v7_switch_mm case.

Signed-off-by: Jianguo Wu <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Acked-by: Will Deacon <[email protected]>
Signed-off-by: Russell King <[email protected]>
Signed-off-by: Jiri Slaby <[email protected]>
popcornmix pushed a commit that referenced this issue Aug 4, 2014
When enable LPAE and big-endian in a hisilicon board, while specify
mem=384M mem=512M@7680M, will get bad page state:

Freeing unused kernel memory: 180K (c0466000 - c0493000)
BUG: Bad page state in process init  pfn:fa442
page:c7749840 count:0 mapcount:-1 mapping:  (null) index:0x0
page flags: 0x40000400(reserved)
Modules linked in:
CPU: 0 PID: 1 Comm: init Not tainted 3.10.27+ #66
[<c000f5f0>] (unwind_backtrace+0x0/0x11c) from [<c000cbc4>] (show_stack+0x10/0x14)
[<c000cbc4>] (show_stack+0x10/0x14) from [<c009e448>] (bad_page+0xd4/0x104)
[<c009e448>] (bad_page+0xd4/0x104) from [<c009e520>] (free_pages_prepare+0xa8/0x14c)
[<c009e520>] (free_pages_prepare+0xa8/0x14c) from [<c009f8ec>] (free_hot_cold_page+0x18/0xf0)
[<c009f8ec>] (free_hot_cold_page+0x18/0xf0) from [<c00b5444>] (handle_pte_fault+0xcf4/0xdc8)
[<c00b5444>] (handle_pte_fault+0xcf4/0xdc8) from [<c00b6458>] (handle_mm_fault+0xf4/0x120)
[<c00b6458>] (handle_mm_fault+0xf4/0x120) from [<c0013754>] (do_page_fault+0xfc/0x354)
[<c0013754>] (do_page_fault+0xfc/0x354) from [<c0008400>] (do_DataAbort+0x2c/0x90)
[<c0008400>] (do_DataAbort+0x2c/0x90) from [<c0008fb4>] (__dabt_usr+0x34/0x40)

The bad pfn:fa442 is not system memory(mem=384M mem=512M@7680M), after debugging,
I find in page fault handler, will get wrong pfn from pte just after set pte,
as follow:
do_anonymous_page()
{
	...
	set_pte_at(mm, address, page_table, entry);

	//debug code
	pfn = pte_pfn(entry);
	pr_info("pfn:0x%lx, pte:0x%llxn", pfn, pte_val(entry));

	//read out the pte just set
	new_pte = pte_offset_map(pmd, address);
	new_pfn = pte_pfn(*new_pte);
	pr_info("new pfn:0x%lx, new pte:0x%llxn", pfn, pte_val(entry));
	...
}

pfn:   0x1fa4f5,     pte:0xc00001fa4f575f
new_pfn:0xfa4f5, new_pte:0xc00000fa4f5f5f	//new pfn/pte is wrong.

The bug is happened in cpu_v7_set_pte_ext(ptep, pte):
An LPAE PTE is a 64bit quantity, passed to cpu_v7_set_pte_ext in the r2 and r3 registers.
On an LE kernel, r2 contains the LSB of the PTE, and r3 the MSB.
On a BE kernel, the assignment is reversed.

Unfortunately, the current code always assumes the LE case,
leading to corruption of the PTE when clearing/setting bits.

This patch fixes this issue much like it has been done already in the
cpu_v7_switch_mm case.

CC stable <[email protected]>

Signed-off-by: Jianguo Wu <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Acked-by: Will Deacon <[email protected]>
Signed-off-by: Russell King <[email protected]>
popcornmix pushed a commit that referenced this issue Nov 7, 2014
commit 86f4062 upstream.

When enable LPAE and big-endian in a hisilicon board, while specify
mem=384M mem=512M@7680M, will get bad page state:

Freeing unused kernel memory: 180K (c0466000 - c0493000)
BUG: Bad page state in process init  pfn:fa442
page:c7749840 count:0 mapcount:-1 mapping:  (null) index:0x0
page flags: 0x40000400(reserved)
Modules linked in:
CPU: 0 PID: 1 Comm: init Not tainted 3.10.27+ #66
[<c000f5f0>] (unwind_backtrace+0x0/0x11c) from [<c000cbc4>] (show_stack+0x10/0x14)
[<c000cbc4>] (show_stack+0x10/0x14) from [<c009e448>] (bad_page+0xd4/0x104)
[<c009e448>] (bad_page+0xd4/0x104) from [<c009e520>] (free_pages_prepare+0xa8/0x14c)
[<c009e520>] (free_pages_prepare+0xa8/0x14c) from [<c009f8ec>] (free_hot_cold_page+0x18/0xf0)
[<c009f8ec>] (free_hot_cold_page+0x18/0xf0) from [<c00b5444>] (handle_pte_fault+0xcf4/0xdc8)
[<c00b5444>] (handle_pte_fault+0xcf4/0xdc8) from [<c00b6458>] (handle_mm_fault+0xf4/0x120)
[<c00b6458>] (handle_mm_fault+0xf4/0x120) from [<c0013754>] (do_page_fault+0xfc/0x354)
[<c0013754>] (do_page_fault+0xfc/0x354) from [<c0008400>] (do_DataAbort+0x2c/0x90)
[<c0008400>] (do_DataAbort+0x2c/0x90) from [<c0008fb4>] (__dabt_usr+0x34/0x40)

The bad pfn:fa442 is not system memory(mem=384M mem=512M@7680M), after debugging,
I find in page fault handler, will get wrong pfn from pte just after set pte,
as follow:
do_anonymous_page()
{
	...
	set_pte_at(mm, address, page_table, entry);

	//debug code
	pfn = pte_pfn(entry);
	pr_info("pfn:0x%lx, pte:0x%llxn", pfn, pte_val(entry));

	//read out the pte just set
	new_pte = pte_offset_map(pmd, address);
	new_pfn = pte_pfn(*new_pte);
	pr_info("new pfn:0x%lx, new pte:0x%llxn", pfn, pte_val(entry));
	...
}

pfn:   0x1fa4f5,     pte:0xc00001fa4f575f
new_pfn:0xfa4f5, new_pte:0xc00000fa4f5f5f	//new pfn/pte is wrong.

The bug is happened in cpu_v7_set_pte_ext(ptep, pte):
An LPAE PTE is a 64bit quantity, passed to cpu_v7_set_pte_ext in the r2 and r3 registers.
On an LE kernel, r2 contains the LSB of the PTE, and r3 the MSB.
On a BE kernel, the assignment is reversed.

Unfortunately, the current code always assumes the LE case,
leading to corruption of the PTE when clearing/setting bits.

This patch fixes this issue much like it has been done already in the
cpu_v7_switch_mm case.

Signed-off-by: Jianguo Wu <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Acked-by: Will Deacon <[email protected]>
Signed-off-by: Russell King <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
@P33M
Copy link
Contributor

P33M commented Jul 14, 2015

There have been no reported issues with bluetooth since fiq_fsm (USB rewrite) went into general usage.

@P33M P33M closed this as completed Jul 14, 2015
anholt referenced this issue in anholt/linux Jan 28, 2016
The code produces the following trace:

[1750924.419007] general protection fault: 0000 [#3] SMP
[1750924.420364] Modules linked in: nfnetlink autofs4 rpcsec_gss_krb5 nfsv4
dcdbas rfcomm bnep bluetooth nfsd auth_rpcgss nfs_acl dm_multipath nfs lockd
scsi_dh sunrpc fscache radeon ttm drm_kms_helper drm serio_raw parport_pc
ppdev i2c_algo_bit lpc_ich ipmi_si ib_mthca ib_qib dca lp parport ib_ipoib
mac_hid ib_cm i3000_edac ib_sa ib_uverbs edac_core ib_umad ib_mad ib_core
ib_addr tg3 ptp dm_mirror dm_region_hash dm_log psmouse pps_core
[1750924.420364] CPU: 1 PID: 8401 Comm: python Tainted: G D
3.13.0-39-generic #66-Ubuntu
[1750924.420364] Hardware name: Dell Computer Corporation PowerEdge
860/0XM089, BIOS A04 07/24/2007
[1750924.420364] task: ffff8800366a9800 ti: ffff88007af1c000 task.ti:
ffff88007af1c000
[1750924.420364] RIP: 0010:[<ffffffffa0131d51>] [<ffffffffa0131d51>]
qib_mcast_qp_free+0x11/0x50 [ib_qib]
[1750924.420364] RSP: 0018:ffff88007af1dd70  EFLAGS: 00010246
[1750924.420364] RAX: 0000000000000001 RBX: ffff88007b822688 RCX:
000000000000000f
[1750924.420364] RDX: ffff88007b822688 RSI: ffff8800366c15a0 RDI:
6764697200000000
[1750924.420364] RBP: ffff88007af1dd78 R08: 0000000000000001 R09:
0000000000000000
[1750924.420364] R10: 0000000000000011 R11: 0000000000000246 R12:
ffff88007baa1d98
[1750924.420364] R13: ffff88003ecab000 R14: ffff88007b822660 R15:
0000000000000000
[1750924.420364] FS:  00007ffff7fd8740(0000) GS:ffff88007fc80000(0000)
knlGS:0000000000000000
[1750924.420364] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1750924.420364] CR2: 00007ffff597c750 CR3: 000000006860b000 CR4:
00000000000007e0
[1750924.420364] Stack:
[1750924.420364]  ffff88007b822688 ffff88007af1ddf0 ffffffffa0132429
000000007af1de20
[1750924.420364]  ffff88007baa1dc8 ffff88007baa0000 ffff88007af1de70
ffffffffa00cb313
[1750924.420364]  00007fffffffde88 0000000000000000 0000000000000008
ffff88003ecab000
[1750924.420364] Call Trace:
[1750924.420364]  [<ffffffffa0132429>] qib_multicast_detach+0x1e9/0x350
[ib_qib]
[1750924.568035]  [<ffffffffa00cb313>] ? ib_uverbs_modify_qp+0x323/0x3d0
[ib_uverbs]
[1750924.568035]  [<ffffffffa0092d61>] ib_detach_mcast+0x31/0x50 [ib_core]
[1750924.568035]  [<ffffffffa00cc213>] ib_uverbs_detach_mcast+0x93/0x170
[ib_uverbs]
[1750924.568035]  [<ffffffffa00c61f6>] ib_uverbs_write+0xc6/0x2c0 [ib_uverbs]
[1750924.568035]  [<ffffffff81312e68>] ? apparmor_file_permission+0x18/0x20
[1750924.568035]  [<ffffffff812d4cd3>] ? security_file_permission+0x23/0xa0
[1750924.568035]  [<ffffffff811bd214>] vfs_write+0xb4/0x1f0
[1750924.568035]  [<ffffffff811bdc49>] SyS_write+0x49/0xa0
[1750924.568035]  [<ffffffff8172f7ed>] system_call_fastpath+0x1a/0x1f
[1750924.568035] Code: 66 2e 0f 1f 84 00 00 00 00 00 31 c0 5d c3 66 2e 0f 1f
84 00 00 00 00 00 66 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 8b 7f 10
<f0> ff 8f 40 01 00 00 74 0e 48 89 df e8 8e f8 06 e1 5b 5d c3 0f
[1750924.568035] RIP  [<ffffffffa0131d51>] qib_mcast_qp_free+0x11/0x50
[ib_qib]
[1750924.568035]  RSP <ffff88007af1dd70>
[1750924.650439] ---[ end trace 73d5d4b3f8ad4851 ]

The fix is to note the qib_mcast_qp that was found.   If none is found, then
return EINVAL indicating the error.

Cc: <[email protected]>
Reviewed-by: Dennis Dalessandro <[email protected]>
Reported-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Mike Marciniszyn <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>
popcornmix pushed a commit that referenced this issue Feb 17, 2016
[ Upstream commit 09dc9cd ]

The code produces the following trace:

[1750924.419007] general protection fault: 0000 [#3] SMP
[1750924.420364] Modules linked in: nfnetlink autofs4 rpcsec_gss_krb5 nfsv4
dcdbas rfcomm bnep bluetooth nfsd auth_rpcgss nfs_acl dm_multipath nfs lockd
scsi_dh sunrpc fscache radeon ttm drm_kms_helper drm serio_raw parport_pc
ppdev i2c_algo_bit lpc_ich ipmi_si ib_mthca ib_qib dca lp parport ib_ipoib
mac_hid ib_cm i3000_edac ib_sa ib_uverbs edac_core ib_umad ib_mad ib_core
ib_addr tg3 ptp dm_mirror dm_region_hash dm_log psmouse pps_core
[1750924.420364] CPU: 1 PID: 8401 Comm: python Tainted: G D
3.13.0-39-generic #66-Ubuntu
[1750924.420364] Hardware name: Dell Computer Corporation PowerEdge
860/0XM089, BIOS A04 07/24/2007
[1750924.420364] task: ffff8800366a9800 ti: ffff88007af1c000 task.ti:
ffff88007af1c000
[1750924.420364] RIP: 0010:[<ffffffffa0131d51>] [<ffffffffa0131d51>]
qib_mcast_qp_free+0x11/0x50 [ib_qib]
[1750924.420364] RSP: 0018:ffff88007af1dd70  EFLAGS: 00010246
[1750924.420364] RAX: 0000000000000001 RBX: ffff88007b822688 RCX:
000000000000000f
[1750924.420364] RDX: ffff88007b822688 RSI: ffff8800366c15a0 RDI:
6764697200000000
[1750924.420364] RBP: ffff88007af1dd78 R08: 0000000000000001 R09:
0000000000000000
[1750924.420364] R10: 0000000000000011 R11: 0000000000000246 R12:
ffff88007baa1d98
[1750924.420364] R13: ffff88003ecab000 R14: ffff88007b822660 R15:
0000000000000000
[1750924.420364] FS:  00007ffff7fd8740(0000) GS:ffff88007fc80000(0000)
knlGS:0000000000000000
[1750924.420364] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1750924.420364] CR2: 00007ffff597c750 CR3: 000000006860b000 CR4:
00000000000007e0
[1750924.420364] Stack:
[1750924.420364]  ffff88007b822688 ffff88007af1ddf0 ffffffffa0132429
000000007af1de20
[1750924.420364]  ffff88007baa1dc8 ffff88007baa0000 ffff88007af1de70
ffffffffa00cb313
[1750924.420364]  00007fffffffde88 0000000000000000 0000000000000008
ffff88003ecab000
[1750924.420364] Call Trace:
[1750924.420364]  [<ffffffffa0132429>] qib_multicast_detach+0x1e9/0x350
[ib_qib]
[1750924.568035]  [<ffffffffa00cb313>] ? ib_uverbs_modify_qp+0x323/0x3d0
[ib_uverbs]
[1750924.568035]  [<ffffffffa0092d61>] ib_detach_mcast+0x31/0x50 [ib_core]
[1750924.568035]  [<ffffffffa00cc213>] ib_uverbs_detach_mcast+0x93/0x170
[ib_uverbs]
[1750924.568035]  [<ffffffffa00c61f6>] ib_uverbs_write+0xc6/0x2c0 [ib_uverbs]
[1750924.568035]  [<ffffffff81312e68>] ? apparmor_file_permission+0x18/0x20
[1750924.568035]  [<ffffffff812d4cd3>] ? security_file_permission+0x23/0xa0
[1750924.568035]  [<ffffffff811bd214>] vfs_write+0xb4/0x1f0
[1750924.568035]  [<ffffffff811bdc49>] SyS_write+0x49/0xa0
[1750924.568035]  [<ffffffff8172f7ed>] system_call_fastpath+0x1a/0x1f
[1750924.568035] Code: 66 2e 0f 1f 84 00 00 00 00 00 31 c0 5d c3 66 2e 0f 1f
84 00 00 00 00 00 66 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 8b 7f 10
<f0> ff 8f 40 01 00 00 74 0e 48 89 df e8 8e f8 06 e1 5b 5d c3 0f
[1750924.568035] RIP  [<ffffffffa0131d51>] qib_mcast_qp_free+0x11/0x50
[ib_qib]
[1750924.568035]  RSP <ffff88007af1dd70>
[1750924.650439] ---[ end trace 73d5d4b3f8ad4851 ]

The fix is to note the qib_mcast_qp that was found.   If none is found, then
return EINVAL indicating the error.

Cc: <[email protected]>
Reviewed-by: Dennis Dalessandro <[email protected]>
Reported-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Mike Marciniszyn <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
popcornmix pushed a commit that referenced this issue Mar 4, 2016
commit 09dc9cd upstream.

The code produces the following trace:

[1750924.419007] general protection fault: 0000 [#3] SMP
[1750924.420364] Modules linked in: nfnetlink autofs4 rpcsec_gss_krb5 nfsv4
dcdbas rfcomm bnep bluetooth nfsd auth_rpcgss nfs_acl dm_multipath nfs lockd
scsi_dh sunrpc fscache radeon ttm drm_kms_helper drm serio_raw parport_pc
ppdev i2c_algo_bit lpc_ich ipmi_si ib_mthca ib_qib dca lp parport ib_ipoib
mac_hid ib_cm i3000_edac ib_sa ib_uverbs edac_core ib_umad ib_mad ib_core
ib_addr tg3 ptp dm_mirror dm_region_hash dm_log psmouse pps_core
[1750924.420364] CPU: 1 PID: 8401 Comm: python Tainted: G D
3.13.0-39-generic #66-Ubuntu
[1750924.420364] Hardware name: Dell Computer Corporation PowerEdge
860/0XM089, BIOS A04 07/24/2007
[1750924.420364] task: ffff8800366a9800 ti: ffff88007af1c000 task.ti:
ffff88007af1c000
[1750924.420364] RIP: 0010:[<ffffffffa0131d51>] [<ffffffffa0131d51>]
qib_mcast_qp_free+0x11/0x50 [ib_qib]
[1750924.420364] RSP: 0018:ffff88007af1dd70  EFLAGS: 00010246
[1750924.420364] RAX: 0000000000000001 RBX: ffff88007b822688 RCX:
000000000000000f
[1750924.420364] RDX: ffff88007b822688 RSI: ffff8800366c15a0 RDI:
6764697200000000
[1750924.420364] RBP: ffff88007af1dd78 R08: 0000000000000001 R09:
0000000000000000
[1750924.420364] R10: 0000000000000011 R11: 0000000000000246 R12:
ffff88007baa1d98
[1750924.420364] R13: ffff88003ecab000 R14: ffff88007b822660 R15:
0000000000000000
[1750924.420364] FS:  00007ffff7fd8740(0000) GS:ffff88007fc80000(0000)
knlGS:0000000000000000
[1750924.420364] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1750924.420364] CR2: 00007ffff597c750 CR3: 000000006860b000 CR4:
00000000000007e0
[1750924.420364] Stack:
[1750924.420364]  ffff88007b822688 ffff88007af1ddf0 ffffffffa0132429
000000007af1de20
[1750924.420364]  ffff88007baa1dc8 ffff88007baa0000 ffff88007af1de70
ffffffffa00cb313
[1750924.420364]  00007fffffffde88 0000000000000000 0000000000000008
ffff88003ecab000
[1750924.420364] Call Trace:
[1750924.420364]  [<ffffffffa0132429>] qib_multicast_detach+0x1e9/0x350
[ib_qib]
[1750924.568035]  [<ffffffffa00cb313>] ? ib_uverbs_modify_qp+0x323/0x3d0
[ib_uverbs]
[1750924.568035]  [<ffffffffa0092d61>] ib_detach_mcast+0x31/0x50 [ib_core]
[1750924.568035]  [<ffffffffa00cc213>] ib_uverbs_detach_mcast+0x93/0x170
[ib_uverbs]
[1750924.568035]  [<ffffffffa00c61f6>] ib_uverbs_write+0xc6/0x2c0 [ib_uverbs]
[1750924.568035]  [<ffffffff81312e68>] ? apparmor_file_permission+0x18/0x20
[1750924.568035]  [<ffffffff812d4cd3>] ? security_file_permission+0x23/0xa0
[1750924.568035]  [<ffffffff811bd214>] vfs_write+0xb4/0x1f0
[1750924.568035]  [<ffffffff811bdc49>] SyS_write+0x49/0xa0
[1750924.568035]  [<ffffffff8172f7ed>] system_call_fastpath+0x1a/0x1f
[1750924.568035] Code: 66 2e 0f 1f 84 00 00 00 00 00 31 c0 5d c3 66 2e 0f 1f
84 00 00 00 00 00 66 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 8b 7f 10
<f0> ff 8f 40 01 00 00 74 0e 48 89 df e8 8e f8 06 e1 5b 5d c3 0f
[1750924.568035] RIP  [<ffffffffa0131d51>] qib_mcast_qp_free+0x11/0x50
[ib_qib]
[1750924.568035]  RSP <ffff88007af1dd70>
[1750924.650439] ---[ end trace 73d5d4b3f8ad4851 ]

The fix is to note the qib_mcast_qp that was found.   If none is found, then
return EINVAL indicating the error.

Reviewed-by: Dennis Dalessandro <[email protected]>
Reported-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Mike Marciniszyn <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
ED6E0F17 pushed a commit to ED6E0F17/linux that referenced this issue Dec 23, 2019
…_cache destruction

commit a264df7 upstream.

Christian reported a warning like the following obtained during running
some KVM-related tests on s390:

    WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58
    Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na>
    CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ raspberrypi#66
    Hardware name: IBM 2964 NC9 712 (LPAR)
    Workqueue: events sysfs_slab_remove_workfn
    Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58)
               R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
    Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000
               0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000
               0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00
               0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0
    Krnl Code: 0000001529746842: f0a0000407fe        srp        4(11,%r0),2046,0
               0000001529746848: 47000700            bc         0,1792
              #000000152974684c: a7f40001            brc        15,152974684e
              >0000001529746850: a7f4fff2            brc        15,1529746834
               0000001529746854: 0707                bcr        0,%r7
               0000001529746856: 0707                bcr        0,%r7
               0000001529746858: eb8ff0580024        stmg       %r8,%r15,88(%r15)
               000000152974685e: a738ffff            lhi        %r3,-1
    Call Trace:
    ([<000003e009263d00>] 0x3e009263d00)
     [<00000015293252ea>] slab_kmem_cache_release+0x3a/0x70
     [<0000001529b04882>] kobject_put+0xaa/0xe8
     [<000000152918cf28>] process_one_work+0x1e8/0x428
     [<000000152918d1b0>] worker_thread+0x48/0x460
     [<00000015291942c6>] kthread+0x126/0x160
     [<0000001529b22344>] ret_from_fork+0x28/0x30
     [<0000001529b2234c>] kernel_thread_starter+0x0/0x10
    Last Breaking-Event-Address:
     [<000000152974684c>] percpu_ref_exit+0x4c/0x58
    ---[ end trace b035e7da5788eb09 ]---

The problem occurs because kmem_cache_destroy() is called immediately
after deleting of a memcg, so it races with the memcg kmem_cache
deactivation.

flush_memcg_workqueue() at the beginning of kmem_cache_destroy() is
supposed to guarantee that all deactivation processes are finished, but
failed to do so.  It waits for an rcu grace period, after which all
children kmem_caches should be deactivated.  During the deactivation
percpu_ref_kill() is called for non root kmem_cache refcounters, but it
requires yet another rcu grace period to finish the transition to the
atomic (dead) state.

So in a rare case when not all children kmem_caches are destroyed at the
moment when the root kmem_cache is about to be gone, we need to wait
another rcu grace period before destroying the root kmem_cache.

This issue can be triggered only with dynamically created kmem_caches
which are used with memcg accounting.  In this case per-memcg child
kmem_caches are created.  They are deactivated from the cgroup removing
path.  If the destruction of the root kmem_cache is racing with the
removal of the cgroup (both are quite complicated multi-stage
processes), the described issue can occur.  The only known way to
trigger it in the real life, is to unload some kernel module which
creates a dedicated kmem_cache, used from different memory cgroups with
GFP_ACCOUNT flag.  If the unloading happens immediately after calling
rmdir on the corresponding cgroup, there is some chance to trigger the
issue.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: f0a3a24 ("mm: memcg/slab: rework non-root kmem_cache lifecycle management")
Signed-off-by: Roman Gushchin <[email protected]>
Reported-by: Christian Borntraeger <[email protected]>
Tested-by: Christian Borntraeger <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
popcornmix pushed a commit that referenced this issue Jan 6, 2020
…_cache destruction

commit a264df7 upstream.

Christian reported a warning like the following obtained during running
some KVM-related tests on s390:

    WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58
    Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na>
    CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ #66
    Hardware name: IBM 2964 NC9 712 (LPAR)
    Workqueue: events sysfs_slab_remove_workfn
    Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58)
               R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
    Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000
               0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000
               0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00
               0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0
    Krnl Code: 0000001529746842: f0a0000407fe        srp        4(11,%r0),2046,0
               0000001529746848: 47000700            bc         0,1792
              #000000152974684c: a7f40001            brc        15,152974684e
              >0000001529746850: a7f4fff2            brc        15,1529746834
               0000001529746854: 0707                bcr        0,%r7
               0000001529746856: 0707                bcr        0,%r7
               0000001529746858: eb8ff0580024        stmg       %r8,%r15,88(%r15)
               000000152974685e: a738ffff            lhi        %r3,-1
    Call Trace:
    ([<000003e009263d00>] 0x3e009263d00)
     [<00000015293252ea>] slab_kmem_cache_release+0x3a/0x70
     [<0000001529b04882>] kobject_put+0xaa/0xe8
     [<000000152918cf28>] process_one_work+0x1e8/0x428
     [<000000152918d1b0>] worker_thread+0x48/0x460
     [<00000015291942c6>] kthread+0x126/0x160
     [<0000001529b22344>] ret_from_fork+0x28/0x30
     [<0000001529b2234c>] kernel_thread_starter+0x0/0x10
    Last Breaking-Event-Address:
     [<000000152974684c>] percpu_ref_exit+0x4c/0x58
    ---[ end trace b035e7da5788eb09 ]---

The problem occurs because kmem_cache_destroy() is called immediately
after deleting of a memcg, so it races with the memcg kmem_cache
deactivation.

flush_memcg_workqueue() at the beginning of kmem_cache_destroy() is
supposed to guarantee that all deactivation processes are finished, but
failed to do so.  It waits for an rcu grace period, after which all
children kmem_caches should be deactivated.  During the deactivation
percpu_ref_kill() is called for non root kmem_cache refcounters, but it
requires yet another rcu grace period to finish the transition to the
atomic (dead) state.

So in a rare case when not all children kmem_caches are destroyed at the
moment when the root kmem_cache is about to be gone, we need to wait
another rcu grace period before destroying the root kmem_cache.

This issue can be triggered only with dynamically created kmem_caches
which are used with memcg accounting.  In this case per-memcg child
kmem_caches are created.  They are deactivated from the cgroup removing
path.  If the destruction of the root kmem_cache is racing with the
removal of the cgroup (both are quite complicated multi-stage
processes), the described issue can occur.  The only known way to
trigger it in the real life, is to unload some kernel module which
creates a dedicated kmem_cache, used from different memory cgroups with
GFP_ACCOUNT flag.  If the unloading happens immediately after calling
rmdir on the corresponding cgroup, there is some chance to trigger the
issue.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: f0a3a24 ("mm: memcg/slab: rework non-root kmem_cache lifecycle management")
Signed-off-by: Roman Gushchin <[email protected]>
Reported-by: Christian Borntraeger <[email protected]>
Tested-by: Christian Borntraeger <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
popcornmix pushed a commit that referenced this issue Jun 17, 2020
After previous fix for zero extension test_verifier tests #65 and #66 now
fail. Before the fix we can see the alu32 mov op at insn 10

10: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0)
    R1_w=invP(id=0,
              smin_value=4294967168,smax_value=4294967423,
              umin_value=4294967168,umax_value=4294967423,
              var_off=(0x0; 0x1ffffffff),
              s32_min_value=-2147483648,s32_max_value=2147483647,
              u32_min_value=0,u32_max_value=-1)
    R10=fp0 fp-8_w=mmmmmmmm
10: (bc) w1 = w1
11: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0)
    R1_w=invP(id=0,
              smin_value=0,smax_value=2147483647,
              umin_value=0,umax_value=4294967295,
              var_off=(0x0; 0xffffffff),
              s32_min_value=-2147483648,s32_max_value=2147483647,
              u32_min_value=0,u32_max_value=-1)
    R10=fp0 fp-8_w=mmmmmmmm

After the fix at insn 10 because we have 's32_min_value < 0' the following
step 11 now has 'smax_value=U32_MAX' where before we pulled the s32_max_value
bound into the smax_value as seen above in 11 with smax_value=2147483647.

10: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0)
    R1_w=inv(id=0,
             smin_value=4294967168,smax_value=4294967423,
             umin_value=4294967168,umax_value=4294967423,
             var_off=(0x0; 0x1ffffffff),
             s32_min_value=-2147483648, s32_max_value=2147483647,
             u32_min_value=0,u32_max_value=-1)
    R10=fp0 fp-8_w=mmmmmmmm
10: (bc) w1 = w1
11: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0)
    R1_w=inv(id=0,
             smin_value=0,smax_value=4294967295,
             umin_value=0,umax_value=4294967295,
             var_off=(0x0; 0xffffffff),
             s32_min_value=-2147483648, s32_max_value=2147483647,
             u32_min_value=0, u32_max_value=-1)
    R10=fp0 fp-8_w=mmmmmmmm

The fall out of this is by the time we get to the failing instruction at
step 14 where previously we had the following:

14: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0)
    R1_w=inv(id=0,
             smin_value=72057594021150720,smax_value=72057594029539328,
             umin_value=72057594021150720,umax_value=72057594029539328,
             var_off=(0xffffffff000000; 0xffffff),
             s32_min_value=-16777216,s32_max_value=-1,
             u32_min_value=-16777216,u32_max_value=-1)
    R10=fp0 fp-8_w=mmmmmmmm
14: (0f) r0 += r1

We now have,

14: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0)
    R1_w=inv(id=0,
             smin_value=0,smax_value=72057594037927935,
             umin_value=0,umax_value=72057594037927935,
             var_off=(0x0; 0xffffffffffffff),
             s32_min_value=-2147483648,s32_max_value=2147483647,
             u32_min_value=0,u32_max_value=-1)
    R10=fp0 fp-8_w=mmmmmmmm
14: (0f) r0 += r1

In the original step 14 'smin_value=72057594021150720' this trips the logic
in the verifier function check_reg_sane_offset(),

 if (smin >= BPF_MAX_VAR_OFF || smin <= -BPF_MAX_VAR_OFF) {
	verbose(env, "value %lld makes %s pointer be out of bounds\n",
		smin, reg_type_str[type]);
	return false;
 }

Specifically, the 'smin <= -BPF_MAX_VAR_OFF' check. But with the fix
at step 14 we have bounds 'smin_value=0' so the above check is not tripped
because BPF_MAX_VAR_OFF=1<<29.

We have a smin_value=0 here because at step 10 the smaller smin_value=0 means
the subtractions at steps 11 and 12 bring the smin_value negative.

11: (17) r1 -= 2147483584
12: (17) r1 -= 2147483584
13: (77) r1 >>= 8

Then the shift clears the top bit and smin_value is set to 0. Note we still
have the smax_value in the fixed code so any reads will fail. An alternative
would be to have reg_sane_check() do both smin and smax value tests.

To fix the test we can omit the 'r1 >>=8' at line 13. This will change the
err string, but keeps the intention of the test as suggseted by the title,
"check after truncation of boundary-crossing range". If the verifier logic
changes a different value is likely to be thrown in the error or the error
will no longer be thrown forcing this test to be examined. With this change
we see the new state at step 13.

13: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0)
    R1_w=invP(id=0,
              smin_value=-4294967168,smax_value=127,
              umin_value=0,umax_value=18446744073709551615,
              s32_min_value=-2147483648,s32_max_value=2147483647,
              u32_min_value=0,u32_max_value=-1)
    R10=fp0 fp-8_w=mmmmmmmm

Giving the expected out of bounds error, "value -4294967168 makes map_value
pointer be out of bounds" However, for unpriv case we see a different error
now because of the mixed signed bounds pointer arithmatic. This seems OK so
I've only added the unpriv_errstr for this. Another optino may have been to
do addition on r1 instead of subtraction but I favor the approach above
slightly.

Signed-off-by: John Fastabend <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/bpf/159077333942.6014.14004320043595756079.stgit@john-Precision-5820-Tower
popcornmix pushed a commit that referenced this issue Sep 28, 2020
Commit 0cb2f13 ("kprobes: Fix NULL pointer dereference at
kprobe_ftrace_handler") fixed one bug but not completely fixed yet.
If we run a kprobe_module.tc of ftracetest, kernel showed a warning
as below.

# ./ftracetest test.d/kprobe/kprobe_module.tc
=== Ftrace unit tests ===
[1] Kprobe dynamic event - probing module
...
[   22.400215] ------------[ cut here ]------------
[   22.400962] Failed to disarm kprobe-ftrace at trace_printk_irq_work+0x0/0x7e [trace_printk] (-2)
[   22.402139] WARNING: CPU: 7 PID: 200 at kernel/kprobes.c:1091 __disarm_kprobe_ftrace.isra.0+0x7e/0xa0
[   22.403358] Modules linked in: trace_printk(-)
[   22.404028] CPU: 7 PID: 200 Comm: rmmod Not tainted 5.9.0-rc2+ #66
[   22.404870] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
[   22.406139] RIP: 0010:__disarm_kprobe_ftrace.isra.0+0x7e/0xa0
[   22.406947] Code: 30 8b 03 eb c9 80 3d e5 09 1f 01 00 75 dc 49 8b 34 24 89 c2 48 c7 c7 a0 c2 05 82 89 45 e4 c6 05 cc 09 1f 01 01 e8 a9 c7 f0 ff <0f> 0b 8b 45 e4 eb b9 89 c6 48 c7 c7 70 c2 05 82 89 45 e4 e8 91 c7
[   22.409544] RSP: 0018:ffffc90000237df0 EFLAGS: 00010286
[   22.410385] RAX: 0000000000000000 RBX: ffffffff83066024 RCX: 0000000000000000
[   22.411434] RDX: 0000000000000001 RSI: ffffffff810de8d3 RDI: ffffffff810de8d3
[   22.412687] RBP: ffffc90000237e10 R08: 0000000000000001 R09: 0000000000000001
[   22.413762] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88807c478640
[   22.414852] R13: ffffffff8235ebc0 R14: ffffffffa00060c0 R15: 0000000000000000
[   22.415941] FS:  00000000019d48c0(0000) GS:ffff88807d7c0000(0000) knlGS:0000000000000000
[   22.417264] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   22.418176] CR2: 00000000005bb7e3 CR3: 0000000078f7a000 CR4: 00000000000006a0
[   22.419309] Call Trace:
[   22.419990]  kill_kprobe+0x94/0x160
[   22.420652]  kprobes_module_callback+0x64/0x230
[   22.421470]  notifier_call_chain+0x4f/0x70
[   22.422184]  blocking_notifier_call_chain+0x49/0x70
[   22.422979]  __x64_sys_delete_module+0x1ac/0x240
[   22.423733]  do_syscall_64+0x38/0x50
[   22.424366]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   22.425176] RIP: 0033:0x4bb81d
[   22.425741] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e0 ff ff ff f7 d8 64 89 01 48
[   22.428726] RSP: 002b:00007ffc70fef008 EFLAGS: 00000246 ORIG_RAX: 00000000000000b0
[   22.430169] RAX: ffffffffffffffda RBX: 00000000019d48a0 RCX: 00000000004bb81d
[   22.431375] RDX: 0000000000000000 RSI: 0000000000000880 RDI: 00007ffc70fef028
[   22.432543] RBP: 0000000000000880 R08: 00000000ffffffff R09: 00007ffc70fef320
[   22.433692] R10: 0000000000656300 R11: 0000000000000246 R12: 00007ffc70fef028
[   22.434635] R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000000
[   22.435682] irq event stamp: 1169
[   22.436240] hardirqs last  enabled at (1179): [<ffffffff810df542>] console_unlock+0x422/0x580
[   22.437466] hardirqs last disabled at (1188): [<ffffffff810df19b>] console_unlock+0x7b/0x580
[   22.438608] softirqs last  enabled at (866): [<ffffffff81c0038e>] __do_softirq+0x38e/0x490
[   22.439637] softirqs last disabled at (859): [<ffffffff81a00f42>] asm_call_on_stack+0x12/0x20
[   22.440690] ---[ end trace 1e7ce7e1e4567276 ]---
[   22.472832] trace_kprobe: This probe might be able to register after target module is loaded. Continue.

This is because the kill_kprobe() calls disarm_kprobe_ftrace() even
if the given probe is not enabled. In that case, ftrace_set_filter_ip()
fails because the given probe point is not registered to ftrace.

Fix to check the given (going) probe is enabled before invoking
disarm_kprobe_ftrace().

Link: https://lkml.kernel.org/r/159888672694.1411785.5987998076694782591.stgit@devnote2

Fixes: 0cb2f13 ("kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler")
Cc: Ingo Molnar <[email protected]>
Cc: "Naveen N . Rao" <[email protected]>
Cc: Anil S Keshavamurthy <[email protected]>
Cc: David Miller <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: [email protected]
Signed-off-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
popcornmix pushed a commit that referenced this issue Oct 2, 2020
commit 3031313 upstream.

Commit 0cb2f13 ("kprobes: Fix NULL pointer dereference at
kprobe_ftrace_handler") fixed one bug but not completely fixed yet.
If we run a kprobe_module.tc of ftracetest, kernel showed a warning
as below.

# ./ftracetest test.d/kprobe/kprobe_module.tc
=== Ftrace unit tests ===
[1] Kprobe dynamic event - probing module
...
[   22.400215] ------------[ cut here ]------------
[   22.400962] Failed to disarm kprobe-ftrace at trace_printk_irq_work+0x0/0x7e [trace_printk] (-2)
[   22.402139] WARNING: CPU: 7 PID: 200 at kernel/kprobes.c:1091 __disarm_kprobe_ftrace.isra.0+0x7e/0xa0
[   22.403358] Modules linked in: trace_printk(-)
[   22.404028] CPU: 7 PID: 200 Comm: rmmod Not tainted 5.9.0-rc2+ #66
[   22.404870] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
[   22.406139] RIP: 0010:__disarm_kprobe_ftrace.isra.0+0x7e/0xa0
[   22.406947] Code: 30 8b 03 eb c9 80 3d e5 09 1f 01 00 75 dc 49 8b 34 24 89 c2 48 c7 c7 a0 c2 05 82 89 45 e4 c6 05 cc 09 1f 01 01 e8 a9 c7 f0 ff <0f> 0b 8b 45 e4 eb b9 89 c6 48 c7 c7 70 c2 05 82 89 45 e4 e8 91 c7
[   22.409544] RSP: 0018:ffffc90000237df0 EFLAGS: 00010286
[   22.410385] RAX: 0000000000000000 RBX: ffffffff83066024 RCX: 0000000000000000
[   22.411434] RDX: 0000000000000001 RSI: ffffffff810de8d3 RDI: ffffffff810de8d3
[   22.412687] RBP: ffffc90000237e10 R08: 0000000000000001 R09: 0000000000000001
[   22.413762] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88807c478640
[   22.414852] R13: ffffffff8235ebc0 R14: ffffffffa00060c0 R15: 0000000000000000
[   22.415941] FS:  00000000019d48c0(0000) GS:ffff88807d7c0000(0000) knlGS:0000000000000000
[   22.417264] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   22.418176] CR2: 00000000005bb7e3 CR3: 0000000078f7a000 CR4: 00000000000006a0
[   22.419309] Call Trace:
[   22.419990]  kill_kprobe+0x94/0x160
[   22.420652]  kprobes_module_callback+0x64/0x230
[   22.421470]  notifier_call_chain+0x4f/0x70
[   22.422184]  blocking_notifier_call_chain+0x49/0x70
[   22.422979]  __x64_sys_delete_module+0x1ac/0x240
[   22.423733]  do_syscall_64+0x38/0x50
[   22.424366]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   22.425176] RIP: 0033:0x4bb81d
[   22.425741] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e0 ff ff ff f7 d8 64 89 01 48
[   22.428726] RSP: 002b:00007ffc70fef008 EFLAGS: 00000246 ORIG_RAX: 00000000000000b0
[   22.430169] RAX: ffffffffffffffda RBX: 00000000019d48a0 RCX: 00000000004bb81d
[   22.431375] RDX: 0000000000000000 RSI: 0000000000000880 RDI: 00007ffc70fef028
[   22.432543] RBP: 0000000000000880 R08: 00000000ffffffff R09: 00007ffc70fef320
[   22.433692] R10: 0000000000656300 R11: 0000000000000246 R12: 00007ffc70fef028
[   22.434635] R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000000
[   22.435682] irq event stamp: 1169
[   22.436240] hardirqs last  enabled at (1179): [<ffffffff810df542>] console_unlock+0x422/0x580
[   22.437466] hardirqs last disabled at (1188): [<ffffffff810df19b>] console_unlock+0x7b/0x580
[   22.438608] softirqs last  enabled at (866): [<ffffffff81c0038e>] __do_softirq+0x38e/0x490
[   22.439637] softirqs last disabled at (859): [<ffffffff81a00f42>] asm_call_on_stack+0x12/0x20
[   22.440690] ---[ end trace 1e7ce7e1e4567276 ]---
[   22.472832] trace_kprobe: This probe might be able to register after target module is loaded. Continue.

This is because the kill_kprobe() calls disarm_kprobe_ftrace() even
if the given probe is not enabled. In that case, ftrace_set_filter_ip()
fails because the given probe point is not registered to ftrace.

Fix to check the given (going) probe is enabled before invoking
disarm_kprobe_ftrace().

Link: https://lkml.kernel.org/r/159888672694.1411785.5987998076694782591.stgit@devnote2

Fixes: 0cb2f13 ("kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler")
Cc: Ingo Molnar <[email protected]>
Cc: "Naveen N . Rao" <[email protected]>
Cc: Anil S Keshavamurthy <[email protected]>
Cc: David Miller <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: [email protected]
Signed-off-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
popcornmix pushed a commit that referenced this issue Oct 2, 2020
commit 3031313 upstream.

Commit 0cb2f13 ("kprobes: Fix NULL pointer dereference at
kprobe_ftrace_handler") fixed one bug but not completely fixed yet.
If we run a kprobe_module.tc of ftracetest, kernel showed a warning
as below.

# ./ftracetest test.d/kprobe/kprobe_module.tc
=== Ftrace unit tests ===
[1] Kprobe dynamic event - probing module
...
[   22.400215] ------------[ cut here ]------------
[   22.400962] Failed to disarm kprobe-ftrace at trace_printk_irq_work+0x0/0x7e [trace_printk] (-2)
[   22.402139] WARNING: CPU: 7 PID: 200 at kernel/kprobes.c:1091 __disarm_kprobe_ftrace.isra.0+0x7e/0xa0
[   22.403358] Modules linked in: trace_printk(-)
[   22.404028] CPU: 7 PID: 200 Comm: rmmod Not tainted 5.9.0-rc2+ #66
[   22.404870] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
[   22.406139] RIP: 0010:__disarm_kprobe_ftrace.isra.0+0x7e/0xa0
[   22.406947] Code: 30 8b 03 eb c9 80 3d e5 09 1f 01 00 75 dc 49 8b 34 24 89 c2 48 c7 c7 a0 c2 05 82 89 45 e4 c6 05 cc 09 1f 01 01 e8 a9 c7 f0 ff <0f> 0b 8b 45 e4 eb b9 89 c6 48 c7 c7 70 c2 05 82 89 45 e4 e8 91 c7
[   22.409544] RSP: 0018:ffffc90000237df0 EFLAGS: 00010286
[   22.410385] RAX: 0000000000000000 RBX: ffffffff83066024 RCX: 0000000000000000
[   22.411434] RDX: 0000000000000001 RSI: ffffffff810de8d3 RDI: ffffffff810de8d3
[   22.412687] RBP: ffffc90000237e10 R08: 0000000000000001 R09: 0000000000000001
[   22.413762] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88807c478640
[   22.414852] R13: ffffffff8235ebc0 R14: ffffffffa00060c0 R15: 0000000000000000
[   22.415941] FS:  00000000019d48c0(0000) GS:ffff88807d7c0000(0000) knlGS:0000000000000000
[   22.417264] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   22.418176] CR2: 00000000005bb7e3 CR3: 0000000078f7a000 CR4: 00000000000006a0
[   22.419309] Call Trace:
[   22.419990]  kill_kprobe+0x94/0x160
[   22.420652]  kprobes_module_callback+0x64/0x230
[   22.421470]  notifier_call_chain+0x4f/0x70
[   22.422184]  blocking_notifier_call_chain+0x49/0x70
[   22.422979]  __x64_sys_delete_module+0x1ac/0x240
[   22.423733]  do_syscall_64+0x38/0x50
[   22.424366]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   22.425176] RIP: 0033:0x4bb81d
[   22.425741] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e0 ff ff ff f7 d8 64 89 01 48
[   22.428726] RSP: 002b:00007ffc70fef008 EFLAGS: 00000246 ORIG_RAX: 00000000000000b0
[   22.430169] RAX: ffffffffffffffda RBX: 00000000019d48a0 RCX: 00000000004bb81d
[   22.431375] RDX: 0000000000000000 RSI: 0000000000000880 RDI: 00007ffc70fef028
[   22.432543] RBP: 0000000000000880 R08: 00000000ffffffff R09: 00007ffc70fef320
[   22.433692] R10: 0000000000656300 R11: 0000000000000246 R12: 00007ffc70fef028
[   22.434635] R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000000
[   22.435682] irq event stamp: 1169
[   22.436240] hardirqs last  enabled at (1179): [<ffffffff810df542>] console_unlock+0x422/0x580
[   22.437466] hardirqs last disabled at (1188): [<ffffffff810df19b>] console_unlock+0x7b/0x580
[   22.438608] softirqs last  enabled at (866): [<ffffffff81c0038e>] __do_softirq+0x38e/0x490
[   22.439637] softirqs last disabled at (859): [<ffffffff81a00f42>] asm_call_on_stack+0x12/0x20
[   22.440690] ---[ end trace 1e7ce7e1e4567276 ]---
[   22.472832] trace_kprobe: This probe might be able to register after target module is loaded. Continue.

This is because the kill_kprobe() calls disarm_kprobe_ftrace() even
if the given probe is not enabled. In that case, ftrace_set_filter_ip()
fails because the given probe point is not registered to ftrace.

Fix to check the given (going) probe is enabled before invoking
disarm_kprobe_ftrace().

Link: https://lkml.kernel.org/r/159888672694.1411785.5987998076694782591.stgit@devnote2

Fixes: 0cb2f13 ("kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler")
Cc: Ingo Molnar <[email protected]>
Cc: "Naveen N . Rao" <[email protected]>
Cc: Anil S Keshavamurthy <[email protected]>
Cc: David Miller <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: [email protected]
Signed-off-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
popcornmix pushed a commit that referenced this issue Jan 27, 2022
[ Upstream commit f1cb347 ]

rsi_get_* functions rely on an offset variable from usb
input. The size of usb input is RSI_MAX_RX_USB_PKT_SIZE(3000),
while 2-byte offset can be up to 0xFFFF. Thus a large offset
can cause out-of-bounds read.

The patch adds a bound checking condition when rcv_pkt_len is 0,
indicating it's USB. It's unclear whether this is triggerable
from other type of bus. The following check might help in that case.
offset > rcv_pkt_len - FRAME_DESC_SZ

The bug is trigerrable with conpromised/malfunctioning USB devices.
I tested the patch with the crashing input and got no more bug report.

Attached is the KASAN report from fuzzing.

BUG: KASAN: slab-out-of-bounds in rsi_read_pkt+0x42e/0x500 [rsi_91x]
Read of size 2 at addr ffff888019439fdb by task RX-Thread/227

CPU: 0 PID: 227 Comm: RX-Thread Not tainted 5.6.0 #66
Call Trace:
 dump_stack+0x76/0xa0
 print_address_description.constprop.0+0x16/0x200
 ? rsi_read_pkt+0x42e/0x500 [rsi_91x]
 ? rsi_read_pkt+0x42e/0x500 [rsi_91x]
 __kasan_report.cold+0x37/0x7c
 ? rsi_read_pkt+0x42e/0x500 [rsi_91x]
 kasan_report+0xe/0x20
 rsi_read_pkt+0x42e/0x500 [rsi_91x]
 rsi_usb_rx_thread+0x1b1/0x2fc [rsi_usb]
 ? rsi_probe+0x16a0/0x16a0 [rsi_usb]
 ? _raw_spin_lock_irqsave+0x7b/0xd0
 ? _raw_spin_trylock_bh+0x120/0x120
 ? __wake_up_common+0x10b/0x520
 ? rsi_probe+0x16a0/0x16a0 [rsi_usb]
 kthread+0x2b5/0x3b0
 ? kthread_create_on_node+0xd0/0xd0
 ret_from_fork+0x22/0x40

Reported-by: Brendan Dolan-Gavitt <[email protected]>
Signed-off-by: Zekun Shen <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
popcornmix pushed a commit that referenced this issue Jan 28, 2022
[ Upstream commit f1cb347 ]

rsi_get_* functions rely on an offset variable from usb
input. The size of usb input is RSI_MAX_RX_USB_PKT_SIZE(3000),
while 2-byte offset can be up to 0xFFFF. Thus a large offset
can cause out-of-bounds read.

The patch adds a bound checking condition when rcv_pkt_len is 0,
indicating it's USB. It's unclear whether this is triggerable
from other type of bus. The following check might help in that case.
offset > rcv_pkt_len - FRAME_DESC_SZ

The bug is trigerrable with conpromised/malfunctioning USB devices.
I tested the patch with the crashing input and got no more bug report.

Attached is the KASAN report from fuzzing.

BUG: KASAN: slab-out-of-bounds in rsi_read_pkt+0x42e/0x500 [rsi_91x]
Read of size 2 at addr ffff888019439fdb by task RX-Thread/227

CPU: 0 PID: 227 Comm: RX-Thread Not tainted 5.6.0 #66
Call Trace:
 dump_stack+0x76/0xa0
 print_address_description.constprop.0+0x16/0x200
 ? rsi_read_pkt+0x42e/0x500 [rsi_91x]
 ? rsi_read_pkt+0x42e/0x500 [rsi_91x]
 __kasan_report.cold+0x37/0x7c
 ? rsi_read_pkt+0x42e/0x500 [rsi_91x]
 kasan_report+0xe/0x20
 rsi_read_pkt+0x42e/0x500 [rsi_91x]
 rsi_usb_rx_thread+0x1b1/0x2fc [rsi_usb]
 ? rsi_probe+0x16a0/0x16a0 [rsi_usb]
 ? _raw_spin_lock_irqsave+0x7b/0xd0
 ? _raw_spin_trylock_bh+0x120/0x120
 ? __wake_up_common+0x10b/0x520
 ? rsi_probe+0x16a0/0x16a0 [rsi_usb]
 kthread+0x2b5/0x3b0
 ? kthread_create_on_node+0xd0/0xd0
 ret_from_fork+0x22/0x40

Reported-by: Brendan Dolan-Gavitt <[email protected]>
Signed-off-by: Zekun Shen <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
popcornmix pushed a commit that referenced this issue Jan 28, 2022
[ Upstream commit f1cb347 ]

rsi_get_* functions rely on an offset variable from usb
input. The size of usb input is RSI_MAX_RX_USB_PKT_SIZE(3000),
while 2-byte offset can be up to 0xFFFF. Thus a large offset
can cause out-of-bounds read.

The patch adds a bound checking condition when rcv_pkt_len is 0,
indicating it's USB. It's unclear whether this is triggerable
from other type of bus. The following check might help in that case.
offset > rcv_pkt_len - FRAME_DESC_SZ

The bug is trigerrable with conpromised/malfunctioning USB devices.
I tested the patch with the crashing input and got no more bug report.

Attached is the KASAN report from fuzzing.

BUG: KASAN: slab-out-of-bounds in rsi_read_pkt+0x42e/0x500 [rsi_91x]
Read of size 2 at addr ffff888019439fdb by task RX-Thread/227

CPU: 0 PID: 227 Comm: RX-Thread Not tainted 5.6.0 #66
Call Trace:
 dump_stack+0x76/0xa0
 print_address_description.constprop.0+0x16/0x200
 ? rsi_read_pkt+0x42e/0x500 [rsi_91x]
 ? rsi_read_pkt+0x42e/0x500 [rsi_91x]
 __kasan_report.cold+0x37/0x7c
 ? rsi_read_pkt+0x42e/0x500 [rsi_91x]
 kasan_report+0xe/0x20
 rsi_read_pkt+0x42e/0x500 [rsi_91x]
 rsi_usb_rx_thread+0x1b1/0x2fc [rsi_usb]
 ? rsi_probe+0x16a0/0x16a0 [rsi_usb]
 ? _raw_spin_lock_irqsave+0x7b/0xd0
 ? _raw_spin_trylock_bh+0x120/0x120
 ? __wake_up_common+0x10b/0x520
 ? rsi_probe+0x16a0/0x16a0 [rsi_usb]
 kthread+0x2b5/0x3b0
 ? kthread_create_on_node+0xd0/0xd0
 ret_from_fork+0x22/0x40

Reported-by: Brendan Dolan-Gavitt <[email protected]>
Signed-off-by: Zekun Shen <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
whdgmawkd pushed a commit to whdgmawkd/linux that referenced this issue Feb 11, 2022
[ Upstream commit f1cb347 ]

rsi_get_* functions rely on an offset variable from usb
input. The size of usb input is RSI_MAX_RX_USB_PKT_SIZE(3000),
while 2-byte offset can be up to 0xFFFF. Thus a large offset
can cause out-of-bounds read.

The patch adds a bound checking condition when rcv_pkt_len is 0,
indicating it's USB. It's unclear whether this is triggerable
from other type of bus. The following check might help in that case.
offset > rcv_pkt_len - FRAME_DESC_SZ

The bug is trigerrable with conpromised/malfunctioning USB devices.
I tested the patch with the crashing input and got no more bug report.

Attached is the KASAN report from fuzzing.

BUG: KASAN: slab-out-of-bounds in rsi_read_pkt+0x42e/0x500 [rsi_91x]
Read of size 2 at addr ffff888019439fdb by task RX-Thread/227

CPU: 0 PID: 227 Comm: RX-Thread Not tainted 5.6.0 raspberrypi#66
Call Trace:
 dump_stack+0x76/0xa0
 print_address_description.constprop.0+0x16/0x200
 ? rsi_read_pkt+0x42e/0x500 [rsi_91x]
 ? rsi_read_pkt+0x42e/0x500 [rsi_91x]
 __kasan_report.cold+0x37/0x7c
 ? rsi_read_pkt+0x42e/0x500 [rsi_91x]
 kasan_report+0xe/0x20
 rsi_read_pkt+0x42e/0x500 [rsi_91x]
 rsi_usb_rx_thread+0x1b1/0x2fc [rsi_usb]
 ? rsi_probe+0x16a0/0x16a0 [rsi_usb]
 ? _raw_spin_lock_irqsave+0x7b/0xd0
 ? _raw_spin_trylock_bh+0x120/0x120
 ? __wake_up_common+0x10b/0x520
 ? rsi_probe+0x16a0/0x16a0 [rsi_usb]
 kthread+0x2b5/0x3b0
 ? kthread_create_on_node+0xd0/0xd0
 ret_from_fork+0x22/0x40

Reported-by: Brendan Dolan-Gavitt <[email protected]>
Signed-off-by: Zekun Shen <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
popcornmix pushed a commit that referenced this issue Jul 20, 2023
[ Upstream commit 85fadc0 ]

The initial memblock metadata is accessed from kernel image mapping. The
regions arrays need to "reallocated" from memblock and accessed through
linear mapping to cover more memblock regions. So the resizing should
not be allowed until linear mapping is ready. Note that there are
memblock allocations when building linear mapping.

This patch is similar to 24cc61d ("arm64: memblock: don't permit
memblock resizing until linear mapping is up").

In following log, many memblock regions are reserved before
create_linear_mapping_page_table(). And then it triggered reallocation
of memblock.reserved.regions and memcpy the old array in kernel image
mapping to the new array in linear mapping which caused a page fault.

[    0.000000] memblock_reserve: [0x00000000bf01f000-0x00000000bf01ffff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6
[    0.000000] memblock_reserve: [0x00000000bf021000-0x00000000bf021fff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6
[    0.000000] memblock_reserve: [0x00000000bf023000-0x00000000bf023fff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6
[    0.000000] memblock_reserve: [0x00000000bf025000-0x00000000bf025fff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6
[    0.000000] memblock_reserve: [0x00000000bf027000-0x00000000bf027fff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6
[    0.000000] memblock_reserve: [0x00000000bf029000-0x00000000bf029fff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6
[    0.000000] memblock_reserve: [0x00000000bf02b000-0x00000000bf02bfff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6
[    0.000000] memblock_reserve: [0x00000000bf02d000-0x00000000bf02dfff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6
[    0.000000] memblock_reserve: [0x00000000bf02f000-0x00000000bf02ffff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6
[    0.000000] memblock_reserve: [0x00000000bf030000-0x00000000bf030fff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6
[    0.000000] OF: reserved mem: 0x0000000080000000..0x000000008007ffff (512 KiB) map non-reusable mmode_resv0@80000000
[    0.000000] memblock_reserve: [0x00000000bf000000-0x00000000bf001fed] paging_init+0x19a/0x5ae
[    0.000000] memblock_phys_alloc_range: 4096 bytes align=0x1000 from=0x0000000000000000 max_addr=0x0000000000000000 alloc_pmd_fixmap+0x14/0x1c
[    0.000000] memblock_reserve: [0x000000017ffff000-0x000000017fffffff] memblock_alloc_range_nid+0xb8/0x128
[    0.000000] memblock: reserved is doubled to 256 at [0x000000017fffd000-0x000000017fffe7ff]
[    0.000000] Unable to handle kernel paging request at virtual address ff600000ffffd000
[    0.000000] Oops [#1]
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.4.0-rc1-00011-g99a670b2069c #66
[    0.000000] Hardware name: riscv-virtio,qemu (DT)
[    0.000000] epc : __memcpy+0x60/0xf8
[    0.000000]  ra : memblock_double_array+0x192/0x248
[    0.000000] epc : ffffffff8081d214 ra : ffffffff80a3dfc0 sp : ffffffff81403bd0
[    0.000000]  gp : ffffffff814fbb38 tp : ffffffff8140dac0 t0 : 0000000001600000
[    0.000000]  t1 : 0000000000000000 t2 : 000000008f001000 s0 : ffffffff81403c60
[    0.000000]  s1 : ffffffff80c0bc98 a0 : ff600000ffffd000 a1 : ffffffff80c0bcd8
[    0.000000]  a2 : 0000000000000c00 a3 : ffffffff80c0c8d8 a4 : 0000000080000000
[    0.000000]  a5 : 0000000000080000 a6 : 0000000000000000 a7 : 0000000080200000
[    0.000000]  s2 : ff600000ffffd000 s3 : 0000000000002000 s4 : 0000000000000c00
[    0.000000]  s5 : ffffffff80c0bc60 s6 : ffffffff80c0bcc8 s7 : 0000000000000000
[    0.000000]  s8 : ffffffff814fd0a8 s9 : 000000017fffe7ff s10: 0000000000000000
[    0.000000]  s11: 0000000000001000 t3 : 0000000000001000 t4 : 0000000000000000
[    0.000000]  t5 : 000000008f003000 t6 : ff600000ffffd000
[    0.000000] status: 0000000200000100 badaddr: ff600000ffffd000 cause: 000000000000000f
[    0.000000] [<ffffffff8081d214>] __memcpy+0x60/0xf8
[    0.000000] [<ffffffff80a3e1a2>] memblock_add_range.isra.14+0x12c/0x162
[    0.000000] [<ffffffff80a3e36a>] memblock_reserve+0x6e/0x8c
[    0.000000] [<ffffffff80a123fc>] memblock_alloc_range_nid+0xb8/0x128
[    0.000000] [<ffffffff80a1256a>] memblock_phys_alloc_range+0x5e/0x6a
[    0.000000] [<ffffffff80a04732>] alloc_pmd_fixmap+0x14/0x1c
[    0.000000] [<ffffffff80a0475a>] alloc_p4d_fixmap+0xc/0x14
[    0.000000] [<ffffffff80a04a36>] create_pgd_mapping+0x98/0x17c
[    0.000000] [<ffffffff80a04e9e>] create_linear_mapping_range.constprop.10+0xe4/0x112
[    0.000000] [<ffffffff80a05bb8>] paging_init+0x3ec/0x5ae
[    0.000000] [<ffffffff80a03354>] setup_arch+0xb2/0x576
[    0.000000] [<ffffffff80a00726>] start_kernel+0x72/0x57e
[    0.000000] Code: b303 0285 b383 0305 be03 0385 be83 0405 bf03 0485 (b023) 00ef
[    0.000000] ---[ end trace 0000000000000000 ]---
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.000000] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

Fixes: 671f9a3 ("RISC-V: Setup initial page tables in two stages")
Signed-off-by: Woody Zhang <[email protected]>
Tested-by: Song Shuai <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
popcornmix pushed a commit that referenced this issue Jul 20, 2023
[ Upstream commit 85fadc0 ]

The initial memblock metadata is accessed from kernel image mapping. The
regions arrays need to "reallocated" from memblock and accessed through
linear mapping to cover more memblock regions. So the resizing should
not be allowed until linear mapping is ready. Note that there are
memblock allocations when building linear mapping.

This patch is similar to 24cc61d ("arm64: memblock: don't permit
memblock resizing until linear mapping is up").

In following log, many memblock regions are reserved before
create_linear_mapping_page_table(). And then it triggered reallocation
of memblock.reserved.regions and memcpy the old array in kernel image
mapping to the new array in linear mapping which caused a page fault.

[    0.000000] memblock_reserve: [0x00000000bf01f000-0x00000000bf01ffff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6
[    0.000000] memblock_reserve: [0x00000000bf021000-0x00000000bf021fff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6
[    0.000000] memblock_reserve: [0x00000000bf023000-0x00000000bf023fff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6
[    0.000000] memblock_reserve: [0x00000000bf025000-0x00000000bf025fff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6
[    0.000000] memblock_reserve: [0x00000000bf027000-0x00000000bf027fff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6
[    0.000000] memblock_reserve: [0x00000000bf029000-0x00000000bf029fff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6
[    0.000000] memblock_reserve: [0x00000000bf02b000-0x00000000bf02bfff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6
[    0.000000] memblock_reserve: [0x00000000bf02d000-0x00000000bf02dfff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6
[    0.000000] memblock_reserve: [0x00000000bf02f000-0x00000000bf02ffff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6
[    0.000000] memblock_reserve: [0x00000000bf030000-0x00000000bf030fff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6
[    0.000000] OF: reserved mem: 0x0000000080000000..0x000000008007ffff (512 KiB) map non-reusable mmode_resv0@80000000
[    0.000000] memblock_reserve: [0x00000000bf000000-0x00000000bf001fed] paging_init+0x19a/0x5ae
[    0.000000] memblock_phys_alloc_range: 4096 bytes align=0x1000 from=0x0000000000000000 max_addr=0x0000000000000000 alloc_pmd_fixmap+0x14/0x1c
[    0.000000] memblock_reserve: [0x000000017ffff000-0x000000017fffffff] memblock_alloc_range_nid+0xb8/0x128
[    0.000000] memblock: reserved is doubled to 256 at [0x000000017fffd000-0x000000017fffe7ff]
[    0.000000] Unable to handle kernel paging request at virtual address ff600000ffffd000
[    0.000000] Oops [#1]
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.4.0-rc1-00011-g99a670b2069c #66
[    0.000000] Hardware name: riscv-virtio,qemu (DT)
[    0.000000] epc : __memcpy+0x60/0xf8
[    0.000000]  ra : memblock_double_array+0x192/0x248
[    0.000000] epc : ffffffff8081d214 ra : ffffffff80a3dfc0 sp : ffffffff81403bd0
[    0.000000]  gp : ffffffff814fbb38 tp : ffffffff8140dac0 t0 : 0000000001600000
[    0.000000]  t1 : 0000000000000000 t2 : 000000008f001000 s0 : ffffffff81403c60
[    0.000000]  s1 : ffffffff80c0bc98 a0 : ff600000ffffd000 a1 : ffffffff80c0bcd8
[    0.000000]  a2 : 0000000000000c00 a3 : ffffffff80c0c8d8 a4 : 0000000080000000
[    0.000000]  a5 : 0000000000080000 a6 : 0000000000000000 a7 : 0000000080200000
[    0.000000]  s2 : ff600000ffffd000 s3 : 0000000000002000 s4 : 0000000000000c00
[    0.000000]  s5 : ffffffff80c0bc60 s6 : ffffffff80c0bcc8 s7 : 0000000000000000
[    0.000000]  s8 : ffffffff814fd0a8 s9 : 000000017fffe7ff s10: 0000000000000000
[    0.000000]  s11: 0000000000001000 t3 : 0000000000001000 t4 : 0000000000000000
[    0.000000]  t5 : 000000008f003000 t6 : ff600000ffffd000
[    0.000000] status: 0000000200000100 badaddr: ff600000ffffd000 cause: 000000000000000f
[    0.000000] [<ffffffff8081d214>] __memcpy+0x60/0xf8
[    0.000000] [<ffffffff80a3e1a2>] memblock_add_range.isra.14+0x12c/0x162
[    0.000000] [<ffffffff80a3e36a>] memblock_reserve+0x6e/0x8c
[    0.000000] [<ffffffff80a123fc>] memblock_alloc_range_nid+0xb8/0x128
[    0.000000] [<ffffffff80a1256a>] memblock_phys_alloc_range+0x5e/0x6a
[    0.000000] [<ffffffff80a04732>] alloc_pmd_fixmap+0x14/0x1c
[    0.000000] [<ffffffff80a0475a>] alloc_p4d_fixmap+0xc/0x14
[    0.000000] [<ffffffff80a04a36>] create_pgd_mapping+0x98/0x17c
[    0.000000] [<ffffffff80a04e9e>] create_linear_mapping_range.constprop.10+0xe4/0x112
[    0.000000] [<ffffffff80a05bb8>] paging_init+0x3ec/0x5ae
[    0.000000] [<ffffffff80a03354>] setup_arch+0xb2/0x576
[    0.000000] [<ffffffff80a00726>] start_kernel+0x72/0x57e
[    0.000000] Code: b303 0285 b383 0305 be03 0385 be83 0405 bf03 0485 (b023) 00ef
[    0.000000] ---[ end trace 0000000000000000 ]---
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.000000] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

Fixes: 671f9a3 ("RISC-V: Setup initial page tables in two stages")
Signed-off-by: Woody Zhang <[email protected]>
Tested-by: Song Shuai <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
0lxb pushed a commit to 0lxb/rpi_linux that referenced this issue Jan 30, 2024
scx: Print scheduler state in panic message
popcornmix pushed a commit that referenced this issue Feb 19, 2024
When a PCI device is dynamically added, the kernel oopses with a NULL
pointer dereference:

  BUG: Kernel NULL pointer dereference on read at 0x00000030
  Faulting instruction address: 0xc0000000006bbe5c
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in: rpadlpar_io rpaphp rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs xsk_diag bonding nft_compat nf_tables nfnetlink rfkill binfmt_misc dm_multipath rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_umad ib_iser libiscsi scsi_transport_iscsi ib_ipoib rdma_cm iw_cm ib_cm mlx5_ib ib_uverbs ib_core pseries_rng drm drm_panel_orientation_quirks xfs libcrc32c mlx5_core mlxfw sd_mod t10_pi sg tls ibmvscsi ibmveth scsi_transport_srp vmx_crypto pseries_wdt psample dm_mirror dm_region_hash dm_log dm_mod fuse
  CPU: 17 PID: 2685 Comm: drmgr Not tainted 6.7.0-203405+ #66
  Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_008) hv:phyp pSeries
  NIP:  c0000000006bbe5c LR: c000000000a13e68 CTR: c0000000000579f8
  REGS: c00000009924f240 TRAP: 0300   Not tainted  (6.7.0-203405+)
  MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24002220  XER: 20040006
  CFAR: c000000000a13e64 DAR: 0000000000000030 DSISR: 40000000 IRQMASK: 0
  ...
  NIP sysfs_add_link_to_group+0x34/0x94
  LR  iommu_device_link+0x5c/0x118
  Call Trace:
   iommu_init_device+0x26c/0x318 (unreliable)
   iommu_device_link+0x5c/0x118
   iommu_init_device+0xa8/0x318
   iommu_probe_device+0xc0/0x134
   iommu_bus_notifier+0x44/0x104
   notifier_call_chain+0xb8/0x19c
   blocking_notifier_call_chain+0x64/0x98
   bus_notify+0x50/0x7c
   device_add+0x640/0x918
   pci_device_add+0x23c/0x298
   of_create_pci_dev+0x400/0x884
   of_scan_pci_dev+0x124/0x1b0
   __of_scan_bus+0x78/0x18c
   pcibios_scan_phb+0x2a4/0x3b0
   init_phb_dynamic+0xb8/0x110
   dlpar_add_slot+0x170/0x3b8 [rpadlpar_io]
   add_slot_store.part.0+0xb4/0x130 [rpadlpar_io]
   kobj_attr_store+0x2c/0x48
   sysfs_kf_write+0x64/0x78
   kernfs_fop_write_iter+0x1b0/0x290
   vfs_write+0x350/0x4a0
   ksys_write+0x84/0x140
   system_call_exception+0x124/0x330
   system_call_vectored_common+0x15c/0x2ec

Commit a940904 ("powerpc/iommu: Add iommu_ops to report capabilities
and allow blocking domains") broke DLPAR add of PCI devices.

The above added iommu_device structure to pci_controller. During
system boot, PCI devices are discovered and this newly added iommu_device
structure is initialized by a call to iommu_device_register().

During DLPAR add of a PCI device, a new pci_controller structure is
allocated but there are no calls made to iommu_device_register()
interface.

Fix is to register the iommu device during DLPAR add as well.

Fixes: a940904 ("powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains")
Signed-off-by: Gaurav Batra <[email protected]>
[mpe: Trim oops and tweak some change log wording]
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
popcornmix pushed a commit that referenced this issue Feb 23, 2024
[ Upstream commit ed8b94f ]

When a PCI device is dynamically added, the kernel oopses with a NULL
pointer dereference:

  BUG: Kernel NULL pointer dereference on read at 0x00000030
  Faulting instruction address: 0xc0000000006bbe5c
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in: rpadlpar_io rpaphp rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs xsk_diag bonding nft_compat nf_tables nfnetlink rfkill binfmt_misc dm_multipath rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_umad ib_iser libiscsi scsi_transport_iscsi ib_ipoib rdma_cm iw_cm ib_cm mlx5_ib ib_uverbs ib_core pseries_rng drm drm_panel_orientation_quirks xfs libcrc32c mlx5_core mlxfw sd_mod t10_pi sg tls ibmvscsi ibmveth scsi_transport_srp vmx_crypto pseries_wdt psample dm_mirror dm_region_hash dm_log dm_mod fuse
  CPU: 17 PID: 2685 Comm: drmgr Not tainted 6.7.0-203405+ #66
  Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_008) hv:phyp pSeries
  NIP:  c0000000006bbe5c LR: c000000000a13e68 CTR: c0000000000579f8
  REGS: c00000009924f240 TRAP: 0300   Not tainted  (6.7.0-203405+)
  MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24002220  XER: 20040006
  CFAR: c000000000a13e64 DAR: 0000000000000030 DSISR: 40000000 IRQMASK: 0
  ...
  NIP sysfs_add_link_to_group+0x34/0x94
  LR  iommu_device_link+0x5c/0x118
  Call Trace:
   iommu_init_device+0x26c/0x318 (unreliable)
   iommu_device_link+0x5c/0x118
   iommu_init_device+0xa8/0x318
   iommu_probe_device+0xc0/0x134
   iommu_bus_notifier+0x44/0x104
   notifier_call_chain+0xb8/0x19c
   blocking_notifier_call_chain+0x64/0x98
   bus_notify+0x50/0x7c
   device_add+0x640/0x918
   pci_device_add+0x23c/0x298
   of_create_pci_dev+0x400/0x884
   of_scan_pci_dev+0x124/0x1b0
   __of_scan_bus+0x78/0x18c
   pcibios_scan_phb+0x2a4/0x3b0
   init_phb_dynamic+0xb8/0x110
   dlpar_add_slot+0x170/0x3b8 [rpadlpar_io]
   add_slot_store.part.0+0xb4/0x130 [rpadlpar_io]
   kobj_attr_store+0x2c/0x48
   sysfs_kf_write+0x64/0x78
   kernfs_fop_write_iter+0x1b0/0x290
   vfs_write+0x350/0x4a0
   ksys_write+0x84/0x140
   system_call_exception+0x124/0x330
   system_call_vectored_common+0x15c/0x2ec

Commit a940904 ("powerpc/iommu: Add iommu_ops to report capabilities
and allow blocking domains") broke DLPAR add of PCI devices.

The above added iommu_device structure to pci_controller. During
system boot, PCI devices are discovered and this newly added iommu_device
structure is initialized by a call to iommu_device_register().

During DLPAR add of a PCI device, a new pci_controller structure is
allocated but there are no calls made to iommu_device_register()
interface.

Fix is to register the iommu device during DLPAR add as well.

Fixes: a940904 ("powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains")
Signed-off-by: Gaurav Batra <[email protected]>
[mpe: Trim oops and tweak some change log wording]
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
popcornmix pushed a commit that referenced this issue Feb 23, 2024
[ Upstream commit ed8b94f ]

When a PCI device is dynamically added, the kernel oopses with a NULL
pointer dereference:

  BUG: Kernel NULL pointer dereference on read at 0x00000030
  Faulting instruction address: 0xc0000000006bbe5c
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in: rpadlpar_io rpaphp rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs xsk_diag bonding nft_compat nf_tables nfnetlink rfkill binfmt_misc dm_multipath rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_umad ib_iser libiscsi scsi_transport_iscsi ib_ipoib rdma_cm iw_cm ib_cm mlx5_ib ib_uverbs ib_core pseries_rng drm drm_panel_orientation_quirks xfs libcrc32c mlx5_core mlxfw sd_mod t10_pi sg tls ibmvscsi ibmveth scsi_transport_srp vmx_crypto pseries_wdt psample dm_mirror dm_region_hash dm_log dm_mod fuse
  CPU: 17 PID: 2685 Comm: drmgr Not tainted 6.7.0-203405+ #66
  Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_008) hv:phyp pSeries
  NIP:  c0000000006bbe5c LR: c000000000a13e68 CTR: c0000000000579f8
  REGS: c00000009924f240 TRAP: 0300   Not tainted  (6.7.0-203405+)
  MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24002220  XER: 20040006
  CFAR: c000000000a13e64 DAR: 0000000000000030 DSISR: 40000000 IRQMASK: 0
  ...
  NIP sysfs_add_link_to_group+0x34/0x94
  LR  iommu_device_link+0x5c/0x118
  Call Trace:
   iommu_init_device+0x26c/0x318 (unreliable)
   iommu_device_link+0x5c/0x118
   iommu_init_device+0xa8/0x318
   iommu_probe_device+0xc0/0x134
   iommu_bus_notifier+0x44/0x104
   notifier_call_chain+0xb8/0x19c
   blocking_notifier_call_chain+0x64/0x98
   bus_notify+0x50/0x7c
   device_add+0x640/0x918
   pci_device_add+0x23c/0x298
   of_create_pci_dev+0x400/0x884
   of_scan_pci_dev+0x124/0x1b0
   __of_scan_bus+0x78/0x18c
   pcibios_scan_phb+0x2a4/0x3b0
   init_phb_dynamic+0xb8/0x110
   dlpar_add_slot+0x170/0x3b8 [rpadlpar_io]
   add_slot_store.part.0+0xb4/0x130 [rpadlpar_io]
   kobj_attr_store+0x2c/0x48
   sysfs_kf_write+0x64/0x78
   kernfs_fop_write_iter+0x1b0/0x290
   vfs_write+0x350/0x4a0
   ksys_write+0x84/0x140
   system_call_exception+0x124/0x330
   system_call_vectored_common+0x15c/0x2ec

Commit a940904 ("powerpc/iommu: Add iommu_ops to report capabilities
and allow blocking domains") broke DLPAR add of PCI devices.

The above added iommu_device structure to pci_controller. During
system boot, PCI devices are discovered and this newly added iommu_device
structure is initialized by a call to iommu_device_register().

During DLPAR add of a PCI device, a new pci_controller structure is
allocated but there are no calls made to iommu_device_register()
interface.

Fix is to register the iommu device during DLPAR add as well.

Fixes: a940904 ("powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains")
Signed-off-by: Gaurav Batra <[email protected]>
[mpe: Trim oops and tweak some change log wording]
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
popcornmix pushed a commit that referenced this issue Mar 5, 2024
…ntroller

[ Upstream commit a5c57fd ]

When a PCI device is dynamically added, the kernel oopses with a NULL
pointer dereference:

  BUG: Kernel NULL pointer dereference on read at 0x00000030
  Faulting instruction address: 0xc0000000006bbe5c
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in: rpadlpar_io rpaphp rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs xsk_diag bonding nft_compat nf_tables nfnetlink rfkill binfmt_misc dm_multipath rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_umad ib_iser libiscsi scsi_transport_iscsi ib_ipoib rdma_cm iw_cm ib_cm mlx5_ib ib_uverbs ib_core pseries_rng drm drm_panel_orientation_quirks xfs libcrc32c mlx5_core mlxfw sd_mod t10_pi sg tls ibmvscsi ibmveth scsi_transport_srp vmx_crypto pseries_wdt psample dm_mirror dm_region_hash dm_log dm_mod fuse
  CPU: 17 PID: 2685 Comm: drmgr Not tainted 6.7.0-203405+ #66
  Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_008) hv:phyp pSeries
  NIP:  c0000000006bbe5c LR: c000000000a13e68 CTR: c0000000000579f8
  REGS: c00000009924f240 TRAP: 0300   Not tainted  (6.7.0-203405+)
  MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24002220  XER: 20040006
  CFAR: c000000000a13e64 DAR: 0000000000000030 DSISR: 40000000 IRQMASK: 0
  ...
  NIP sysfs_add_link_to_group+0x34/0x94
  LR  iommu_device_link+0x5c/0x118
  Call Trace:
   iommu_init_device+0x26c/0x318 (unreliable)
   iommu_device_link+0x5c/0x118
   iommu_init_device+0xa8/0x318
   iommu_probe_device+0xc0/0x134
   iommu_bus_notifier+0x44/0x104
   notifier_call_chain+0xb8/0x19c
   blocking_notifier_call_chain+0x64/0x98
   bus_notify+0x50/0x7c
   device_add+0x640/0x918
   pci_device_add+0x23c/0x298
   of_create_pci_dev+0x400/0x884
   of_scan_pci_dev+0x124/0x1b0
   __of_scan_bus+0x78/0x18c
   pcibios_scan_phb+0x2a4/0x3b0
   init_phb_dynamic+0xb8/0x110
   dlpar_add_slot+0x170/0x3b8 [rpadlpar_io]
   add_slot_store.part.0+0xb4/0x130 [rpadlpar_io]
   kobj_attr_store+0x2c/0x48
   sysfs_kf_write+0x64/0x78
   kernfs_fop_write_iter+0x1b0/0x290
   vfs_write+0x350/0x4a0
   ksys_write+0x84/0x140
   system_call_exception+0x124/0x330
   system_call_vectored_common+0x15c/0x2ec

Commit a940904 ("powerpc/iommu: Add iommu_ops to report capabilities
and allow blocking domains") broke DLPAR add of PCI devices.

The above added iommu_device structure to pci_controller. During
system boot, PCI devices are discovered and this newly added iommu_device
structure is initialized by a call to iommu_device_register().

During DLPAR add of a PCI device, a new pci_controller structure is
allocated but there are no calls made to iommu_device_register()
interface.

Fix is to register the iommu device during DLPAR add as well.

Fixes: a940904 ("powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains")
Signed-off-by: Gaurav Batra <[email protected]>
Reviewed-by: Brian King <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
popcornmix pushed a commit that referenced this issue Mar 5, 2024
…ntroller

When a PCI device is dynamically added, the kernel oopses with a NULL
pointer dereference:

  BUG: Kernel NULL pointer dereference on read at 0x00000030
  Faulting instruction address: 0xc0000000006bbe5c
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in: rpadlpar_io rpaphp rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs xsk_diag bonding nft_compat nf_tables nfnetlink rfkill binfmt_misc dm_multipath rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_umad ib_iser libiscsi scsi_transport_iscsi ib_ipoib rdma_cm iw_cm ib_cm mlx5_ib ib_uverbs ib_core pseries_rng drm drm_panel_orientation_quirks xfs libcrc32c mlx5_core mlxfw sd_mod t10_pi sg tls ibmvscsi ibmveth scsi_transport_srp vmx_crypto pseries_wdt psample dm_mirror dm_region_hash dm_log dm_mod fuse
  CPU: 17 PID: 2685 Comm: drmgr Not tainted 6.7.0-203405+ #66
  Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_008) hv:phyp pSeries
  NIP:  c0000000006bbe5c LR: c000000000a13e68 CTR: c0000000000579f8
  REGS: c00000009924f240 TRAP: 0300   Not tainted  (6.7.0-203405+)
  MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24002220  XER: 20040006
  CFAR: c000000000a13e64 DAR: 0000000000000030 DSISR: 40000000 IRQMASK: 0
  ...
  NIP sysfs_add_link_to_group+0x34/0x94
  LR  iommu_device_link+0x5c/0x118
  Call Trace:
   iommu_init_device+0x26c/0x318 (unreliable)
   iommu_device_link+0x5c/0x118
   iommu_init_device+0xa8/0x318
   iommu_probe_device+0xc0/0x134
   iommu_bus_notifier+0x44/0x104
   notifier_call_chain+0xb8/0x19c
   blocking_notifier_call_chain+0x64/0x98
   bus_notify+0x50/0x7c
   device_add+0x640/0x918
   pci_device_add+0x23c/0x298
   of_create_pci_dev+0x400/0x884
   of_scan_pci_dev+0x124/0x1b0
   __of_scan_bus+0x78/0x18c
   pcibios_scan_phb+0x2a4/0x3b0
   init_phb_dynamic+0xb8/0x110
   dlpar_add_slot+0x170/0x3b8 [rpadlpar_io]
   add_slot_store.part.0+0xb4/0x130 [rpadlpar_io]
   kobj_attr_store+0x2c/0x48
   sysfs_kf_write+0x64/0x78
   kernfs_fop_write_iter+0x1b0/0x290
   vfs_write+0x350/0x4a0
   ksys_write+0x84/0x140
   system_call_exception+0x124/0x330
   system_call_vectored_common+0x15c/0x2ec

Commit a940904 ("powerpc/iommu: Add iommu_ops to report capabilities
and allow blocking domains") broke DLPAR add of PCI devices.

The above added iommu_device structure to pci_controller. During
system boot, PCI devices are discovered and this newly added iommu_device
structure is initialized by a call to iommu_device_register().

During DLPAR add of a PCI device, a new pci_controller structure is
allocated but there are no calls made to iommu_device_register()
interface.

Fix is to register the iommu device during DLPAR add as well.

Fixes: a940904 ("powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains")
Signed-off-by: Gaurav Batra <[email protected]>
Reviewed-by: Brian King <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
popcornmix pushed a commit that referenced this issue Mar 5, 2024
…ntroller

[ Upstream commit a5c57fd ]

When a PCI device is dynamically added, the kernel oopses with a NULL
pointer dereference:

  BUG: Kernel NULL pointer dereference on read at 0x00000030
  Faulting instruction address: 0xc0000000006bbe5c
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in: rpadlpar_io rpaphp rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs xsk_diag bonding nft_compat nf_tables nfnetlink rfkill binfmt_misc dm_multipath rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_umad ib_iser libiscsi scsi_transport_iscsi ib_ipoib rdma_cm iw_cm ib_cm mlx5_ib ib_uverbs ib_core pseries_rng drm drm_panel_orientation_quirks xfs libcrc32c mlx5_core mlxfw sd_mod t10_pi sg tls ibmvscsi ibmveth scsi_transport_srp vmx_crypto pseries_wdt psample dm_mirror dm_region_hash dm_log dm_mod fuse
  CPU: 17 PID: 2685 Comm: drmgr Not tainted 6.7.0-203405+ #66
  Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_008) hv:phyp pSeries
  NIP:  c0000000006bbe5c LR: c000000000a13e68 CTR: c0000000000579f8
  REGS: c00000009924f240 TRAP: 0300   Not tainted  (6.7.0-203405+)
  MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24002220  XER: 20040006
  CFAR: c000000000a13e64 DAR: 0000000000000030 DSISR: 40000000 IRQMASK: 0
  ...
  NIP sysfs_add_link_to_group+0x34/0x94
  LR  iommu_device_link+0x5c/0x118
  Call Trace:
   iommu_init_device+0x26c/0x318 (unreliable)
   iommu_device_link+0x5c/0x118
   iommu_init_device+0xa8/0x318
   iommu_probe_device+0xc0/0x134
   iommu_bus_notifier+0x44/0x104
   notifier_call_chain+0xb8/0x19c
   blocking_notifier_call_chain+0x64/0x98
   bus_notify+0x50/0x7c
   device_add+0x640/0x918
   pci_device_add+0x23c/0x298
   of_create_pci_dev+0x400/0x884
   of_scan_pci_dev+0x124/0x1b0
   __of_scan_bus+0x78/0x18c
   pcibios_scan_phb+0x2a4/0x3b0
   init_phb_dynamic+0xb8/0x110
   dlpar_add_slot+0x170/0x3b8 [rpadlpar_io]
   add_slot_store.part.0+0xb4/0x130 [rpadlpar_io]
   kobj_attr_store+0x2c/0x48
   sysfs_kf_write+0x64/0x78
   kernfs_fop_write_iter+0x1b0/0x290
   vfs_write+0x350/0x4a0
   ksys_write+0x84/0x140
   system_call_exception+0x124/0x330
   system_call_vectored_common+0x15c/0x2ec

Commit a940904 ("powerpc/iommu: Add iommu_ops to report capabilities
and allow blocking domains") broke DLPAR add of PCI devices.

The above added iommu_device structure to pci_controller. During
system boot, PCI devices are discovered and this newly added iommu_device
structure is initialized by a call to iommu_device_register().

During DLPAR add of a PCI device, a new pci_controller structure is
allocated but there are no calls made to iommu_device_register()
interface.

Fix is to register the iommu device during DLPAR add as well.

Fixes: a940904 ("powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains")
Signed-off-by: Gaurav Batra <[email protected]>
Reviewed-by: Brian King <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
popcornmix pushed a commit that referenced this issue Apr 29, 2024
Florian reported the following kernel NULL pointer dereference issue on
a BCM7250 board:
[    2.829744] Unable to handle kernel NULL pointer dereference at virtual address 0000000c when read
[    2.838740] [0000000c] *pgd=80000000004003, *pmd=00000000
[    2.844178] Internal error: Oops: 206 [#1] SMP ARM
[    2.848990] Modules linked in:
[    2.852061] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.8.0-next-20240305-gd95fcdf4961d #66
[    2.860436] Hardware name: Broadcom STB (Flattened Device Tree)
[    2.866371] PC is at brcmnand_read_by_pio+0x180/0x278
[    2.871449] LR is at __wait_for_common+0x9c/0x1b0
[    2.876178] pc : [<c094b6cc>]    lr : [<c0e66310>]    psr: 60000053
[    2.882460] sp : f0811a80  ip : 00000012  fp : 00000000
[    2.887699] r10: 00000000  r9 : 00000000  r8 : c3790000
[    2.892936] r7 : 00000000  r6 : 00000000  r5 : c35db440  r4 : ffe00000
[    2.899479] r3 : f15cb814  r2 : 00000000  r1 : 00000000  r0 : 00000000

The issue only happens when dma mode is disabled or not supported on STB
chip. The pio mode transfer calls brcmnand_read_data_bus function which
dereferences ctrl->soc->read_data_bus. But the soc member in STB chip is
NULL hence triggers the access violation. The function needs to check
the soc pointer first.

Fixes: 546e425 ("mtd: rawnand: brcmnand: Add BCMBCA read data bus interface")

Reported-by: Florian Fainelli <[email protected]>
Tested-by: Florian Fainelli <[email protected]>
Signed-off-by: William Zhang <[email protected]>
Signed-off-by: Miquel Raynal <[email protected]>
Link: https://lore.kernel.org/linux-mtd/[email protected]
popcornmix pushed a commit that referenced this issue Jul 16, 2024
…b folio

A kernel crash was observed when migrating hugetlb folio:

BUG: kernel NULL pointer dereference, address: 0000000000000008
PGD 0 P4D 0
Oops: Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 3435 Comm: bash Not tainted 6.10.0-rc6-00450-g8578ca01f21f #66
RIP: 0010:__folio_undo_large_rmappable+0x70/0xb0
RSP: 0018:ffffb165c98a7b38 EFLAGS: 00000097
RAX: fffffbbc44528090 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffffa30e000a2800 RSI: 0000000000000246 RDI: ffffa3153ffffcc0
RBP: fffffbbc44528000 R08: 0000000000002371 R09: ffffffffbe4e5868
R10: 0000000000000001 R11: 0000000000000001 R12: ffffa3153ffffcc0
R13: fffffbbc44468000 R14: 0000000000000001 R15: 0000000000000001
FS:  00007f5b3a716740(0000) GS:ffffa3151fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 000000010959a000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 __folio_migrate_mapping+0x59e/0x950
 __migrate_folio.constprop.0+0x5f/0x120
 move_to_new_folio+0xfd/0x250
 migrate_pages+0x383/0xd70
 soft_offline_page+0x2ab/0x7f0
 soft_offline_page_store+0x52/0x90
 kernfs_fop_write_iter+0x12c/0x1d0
 vfs_write+0x380/0x540
 ksys_write+0x64/0xe0
 do_syscall_64+0xb9/0x1d0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5b3a514887
RSP: 002b:00007ffe138fce68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f5b3a514887
RDX: 000000000000000c RSI: 0000556ab809ee10 RDI: 0000000000000001
RBP: 0000556ab809ee10 R08: 00007f5b3a5d1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
R13: 00007f5b3a61b780 R14: 00007f5b3a617600 R15: 00007f5b3a616a00

It's because hugetlb folio is passed to __folio_undo_large_rmappable()
unexpectedly.  large_rmappable flag is imperceptibly set to hugetlb folio
since commit f6a8dd9 ("hugetlb: convert alloc_buddy_hugetlb_folio to
use a folio").  Then commit be9581e ("mm: fix crashes from deferred
split racing folio migration") makes folio_migrate_mapping() call
folio_undo_large_rmappable() triggering the bug.  Fix this issue by
clearing large_rmappable flag for hugetlb folios.  They don't need that
flag set anyway.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: f6a8dd9 ("hugetlb: convert alloc_buddy_hugetlb_folio to use a folio")
Fixes: be9581e ("mm: fix crashes from deferred split racing folio migration")
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
kimocoder pushed a commit to kimocoder/rpi-linux that referenced this issue Aug 15, 2024
On resume of the Raspberry Pi the dwc2 driver fails to enable
HCD_FLAG_HW_ACCESSIBLE before re-enabling the interrupts.
This causes a situation where both handler ignore a incoming port
interrupt and force the upper layers to disable the dwc2 interrupt line.
This leaves the USB interface in a unusable state:

irq 66: nobody cared (try booting with the "irqpoll" option)
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W          6.10.0-rc3
Hardware name: BCM2835
Call trace:
unwind_backtrace from show_stack+0x10/0x14
show_stack from dump_stack_lvl+0x50/0x64
dump_stack_lvl from __report_bad_irq+0x38/0xc0
__report_bad_irq from note_interrupt+0x2ac/0x2f4
note_interrupt from handle_irq_event+0x88/0x8c
handle_irq_event from handle_level_irq+0xb4/0x1ac
handle_level_irq from generic_handle_domain_irq+0x24/0x34
generic_handle_domain_irq from bcm2836_chained_handle_irq+0x24/0x28
bcm2836_chained_handle_irq from generic_handle_domain_irq+0x24/0x34
generic_handle_domain_irq from generic_handle_arch_irq+0x34/0x44
generic_handle_arch_irq from __irq_svc+0x88/0xb0
Exception stack(0xc1b01f20 to 0xc1b01f68)
1f20: 0005c0d4 00000001 00000000 00000000 c1b09780 c1d6b32c c1b04e54 c1a5eae8
1f40: c1b04e90 00000000 00000000 00000000 c1d6a8a0 c1b01f70 c11d2da8 c11d4160
1f60: 60000013 ffffffff
__irq_svc from default_idle_call+0x1c/0xb0
default_idle_call from do_idle+0x21c/0x284
do_idle from cpu_startup_entry+0x28/0x2c
cpu_startup_entry from kernel_init+0x0/0x12c
handlers:
[<f539e0f4>] dwc2_handle_common_intr
[<75cd278b>] usb_hcd_irq
Disabling IRQ raspberrypi#66

Disabling clock gating workaround this issue.

Fixes: 0112b7c ("usb: dwc2: Update dwc2_handle_usb_suspend_intr function.")
Link: https://lore.kernel.org/linux-usb/[email protected]/T/
Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Stefan Wahren <[email protected]>
Acked-by: Minas Harutyunyan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests