Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starlight DTS cleanups and LED support #1

Merged
merged 6 commits into from
May 7, 2021
Merged

Conversation

geertu
Copy link

@geertu geertu commented May 7, 2021

No description provided.

geertu added 6 commits May 7, 2021 15:40
There is no need to use the "clock-output-names" property for fixed
clocks, as the clock's name can be derived from the clock node's name.

Fixes: c653fe8 ("riscv: dts: Add JH7100 and BeagleV Starlight support")
Signed-off-by: Geert Uytterhoeven <[email protected]>
"make dtbs_check":

    arch/riscv/boot/dts/starfive/jh7100-starlight.dt.yaml: /: qspi-clk@0: 'anyOf' conditional failed, one must be
     fixed:
	'reg' is a required property
	'ranges' is a required property
	From schema: dtschema/schemas/root-node.yaml

Fix this by removing the bogus unit-address from the qspi-clk node.

Fixes: c653fe8 ("riscv: dts: Add JH7100 and BeagleV Starlight support")
Signed-off-by: Geert Uytterhoeven <[email protected]>
Interrupt controllers should have "#interrupt-cells".

arch/riscv/boot/dts/starfive/jh7100-starlight.dt.yaml: gpio@11910000: 'gpio-controller' is a dependency of '#gpio-cells'

GPIO controllers should be marked with "gpio-controller".

Fixes: c653fe8 ("riscv: dts: Add JH7100 and BeagleV Starlight support")
Signed-off-by: Geert Uytterhoeven <[email protected]>
arch/riscv/boot/dts/starfive/jh7100-starlight.dt.yaml: soc: cache-controller@2010000:reg:0: [0, 33619968, 0, 4096, 0, 134217728, 0, 33554432] is too long
        From schema: dtschema/schemas/simple-bus.yaml
arch/riscv/boot/dts/starfive/jh7100-starlight.dt.yaml: soc: spi@11860000:reg:0: [0, 293994496, 0, 65536, 0, 536870912, 0, 536870912] is too long
        From schema: dtschema/schemas/simple-bus.yaml
arch/riscv/boot/dts/starfive/jh7100-starlight.dt.yaml: soc: xrp@f0000000:reg:0: [0, 4026531840, 0, 33554431, 16, 1912602624, 0, 4096, 16, 1912606720, 0, 16773120, 0, 306905088, 0, 65536] is too long

To improve human readability and enable automatic validation, the tuples
in "reg" properties should be grouped using angle brackets.

Fixes: c653fe8 ("riscv: dts: Add JH7100 and BeagleV Starlight support")
Signed-off-by: Geert Uytterhoeven <[email protected]>
arch/riscv/boot/dts/starfive/jh7100-starlight.dt.yaml: hs_serial@11870000: $nodename:0: 'hs_serial@11870000' does not match '^serial(@[0-9a-f,]+)*$'
arch/riscv/boot/dts/starfive/jh7100-starlight.dt.yaml: hs_serial@11880000: $nodename:0: 'hs_serial@11880000' does not match '^serial(@[0-9a-f,]+)*$'
arch/riscv/boot/dts/starfive/jh7100-starlight.dt.yaml: sgdma1p@10500000: $nodename:0: 'sgdma1p@10500000' does not match '^dma-controller(@.*)?$'
arch/riscv/boot/dts/starfive/jh7100-starlight.dt.yaml: sgdma2p@100b0000: $nodename:0: 'sgdma2p@100b0000' does not match '^dma-controller(@.*)?$'
arch/riscv/boot/dts/starfive/jh7100-starlight.dt.yaml: gmac@10020000: $nodename:0: 'gmac@10020000' does not match '^ethernet(@.*)?$'
arch/riscv/boot/dts/starfive/jh7100-starlight.dt.yaml: sdio0@10000000: $nodename:0: 'sdio0@10000000' does not match '^mmc(@.*)?$'
arch/riscv/boot/dts/starfive/jh7100-starlight.dt.yaml: sdio1@10010000: $nodename:0: 'sdio1@10010000' does not match '^mmc(@.*)?$'

Use generic node names, to comply with the generic names recommendation
in the devicetree specification.

Fixes: c653fe8 ("riscv: dts: Add JH7100 and BeagleV Starlight support")
Signed-off-by: Geert Uytterhoeven <[email protected]>
Add a device node for the LED labeled "ack", and use it as a heartbeat.

Signed-off-by: Geert Uytterhoeven <[email protected]>
@esmil
Copy link
Owner

esmil commented May 7, 2021

This looks great, thank you. I was just about to look at adding the LED.

I'll just merge this as is for now, but before sending it upstream we'll probably need to merge it into 1 or 2 patches. Do you mind if squash these into a single devicetree commit in my tree before that happens?

@geertu
Copy link
Author

geertu commented May 7, 2021

Feel free to squash them.

@esmil esmil merged commit 0494f98 into esmil:starlight May 7, 2021
esmil pushed a commit that referenced this pull request May 11, 2021
commit eaab1d4 upstream.

Since commit 6815f47 ("ovl: use only uppermetacopy state in
ovl_lookup()"), overlayfs doesn't put temporary dentry when there is a
metacopy error, which leads to dentry leaks when shutting down the related
superblock:

  overlayfs: refusing to follow metacopy origin for (/file0)
  ...
  BUG: Dentry (____ptrval____){i=3f33,n=file3}  still in use (1) [unmount of overlay overlay]
  ...
  WARNING: CPU: 1 PID: 432 at umount_check.cold+0x107/0x14d
  CPU: 1 PID: 432 Comm: unmount-overlay Not tainted 5.12.0-rc5 #1
  ...
  RIP: 0010:umount_check.cold+0x107/0x14d
  ...
  Call Trace:
   d_walk+0x28c/0x950
   ? dentry_lru_isolate+0x2b0/0x2b0
   ? __kasan_slab_free+0x12/0x20
   do_one_tree+0x33/0x60
   shrink_dcache_for_umount+0x78/0x1d0
   generic_shutdown_super+0x70/0x440
   kill_anon_super+0x3e/0x70
   deactivate_locked_super+0xc4/0x160
   deactivate_super+0xfa/0x140
   cleanup_mnt+0x22e/0x370
   __cleanup_mnt+0x1a/0x30
   task_work_run+0x139/0x210
   do_exit+0xb0c/0x2820
   ? __kasan_check_read+0x1d/0x30
   ? find_held_lock+0x35/0x160
   ? lock_release+0x1b6/0x660
   ? mm_update_next_owner+0xa20/0xa20
   ? reacquire_held_locks+0x3f0/0x3f0
   ? __sanitizer_cov_trace_const_cmp4+0x22/0x30
   do_group_exit+0x135/0x380
   __do_sys_exit_group.isra.0+0x20/0x20
   __x64_sys_exit_group+0x3c/0x50
   do_syscall_64+0x45/0x70
   entry_SYSCALL_64_after_hwframe+0x44/0xae
  ...
  VFS: Busy inodes after unmount of overlay. Self-destruct in 5 seconds.  Have a nice day...

This fix has been tested with a syzkaller reproducer.

Cc: Amir Goldstein <[email protected]>
Cc: <[email protected]> # v5.8+
Reported-by: syzbot <[email protected]>
Fixes: 6815f47 ("ovl: use only uppermetacopy state in ovl_lookup()")
Signed-off-by: Mickaël Salaün <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Vivek Goyal <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
esmil pushed a commit that referenced this pull request May 12, 2021
commit f02d408 upstream.

Commit a6dcfe0 ("scsi: qla2xxx: Limit interrupt vectors to number of
CPUs") lowers the number of allocated MSI-X vectors to the number of CPUs.

That breaks vector allocation assumptions in qla83xx_iospace_config(),
qla24xx_enable_msix() and qla2x00_iospace_config(). Either of the functions
computes maximum number of qpairs as:

  ha->max_qpairs = ha->msix_count - 1 (MB interrupt) - 1 (default
                   response queue) - 1 (ATIO, in dual or pure target mode)

max_qpairs is set to zero in case of two CPUs and initiator mode. The
number is then used to allocate ha->queue_pair_map inside
qla2x00_alloc_queues(). No allocation happens and ha->queue_pair_map is
left NULL but the driver thinks there are queue pairs available.

qla2xxx_queuecommand() tries to find a qpair in the map and crashes:

  if (ha->mqenable) {
          uint32_t tag;
          uint16_t hwq;
          struct qla_qpair *qpair = NULL;

          tag = blk_mq_unique_tag(cmd->request);
          hwq = blk_mq_unique_tag_to_hwq(tag);
          qpair = ha->queue_pair_map[hwq]; # <- HERE

          if (qpair)
                  return qla2xxx_mqueuecommand(host, cmd, qpair);
  }

  BUG: kernel NULL pointer dereference, address: 0000000000000000
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP PTI
  CPU: 0 PID: 72 Comm: kworker/u4:3 Tainted: G        W         5.10.0-rc1+ #25
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
  Workqueue: scsi_wq_7 fc_scsi_scan_rport [scsi_transport_fc]
  RIP: 0010:qla2xxx_queuecommand+0x16b/0x3f0 [qla2xxx]
  Call Trace:
   scsi_queue_rq+0x58c/0xa60
   blk_mq_dispatch_rq_list+0x2b7/0x6f0
   ? __sbitmap_get_word+0x2a/0x80
   __blk_mq_sched_dispatch_requests+0xb8/0x170
   blk_mq_sched_dispatch_requests+0x2b/0x50
   __blk_mq_run_hw_queue+0x49/0xb0
   __blk_mq_delay_run_hw_queue+0xfb/0x150
   blk_mq_sched_insert_request+0xbe/0x110
   blk_execute_rq+0x45/0x70
   __scsi_execute+0x10e/0x250
   scsi_probe_and_add_lun+0x228/0xda0
   __scsi_scan_target+0xf4/0x620
   ? __pm_runtime_resume+0x4f/0x70
   scsi_scan_target+0x100/0x110
   fc_scsi_scan_rport+0xa1/0xb0 [scsi_transport_fc]
   process_one_work+0x1ea/0x3b0
   worker_thread+0x28/0x3b0
   ? process_one_work+0x3b0/0x3b0
   kthread+0x112/0x130
   ? kthread_park+0x80/0x80
   ret_from_fork+0x22/0x30

The driver should allocate enough vectors to provide every CPU it's own HW
queue and still handle reserved (MB, RSP, ATIO) interrupts.

The change fixes the crash on dual core VM and prevents unbalanced QP
allocation where nr_hw_queues is two less than the number of CPUs.

Link: https://lore.kernel.org/r/[email protected]
Fixes: a6dcfe0 ("scsi: qla2xxx: Limit interrupt vectors to number of CPUs")
Cc: Daniel Wagner <[email protected]>
Cc: Himanshu Madhani <[email protected]>
Cc: Quinn Tran <[email protected]>
Cc: Nilesh Javali <[email protected]>
Cc: Martin K. Petersen <[email protected]>
Cc: [email protected] # 5.11+
Reported-by: Aleksandr Volkov <[email protected]>
Reported-by: Aleksandr Miloserdov <[email protected]>
Reviewed-by: Daniel Wagner <[email protected]>
Reviewed-by: Himanshu Madhani <[email protected]>
Signed-off-by: Roman Bolshakov <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
esmil pushed a commit that referenced this pull request May 12, 2021
commit a97709f upstream.

We triggered the following error while running our 4.19 kernel
with the pseudo-NMI patches backported to it:

[   14.816231] ------------[ cut here ]------------
[   14.816231] kernel BUG at irq.c:99!
[   14.816232] Internal error: Oops - BUG: 0 [#1] SMP
[   14.816232] Process swapper/0 (pid: 0, stack limit = 0x(____ptrval____))
[   14.816233] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O      4.19.95.aarch64 #14
[   14.816233] Hardware name: evb (DT)
[   14.816234] pstate: 80400085 (Nzcv daIf +PAN -UAO)
[   14.816234] pc : asm_nmi_enter+0x94/0x98
[   14.816235] lr : asm_nmi_enter+0x18/0x98
[   14.816235] sp : ffff000008003c50
[   14.816235] pmr_save: 00000070
[   14.816237] x29: ffff000008003c50 x28: ffff0000095f56c0
[   14.816238] x27: 0000000000000000 x26: ffff000008004000
[   14.816239] x25: 00000000015e0000 x24: ffff8008fb916000
[   14.816240] x23: 0000000020400005 x22: ffff0000080817cc
[   14.816241] x21: ffff000008003da0 x20: 0000000000000060
[   14.816242] x19: 00000000000003ff x18: ffffffffffffffff
[   14.816243] x17: 0000000000000008 x16: 003d090000000000
[   14.816244] x15: ffff0000095ea6c8 x14: ffff8008fff5ab40
[   14.816244] x13: ffff8008fff58b9d x12: 0000000000000000
[   14.816245] x11: ffff000008c8a200 x10: 000000008e31fca5
[   14.816246] x9 : ffff000008c8a208 x8 : 000000000000000f
[   14.816247] x7 : 0000000000000004 x6 : ffff8008fff58b9e
[   14.816248] x5 : 0000000000000000 x4 : 0000000080000000
[   14.816249] x3 : 0000000000000000 x2 : 0000000080000000
[   14.816250] x1 : 0000000000120000 x0 : ffff0000095f56c0
[   14.816251] Call trace:
[   14.816251]  asm_nmi_enter+0x94/0x98
[   14.816251]  el1_irq+0x8c/0x180                    (IRQ C)
[   14.816252]  gic_handle_irq+0xbc/0x2e4
[   14.816252]  el1_irq+0xcc/0x180                    (IRQ B)
[   14.816253]  arch_timer_handler_virt+0x38/0x58
[   14.816253]  handle_percpu_devid_irq+0x90/0x240
[   14.816253]  generic_handle_irq+0x34/0x50
[   14.816254]  __handle_domain_irq+0x68/0xc0
[   14.816254]  gic_handle_irq+0xf8/0x2e4
[   14.816255]  el1_irq+0xcc/0x180                    (IRQ A)
[   14.816255]  arch_cpu_idle+0x34/0x1c8
[   14.816255]  default_idle_call+0x24/0x44
[   14.816256]  do_idle+0x1d0/0x2c8
[   14.816256]  cpu_startup_entry+0x28/0x30
[   14.816256]  rest_init+0xb8/0xc8
[   14.816257]  start_kernel+0x4c8/0x4f4
[   14.816257] Code: 940587f1 d5384100 b9401001 36a7fd01 (d4210000)
[   14.816258] Modules linked in: start_dp(O) smeth(O)
[   15.103092] ---[ end trace 701753956cb14aa8 ]---
[   15.103093] Kernel panic - not syncing: Fatal exception in interrupt
[   15.103099] SMP: stopping secondary CPUs
[   15.103100] Kernel Offset: disabled
[   15.103100] CPU features: 0x36,a2400218
[   15.103100] Memory Limit: none

which is cause by a 'BUG_ON(in_nmi())' in nmi_enter().

From the call trace, we can find three interrupts (noted A, B, C above):
interrupt (A) is preempted by (B), which is further interrupted by (C).

Subsequent investigations show that (B) results in nmi_enter() being
called, but that it actually is a spurious interrupt. Furthermore,
interrupts are reenabled in the context of (B), and (C) fires with
NMI priority. We end-up with a nested NMI situation, something
we definitely do not want to (and cannot) handle.

The bug here is that spurious interrupts should never result in any
state change, and we should just return to the interrupted context.
Moving the handling of spurious interrupts as early as possible in
the GICv3 handler fixes this issue.

Fixes: 3f1f323 ("irqchip/gic-v3: Switch to PMR masking before calling IRQ handler")
Acked-by: Mark Rutland <[email protected]>
Signed-off-by: He Ying <[email protected]>
[maz: rewrote commit message, corrected Fixes: tag]
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Cc: [email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
esmil pushed a commit that referenced this pull request May 12, 2021
…after-free

commit 061dde8 upstream.

There is a race between a task aborting a transaction during a commit,
a task doing an fsync and the transaction kthread, which leads to an
use-after-free of the log root tree. When this happens, it results in a
stack trace like the following:

  BTRFS info (device dm-0): forced readonly
  BTRFS warning (device dm-0): Skipping commit of aborted transaction.
  BTRFS: error (device dm-0) in cleanup_transaction:1958: errno=-5 IO failure
  BTRFS warning (device dm-0): lost page write due to IO error on /dev/mapper/error-test (-5)
  BTRFS warning (device dm-0): Skipping commit of aborted transaction.
  BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0xa4e8 len 4096 err no 10
  BTRFS error (device dm-0): error writing primary super block to device 1
  BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e000 len 4096 err no 10
  BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e008 len 4096 err no 10
  BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e010 len 4096 err no 10
  BTRFS: error (device dm-0) in write_all_supers:4110: errno=-5 IO failure (1 errors while writing supers)
  BTRFS: error (device dm-0) in btrfs_sync_log:3308: errno=-5 IO failure
  general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b68: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
  CPU: 2 PID: 2458471 Comm: fsstress Not tainted 5.12.0-rc5-btrfs-next-84 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  RIP: 0010:__mutex_lock+0x139/0xa40
  Code: c0 74 19 (...)
  RSP: 0018:ffff9f18830d7b00 EFLAGS: 00010202
  RAX: 6b6b6b6b6b6b6b68 RBX: 0000000000000001 RCX: 0000000000000002
  RDX: ffffffffb9c54d13 RSI: 0000000000000000 RDI: 0000000000000000
  RBP: ffff9f18830d7bc0 R08: 0000000000000000 R09: 0000000000000000
  R10: ffff9f18830d7be0 R11: 0000000000000001 R12: ffff8c6cd199c040
  R13: ffff8c6c95821358 R14: 00000000fffffffb R15: ffff8c6cbcf01358
  FS:  00007fa9140c2b80(0000) GS:ffff8c6fac600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fa913d52000 CR3: 000000013d2b4003 CR4: 0000000000370ee0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   ? __btrfs_handle_fs_error+0xde/0x146 [btrfs]
   ? btrfs_sync_log+0x7c1/0xf20 [btrfs]
   ? btrfs_sync_log+0x7c1/0xf20 [btrfs]
   btrfs_sync_log+0x7c1/0xf20 [btrfs]
   btrfs_sync_file+0x40c/0x580 [btrfs]
   do_fsync+0x38/0x70
   __x64_sys_fsync+0x10/0x20
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xae
  RIP: 0033:0x7fa9142a55c3
  Code: 8b 15 09 (...)
  RSP: 002b:00007fff26278d48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
  RAX: ffffffffffffffda RBX: 0000563c83cb4560 RCX: 00007fa9142a55c3
  RDX: 00007fff26278cb0 RSI: 00007fff26278cb0 RDI: 0000000000000005
  RBP: 0000000000000005 R08: 0000000000000001 R09: 00007fff26278d5c
  R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000340
  R13: 00007fff26278de0 R14: 00007fff26278d96 R15: 0000563c83ca57c0
  Modules linked in: btrfs dm_zero dm_snapshot dm_thin_pool (...)
  ---[ end trace ee2f1b19327d791d ]---

The steps that lead to this crash are the following:

1) We are at transaction N;

2) We have two tasks with a transaction handle attached to transaction N.
   Task A and Task B. Task B is doing an fsync;

3) Task B is at btrfs_sync_log(), and has saved fs_info->log_root_tree
   into a local variable named 'log_root_tree' at the top of
   btrfs_sync_log(). Task B is about to call write_all_supers(), but
   before that...

4) Task A calls btrfs_commit_transaction(), and after it sets the
   transaction state to TRANS_STATE_COMMIT_START, an error happens before
   it waits for the transaction's 'num_writers' counter to reach a value
   of 1 (no one else attached to the transaction), so it jumps to the
   label "cleanup_transaction";

5) Task A then calls cleanup_transaction(), where it aborts the
   transaction, setting BTRFS_FS_STATE_TRANS_ABORTED on fs_info->fs_state,
   setting the ->aborted field of the transaction and the handle to an
   errno value and also setting BTRFS_FS_STATE_ERROR on fs_info->fs_state.

   After that, at cleanup_transaction(), it deletes the transaction from
   the list of transactions (fs_info->trans_list), sets the transaction
   to the state TRANS_STATE_COMMIT_DOING and then waits for the number
   of writers to go down to 1, as it's currently 2 (1 for task A and 1
   for task B);

6) The transaction kthread is running and sees that BTRFS_FS_STATE_ERROR
   is set in fs_info->fs_state, so it calls btrfs_cleanup_transaction().

   There it sees the list fs_info->trans_list is empty, and then proceeds
   into calling btrfs_drop_all_logs(), which frees the log root tree with
   a call to btrfs_free_log_root_tree();

7) Task B calls write_all_supers() and, shortly after, under the label
   'out_wake_log_root', it deferences the pointer stored in
   'log_root_tree', which was already freed in the previous step by the
   transaction kthread. This results in a use-after-free leading to a
   crash.

Fix this by deleting the transaction from the list of transactions at
cleanup_transaction() only after setting the transaction state to
TRANS_STATE_COMMIT_DOING and waiting for all existing tasks that are
attached to the transaction to release their transaction handles.
This makes the transaction kthread wait for all the tasks attached to
the transaction to be done with the transaction before dropping the
log roots and doing other cleanups.

Fixes: ef67963 ("btrfs: drop logs when we've aborted a transaction")
CC: [email protected] # 5.10+
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
esmil pushed a commit that referenced this pull request May 12, 2021
…ocations

[ Upstream commit eafa4fd ]

When we are running out of space for updating the chunk tree, that is,
when we are low on available space in the system space info, if we have
many task concurrently allocating block groups, via fallocate for example,
many of them can end up all allocating new system chunks when only one is
needed. In extreme cases this can lead to exhaustion of the system chunk
array, which has a size limit of 2048 bytes, and results in a transaction
abort with errno EFBIG, producing a trace in dmesg like the following,
which was triggered on a PowerPC machine with a node/leaf size of 64K:

  [1359.518899] ------------[ cut here ]------------
  [1359.518980] BTRFS: Transaction aborted (error -27)
  [1359.519135] WARNING: CPU: 3 PID: 16463 at ../fs/btrfs/block-group.c:1968 btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs]
  [1359.519152] Modules linked in: (...)
  [1359.519239] Supported: Yes, External
  [1359.519252] CPU: 3 PID: 16463 Comm: stress-ng Tainted: G               X    5.3.18-47-default #1 SLE15-SP3
  [1359.519274] NIP:  c008000000e36fe8 LR: c008000000e36fe4 CTR: 00000000006de8e8
  [1359.519293] REGS: c00000056890b700 TRAP: 0700   Tainted: G               X     (5.3.18-47-default)
  [1359.519317] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 48008222  XER: 00000007
  [1359.519356] CFAR: c00000000013e170 IRQMASK: 0
  [1359.519356] GPR00: c008000000e36fe4 c00000056890b990 c008000000e83200 0000000000000026
  [1359.519356] GPR04: 0000000000000000 0000000000000000 0000d52a3b027651 0000000000000007
  [1359.519356] GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000000
  [1359.519356] GPR12: 0000000000008000 c00000063fe44600 000000001015e028 000000001015dfd0
  [1359.519356] GPR16: 000000000000404f 0000000000000001 0000000000010000 0000dd1e287affff
  [1359.519356] GPR20: 0000000000000001 c000000637c9a000 ffffffffffffffe5 0000000000000000
  [1359.519356] GPR24: 0000000000000004 0000000000000000 0000000000000100 ffffffffffffffc0
  [1359.519356] GPR28: c000000637c9a000 c000000630e09230 c000000630e091d8 c000000562188b08
  [1359.519561] NIP [c008000000e36fe8] btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs]
  [1359.519613] LR [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs]
  [1359.519626] Call Trace:
  [1359.519671] [c00000056890b990] [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs] (unreliable)
  [1359.519729] [c00000056890ba90] [c008000000d68d44] __btrfs_end_transaction+0xbc/0x2f0 [btrfs]
  [1359.519782] [c00000056890bae0] [c008000000e309ac] btrfs_alloc_data_chunk_ondemand+0x154/0x610 [btrfs]
  [1359.519844] [c00000056890bba0] [c008000000d8a0fc] btrfs_fallocate+0xe4/0x10e0 [btrfs]
  [1359.519891] [c00000056890bd00] [c0000000004a23b4] vfs_fallocate+0x174/0x350
  [1359.519929] [c00000056890bd50] [c0000000004a3cf8] ksys_fallocate+0x68/0xf0
  [1359.519957] [c00000056890bda0] [c0000000004a3da8] sys_fallocate+0x28/0x40
  [1359.519988] [c00000056890bdc0] [c000000000038968] system_call_exception+0xe8/0x170
  [1359.520021] [c00000056890be20] [c00000000000cb70] system_call_common+0xf0/0x278
  [1359.520037] Instruction dump:
  [1359.520049] 7d0049ad 40c2fff4 7c0004ac 71490004 40820024 2f83fffb 419e0048 3c620000
  [1359.520082] e863bcb8 7ec4b378 48010d91 e8410018 <0fe00000> 3c820000 e884bcc8 7ec6b378
  [1359.520122] ---[ end trace d6c186e151022e20 ]---

The following steps explain how we can end up in this situation:

1) Task A is at check_system_chunk(), either because it is allocating a
   new data or metadata block group, at btrfs_chunk_alloc(), or because
   it is removing a block group or turning a block group RO. It does not
   matter why;

2) Task A sees that there is not enough free space in the system
   space_info object, that is 'left' is < 'thresh'. And at this point
   the system space_info has a value of 0 for its 'bytes_may_use'
   counter;

3) As a consequence task A calls btrfs_alloc_chunk() in order to allocate
   a new system block group (chunk) and then reserves 'thresh' bytes in
   the chunk block reserve with the call to btrfs_block_rsv_add(). This
   changes the chunk block reserve's 'reserved' and 'size' counters by an
   amount of 'thresh', and changes the 'bytes_may_use' counter of the
   system space_info object from 0 to 'thresh'.

   Also during its call to btrfs_alloc_chunk(), we end up increasing the
   value of the 'total_bytes' counter of the system space_info object by
   8MiB (the size of a system chunk stripe). This happens through the
   call chain:

   btrfs_alloc_chunk()
       create_chunk()
           btrfs_make_block_group()
               btrfs_update_space_info()

4) After it finishes the first phase of the block group allocation, at
   btrfs_chunk_alloc(), task A unlocks the chunk mutex;

5) At this point the new system block group was added to the transaction
   handle's list of new block groups, but its block group item, device
   items and chunk item were not yet inserted in the extent, device and
   chunk trees, respectively. That only happens later when we call
   btrfs_finish_chunk_alloc() through a call to
   btrfs_create_pending_block_groups();

   Note that only when we update the chunk tree, through the call to
   btrfs_finish_chunk_alloc(), we decrement the 'reserved' counter
   of the chunk block reserve as we COW/allocate extent buffers,
   through:

   btrfs_alloc_tree_block()
      btrfs_use_block_rsv()
         btrfs_block_rsv_use_bytes()

   And the system space_info's 'bytes_may_use' is decremented everytime
   we allocate an extent buffer for COW operations on the chunk tree,
   through:

   btrfs_alloc_tree_block()
      btrfs_reserve_extent()
         find_free_extent()
            btrfs_add_reserved_bytes()

   If we end up COWing less chunk btree nodes/leaves than expected, which
   is the typical case since the amount of space we reserve is always
   pessimistic to account for the worst possible case, we release the
   unused space through:

   btrfs_create_pending_block_groups()
      btrfs_trans_release_chunk_metadata()
         btrfs_block_rsv_release()
            block_rsv_release_bytes()
                btrfs_space_info_free_bytes_may_use()

   But before task A gets into btrfs_create_pending_block_groups()...

6) Many other tasks start allocating new block groups through fallocate,
   each one does the first phase of block group allocation in a
   serialized way, since btrfs_chunk_alloc() takes the chunk mutex
   before calling check_system_chunk() and btrfs_alloc_chunk().

   However before everyone enters the final phase of the block group
   allocation, that is, before calling btrfs_create_pending_block_groups(),
   new tasks keep coming to allocate new block groups and while at
   check_system_chunk(), the system space_info's 'bytes_may_use' keeps
   increasing each time a task reserves space in the chunk block reserve.
   This means that eventually some other task can end up not seeing enough
   free space in the system space_info and decide to allocate yet another
   system chunk.

   This may repeat several times if yet more new tasks keep allocating
   new block groups before task A, and all the other tasks, finish the
   creation of the pending block groups, which is when reserved space
   in excess is released. Eventually this can result in exhaustion of
   system chunk array in the superblock, with btrfs_add_system_chunk()
   returning EFBIG, resulting later in a transaction abort.

   Even when we don't reach the extreme case of exhausting the system
   array, most, if not all, unnecessarily created system block groups
   end up being unused since when finishing creation of the first
   pending system block group, the creation of the following ones end
   up not needing to COW nodes/leaves of the chunk tree, so we never
   allocate and deallocate from them, resulting in them never being
   added to the list of unused block groups - as a consequence they
   don't get deleted by the cleaner kthread - the only exceptions are
   if we unmount and mount the filesystem again, which adds any unused
   block groups to the list of unused block groups, if a scrub is
   run, which also adds unused block groups to the unused list, and
   under some circumstances when using a zoned filesystem or async
   discard, which may also add unused block groups to the unused list.

So fix this by:

*) Tracking the number of reserved bytes for the chunk tree per
   transaction, which is the sum of reserved chunk bytes by each
   transaction handle currently being used;

*) When there is not enough free space in the system space_info,
   if there are other transaction handles which reserved chunk space,
   wait for some of them to complete in order to have enough excess
   reserved space released, and then try again. Otherwise proceed with
   the creation of a new system chunk.

Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
esmil pushed a commit that referenced this pull request May 12, 2021
[ Upstream commit a2b2cc6 ]

This patch fixes the following Coverity warning:

    CID 361199 (#1 of 1): Unchecked return value (CHECKED_RETURN)
    3. check_return: Calling qla24xx_get_isp_stats without checking return
    value (as is done elsewhere 4 out of 5 times).

Link: https://lore.kernel.org/r/[email protected]
Cc: Quinn Tran <[email protected]>
Cc: Mike Christie <[email protected]>
Cc: Himanshu Madhani <[email protected]>
Cc: Daniel Wagner <[email protected]>
Cc: Lee Duncan <[email protected]>
Reviewed-by: Daniel Wagner <[email protected]>
Reviewed-by: Himanshu Madhani <[email protected]>
Signed-off-by: Bart Van Assche <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
esmil pushed a commit that referenced this pull request May 12, 2021
[ Upstream commit 2ce35c0 ]

On bsg command completion, bsg_job_done() was called while qla driver
continued to access the bsg_job buffer. bsg_job_done() would free up
resources that ended up being reused by other task while the driver
continued to access the buffers. As a result, driver was reading garbage
data.

localhost kernel: BUG: KASAN: use-after-free in sg_next+0x64/0x80
localhost kernel: Read of size 8 at addr ffff8883228a3330 by task swapper/26/0
localhost kernel:
localhost kernel: CPU: 26 PID: 0 Comm: swapper/26 Kdump:
loaded Tainted: G          OE    --------- -  - 4.18.0-193.el8.x86_64+debug #1
localhost kernel: Hardware name: HP ProLiant DL360
Gen9/ProLiant DL360 Gen9, BIOS P89 08/12/2016
localhost kernel: Call Trace:
localhost kernel: <IRQ>
localhost kernel: dump_stack+0x9a/0xf0
localhost kernel: print_address_description.cold.3+0x9/0x23b
localhost kernel: kasan_report.cold.4+0x65/0x95
localhost kernel: debug_dma_unmap_sg.part.12+0x10d/0x2d0
localhost kernel: qla2x00_bsg_sp_free+0xaf6/0x1010 [qla2xxx]

Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Himanshu Madhani <[email protected]>
Signed-off-by: Quinn Tran <[email protected]>
Signed-off-by: Saurav Kashyap <[email protected]>
Signed-off-by: Nilesh Javali <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
esmil pushed a commit that referenced this pull request May 12, 2021
[ Upstream commit d736102 ]

Here is the system crash log:
[ 1272.884438] BUG: unable to handle kernel NULL pointer dereference at
(null)
[ 1272.884444] IP: [<          (null)>]           (null)
[ 1272.884447] PGD 825b09067 PUD 8267c8067 PMD 0
[ 1272.884452] Oops: 0010 [#1] SMP
[ 1272.884509] CPU: 13 PID: 3485 Comm: cat Kdump: loaded Tainted: G
[ 1272.884515] task: ffff9a38dbd4d140 ti: ffff9a37cd3b8000 task.ti:
ffff9a37cd3b8000
[ 1272.884517] RIP: 0010:[<0000000000000000>]  [<          (null)>]
(null)
[ 1272.884520] RSP: 0018:ffff9a37cd3bbe68  EFLAGS: 00010203
[ 1272.884522] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
0000000000014d5f
[ 1272.884524] RDX: fffffffffffffff4 RSI: 0000000000000001 RDI:
ffff9a38aca4d200
[ 1272.884526] RBP: ffff9a37cd3bbed0 R08: ffff9a38dcd5f1a0 R09:
ffff9a31ffc07300
[ 1272.884527] R10: ffff9a31ffc07300 R11: ffffffffaddd5e9d R12:
ffff9a38b4e0fb00
[ 1272.884529] R13: 0000000000000001 R14: ffff9a37cd3bbf18 R15:
ffff9a38aca4d200
[ 1272.884532] FS:  00007feccaa67740(0000) GS:ffff9a38dcd40000(0000)
knlGS:0000000000000000
[ 1272.884534] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1272.884536] CR2: 0000000000000000 CR3: 00000008267c0000 CR4:
00000000003407e0
[ 1272.884537] Call Trace:
[ 1272.884544]  [<ffffffffade68940>] ? seq_read+0x130/0x440
[ 1272.884548]  [<ffffffffade40f8f>] vfs_read+0x9f/0x170
[ 1272.884552]  [<ffffffffade41e4f>] SyS_read+0x7f/0xf0
[ 1272.884557]  [<ffffffffae374ddb>] system_call_fastpath+0x22/0x27
[ 1272.884558] Code:  Bad RIP value.
[ 1272.884562] RIP  [<          (null)>]           (null)
[ 1272.884564]  RSP <ffff9a37cd3bbe68>
[ 1272.884566] CR2: 0000000000000000

Signed-off-by: Qu Huang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
esmil pushed a commit that referenced this pull request May 12, 2021
…rallel

[ Upstream commit b168e1d ]

We got a warning message below.
When server tries to close one session by force, it locks the sysfs
interface and locks the srv_sess lock.
The problem is that client can send a request to close at the same time.
By close request, server locks the srv_sess lock and locks the sysfs
to remove the sysfs interfaces.

The simplest way to prevent that situation could be just use
mutex_trylock.

[  234.153965] ======================================================
[  234.154093] WARNING: possible circular locking dependency detected
[  234.154219] 5.4.84-storage #5.4.84-1+feature+linux+5.4.y+dbg+20201216.1319+b6b887b~deb10 Tainted: G           O
[  234.154381] ------------------------------------------------------
[  234.154531] kworker/1:1H/618 is trying to acquire lock:
[  234.154651] ffff8887a09db0a8 (kn->count#132){++++}, at: kernfs_remove_by_name_ns+0x40/0x80
[  234.154819]
               but task is already holding lock:
[  234.154965] ffff8887ae5f6518 (&srv_sess->lock){+.+.}, at: rnbd_srv_rdma_ev+0x144/0x1590 [rnbd_server]
[  234.155132]
               which lock already depends on the new lock.

[  234.155311]
               the existing dependency chain (in reverse order) is:
[  234.155462]
               -> #1 (&srv_sess->lock){+.+.}:
[  234.155614]        __mutex_lock+0x134/0xcb0
[  234.155761]        rnbd_srv_sess_dev_force_close+0x36/0x50 [rnbd_server]
[  234.155889]        rnbd_srv_dev_session_force_close_store+0x69/0xc0 [rnbd_server]
[  234.156042]        kernfs_fop_write+0x13f/0x240
[  234.156162]        vfs_write+0xf3/0x280
[  234.156278]        ksys_write+0xba/0x150
[  234.156395]        do_syscall_64+0x62/0x270
[  234.156513]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  234.156632]
               -> #0 (kn->count#132){++++}:
[  234.156782]        __lock_acquire+0x129e/0x23a0
[  234.156900]        lock_acquire+0xf3/0x210
[  234.157043]        __kernfs_remove+0x42b/0x4c0
[  234.157161]        kernfs_remove_by_name_ns+0x40/0x80
[  234.157282]        remove_files+0x3f/0xa0
[  234.157399]        sysfs_remove_group+0x4a/0xb0
[  234.157519]        rnbd_srv_destroy_dev_session_sysfs+0x19/0x30 [rnbd_server]
[  234.157648]        rnbd_srv_rdma_ev+0x14c/0x1590 [rnbd_server]
[  234.157775]        process_io_req+0x29a/0x6a0 [rtrs_server]
[  234.157924]        __ib_process_cq+0x8c/0x100 [ib_core]
[  234.158709]        ib_cq_poll_work+0x31/0xb0 [ib_core]
[  234.158834]        process_one_work+0x4e5/0xaa0
[  234.158958]        worker_thread+0x65/0x5c0
[  234.159078]        kthread+0x1e0/0x200
[  234.159194]        ret_from_fork+0x24/0x30
[  234.159309]
               other info that might help us debug this:

[  234.159513]  Possible unsafe locking scenario:

[  234.159658]        CPU0                    CPU1
[  234.159775]        ----                    ----
[  234.159891]   lock(&srv_sess->lock);
[  234.160005]                                lock(kn->count#132);
[  234.160128]                                lock(&srv_sess->lock);
[  234.160250]   lock(kn->count#132);
[  234.160364]
                *** DEADLOCK ***

[  234.160536] 3 locks held by kworker/1:1H/618:
[  234.160677]  #0: ffff8883ca1ed528 ((wq_completion)ib-comp-wq){+.+.}, at: process_one_work+0x40a/0xaa0
[  234.160840]  #1: ffff8883d2d5fe10 ((work_completion)(&cq->work)){+.+.}, at: process_one_work+0x40a/0xaa0
[  234.161003]  #2: ffff8887ae5f6518 (&srv_sess->lock){+.+.}, at: rnbd_srv_rdma_ev+0x144/0x1590 [rnbd_server]
[  234.161168]
               stack backtrace:
[  234.161312] CPU: 1 PID: 618 Comm: kworker/1:1H Tainted: G           O      5.4.84-storage #5.4.84-1+feature+linux+5.4.y+dbg+20201216.1319+b6b887b~deb10
[  234.161490] Hardware name: Supermicro H8QG6/H8QG6, BIOS 3.00       09/04/2012
[  234.161643] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
[  234.161765] Call Trace:
[  234.161910]  dump_stack+0x96/0xe0
[  234.162028]  check_noncircular+0x29e/0x2e0
[  234.162148]  ? print_circular_bug+0x100/0x100
[  234.162267]  ? register_lock_class+0x1ad/0x8a0
[  234.162385]  ? __lock_acquire+0x68e/0x23a0
[  234.162505]  ? trace_event_raw_event_lock+0x190/0x190
[  234.162626]  __lock_acquire+0x129e/0x23a0
[  234.162746]  ? register_lock_class+0x8a0/0x8a0
[  234.162866]  lock_acquire+0xf3/0x210
[  234.162982]  ? kernfs_remove_by_name_ns+0x40/0x80
[  234.163127]  __kernfs_remove+0x42b/0x4c0
[  234.163243]  ? kernfs_remove_by_name_ns+0x40/0x80
[  234.163363]  ? kernfs_fop_readdir+0x3b0/0x3b0
[  234.163482]  ? strlen+0x1f/0x40
[  234.163596]  ? strcmp+0x30/0x50
[  234.163712]  kernfs_remove_by_name_ns+0x40/0x80
[  234.163832]  remove_files+0x3f/0xa0
[  234.163948]  sysfs_remove_group+0x4a/0xb0
[  234.164068]  rnbd_srv_destroy_dev_session_sysfs+0x19/0x30 [rnbd_server]
[  234.164196]  rnbd_srv_rdma_ev+0x14c/0x1590 [rnbd_server]
[  234.164345]  ? _raw_spin_unlock_irqrestore+0x43/0x50
[  234.164466]  ? lockdep_hardirqs_on+0x1a8/0x290
[  234.164597]  ? mlx4_ib_poll_cq+0x927/0x1280 [mlx4_ib]
[  234.164732]  ? rnbd_get_sess_dev+0x270/0x270 [rnbd_server]
[  234.164859]  process_io_req+0x29a/0x6a0 [rtrs_server]
[  234.164982]  ? rnbd_get_sess_dev+0x270/0x270 [rnbd_server]
[  234.165130]  __ib_process_cq+0x8c/0x100 [ib_core]
[  234.165279]  ib_cq_poll_work+0x31/0xb0 [ib_core]
[  234.165404]  process_one_work+0x4e5/0xaa0
[  234.165550]  ? pwq_dec_nr_in_flight+0x160/0x160
[  234.165675]  ? do_raw_spin_lock+0x119/0x1d0
[  234.165796]  worker_thread+0x65/0x5c0
[  234.165914]  ? process_one_work+0xaa0/0xaa0
[  234.166031]  kthread+0x1e0/0x200
[  234.166147]  ? kthread_create_worker_on_cpu+0xc0/0xc0
[  234.166268]  ret_from_fork+0x24/0x30
[  234.251591] rnbd_server L243: </dev/loop1@close_device_session>: Device closed
[  234.604221] rnbd_server L264: RTRS Session close_device_session disconnected

Signed-off-by: Gioh Kim <[email protected]>
Signed-off-by: Md Haris Iqbal <[email protected]>
Signed-off-by: Jack Wang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
esmil pushed a commit that referenced this pull request May 12, 2021
… root

[ Upstream commit f9690f4 ]

Commit dbcc7d5 ("btrfs: fix race when cloning extent buffer during
rewind of an old root"), fixed a race when we need to rewind the extent
buffer of an old root. It was caused by picking a new mod log operation
for the extent buffer while getting a cloned extent buffer with an outdated
number of items (off by -1), because we cloned the extent buffer without
locking it first.

However there is still another similar race, but in the opposite direction.
The cloned extent buffer has a number of items that does not match the
number of tree mod log operations that are going to be replayed. This is
because right after we got the last (most recent) tree mod log operation to
replay and before locking and cloning the extent buffer, another task adds
a new pointer to the extent buffer, which results in adding a new tree mod
log operation and incrementing the number of items in the extent buffer.
So after cloning we have mismatch between the number of items in the extent
buffer and the number of mod log operations we are going to apply to it.
This results in hitting a BUG_ON() that produces the following stack trace:

   ------------[ cut here ]------------
   kernel BUG at fs/btrfs/tree-mod-log.c:675!
   invalid opcode: 0000 [#1] SMP KASAN PTI
   CPU: 3 PID: 4811 Comm: crawl_1215 Tainted: G        W         5.12.0-7d1efdf501f8-misc-next+ torvalds#99
   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
   RIP: 0010:tree_mod_log_rewind+0x3b1/0x3c0
   Code: 05 48 8d 74 10 (...)
   RSP: 0018:ffffc90001027090 EFLAGS: 00010293
   RAX: 0000000000000000 RBX: ffff8880a8514600 RCX: ffffffffaa9e59b6
   RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8880a851462c
   RBP: ffffc900010270e0 R08: 00000000000000c0 R09: ffffed1004333417
   R10: ffff88802199a0b7 R11: ffffed1004333416 R12: 000000000000000e
   R13: ffff888135af8748 R14: ffff88818766ff00 R15: ffff8880a851462c
   FS:  00007f29acf62700(0000) GS:ffff8881f2200000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 00007f0e6013f718 CR3: 000000010d42e003 CR4: 0000000000170ee0
   Call Trace:
    btrfs_get_old_root+0x16a/0x5c0
    ? lock_downgrade+0x400/0x400
    btrfs_search_old_slot+0x192/0x520
    ? btrfs_search_slot+0x1090/0x1090
    ? free_extent_buffer.part.61+0xd7/0x140
    ? free_extent_buffer+0x13/0x20
    resolve_indirect_refs+0x3e9/0xfc0
    ? lock_downgrade+0x400/0x400
    ? __kasan_check_read+0x11/0x20
    ? add_prelim_ref.part.11+0x150/0x150
    ? lock_downgrade+0x400/0x400
    ? __kasan_check_read+0x11/0x20
    ? lock_acquired+0xbb/0x620
    ? __kasan_check_write+0x14/0x20
    ? do_raw_spin_unlock+0xa8/0x140
    ? rb_insert_color+0x340/0x360
    ? prelim_ref_insert+0x12d/0x430
    find_parent_nodes+0x5c3/0x1830
    ? stack_trace_save+0x87/0xb0
    ? resolve_indirect_refs+0xfc0/0xfc0
    ? fs_reclaim_acquire+0x67/0xf0
    ? __kasan_check_read+0x11/0x20
    ? lockdep_hardirqs_on_prepare+0x210/0x210
    ? fs_reclaim_acquire+0x67/0xf0
    ? __kasan_check_read+0x11/0x20
    ? ___might_sleep+0x10f/0x1e0
    ? __kasan_kmalloc+0x9d/0xd0
    ? trace_hardirqs_on+0x55/0x120
    btrfs_find_all_roots_safe+0x142/0x1e0
    ? find_parent_nodes+0x1830/0x1830
    ? trace_hardirqs_on+0x55/0x120
    ? ulist_free+0x1f/0x30
    ? btrfs_inode_flags_to_xflags+0x50/0x50
    iterate_extent_inodes+0x20e/0x580
    ? tree_backref_for_extent+0x230/0x230
    ? release_extent_buffer+0x225/0x280
    ? read_extent_buffer+0xdd/0x110
    ? lock_downgrade+0x400/0x400
    ? __kasan_check_read+0x11/0x20
    ? lock_acquired+0xbb/0x620
    ? __kasan_check_write+0x14/0x20
    ? do_raw_spin_unlock+0xa8/0x140
    ? _raw_spin_unlock+0x22/0x30
    ? release_extent_buffer+0x225/0x280
    iterate_inodes_from_logical+0x129/0x170
    ? iterate_inodes_from_logical+0x129/0x170
    ? btrfs_inode_flags_to_xflags+0x50/0x50
    ? iterate_extent_inodes+0x580/0x580
    ? __vmalloc_node+0x92/0xb0
    ? init_data_container+0x34/0xb0
    ? init_data_container+0x34/0xb0
    ? kvmalloc_node+0x60/0x80
    btrfs_ioctl_logical_to_ino+0x158/0x230
    btrfs_ioctl+0x2038/0x4360
    ? __kasan_check_write+0x14/0x20
    ? mmput+0x3b/0x220
    ? btrfs_ioctl_get_supported_features+0x30/0x30
    ? __kasan_check_read+0x11/0x20
    ? __kasan_check_read+0x11/0x20
    ? lock_release+0xc8/0x650
    ? __might_fault+0x64/0xd0
    ? __kasan_check_read+0x11/0x20
    ? lock_downgrade+0x400/0x400
    ? lockdep_hardirqs_on_prepare+0x210/0x210
    ? lockdep_hardirqs_on_prepare+0x13/0x210
    ? _raw_spin_unlock_irqrestore+0x51/0x63
    ? __kasan_check_read+0x11/0x20
    ? do_vfs_ioctl+0xfc/0x9d0
    ? ioctl_file_clone+0xe0/0xe0
    ? lock_downgrade+0x400/0x400
    ? lockdep_hardirqs_on_prepare+0x210/0x210
    ? __kasan_check_read+0x11/0x20
    ? lock_release+0xc8/0x650
    ? __task_pid_nr_ns+0xd3/0x250
    ? __kasan_check_read+0x11/0x20
    ? __fget_files+0x160/0x230
    ? __fget_light+0xf2/0x110
    __x64_sys_ioctl+0xc3/0x100
    do_syscall_64+0x37/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xae
   RIP: 0033:0x7f29ae85b427
   Code: 00 00 90 48 8b (...)
   RSP: 002b:00007f29acf5fcf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
   RAX: ffffffffffffffda RBX: 00007f29acf5ff40 RCX: 00007f29ae85b427
   RDX: 00007f29acf5ff48 RSI: 00000000c038943b RDI: 0000000000000003
   RBP: 0000000001000000 R08: 0000000000000000 R09: 00007f29acf60120
   R10: 00005640d5fc7b00 R11: 0000000000000246 R12: 0000000000000003
   R13: 00007f29acf5ff48 R14: 00007f29acf5ff40 R15: 00007f29acf5fef8
   Modules linked in:
   ---[ end trace 85e5fce078dfbe04 ]---

  (gdb) l *(tree_mod_log_rewind+0x3b1)
  0xffffffff819e5b21 is in tree_mod_log_rewind (fs/btrfs/tree-mod-log.c:675).
  670                      * the modification. As we're going backwards, we do the
  671                      * opposite of each operation here.
  672                      */
  673                     switch (tm->op) {
  674                     case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
  675                             BUG_ON(tm->slot < n);
  676                             fallthrough;
  677                     case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING:
  678                     case BTRFS_MOD_LOG_KEY_REMOVE:
  679                             btrfs_set_node_key(eb, &tm->key, tm->slot);
  (gdb) quit

The following steps explain in more detail how it happens:

1) We have one tree mod log user (through fiemap or the logical ino ioctl),
   with a sequence number of 1, so we have fs_info->tree_mod_seq == 1.
   This is task A;

2) Another task is at ctree.c:balance_level() and we have eb X currently as
   the root of the tree, and we promote its single child, eb Y, as the new
   root.

   Then, at ctree.c:balance_level(), we call:

      ret = btrfs_tree_mod_log_insert_root(root->node, child, true);

3) At btrfs_tree_mod_log_insert_root() we create a tree mod log operation
   of type BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING, with a ->logical field
   pointing to ebX->start. We only have one item in eb X, so we create
   only one tree mod log operation, and store in the "tm_list" array;

4) Then, still at btrfs_tree_mod_log_insert_root(), we create a tree mod
   log element of operation type BTRFS_MOD_LOG_ROOT_REPLACE, ->logical set
   to ebY->start, ->old_root.logical set to ebX->start, ->old_root.level
   set to the level of eb X and ->generation set to the generation of eb X;

5) Then btrfs_tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
   "tm_list" as argument. After that, tree_mod_log_free_eb() calls
   tree_mod_log_insert(). This inserts the mod log operation of type
   BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING from step 3 into the rbtree
   with a sequence number of 2 (and fs_info->tree_mod_seq set to 2);

6) Then, after inserting the "tm_list" single element into the tree mod
   log rbtree, the BTRFS_MOD_LOG_ROOT_REPLACE element is inserted, which
   gets the sequence number 3 (and fs_info->tree_mod_seq set to 3);

7) Back to ctree.c:balance_level(), we free eb X by calling
   btrfs_free_tree_block() on it. Because eb X was created in the current
   transaction, has no other references and writeback did not happen for
   it, we add it back to the free space cache/tree;

8) Later some other task B allocates the metadata extent from eb X, since
   it is marked as free space in the space cache/tree, and uses it as a
   node for some other btree;

9) The tree mod log user task calls btrfs_search_old_slot(), which calls
   btrfs_get_old_root(), and finally that calls tree_mod_log_oldest_root()
   with time_seq == 1 and eb_root == eb Y;

10) The first iteration of the while loop finds the tree mod log element
    with sequence number 3, for the logical address of eb Y and of type
    BTRFS_MOD_LOG_ROOT_REPLACE;

11) Because the operation type is BTRFS_MOD_LOG_ROOT_REPLACE, we don't
    break out of the loop, and set root_logical to point to
    tm->old_root.logical, which corresponds to the logical address of
    eb X;

12) On the next iteration of the while loop, the call to
    tree_mod_log_search_oldest() returns the smallest tree mod log element
    for the logical address of eb X, which has a sequence number of 2, an
    operation type of BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and
    corresponds to the old slot 0 of eb X (eb X had only 1 item in it
    before being freed at step 7);

13) We then break out of the while loop and return the tree mod log
    operation of type BTRFS_MOD_LOG_ROOT_REPLACE (eb Y), and not the one
    for slot 0 of eb X, to btrfs_get_old_root();

14) At btrfs_get_old_root(), we process the BTRFS_MOD_LOG_ROOT_REPLACE
    operation and set "logical" to the logical address of eb X, which was
    the old root. We then call tree_mod_log_search() passing it the logical
    address of eb X and time_seq == 1;

15) But before calling tree_mod_log_search(), task B locks eb X, adds a
    key to eb X, which results in adding a tree mod log operation of type
    BTRFS_MOD_LOG_KEY_ADD, with a sequence number of 4, to the tree mod
    log, and increments the number of items in eb X from 0 to 1.
    Now fs_info->tree_mod_seq has a value of 4;

16) Task A then calls tree_mod_log_search(), which returns the most recent
    tree mod log operation for eb X, which is the one just added by task B
    at the previous step, with a sequence number of 4, a type of
    BTRFS_MOD_LOG_KEY_ADD and for slot 0;

17) Before task A locks and clones eb X, task A adds another key to eb X,
    which results in adding a new BTRFS_MOD_LOG_KEY_ADD mod log operation,
    with a sequence number of 5, for slot 1 of eb X, increments the
    number of items in eb X from 1 to 2, and unlocks eb X.
    Now fs_info->tree_mod_seq has a value of 5;

18) Task A then locks eb X and clones it. The clone has a value of 2 for
    the number of items and the pointer "tm" points to the tree mod log
    operation with sequence number 4, not the most recent one with a
    sequence number of 5, so there is mismatch between the number of
    mod log operations that are going to be applied to the cloned version
    of eb X and the number of items in the clone;

19) Task A then calls tree_mod_log_rewind() with the clone of eb X, the
    tree mod log operation with sequence number 4 and a type of
    BTRFS_MOD_LOG_KEY_ADD, and time_seq == 1;

20) At tree_mod_log_rewind(), we set the local variable "n" with a value
    of 2, which is the number of items in the clone of eb X.

    Then in the first iteration of the while loop, we process the mod log
    operation with sequence number 4, which is targeted at slot 0 and has
    a type of BTRFS_MOD_LOG_KEY_ADD. This results in decrementing "n" from
    2 to 1.

    Then we pick the next tree mod log operation for eb X, which is the
    tree mod log operation with a sequence number of 2, a type of
    BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and for slot 0, it is the one
    added in step 5 to the tree mod log tree.

    We go back to the top of the loop to process this mod log operation,
    and because its slot is 0 and "n" has a value of 1, we hit the BUG_ON:

        (...)
        switch (tm->op) {
        case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
                BUG_ON(tm->slot < n);
                fallthrough;
	(...)

Fix this by checking for a more recent tree mod log operation after locking
and cloning the extent buffer of the old root node, and use it as the first
operation to apply to the cloned extent buffer when rewinding it.

Stable backport notes: due to moved code and renames, in =< 5.11 the
change should be applied to ctree.c:get_old_root.

Reported-by: Zygo Blaxell <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Fixes: 834328a ("Btrfs: tree mod log's old roots could still be part of the tree")
CC: [email protected] # 4.4+
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
esmil pushed a commit that referenced this pull request May 12, 2021
commit b1ebaa0 upstream.

The execution of sys_read end up hitting a BUG_ON() in __find_get_block
after installing kprobe at sys_read, the BUG message like the following:

[   65.708663] ------------[ cut here ]------------
[   65.709987] kernel BUG at fs/buffer.c:1251!
[   65.711283] Kernel BUG [#1]
[   65.712032] Modules linked in:
[   65.712925] CPU: 0 PID: 51 Comm: sh Not tainted 5.12.0-rc4 #1
[   65.714407] Hardware name: riscv-virtio,qemu (DT)
[   65.715696] epc : __find_get_block+0x218/0x2c8
[   65.716835]  ra : __getblk_gfp+0x1c/0x4a
[   65.717831] epc : ffffffe00019f11e ra : ffffffe00019f56a sp : ffffffe002437930
[   65.719553]  gp : ffffffe000f06030 tp : ffffffe0015abc00 t0 : ffffffe00191e038
[   65.721290]  t1 : ffffffe00191e038 t2 : 000000000000000a s0 : ffffffe002437960
[   65.723051]  s1 : ffffffe00160ad00 a0 : ffffffe00160ad00 a1 : 000000000000012a
[   65.724772]  a2 : 0000000000000400 a3 : 0000000000000008 a4 : 0000000000000040
[   65.726545]  a5 : 0000000000000000 a6 : ffffffe00191e000 a7 : 0000000000000000
[   65.728308]  s2 : 000000000000012a s3 : 0000000000000400 s4 : 0000000000000008
[   65.730049]  s5 : 000000000000006c s6 : ffffffe00240f800 s7 : ffffffe000f080a8
[   65.731802]  s8 : 0000000000000001 s9 : 000000000000012a s10: 0000000000000008
[   65.733516]  s11: 0000000000000008 t3 : 00000000000003ff t4 : 000000000000000f
[   65.734434]  t5 : 00000000000003ff t6 : 0000000000040000
[   65.734613] status: 0000000000000100 badaddr: 0000000000000000 cause: 0000000000000003
[   65.734901] Call Trace:
[   65.735076] [<ffffffe00019f11e>] __find_get_block+0x218/0x2c8
[   65.735417] [<ffffffe00020017a>] __ext4_get_inode_loc+0xb2/0x2f6
[   65.735618] [<ffffffe000201b6c>] ext4_get_inode_loc+0x3a/0x8a
[   65.735802] [<ffffffe000203380>] ext4_reserve_inode_write+0x2e/0x8c
[   65.735999] [<ffffffe00020357a>] __ext4_mark_inode_dirty+0x4c/0x18e
[   65.736208] [<ffffffe000206bb0>] ext4_dirty_inode+0x46/0x66
[   65.736387] [<ffffffe000192914>] __mark_inode_dirty+0x12c/0x3da
[   65.736576] [<ffffffe000180dd2>] touch_atime+0x146/0x150
[   65.736748] [<ffffffe00010d762>] filemap_read+0x234/0x246
[   65.736920] [<ffffffe00010d834>] generic_file_read_iter+0xc0/0x114
[   65.737114] [<ffffffe0001f5d7a>] ext4_file_read_iter+0x42/0xea
[   65.737310] [<ffffffe000163f2c>] new_sync_read+0xe2/0x15a
[   65.737483] [<ffffffe000165814>] vfs_read+0xca/0xf2
[   65.737641] [<ffffffe000165bae>] ksys_read+0x5e/0xc8
[   65.737816] [<ffffffe000165c26>] sys_read+0xe/0x16
[   65.737973] [<ffffffe000003972>] ret_from_syscall+0x0/0x2
[   65.738858] ---[ end trace fe93f985456c935d ]---

A simple reproducer looks like:
	echo 'p:myprobe sys_read fd=%a0 buf=%a1 count=%a2' > /sys/kernel/debug/tracing/kprobe_events
	echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
	cat /sys/kernel/debug/tracing/trace

Here's what happens to hit that BUG_ON():

1) After installing kprobe at entry of sys_read, the first instruction
   is replaced by 'ebreak' instruction on riscv64 platform.

2) Once kernel reach the 'ebreak' instruction at the entry of sys_read,
   it trap into the riscv breakpoint handler, where it do something to
   setup for coming single-step of origin instruction, including backup
   the 'sstatus' in pt_regs, followed by disable interrupt during single
   stepping via clear 'SIE' bit of 'sstatus' in pt_regs.

3) Then kernel restore to the instruction slot contains two instructions,
   one is original instruction at entry of sys_read, the other is 'ebreak'.
   Here it trigger a 'Instruction page fault' exception (value at 'scause'
   is '0xc'), if PF is not filled into PageTabe for that slot yet.

4) Again kernel trap into page fault exception handler, where it choose
   different policy according to the state of running kprobe. Because
   afte 2) the state is KPROBE_HIT_SS, so kernel reset the current kprobe
   and 'pc' points back to the probe address.

5) Because 'epc' point back to 'ebreak' instrution at sys_read probe,
   kernel trap into breakpoint handler again, and repeat the operations
   at 2), however 'sstatus' without 'SIE' is keep at 4), it cause the
   real 'sstatus' saved at 2) is overwritten by the one withou 'SIE'.

6) When kernel cross the probe the 'sstatus' CSR restore with value
   without 'SIE', and reach __find_get_block where it requires the
   interrupt must be enabled.

Fix this is very trivial, just restore the value of 'sstatus' in pt_regs
with backup one at 2) when the instruction being single stepped cause a
page fault.

Fixes: c22b0bc ("riscv: Add kprobes supported")
Signed-off-by: Liao Chang <[email protected]>
Cc: [email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
esmil pushed a commit that referenced this pull request May 12, 2021
commit 44200f2 upstream.

Debian's clang carries a patch that makes the default FPU mode
'vfp3-d16' instead of 'neon' for 'armv7-a' to avoid generating NEON
instructions on hardware that does not support them:

https://salsa.debian.org/pkg-llvm-team/llvm-toolchain/-/raw/5a61ca6f21b4ad8c6ac4970e5ea5a7b5b4486d22/debian/patches/clang-arm-default-vfp3-on-armv7a.patch
https://bugs.debian.org/841474
https://bugs.debian.org/842142
https://bugs.debian.org/914268

This results in the following build error when clang's integrated
assembler is used because the '.arch' directive overrides the '.fpu'
directive:

arch/arm/crypto/curve25519-core.S:25:2: error: instruction requires: NEON
 vmov.i32 q0, #1
 ^
arch/arm/crypto/curve25519-core.S:26:2: error: instruction requires: NEON
 vshr.u64 q1, q0, #7
 ^
arch/arm/crypto/curve25519-core.S:27:2: error: instruction requires: NEON
 vshr.u64 q0, q0, #8
 ^
arch/arm/crypto/curve25519-core.S:28:2: error: instruction requires: NEON
 vmov.i32 d4, #19
 ^

Shuffle the order of the '.arch' and '.fpu' directives so that the code
builds regardless of the default FPU mode. This has been tested against
both clang with and without Debian's patch and GCC.

Cc: [email protected]
Fixes: d8f1308 ("crypto: arm/curve25519 - wire up NEON implementation")
Link: https://github.com/ClangBuiltLinux/continuous-integration2/issues/118
Reported-by: Arnd Bergmann <[email protected]>
Suggested-by: Arnd Bergmann <[email protected]>
Suggested-by: Jessica Clarke <[email protected]>
Signed-off-by: Nathan Chancellor <[email protected]>
Acked-by: Jason A. Donenfeld <[email protected]>
Reviewed-by: Nick Desaulniers <[email protected]>
Tested-by: Nick Desaulniers <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
esmil pushed a commit that referenced this pull request May 12, 2021
…ed queues

commit 99ba0ea upstream.

efx->xdp_tx_queue_count is initially initialized to num_possible_cpus() and is
later used to allocate and traverse efx->xdp_tx_queues lookup array. However,
we may end up not initializing all the array slots with real queues during
probing. This results, for example, in a NULL pointer dereference, when running
"# ethtool -S <iface>", similar to below

[2570283.664955][T4126959] BUG: kernel NULL pointer dereference, address: 00000000000000f8
[2570283.681283][T4126959] #PF: supervisor read access in kernel mode
[2570283.695678][T4126959] #PF: error_code(0x0000) - not-present page
[2570283.710013][T4126959] PGD 0 P4D 0
[2570283.721649][T4126959] Oops: 0000 [#1] SMP PTI
[2570283.734108][T4126959] CPU: 23 PID: 4126959 Comm: ethtool Tainted: G           O      5.10.20-cloudflare-2021.3.1 #1
[2570283.752641][T4126959] Hardware name: <redacted>
[2570283.781408][T4126959] RIP: 0010:efx_ethtool_get_stats+0x2ca/0x330 [sfc]
[2570283.796073][T4126959] Code: 00 85 c0 74 39 48 8b 95 a8 0f 00 00 48 85 d2 74 2d 31 c0 eb 07 48 8b 95 a8 0f 00 00 48 63 c8 49 83 c4 08 83 c0 01 48 8b 14 ca <48> 8b 92 f8 00 00 00 49 89 54 24 f8 39 85 a0 0f 00 00 77 d7 48 8b
[2570283.831259][T4126959] RSP: 0018:ffffb79a77657ce8 EFLAGS: 00010202
[2570283.845121][T4126959] RAX: 0000000000000019 RBX: ffffb799cd0c9280 RCX: 0000000000000018
[2570283.860872][T4126959] RDX: 0000000000000000 RSI: ffff96dd970ce000 RDI: 0000000000000005
[2570283.876525][T4126959] RBP: ffff96dd86f0a000 R08: ffff96dd970ce480 R09: 000000000000005f
[2570283.892014][T4126959] R10: ffffb799cd0c9fff R11: ffffb799cd0c9000 R12: ffffb799cd0c94f8
[2570283.907406][T4126959] R13: ffffffffc11b1090 R14: ffff96dd970ce000 R15: ffffffffc11cd66c
[2570283.922705][T4126959] FS:  00007fa7723f8740(0000) GS:ffff96f51fac0000(0000) knlGS:0000000000000000
[2570283.938848][T4126959] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2570283.952524][T4126959] CR2: 00000000000000f8 CR3: 0000001a73e6e006 CR4: 00000000007706e0
[2570283.967529][T4126959] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[2570283.982400][T4126959] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[2570283.997308][T4126959] PKRU: 55555554
[2570284.007649][T4126959] Call Trace:
[2570284.017598][T4126959]  dev_ethtool+0x1832/0x2830

Fix this by adjusting efx->xdp_tx_queue_count after probing to reflect the true
value of initialized slots in efx->xdp_tx_queues.

Signed-off-by: Ignat Korchagin <[email protected]>
Fixes: e26ca4b ("sfc: reduce the number of requested xdp ev queues")
Cc: <[email protected]> # 5.12.x
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
esmil pushed a commit that referenced this pull request May 12, 2021
commit ad91849 upstream.

If there in an error during a set_fmt, do not overwrite the previous
sizes with the invalid config.

Without this patch, v4l2-compliance ends up allocating 4GiB of RAM and
causing the following OOPs

[   38.662975] ipu3-imgu 0000:00:05.0: swiotlb buffer is full (sz: 4096 bytes)
[   38.662980] DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:00:05.0
[   38.663010] general protection fault: 0000 [#1] PREEMPT SMP

Cc: [email protected]
Fixes: 6d5f26f ("media: staging/intel-ipu3-v4l: reduce kernel stack usage")
Signed-off-by: Ricardo Ribalda <[email protected]>
Signed-off-by: Sakari Ailus <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
esmil pushed a commit that referenced this pull request May 12, 2021
…ad fails

commit 8e947c8 upstream.

When loading a device-mapper table for a request-based mapped device,
and the allocation/initialization of the blk_mq_tag_set for the device
fails, a following device remove will cause a double free.

E.g. (dmesg):
  device-mapper: core: Cannot initialize queue for request-based dm-mq mapped device
  device-mapper: ioctl: unable to set up device queue for new table.
  Unable to handle kernel pointer dereference in virtual kernel address space
  Failing address: 0305e098835de000 TEID: 0305e098835de803
  Fault in home space mode while using kernel ASCE.
  AS:000000025efe0007 R3:0000000000000024
  Oops: 0038 ilc:3 [#1] SMP
  Modules linked in: ... lots of modules ...
  Supported: Yes, External
  CPU: 0 PID: 7348 Comm: multipathd Kdump: loaded Tainted: G        W      X    5.3.18-53-default #1 SLE15-SP3
  Hardware name: IBM 8561 T01 7I2 (LPAR)
  Krnl PSW : 0704e00180000000 000000025e368eca (kfree+0x42/0x330)
             R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
  Krnl GPRS: 000000000000004a 000000025efe5230 c1773200d779968d 0000000000000000
             000000025e520270 000000025e8d1b40 0000000000000003 00000007aae10000
             000000025e5202a2 0000000000000001 c1773200d779968d 0305e098835de640
             00000007a8170000 000003ff80138650 000000025e5202a2 000003e00396faa8
  Krnl Code: 000000025e368eb8: c4180041e100       lgrl    %r1,25eba50b8
             000000025e368ebe: ecba06b93a55       risbg   %r11,%r10,6,185,58
            #000000025e368ec4: e3b010000008       ag      %r11,0(%r1)
            >000000025e368eca: e310b0080004       lg      %r1,8(%r11)
             000000025e368ed0: a7110001           tmll    %r1,1
             000000025e368ed4: a7740129           brc     7,25e369126
             000000025e368ed8: e320b0080004       lg      %r2,8(%r11)
             000000025e368ede: b904001b           lgr     %r1,%r11
  Call Trace:
   [<000000025e368eca>] kfree+0x42/0x330
   [<000000025e5202a2>] blk_mq_free_tag_set+0x72/0xb8
   [<000003ff801316a8>] dm_mq_cleanup_mapped_device+0x38/0x50 [dm_mod]
   [<000003ff80120082>] free_dev+0x52/0xd0 [dm_mod]
   [<000003ff801233f0>] __dm_destroy+0x150/0x1d0 [dm_mod]
   [<000003ff8012bb9a>] dev_remove+0x162/0x1c0 [dm_mod]
   [<000003ff8012a988>] ctl_ioctl+0x198/0x478 [dm_mod]
   [<000003ff8012ac8a>] dm_ctl_ioctl+0x22/0x38 [dm_mod]
   [<000000025e3b11ee>] ksys_ioctl+0xbe/0xe0
   [<000000025e3b127a>] __s390x_sys_ioctl+0x2a/0x40
   [<000000025e8c15ac>] system_call+0xd8/0x2c8
  Last Breaking-Event-Address:
   [<000000025e52029c>] blk_mq_free_tag_set+0x6c/0xb8
  Kernel panic - not syncing: Fatal exception: panic_on_oops

When allocation/initialization of the blk_mq_tag_set fails in
dm_mq_init_request_queue(), it is uninitialized/freed, but the pointer
is not reset to NULL; so when dev_remove() later gets into
dm_mq_cleanup_mapped_device() it sees the pointer and tries to
uninitialize and free it again.

Fix this by setting the pointer to NULL in dm_mq_init_request_queue()
error-handling. Also set it to NULL in dm_mq_cleanup_mapped_device().

Cc: <[email protected]> # 4.6+
Fixes: 1c357a1 ("dm: allocate blk_mq_tag_set rather than embed in mapped_device")
Signed-off-by: Benjamin Block <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
esmil pushed a commit that referenced this pull request May 14, 2021
commit f7c7a2f upstream.

md_kick_rdev_from_array will remove rdev, so we should
use rdev_for_each_safe to search list.

How to trigger:

env: Two nodes on kvm-qemu x86_64 VMs (2C2G with 2 iscsi luns).

```
node2=192.168.0.3

for i in {1..20}; do
    echo ==== $i `date` ====;

    mdadm -Ss && ssh ${node2} "mdadm -Ss"
    wipefs -a /dev/sda /dev/sdb

    mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l 1 /dev/sda \
       /dev/sdb --assume-clean
    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
    mdadm --wait /dev/md0
    ssh ${node2} "mdadm --wait /dev/md0"

    mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
    sleep 1
done
```

Crash stack:

```
stack segment: 0000 [#1] SMP
... ...
RIP: 0010:md_check_recovery+0x1e8/0x570 [md_mod]
... ...
RSP: 0018:ffffb149807a7d68 EFLAGS: 00010207
RAX: 0000000000000000 RBX: ffff9d494c180800 RCX: ffff9d490fc01e50
RDX: fffff047c0ed8308 RSI: 0000000000000246 RDI: 0000000000000246
RBP: 6b6b6b6b6b6b6b6b R08: ffff9d490fc01e40 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
R13: ffff9d494c180818 R14: ffff9d493399ef38 R15: ffff9d4933a1d800
FS:  0000000000000000(0000) GS:ffff9d494f700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe68cab9010 CR3: 000000004c6be001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 raid1d+0x5c/0xd40 [raid1]
 ? finish_task_switch+0x75/0x2a0
 ? lock_timer_base+0x67/0x80
 ? try_to_del_timer_sync+0x4d/0x80
 ? del_timer_sync+0x41/0x50
 ? schedule_timeout+0x254/0x2d0
 ? md_start_sync+0xe0/0xe0 [md_mod]
 ? md_thread+0x127/0x160 [md_mod]
 md_thread+0x127/0x160 [md_mod]
 ? wait_woken+0x80/0x80
 kthread+0x10d/0x130
 ? kthread_park+0xa0/0xa0
 ret_from_fork+0x1f/0x40
```

Fixes: dbb64f8 ("md-cluster: Fix adding of new disk with new reload code")
Fixes: 659b254 ("md-cluster: remove a disk asynchronously from cluster environment")
Cc: [email protected]
Reviewed-by: Gang He <[email protected]>
Signed-off-by: Heming Zhao <[email protected]>
Signed-off-by: Song Liu <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
esmil pushed a commit that referenced this pull request May 14, 2021
…aster_register"

commit 0d95f41 upstream.

Adding the destroy_workqueue call in i3c_master_register introduced below
kernel warning because it makes duplicate destroy_workqueue calls when
i3c_master_register fails after allocating the workqueue. The workqueue will
be destroyed by i3c_masterdev_release which is called by put_device at the
end of the i3c_master_register function eventually in failure cases so the
workqueue doesn't need to be destroyed in i3c_master_register.

[    6.972952] WARNING: CPU: 1 PID: 1 at lib/list_debug.c:48 __list_del_entry_valid+0x9c/0xf4
[    6.982205] list_del corruption, 8fe03c08->prev is LIST_POISON2 (00000122)
[    6.989910] CPU: 1 PID: 1 Comm: swapper/0 Tainted: G        W         5.10.23-c12838a-dirty-31dc772 #1
[    7.000295] Hardware name: Generic DT based system
[    7.005638] Backtrace:
[    7.008369] [<809133f0>] (dump_backtrace) from [<80913644>] (show_stack+0x20/0x24)
[    7.016819]  r7:00000030 r6:60000013 r5:00000000 r4:813b5d40
[    7.023137] [<80913624>] (show_stack) from [<8091e1a0>] (dump_stack+0x9c/0xb0)
[    7.031201] [<8091e104>] (dump_stack) from [<8011fa30>] (__warn+0xf8/0x154)
[    7.038972]  r7:00000030 r6:00000009 r5:804fa1c8 r4:80b6eca4
[    7.045289] [<8011f938>] (__warn) from [<80913d14>] (warn_slowpath_fmt+0x8c/0xc0)
[    7.053641]  r7:00000030 r6:80b6eca4 r5:80b6ed74 r4:818cc000
[    7.059960] [<80913c8c>] (warn_slowpath_fmt) from [<804fa1c8>] (__list_del_entry_valid+0x9c/0xf4)
[    7.069866]  r9:96becf8c r8:818cc000 r7:8fe03c10 r6:8fe03c00 r5:8fe03ba0 r4:ff7ead4c
[    7.078513] [<804fa12c>] (__list_del_entry_valid) from [<8013f0b4>] (destroy_workqueue+0x1c4/0x23c)
[    7.088615] [<8013eef0>] (destroy_workqueue) from [<806aa124>] (i3c_masterdev_release+0x40/0xb0)
[    7.098421]  r7:00000000 r6:81a43b80 r5:8fe65360 r4:8fe65048
[    7.104740] [<806aa0e4>] (i3c_masterdev_release) from [<805f3f04>] (device_release+0x40/0xb0)
[    7.114254]  r5:00000000 r4:8fe65048
[    7.118245] [<805f3ec4>] (device_release) from [<808fe754>] (kobject_put+0xc8/0x204)
[    7.126885]  r5:813978dc r4:8fe65048
[    7.130877] [<808fe68c>] (kobject_put) from [<805f5fbc>] (put_device+0x20/0x24)
[    7.139037]  r7:8fe65358 r6:8fe65368 r5:8fe65358 r4:8fe65048
[    7.145355] [<805f5f9c>] (put_device) from [<806abac4>] (i3c_master_register+0x338/0xb00)
[    7.154487] [<806ab78c>] (i3c_master_register) from [<806ae084>] (dw_i3c_probe+0x224/0x24c)
[    7.163811]  r10:00000000 r9:8fe7a100 r8:00000032 r7:819fa810 r6:819fa800 r5:8fe65040
[    7.172547]  r4:00000000
[    7.175376] [<806ade60>] (dw_i3c_probe) from [<805fdc14>] (platform_drv_probe+0x44/0x80)
[    7.184409]  r9:813a25c0 r8:00000000 r7:815ec114 r6:00000000 r5:813a25c0 r4:819fa810
[    7.193053] [<805fdbd0>] (platform_drv_probe) from [<805fb83c>] (really_probe+0x108/0x50c)
[    7.202275]  r5:815ec004 r4:819fa810
[    7.206265] [<805fb734>] (really_probe) from [<805fc180>] (driver_probe_device+0xb4/0x190)
[    7.215492]  r10:813dc000 r9:80c4385c r8:000000d9 r7:813a25c0 r6:819fa810 r5:00000000
[    7.224228]  r4:813a25c0
[    7.227055] [<805fc0cc>] (driver_probe_device) from [<805fc5cc>] (device_driver_attach+0xb8/0xc0)
[    7.236959]  r9:80c4385c r8:000000d9 r7:813a25c0 r6:819fa854 r4:819fa810
[    7.244439] [<805fc514>] (device_driver_attach) from [<805fc65c>] (__driver_attach+0x88/0x16c)
[    7.254051]  r7:00000000 r6:819fa810 r5:00000000 r4:813a25c0
[    7.260369] [<805fc5d4>] (__driver_attach) from [<805f954c>] (bus_for_each_dev+0x88/0xc8)
[    7.269489]  r7:00000000 r6:818cc000 r5:805fc5d4 r4:813a25c0
[    7.275806] [<805f94c4>] (bus_for_each_dev) from [<805fc76c>] (driver_attach+0x2c/0x30)
[    7.284739]  r7:81397c98 r6:00000000 r5:8fe7db80 r4:813a25c0
[    7.291057] [<805fc740>] (driver_attach) from [<805f9eec>] (bus_add_driver+0x120/0x200)
[    7.299984] [<805f9dcc>] (bus_add_driver) from [<805fce44>] (driver_register+0x98/0x128)
[    7.309005]  r7:80c4383c r6:00000000 r5:00000000 r4:813a25c0
[    7.315323] [<805fcdac>] (driver_register) from [<805fedb4>] (__platform_driver_register+0x50/0x58)
[    7.325410]  r5:818cc000 r4:81397c98
[    7.329404] [<805fed64>] (__platform_driver_register) from [<80c23398>] (dw_i3c_driver_init+0x24/0x28)
[    7.339790]  r5:818cc000 r4:80c23374
[    7.343784] [<80c23374>] (dw_i3c_driver_init) from [<80c01300>] (do_one_initcall+0xac/0x1d0)
[    7.353206] [<80c01254>] (do_one_initcall) from [<80c01630>] (kernel_init_freeable+0x1a8/0x204)
[    7.362916]  r8:000000d9 r7:80c4383c r6:00000007 r5:819ca2c0 r4:80c67680
[    7.370398] [<80c01488>] (kernel_init_freeable) from [<8091eb18>] (kernel_init+0x18/0x12c)
[    7.379616]  r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:8091eb00
[    7.388343]  r4:00000000
[    7.391170] [<8091eb00>] (kernel_init) from [<80100148>] (ret_from_fork+0x14/0x2c)
[    7.399607] Exception stack(0x818cdfb0 to 0x818cdff8)
[    7.405243] dfa0:                                     00000000 00000000 00000000 00000000
[    7.414371] dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[    7.423499] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000
[    7.430879]  r5:8091eb00 r4:00000000

This reverts commit 59165d1.

Fixes: 59165d1 ("i3c master: fix missing destroy_workqueue() on error in i3c_master_register")
Signed-off-by: Jae Hyun Yoo <[email protected]>
Signed-off-by: Alexandre Belloni <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
esmil pushed a commit that referenced this pull request May 14, 2021
[ Upstream commit 126bdb6 ]

The spi controller supports 44-bit address space on AXI in DMA mode,
so set dma_addr_t width to 44-bit to avoid using a swiotlb mapping.
In addition, if dma_map_single fails, it should return immediately
instead of continuing doing the DMA operation which bases on invalid
address.

This fixes the following crash which occurs in reading a big block
from flash:

[  123.633577] zynqmp-qspi ff0f0000.spi: swiotlb buffer is full (sz: 4194304 bytes), total 32768 (slots), used 0 (slots)
[  123.644230] zynqmp-qspi ff0f0000.spi: ERR:rxdma:memory not mapped
[  123.784625] Unable to handle kernel paging request at virtual address 00000000003fffc0
[  123.792536] Mem abort info:
[  123.795313]   ESR = 0x96000145
[  123.798351]   EC = 0x25: DABT (current EL), IL = 32 bits
[  123.803655]   SET = 0, FnV = 0
[  123.806693]   EA = 0, S1PTW = 0
[  123.809818] Data abort info:
[  123.812683]   ISV = 0, ISS = 0x00000145
[  123.816503]   CM = 1, WnR = 1
[  123.819455] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000805047000
[  123.825887] [00000000003fffc0] pgd=0000000803b45003, p4d=0000000803b45003, pud=0000000000000000
[  123.834586] Internal error: Oops: 96000145 [#1] PREEMPT SMP

Fixes: 1c26372 ("spi: spi-zynqmp-gqspi: Update driver to use spi-mem framework")
Signed-off-by: Quanyang Wang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
esmil pushed a commit that referenced this pull request May 14, 2021
[ Upstream commit b5332a9 ]

We are not changing anything in the TCP connection state so
we should not take a write_lock but rather a read lock.

This caused a deadlock when running nvmet-tcp and nvme-tcp
on the same system, where state_change callbacks on the
host and on the controller side have causal relationship
and made lockdep report on this with blktests:

================================
WARNING: inconsistent lock state
5.12.0-rc3 #1 Tainted: G          I
--------------------------------
inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-R} usage.
nvme/1324 [HC0[0]:SC0[0]:HE1:SE1] takes:
ffff888363151000 (clock-AF_INET){++-?}-{2:2}, at: nvme_tcp_state_change+0x21/0x150 [nvme_tcp]
{IN-SOFTIRQ-W} state was registered at:
  __lock_acquire+0x79b/0x18d0
  lock_acquire+0x1ca/0x480
  _raw_write_lock_bh+0x39/0x80
  nvmet_tcp_state_change+0x21/0x170 [nvmet_tcp]
  tcp_fin+0x2a8/0x780
  tcp_data_queue+0xf94/0x1f20
  tcp_rcv_established+0x6ba/0x1f00
  tcp_v4_do_rcv+0x502/0x760
  tcp_v4_rcv+0x257e/0x3430
  ip_protocol_deliver_rcu+0x69/0x6a0
  ip_local_deliver_finish+0x1e2/0x2f0
  ip_local_deliver+0x1a2/0x420
  ip_rcv+0x4fb/0x6b0
  __netif_receive_skb_one_core+0x162/0x1b0
  process_backlog+0x1ff/0x770
  __napi_poll.constprop.0+0xa9/0x5c0
  net_rx_action+0x7b3/0xb30
  __do_softirq+0x1f0/0x940
  do_softirq+0xa1/0xd0
  __local_bh_enable_ip+0xd8/0x100
  ip_finish_output2+0x6b7/0x18a0
  __ip_queue_xmit+0x706/0x1aa0
  __tcp_transmit_skb+0x2068/0x2e20
  tcp_write_xmit+0xc9e/0x2bb0
  __tcp_push_pending_frames+0x92/0x310
  inet_shutdown+0x158/0x300
  __nvme_tcp_stop_queue+0x36/0x270 [nvme_tcp]
  nvme_tcp_stop_queue+0x87/0xb0 [nvme_tcp]
  nvme_tcp_teardown_admin_queue+0x69/0xe0 [nvme_tcp]
  nvme_do_delete_ctrl+0x100/0x10c [nvme_core]
  nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
  kernfs_fop_write_iter+0x2c7/0x460
  new_sync_write+0x36c/0x610
  vfs_write+0x5c0/0x870
  ksys_write+0xf9/0x1d0
  do_syscall_64+0x33/0x40
  entry_SYSCALL_64_after_hwframe+0x44/0xae
irq event stamp: 10687
hardirqs last  enabled at (10687): [<ffffffff9ec376bd>] _raw_spin_unlock_irqrestore+0x2d/0x40
hardirqs last disabled at (10686): [<ffffffff9ec374d8>] _raw_spin_lock_irqsave+0x68/0x90
softirqs last  enabled at (10684): [<ffffffff9f000608>] __do_softirq+0x608/0x940
softirqs last disabled at (10649): [<ffffffff9cdedd31>] do_softirq+0xa1/0xd0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(clock-AF_INET);
  <Interrupt>
    lock(clock-AF_INET);

 *** DEADLOCK ***

5 locks held by nvme/1324:
 #0: ffff8884a01fe470 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
 #1: ffff8886e435c090 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x216/0x460
 #2: ffff888104d90c38 (kn->active#255){++++}-{0:0}, at: kernfs_remove_self+0x22d/0x330
 #3: ffff8884634538d0 (&queue->queue_lock){+.+.}-{3:3}, at: nvme_tcp_stop_queue+0x52/0xb0 [nvme_tcp]
 #4: ffff888363150d30 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_shutdown+0x59/0x300

stack backtrace:
CPU: 26 PID: 1324 Comm: nvme Tainted: G          I       5.12.0-rc3 #1
Hardware name: Dell Inc. PowerEdge R640/06NR82, BIOS 2.10.0 11/12/2020
Call Trace:
 dump_stack+0x93/0xc2
 mark_lock_irq.cold+0x2c/0xb3
 ? verify_lock_unused+0x390/0x390
 ? stack_trace_consume_entry+0x160/0x160
 ? lock_downgrade+0x100/0x100
 ? save_trace+0x88/0x5e0
 ? _raw_spin_unlock_irqrestore+0x2d/0x40
 mark_lock+0x530/0x1470
 ? mark_lock_irq+0x1d10/0x1d10
 ? enqueue_timer+0x660/0x660
 mark_usage+0x215/0x2a0
 __lock_acquire+0x79b/0x18d0
 ? tcp_schedule_loss_probe.part.0+0x38c/0x520
 lock_acquire+0x1ca/0x480
 ? nvme_tcp_state_change+0x21/0x150 [nvme_tcp]
 ? rcu_read_unlock+0x40/0x40
 ? tcp_mtu_probe+0x1ae0/0x1ae0
 ? kmalloc_reserve+0xa0/0xa0
 ? sysfs_file_ops+0x170/0x170
 _raw_read_lock+0x3d/0xa0
 ? nvme_tcp_state_change+0x21/0x150 [nvme_tcp]
 nvme_tcp_state_change+0x21/0x150 [nvme_tcp]
 ? sysfs_file_ops+0x170/0x170
 inet_shutdown+0x189/0x300
 __nvme_tcp_stop_queue+0x36/0x270 [nvme_tcp]
 nvme_tcp_stop_queue+0x87/0xb0 [nvme_tcp]
 nvme_tcp_teardown_admin_queue+0x69/0xe0 [nvme_tcp]
 nvme_do_delete_ctrl+0x100/0x10c [nvme_core]
 nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
 kernfs_fop_write_iter+0x2c7/0x460
 new_sync_write+0x36c/0x610
 ? new_sync_read+0x600/0x600
 ? lock_acquire+0x1ca/0x480
 ? rcu_read_unlock+0x40/0x40
 ? lock_is_held_type+0x9a/0x110
 vfs_write+0x5c0/0x870
 ksys_write+0xf9/0x1d0
 ? __ia32_sys_read+0xa0/0xa0
 ? lockdep_hardirqs_on_prepare.part.0+0x198/0x340
 ? syscall_enter_from_user_mode+0x27/0x70
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: 872d26a ("nvmet-tcp: add NVMe over TCP target driver")
Reported-by: Yi Zhang <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
esmil pushed a commit that referenced this pull request May 14, 2021
[ Upstream commit 1748696 ]

Commit eab2404 ("Bluetooth: Add BT_PHY socket option") added a
dependency between socket lock and hci_dev->lock that could lead to
deadlock.

It turns out that hci_conn_get_phy() is not in any way relying on hdev
being immutable during the runtime of this function, neither does it even
look at any of the members of hdev, and as such there is no need to hold
that lock.

This fixes the lockdep splat below:

 ======================================================
 WARNING: possible circular locking dependency detected
 5.12.0-rc1-00026-g73d464503354 #10 Not tainted
 ------------------------------------------------------
 bluetoothd/1118 is trying to acquire lock:
 ffff8f078383c078 (&hdev->lock){+.+.}-{3:3}, at: hci_conn_get_phy+0x1c/0x150 [bluetooth]

 but task is already holding lock:
 ffff8f07e831d920 (sk_lock-AF_BLUETOOTH-BTPROTO_L2CAP){+.+.}-{0:0}, at: l2cap_sock_getsockopt+0x8b/0x610

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #3 (sk_lock-AF_BLUETOOTH-BTPROTO_L2CAP){+.+.}-{0:0}:
        lock_sock_nested+0x72/0xa0
        l2cap_sock_ready_cb+0x18/0x70 [bluetooth]
        l2cap_config_rsp+0x27a/0x520 [bluetooth]
        l2cap_sig_channel+0x658/0x1330 [bluetooth]
        l2cap_recv_frame+0x1ba/0x310 [bluetooth]
        hci_rx_work+0x1cc/0x640 [bluetooth]
        process_one_work+0x244/0x5f0
        worker_thread+0x3c/0x380
        kthread+0x13e/0x160
        ret_from_fork+0x22/0x30

 -> #2 (&chan->lock#2/1){+.+.}-{3:3}:
        __mutex_lock+0xa3/0xa10
        l2cap_chan_connect+0x33a/0x940 [bluetooth]
        l2cap_sock_connect+0x141/0x2a0 [bluetooth]
        __sys_connect+0x9b/0xc0
        __x64_sys_connect+0x16/0x20
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae

 -> #1 (&conn->chan_lock){+.+.}-{3:3}:
        __mutex_lock+0xa3/0xa10
        l2cap_chan_connect+0x322/0x940 [bluetooth]
        l2cap_sock_connect+0x141/0x2a0 [bluetooth]
        __sys_connect+0x9b/0xc0
        __x64_sys_connect+0x16/0x20
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae

 -> #0 (&hdev->lock){+.+.}-{3:3}:
        __lock_acquire+0x147a/0x1a50
        lock_acquire+0x277/0x3d0
        __mutex_lock+0xa3/0xa10
        hci_conn_get_phy+0x1c/0x150 [bluetooth]
        l2cap_sock_getsockopt+0x5a9/0x610 [bluetooth]
        __sys_getsockopt+0xcc/0x200
        __x64_sys_getsockopt+0x20/0x30
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae

 other info that might help us debug this:

 Chain exists of:
   &hdev->lock --> &chan->lock#2/1 --> sk_lock-AF_BLUETOOTH-BTPROTO_L2CAP

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(sk_lock-AF_BLUETOOTH-BTPROTO_L2CAP);
                                lock(&chan->lock#2/1);
                                lock(sk_lock-AF_BLUETOOTH-BTPROTO_L2CAP);
   lock(&hdev->lock);

  *** DEADLOCK ***

 1 lock held by bluetoothd/1118:
  #0: ffff8f07e831d920 (sk_lock-AF_BLUETOOTH-BTPROTO_L2CAP){+.+.}-{0:0}, at: l2cap_sock_getsockopt+0x8b/0x610 [bluetooth]

 stack backtrace:
 CPU: 3 PID: 1118 Comm: bluetoothd Not tainted 5.12.0-rc1-00026-g73d464503354 #10
 Hardware name: LENOVO 20K5S22R00/20K5S22R00, BIOS R0IET38W (1.16 ) 05/31/2017
 Call Trace:
  dump_stack+0x7f/0xa1
  check_noncircular+0x105/0x120
  ? __lock_acquire+0x147a/0x1a50
  __lock_acquire+0x147a/0x1a50
  lock_acquire+0x277/0x3d0
  ? hci_conn_get_phy+0x1c/0x150 [bluetooth]
  ? __lock_acquire+0x2e1/0x1a50
  ? lock_is_held_type+0xb4/0x120
  ? hci_conn_get_phy+0x1c/0x150 [bluetooth]
  __mutex_lock+0xa3/0xa10
  ? hci_conn_get_phy+0x1c/0x150 [bluetooth]
  ? lock_acquire+0x277/0x3d0
  ? mark_held_locks+0x49/0x70
  ? mark_held_locks+0x49/0x70
  ? hci_conn_get_phy+0x1c/0x150 [bluetooth]
  hci_conn_get_phy+0x1c/0x150 [bluetooth]
  l2cap_sock_getsockopt+0x5a9/0x610 [bluetooth]
  __sys_getsockopt+0xcc/0x200
  __x64_sys_getsockopt+0x20/0x30
  do_syscall_64+0x33/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xae
 RIP: 0033:0x7fb73df33eee
 Code: 48 8b 0d 85 0f 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 37 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 52 0f 0c 00 f7 d8 64 89 01 48
 RSP: 002b:00007fffcfbbbf08 EFLAGS: 00000203 ORIG_RAX: 0000000000000037
 RAX: ffffffffffffffda RBX: 0000000000000019 RCX: 00007fb73df33eee
 RDX: 000000000000000e RSI: 0000000000000112 RDI: 0000000000000018
 RBP: 0000000000000000 R08: 00007fffcfbbbf44 R09: 0000000000000000
 R10: 00007fffcfbbbf3c R11: 0000000000000203 R12: 0000000000000000
 R13: 0000000000000018 R14: 0000000000000000 R15: 0000556fcefc70d0

Fixes: eab2404 ("Bluetooth: Add BT_PHY socket option")
Signed-off-by: Jiri Kosina <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
esmil pushed a commit that referenced this pull request May 14, 2021
[ Upstream commit b8b2f37 ]

When adding a PTE a ptesync is needed to order the update of the PTE
with subsequent accesses otherwise a spurious fault may be raised.

radix__set_pte_at() does not do this for performance gains. For
non-kernel memory this is not an issue as any faults of this kind are
corrected by the page fault handler. For kernel memory these faults
are not handled. The current solution is that there is a ptesync in
flush_cache_vmap() which should be called when mapping from the
vmalloc region.

However, map_kernel_page() does not call flush_cache_vmap(). This is
troublesome in particular for code patching with Strict RWX on radix.
In do_patch_instruction() the page frame that contains the instruction
to be patched is mapped and then immediately patched. With no ordering
or synchronization between setting up the PTE and writing to the page
it is possible for faults.

As the code patching is done using __put_user_asm_goto() the resulting
fault is obscured - but using a normal store instead it can be seen:

  BUG: Unable to handle kernel data access on write at 0xc008000008f24a3c
  Faulting instruction address: 0xc00000000008bd74
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
  Modules linked in: nop_module(PO+) [last unloaded: nop_module]
  CPU: 4 PID: 757 Comm: sh Tainted: P           O      5.10.0-rc5-01361-ge3c1b78c8440-dirty torvalds#43
  NIP:  c00000000008bd74 LR: c00000000008bd50 CTR: c000000000025810
  REGS: c000000016f634a0 TRAP: 0300   Tainted: P           O       (5.10.0-rc5-01361-ge3c1b78c8440-dirty)
  MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 44002884  XER: 00000000
  CFAR: c00000000007c68c DAR: c008000008f24a3c DSISR: 42000000 IRQMASK: 1

This results in the kind of issue reported here:
  https://lore.kernel.org/linuxppc-dev/[email protected]/

Chris Riedl suggested a reliable way to reproduce the issue:
  $ mount -t debugfs none /sys/kernel/debug
  $ (while true; do echo function > /sys/kernel/debug/tracing/current_tracer ; echo nop > /sys/kernel/debug/tracing/current_tracer ; done) &

Turning ftrace on and off does a large amount of code patching which
in usually less then 5min will crash giving a trace like:

   ftrace-powerpc: (____ptrval____): replaced (4b473b11) != old (60000000)
   ------------[ ftrace bug ]------------
   ftrace failed to modify
   [<c000000000bf8e5c>] napi_busy_loop+0xc/0x390
    actual:   11:3b:47:4b
   Setting ftrace call site to call ftrace function
   ftrace record flags: 80000001
    (1)
    expected tramp: c00000000006c96c
   ------------[ cut here ]------------
   WARNING: CPU: 4 PID: 809 at kernel/trace/ftrace.c:2065 ftrace_bug+0x28c/0x2e8
   Modules linked in: nop_module(PO-) [last unloaded: nop_module]
   CPU: 4 PID: 809 Comm: sh Tainted: P           O      5.10.0-rc5-01360-gf878ccaf250a #1
   NIP:  c00000000024f334 LR: c00000000024f330 CTR: c0000000001a5af0
   REGS: c000000004c8b760 TRAP: 0700   Tainted: P           O       (5.10.0-rc5-01360-gf878ccaf250a)
   MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28008848  XER: 20040000
   CFAR: c0000000001a9c98 IRQMASK: 0
   GPR00: c00000000024f330 c000000004c8b9f0 c000000002770600 0000000000000022
   GPR04: 00000000ffff7fff c000000004c8b6d0 0000000000000027 c0000007fe9bcdd8
   GPR08: 0000000000000023 ffffffffffffffd8 0000000000000027 c000000002613118
   GPR12: 0000000000008000 c0000007fffdca00 0000000000000000 0000000000000000
   GPR16: 0000000023ec37c5 0000000000000000 0000000000000000 0000000000000008
   GPR20: c000000004c8bc90 c0000000027a2d20 c000000004c8bcd0 c000000002612fe8
   GPR24: 0000000000000038 0000000000000030 0000000000000028 0000000000000020
   GPR28: c000000000ff1b68 c000000000bf8e5c c00000000312f700 c000000000fbb9b0
   NIP ftrace_bug+0x28c/0x2e8
   LR  ftrace_bug+0x288/0x2e8
   Call Trace:
     ftrace_bug+0x288/0x2e8 (unreliable)
     ftrace_modify_all_code+0x168/0x210
     arch_ftrace_update_code+0x18/0x30
     ftrace_run_update_code+0x44/0xc0
     ftrace_startup+0xf8/0x1c0
     register_ftrace_function+0x4c/0xc0
     function_trace_init+0x80/0xb0
     tracing_set_tracer+0x2a4/0x4f0
     tracing_set_trace_write+0xd4/0x130
     vfs_write+0xf0/0x330
     ksys_write+0x84/0x140
     system_call_exception+0x14c/0x230
     system_call_common+0xf0/0x27c

To fix this when updating kernel memory PTEs using ptesync.

Fixes: f1cb8f9 ("powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags")
Signed-off-by: Jordan Niethe <[email protected]>
Reviewed-by: Nicholas Piggin <[email protected]>
[mpe: Tidy up change log slightly]
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
esmil pushed a commit that referenced this pull request May 6, 2024
Using restricted DMA pools (CONFIG_DMA_RESTRICTED_POOL=y) in conjunction
with dynamic SWIOTLB (CONFIG_SWIOTLB_DYNAMIC=y) leads to the following
crash when initialising the restricted pools at boot-time:

  | Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
  | Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
  | pc : rmem_swiotlb_device_init+0xfc/0x1ec
  | lr : rmem_swiotlb_device_init+0xf0/0x1ec
  | Call trace:
  |  rmem_swiotlb_device_init+0xfc/0x1ec
  |  of_reserved_mem_device_init_by_idx+0x18c/0x238
  |  of_dma_configure_id+0x31c/0x33c
  |  platform_dma_configure+0x34/0x80

faddr2line reveals that the crash is in the list validation code:

  include/linux/list.h:83
  include/linux/rculist.h:79
  include/linux/rculist.h:106
  kernel/dma/swiotlb.c:306
  kernel/dma/swiotlb.c:1695

because add_mem_pool() is trying to list_add_rcu() to a NULL
'mem->pools'.

Fix the crash by initialising the 'mem->pools' list_head in
rmem_swiotlb_device_init() before calling add_mem_pool().

Reported-by: Nikita Ioffe <[email protected]>
Tested-by: Nikita Ioffe <[email protected]>
Fixes: 1aaa736 ("swiotlb: allocate a new memory pool when existing pools are full")
Signed-off-by: Will Deacon <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
esmil pushed a commit that referenced this pull request May 14, 2024
We found below OOB crash:

[   33.452494] ==================================================================
[   33.453513] BUG: KASAN: stack-out-of-bounds in refresh_cpu_vm_stats.constprop.0+0xcc/0x2ec
[   33.454660] Write of size 164 at addr c1d03d30 by task swapper/0/0
[   33.455515]
[   33.455767] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O       6.1.25-mainline #1
[   33.456880] Hardware name: Generic DT based system
[   33.457555]  unwind_backtrace from show_stack+0x18/0x1c
[   33.458326]  show_stack from dump_stack_lvl+0x40/0x4c
[   33.459072]  dump_stack_lvl from print_report+0x158/0x4a4
[   33.459863]  print_report from kasan_report+0x9c/0x148
[   33.460616]  kasan_report from kasan_check_range+0x94/0x1a0
[   33.461424]  kasan_check_range from memset+0x20/0x3c
[   33.462157]  memset from refresh_cpu_vm_stats.constprop.0+0xcc/0x2ec
[   33.463064]  refresh_cpu_vm_stats.constprop.0 from tick_nohz_idle_stop_tick+0x180/0x53c
[   33.464181]  tick_nohz_idle_stop_tick from do_idle+0x264/0x354
[   33.465029]  do_idle from cpu_startup_entry+0x20/0x24
[   33.465769]  cpu_startup_entry from rest_init+0xf0/0xf4
[   33.466528]  rest_init from arch_post_acpi_subsys_init+0x0/0x18
[   33.467397]
[   33.467644] The buggy address belongs to stack of task swapper/0/0
[   33.468493]  and is located at offset 112 in frame:
[   33.469172]  refresh_cpu_vm_stats.constprop.0+0x0/0x2ec
[   33.469917]
[   33.470165] This frame has 2 objects:
[   33.470696]  [32, 76) 'global_zone_diff'
[   33.470729]  [112, 276) 'global_node_diff'
[   33.471294]
[   33.472095] The buggy address belongs to the physical page:
[   33.472862] page:3cd72da8 refcount:1 mapcount:0 mapping:00000000 index:0x0 pfn:0x41d03
[   33.473944] flags: 0x1000(reserved|zone=0)
[   33.474565] raw: 00001000 ed741470 ed741470 00000000 00000000 00000000 ffffffff 00000001
[   33.475656] raw: 00000000
[   33.476050] page dumped because: kasan: bad access detected
[   33.476816]
[   33.477061] Memory state around the buggy address:
[   33.477732]  c1d03c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   33.478630]  c1d03c80: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 00 00 00
[   33.479526] >c1d03d00: 00 04 f2 f2 f2 f2 00 00 00 00 00 00 f1 f1 f1 f1
[   33.480415]                                                ^
[   33.481195]  c1d03d80: 00 00 00 00 00 00 00 00 00 00 04 f3 f3 f3 f3 f3
[   33.482088]  c1d03e00: f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
[   33.482978] ==================================================================

We find the root cause of this OOB is that arm does not clear stale stack
poison in the case of cpuidle.

This patch refer to arch/arm64/kernel/sleep.S to resolve this issue.

From cited commit [1] that explain the problem

Functions which the compiler has instrumented for KASAN place poison on
the stack shadow upon entry and remove this poison prior to returning.

In the case of cpuidle, CPUs exit the kernel a number of levels deep in
C code.  Any instrumented functions on this critical path will leave
portions of the stack shadow poisoned.

If CPUs lose context and return to the kernel via a cold path, we
restore a prior context saved in __cpu_suspend_enter are forgotten, and
we never remove the poison they placed in the stack shadow area by
functions calls between this and the actual exit of the kernel.

Thus, (depending on stackframe layout) subsequent calls to instrumented
functions may hit this stale poison, resulting in (spurious) KASAN
splats to the console.

To avoid this, clear any stale poison from the idle thread for a CPU
prior to bringing a CPU online.

From cited commit [2]

Extend to check for CONFIG_KASAN_STACK

[1] commit 0d97e6d ("arm64: kasan: clear stale stack poison")
[2] commit d56a9ef ("kasan, arm64: unpoison stack only with CONFIG_KASAN_STACK")

Signed-off-by: Boy Wu <[email protected]>
Reviewed-by: Mark Rutland <[email protected]>
Acked-by: Andrey Ryabinin <[email protected]>
Reviewed-by: Linus Walleij <[email protected]>
Fixes: 5615f69 ("ARM: 9016/2: Initialize the mapping of KASan shadow memory")
Signed-off-by: Russell King (Oracle) <[email protected]>
esmil pushed a commit that referenced this pull request May 14, 2024
We previously would call btrfs_check_leaf() if we had the check
integrity code enabled, which meant that we could only run the extended
leaf checks if we had WRITTEN set on the header flags.

This leaves a gap in our checking, because we could end up with
corruption on disk where WRITTEN isn't set on the leaf, and then the
extended leaf checks don't get run which we rely on to validate all of
the item pointers to make sure we don't access memory outside of the
extent buffer.

However, since 732fab9 ("btrfs: check-integrity: remove
CONFIG_BTRFS_FS_CHECK_INTEGRITY option") we no longer call
btrfs_check_leaf() from btrfs_mark_buffer_dirty(), which means we only
ever call it on blocks that are being written out, and thus have WRITTEN
set, or that are being read in, which should have WRITTEN set.

Add checks to make sure we have WRITTEN set appropriately, and then make
sure __btrfs_check_leaf() always does the item checking.  This will
protect us from file systems that have been corrupted and no longer have
WRITTEN set on some of the blocks.

This was hit on a crafted image tweaking the WRITTEN bit and reported by
KASAN as out-of-bound access in the eb accessors. The example is a dir
item at the end of an eb.

  [2.042] BTRFS warning (device loop1): bad eb member start: ptr 0x3fff start 30572544 member offset 16410 size 2
  [2.040] general protection fault, probably for non-canonical address 0xe0009d1000000003: 0000 [#1] PREEMPT SMP KASAN NOPTI
  [2.537] KASAN: maybe wild-memory-access in range [0x0005088000000018-0x000508800000001f]
  [2.729] CPU: 0 PID: 2587 Comm: mount Not tainted 6.8.2 #1
  [2.729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  [2.621] RIP: 0010:btrfs_get_16+0x34b/0x6d0
  [2.621] RSP: 0018:ffff88810871fab8 EFLAGS: 00000206
  [2.621] RAX: 0000a11000000003 RBX: ffff888104ff8720 RCX: ffff88811b2288c0
  [2.621] RDX: dffffc0000000000 RSI: ffffffff81dd8aca RDI: ffff88810871f748
  [2.621] RBP: 000000000000401a R08: 0000000000000001 R09: ffffed10210e3ee9
  [2.621] R10: ffff88810871f74f R11: 205d323430333737 R12: 000000000000001a
  [2.621] R13: 000508800000001a R14: 1ffff110210e3f5d R15: ffffffff850011e8
  [2.621] FS:  00007f56ea275840(0000) GS:ffff88811b200000(0000) knlGS:0000000000000000
  [2.621] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [2.621] CR2: 00007febd13b75c0 CR3: 000000010bb50000 CR4: 00000000000006f0
  [2.621] Call Trace:
  [2.621]  <TASK>
  [2.621]  ? show_regs+0x74/0x80
  [2.621]  ? die_addr+0x46/0xc0
  [2.621]  ? exc_general_protection+0x161/0x2a0
  [2.621]  ? asm_exc_general_protection+0x26/0x30
  [2.621]  ? btrfs_get_16+0x33a/0x6d0
  [2.621]  ? btrfs_get_16+0x34b/0x6d0
  [2.621]  ? btrfs_get_16+0x33a/0x6d0
  [2.621]  ? __pfx_btrfs_get_16+0x10/0x10
  [2.621]  ? __pfx_mutex_unlock+0x10/0x10
  [2.621]  btrfs_match_dir_item_name+0x101/0x1a0
  [2.621]  btrfs_lookup_dir_item+0x1f3/0x280
  [2.621]  ? __pfx_btrfs_lookup_dir_item+0x10/0x10
  [2.621]  btrfs_get_tree+0xd25/0x1910

Reported-by: lei lu <[email protected]>
CC: [email protected] # 6.7+
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
[ copy more details from report ]
Signed-off-by: David Sterba <[email protected]>
esmil pushed a commit that referenced this pull request May 14, 2024
TCP_SYN_RECV state is really special, it is only used by
cross-syn connections, mostly used by fuzzers.

In the following crash [1], syzbot managed to trigger a divide
by zero in tcp_rcv_space_adjust()

A socket makes the following state transitions,
without ever calling tcp_init_transfer(),
meaning tcp_init_buffer_space() is also not called.

         TCP_CLOSE
connect()
         TCP_SYN_SENT
         TCP_SYN_RECV
shutdown() -> tcp_shutdown(sk, SEND_SHUTDOWN)
         TCP_FIN_WAIT1

To fix this issue, change tcp_shutdown() to not
perform a TCP_SYN_RECV -> TCP_FIN_WAIT1 transition,
which makes no sense anyway.

When tcp_rcv_state_process() later changes socket state
from TCP_SYN_RECV to TCP_ESTABLISH, then look at
sk->sk_shutdown to finally enter TCP_FIN_WAIT1 state,
and send a FIN packet from a sane socket state.

This means tcp_send_fin() can now be called from BH
context, and must use GFP_ATOMIC allocations.

[1]
divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 1 PID: 5084 Comm: syz-executor358 Not tainted 6.9.0-rc6-syzkaller-00022-g98369dccd2f8 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
 RIP: 0010:tcp_rcv_space_adjust+0x2df/0x890 net/ipv4/tcp_input.c:767
Code: e3 04 4c 01 eb 48 8b 44 24 38 0f b6 04 10 84 c0 49 89 d5 0f 85 a5 03 00 00 41 8b 8e c8 09 00 00 89 e8 29 c8 48 0f af c3 31 d2 <48> f7 f1 48 8d 1c 43 49 8d 96 76 08 00 00 48 89 d0 48 c1 e8 03 48
RSP: 0018:ffffc900031ef3f0 EFLAGS: 00010246
RAX: 0c677a10441f8f42 RBX: 000000004fb95e7e RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000027d4b11f R08: ffffffff89e535a4 R09: 1ffffffff25e6ab7
R10: dffffc0000000000 R11: ffffffff8135e920 R12: ffff88802a9f8d30
R13: dffffc0000000000 R14: ffff88802a9f8d00 R15: 1ffff1100553f2da
FS:  00005555775c0380(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1155bf2304 CR3: 000000002b9f2000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
  tcp_recvmsg_locked+0x106d/0x25a0 net/ipv4/tcp.c:2513
  tcp_recvmsg+0x25d/0x920 net/ipv4/tcp.c:2578
  inet6_recvmsg+0x16a/0x730 net/ipv6/af_inet6.c:680
  sock_recvmsg_nosec net/socket.c:1046 [inline]
  sock_recvmsg+0x109/0x280 net/socket.c:1068
  ____sys_recvmsg+0x1db/0x470 net/socket.c:2803
  ___sys_recvmsg net/socket.c:2845 [inline]
  do_recvmmsg+0x474/0xae0 net/socket.c:2939
  __sys_recvmmsg net/socket.c:3018 [inline]
  __do_sys_recvmmsg net/socket.c:3041 [inline]
  __se_sys_recvmmsg net/socket.c:3034 [inline]
  __x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034
  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
  do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7faeb6363db9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 17 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffcc1997168 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007faeb6363db9
RDX: 0000000000000001 RSI: 0000000020000bc0 RDI: 0000000000000005
RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000001c
R10: 0000000000000122 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001

Fixes: 1da177e ("Linux-2.6.12-rc2")
Reported-by: syzbot <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Neal Cardwell <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
esmil pushed a commit that referenced this pull request May 14, 2024
Anderson Nascimento reported a use-after-free splat in tcp_twsk_unique()
with nice analysis.

Since commit ec94c26 ("tcp/dccp: avoid one atomic operation for
timewait hashdance"), inet_twsk_hashdance() sets TIME-WAIT socket's
sk_refcnt after putting it into ehash and releasing the bucket lock.

Thus, there is a small race window where other threads could try to
reuse the port during connect() and call sock_hold() in tcp_twsk_unique()
for the TIME-WAIT socket with zero refcnt.

If that happens, the refcnt taken by tcp_twsk_unique() is overwritten
and sock_put() will cause underflow, triggering a real use-after-free
somewhere else.

To avoid the use-after-free, we need to use refcount_inc_not_zero() in
tcp_twsk_unique() and give up on reusing the port if it returns false.

[0]:
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 0 PID: 1039313 at lib/refcount.c:25 refcount_warn_saturate+0xe5/0x110
CPU: 0 PID: 1039313 Comm: trigger Not tainted 6.8.6-200.fc39.x86_64 #1
Hardware name: VMware, Inc. VMware20,1/440BX Desktop Reference Platform, BIOS VMW201.00V.21805430.B64.2305221830 05/22/2023
RIP: 0010:refcount_warn_saturate+0xe5/0x110
Code: 42 8e ff 0f 0b c3 cc cc cc cc 80 3d aa 13 ea 01 00 0f 85 5e ff ff ff 48 c7 c7 f8 8e b7 82 c6 05 96 13 ea 01 01 e8 7b 42 8e ff <0f> 0b c3 cc cc cc cc 48 c7 c7 50 8f b7 82 c6 05 7a 13 ea 01 01 e8
RSP: 0018:ffffc90006b43b60 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff888009bb3ef0 RCX: 0000000000000027
RDX: ffff88807be218c8 RSI: 0000000000000001 RDI: ffff88807be218c0
RBP: 0000000000069d70 R08: 0000000000000000 R09: ffffc90006b439f0
R10: ffffc90006b439e8 R11: 0000000000000003 R12: ffff8880029ede84
R13: 0000000000004e20 R14: ffffffff84356dc0 R15: ffff888009bb3ef0
FS:  00007f62c10926c0(0000) GS:ffff88807be00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020ccb000 CR3: 000000004628c005 CR4: 0000000000f70ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ? refcount_warn_saturate+0xe5/0x110
 ? __warn+0x81/0x130
 ? refcount_warn_saturate+0xe5/0x110
 ? report_bug+0x171/0x1a0
 ? refcount_warn_saturate+0xe5/0x110
 ? handle_bug+0x3c/0x80
 ? exc_invalid_op+0x17/0x70
 ? asm_exc_invalid_op+0x1a/0x20
 ? refcount_warn_saturate+0xe5/0x110
 tcp_twsk_unique+0x186/0x190
 __inet_check_established+0x176/0x2d0
 __inet_hash_connect+0x74/0x7d0
 ? __pfx___inet_check_established+0x10/0x10
 tcp_v4_connect+0x278/0x530
 __inet_stream_connect+0x10f/0x3d0
 inet_stream_connect+0x3a/0x60
 __sys_connect+0xa8/0xd0
 __x64_sys_connect+0x18/0x20
 do_syscall_64+0x83/0x170
 entry_SYSCALL_64_after_hwframe+0x78/0x80
RIP: 0033:0x7f62c11a885d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a3 45 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007f62c1091e58 EFLAGS: 00000296 ORIG_RAX: 000000000000002a
RAX: ffffffffffffffda RBX: 0000000020ccb004 RCX: 00007f62c11a885d
RDX: 0000000000000010 RSI: 0000000020ccb000 RDI: 0000000000000003
RBP: 00007f62c1091e90 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000296 R12: 00007f62c10926c0
R13: ffffffffffffff88 R14: 0000000000000000 R15: 00007ffe237885b0
 </TASK>

Fixes: ec94c26 ("tcp/dccp: avoid one atomic operation for timewait hashdance")
Reported-by: Anderson Nascimento <[email protected]>
Closes: https://lore.kernel.org/netdev/[email protected]/
Suggested-by: Eric Dumazet <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
esmil pushed a commit that referenced this pull request May 14, 2024
There is a race condition between l2cap_chan_timeout() and
l2cap_chan_del(). When we use l2cap_chan_del() to delete the
channel, the chan->conn will be set to null. But the conn could
be dereferenced again in the mutex_lock() of l2cap_chan_timeout().
As a result the null pointer dereference bug will happen. The
KASAN report triggered by POC is shown below:

[  472.074580] ==================================================================
[  472.075284] BUG: KASAN: null-ptr-deref in mutex_lock+0x68/0xc0
[  472.075308] Write of size 8 at addr 0000000000000158 by task kworker/0:0/7
[  472.075308]
[  472.075308] CPU: 0 PID: 7 Comm: kworker/0:0 Not tainted 6.9.0-rc5-00356-g78c0094a146b #36
[  472.075308] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu4
[  472.075308] Workqueue: events l2cap_chan_timeout
[  472.075308] Call Trace:
[  472.075308]  <TASK>
[  472.075308]  dump_stack_lvl+0x137/0x1a0
[  472.075308]  print_report+0x101/0x250
[  472.075308]  ? __virt_addr_valid+0x77/0x160
[  472.075308]  ? mutex_lock+0x68/0xc0
[  472.075308]  kasan_report+0x139/0x170
[  472.075308]  ? mutex_lock+0x68/0xc0
[  472.075308]  kasan_check_range+0x2c3/0x2e0
[  472.075308]  mutex_lock+0x68/0xc0
[  472.075308]  l2cap_chan_timeout+0x181/0x300
[  472.075308]  process_one_work+0x5d2/0xe00
[  472.075308]  worker_thread+0xe1d/0x1660
[  472.075308]  ? pr_cont_work+0x5e0/0x5e0
[  472.075308]  kthread+0x2b7/0x350
[  472.075308]  ? pr_cont_work+0x5e0/0x5e0
[  472.075308]  ? kthread_blkcg+0xd0/0xd0
[  472.075308]  ret_from_fork+0x4d/0x80
[  472.075308]  ? kthread_blkcg+0xd0/0xd0
[  472.075308]  ret_from_fork_asm+0x11/0x20
[  472.075308]  </TASK>
[  472.075308] ==================================================================
[  472.094860] Disabling lock debugging due to kernel taint
[  472.096136] BUG: kernel NULL pointer dereference, address: 0000000000000158
[  472.096136] #PF: supervisor write access in kernel mode
[  472.096136] #PF: error_code(0x0002) - not-present page
[  472.096136] PGD 0 P4D 0
[  472.096136] Oops: 0002 [#1] PREEMPT SMP KASAN NOPTI
[  472.096136] CPU: 0 PID: 7 Comm: kworker/0:0 Tainted: G    B              6.9.0-rc5-00356-g78c0094a146b #36
[  472.096136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu4
[  472.096136] Workqueue: events l2cap_chan_timeout
[  472.096136] RIP: 0010:mutex_lock+0x88/0xc0
[  472.096136] Code: be 08 00 00 00 e8 f8 23 1f fd 4c 89 f7 be 08 00 00 00 e8 eb 23 1f fd 42 80 3c 23 00 74 08 48 88
[  472.096136] RSP: 0018:ffff88800744fc78 EFLAGS: 00000246
[  472.096136] RAX: 0000000000000000 RBX: 1ffff11000e89f8f RCX: ffffffff8457c865
[  472.096136] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff88800744fc78
[  472.096136] RBP: 0000000000000158 R08: ffff88800744fc7f R09: 1ffff11000e89f8f
[  472.096136] R10: dffffc0000000000 R11: ffffed1000e89f90 R12: dffffc0000000000
[  472.096136] R13: 0000000000000158 R14: ffff88800744fc78 R15: ffff888007405a00
[  472.096136] FS:  0000000000000000(0000) GS:ffff88806d200000(0000) knlGS:0000000000000000
[  472.096136] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  472.096136] CR2: 0000000000000158 CR3: 000000000da32000 CR4: 00000000000006f0
[  472.096136] Call Trace:
[  472.096136]  <TASK>
[  472.096136]  ? __die_body+0x8d/0xe0
[  472.096136]  ? page_fault_oops+0x6b8/0x9a0
[  472.096136]  ? kernelmode_fixup_or_oops+0x20c/0x2a0
[  472.096136]  ? do_user_addr_fault+0x1027/0x1340
[  472.096136]  ? _printk+0x7a/0xa0
[  472.096136]  ? mutex_lock+0x68/0xc0
[  472.096136]  ? add_taint+0x42/0xd0
[  472.096136]  ? exc_page_fault+0x6a/0x1b0
[  472.096136]  ? asm_exc_page_fault+0x26/0x30
[  472.096136]  ? mutex_lock+0x75/0xc0
[  472.096136]  ? mutex_lock+0x88/0xc0
[  472.096136]  ? mutex_lock+0x75/0xc0
[  472.096136]  l2cap_chan_timeout+0x181/0x300
[  472.096136]  process_one_work+0x5d2/0xe00
[  472.096136]  worker_thread+0xe1d/0x1660
[  472.096136]  ? pr_cont_work+0x5e0/0x5e0
[  472.096136]  kthread+0x2b7/0x350
[  472.096136]  ? pr_cont_work+0x5e0/0x5e0
[  472.096136]  ? kthread_blkcg+0xd0/0xd0
[  472.096136]  ret_from_fork+0x4d/0x80
[  472.096136]  ? kthread_blkcg+0xd0/0xd0
[  472.096136]  ret_from_fork_asm+0x11/0x20
[  472.096136]  </TASK>
[  472.096136] Modules linked in:
[  472.096136] CR2: 0000000000000158
[  472.096136] ---[ end trace 0000000000000000 ]---
[  472.096136] RIP: 0010:mutex_lock+0x88/0xc0
[  472.096136] Code: be 08 00 00 00 e8 f8 23 1f fd 4c 89 f7 be 08 00 00 00 e8 eb 23 1f fd 42 80 3c 23 00 74 08 48 88
[  472.096136] RSP: 0018:ffff88800744fc78 EFLAGS: 00000246
[  472.096136] RAX: 0000000000000000 RBX: 1ffff11000e89f8f RCX: ffffffff8457c865
[  472.096136] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff88800744fc78
[  472.096136] RBP: 0000000000000158 R08: ffff88800744fc7f R09: 1ffff11000e89f8f
[  472.132932] R10: dffffc0000000000 R11: ffffed1000e89f90 R12: dffffc0000000000
[  472.132932] R13: 0000000000000158 R14: ffff88800744fc78 R15: ffff888007405a00
[  472.132932] FS:  0000000000000000(0000) GS:ffff88806d200000(0000) knlGS:0000000000000000
[  472.132932] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  472.132932] CR2: 0000000000000158 CR3: 000000000da32000 CR4: 00000000000006f0
[  472.132932] Kernel panic - not syncing: Fatal exception
[  472.132932] Kernel Offset: disabled
[  472.132932] ---[ end Kernel panic - not syncing: Fatal exception ]---

Add a check to judge whether the conn is null in l2cap_chan_timeout()
in order to mitigate the bug.

Fixes: 3df91ea ("Bluetooth: Revert to mutexes from RCU list")
Signed-off-by: Duoming Zhou <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>
esmil pushed a commit that referenced this pull request May 14, 2024
syzbot is able to trigger the following crash [1],
caused by unsafe ip6_dst_idev() use.

Indeed ip6_dst_idev() can return NULL, and must always be checked.

[1]

Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 0 PID: 31648 Comm: syz-executor.0 Not tainted 6.9.0-rc4-next-20240417-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
 RIP: 0010:__fib6_rule_action net/ipv6/fib6_rules.c:237 [inline]
 RIP: 0010:fib6_rule_action+0x241/0x7b0 net/ipv6/fib6_rules.c:267
Code: 02 00 00 49 8d 9f d8 00 00 00 48 89 d8 48 c1 e8 03 42 80 3c 20 00 74 08 48 89 df e8 f9 32 bf f7 48 8b 1b 48 89 d8 48 c1 e8 03 <42> 80 3c 20 00 74 08 48 89 df e8 e0 32 bf f7 4c 8b 03 48 89 ef 4c
RSP: 0018:ffffc9000fc1f2f0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 1a772f98c8186700
RDX: 0000000000000003 RSI: ffffffff8bcac4e0 RDI: ffffffff8c1f9760
RBP: ffff8880673fb980 R08: ffffffff8fac15ef R09: 1ffffffff1f582bd
R10: dffffc0000000000 R11: fffffbfff1f582be R12: dffffc0000000000
R13: 0000000000000080 R14: ffff888076509000 R15: ffff88807a029a00
FS:  00007f55e82ca6c0(0000) GS:ffff8880b9400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b31d23000 CR3: 0000000022b66000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
  fib_rules_lookup+0x62c/0xdb0 net/core/fib_rules.c:317
  fib6_rule_lookup+0x1fd/0x790 net/ipv6/fib6_rules.c:108
  ip6_route_output_flags_noref net/ipv6/route.c:2637 [inline]
  ip6_route_output_flags+0x38e/0x610 net/ipv6/route.c:2649
  ip6_route_output include/net/ip6_route.h:93 [inline]
  ip6_dst_lookup_tail+0x189/0x11a0 net/ipv6/ip6_output.c:1120
  ip6_dst_lookup_flow+0xb9/0x180 net/ipv6/ip6_output.c:1250
  sctp_v6_get_dst+0x792/0x1e20 net/sctp/ipv6.c:326
  sctp_transport_route+0x12c/0x2e0 net/sctp/transport.c:455
  sctp_assoc_add_peer+0x614/0x15c0 net/sctp/associola.c:662
  sctp_connect_new_asoc+0x31d/0x6c0 net/sctp/socket.c:1099
  __sctp_connect+0x66d/0xe30 net/sctp/socket.c:1197
  sctp_connect net/sctp/socket.c:4819 [inline]
  sctp_inet_connect+0x149/0x1f0 net/sctp/socket.c:4834
  __sys_connect_file net/socket.c:2048 [inline]
  __sys_connect+0x2df/0x310 net/socket.c:2065
  __do_sys_connect net/socket.c:2075 [inline]
  __se_sys_connect net/socket.c:2072 [inline]
  __x64_sys_connect+0x7a/0x90 net/socket.c:2072
  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
  do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fixes: 5e5f3f0 ("[IPV6] ADDRCONF: Convert ipv6_get_saddr() to ipv6_dev_get_saddr().")
Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
esmil pushed a commit that referenced this pull request May 14, 2024
According to syzbot, there is a chance that ip6_dst_idev()
returns NULL in ip6_output(). Most places in IPv6 stack
deal with a NULL idev just fine, but not here.

syzbot reported:

general protection fault, probably for non-canonical address 0xdffffc00000000bc: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: null-ptr-deref in range [0x00000000000005e0-0x00000000000005e7]
CPU: 0 PID: 9775 Comm: syz-executor.4 Not tainted 6.9.0-rc5-syzkaller-00157-g6a30653b604a #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
 RIP: 0010:ip6_output+0x231/0x3f0 net/ipv6/ip6_output.c:237
Code: 3c 1e 00 49 89 df 74 08 4c 89 ef e8 19 58 db f7 48 8b 44 24 20 49 89 45 00 49 89 c5 48 8d 9d e0 05 00 00 48 89 d8 48 c1 e8 03 <42> 0f b6 04 38 84 c0 4c 8b 74 24 28 0f 85 61 01 00 00 8b 1b 31 ff
RSP: 0018:ffffc9000927f0d8 EFLAGS: 00010202
RAX: 00000000000000bc RBX: 00000000000005e0 RCX: 0000000000040000
RDX: ffffc900131f9000 RSI: 0000000000004f47 RDI: 0000000000004f48
RBP: 0000000000000000 R08: ffffffff8a1f0b9a R09: 1ffffffff1f51fad
R10: dffffc0000000000 R11: fffffbfff1f51fae R12: ffff8880293ec8c0
R13: ffff88805d7fc000 R14: 1ffff1100527d91a R15: dffffc0000000000
FS:  00007f135c6856c0(0000) GS:ffff8880b9400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000080 CR3: 0000000064096000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
  NF_HOOK include/linux/netfilter.h:314 [inline]
  ip6_xmit+0xefe/0x17f0 net/ipv6/ip6_output.c:358
  sctp_v6_xmit+0x9f2/0x13f0 net/sctp/ipv6.c:248
  sctp_packet_transmit+0x26ad/0x2ca0 net/sctp/output.c:653
  sctp_packet_singleton+0x22c/0x320 net/sctp/outqueue.c:783
  sctp_outq_flush_ctrl net/sctp/outqueue.c:914 [inline]
  sctp_outq_flush+0x6d5/0x3e20 net/sctp/outqueue.c:1212
  sctp_side_effects net/sctp/sm_sideeffect.c:1198 [inline]
  sctp_do_sm+0x59cc/0x60c0 net/sctp/sm_sideeffect.c:1169
  sctp_primitive_ASSOCIATE+0x95/0xc0 net/sctp/primitive.c:73
  __sctp_connect+0x9cd/0xe30 net/sctp/socket.c:1234
  sctp_connect net/sctp/socket.c:4819 [inline]
  sctp_inet_connect+0x149/0x1f0 net/sctp/socket.c:4834
  __sys_connect_file net/socket.c:2048 [inline]
  __sys_connect+0x2df/0x310 net/socket.c:2065
  __do_sys_connect net/socket.c:2075 [inline]
  __se_sys_connect net/socket.c:2072 [inline]
  __x64_sys_connect+0x7a/0x90 net/socket.c:2072
  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
  do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fixes: 778d80b ("ipv6: Add disable_ipv6 sysctl to disable IPv6 operaion on specific interface.")
Reported-by: syzbot <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Larysa Zaremba <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
esmil pushed a commit that referenced this pull request May 14, 2024
This was missed because of the function pointer indirection.

nvidia_smmu_context_fault() is also installed as a irq function, and the
'void *' was changed to a struct arm_smmu_domain. Since the iommu_domain
is embedded at a non-zero offset this causes nvidia_smmu_context_fault()
to miscompute the offset. Fixup the types.

  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000120
  Mem abort info:
    ESR = 0x0000000096000004
    EC = 0x25: DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
    FSC = 0x04: level 0 translation fault
  Data abort info:
    ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
    CM = 0, WnR = 0, TnD = 0, TagAccess = 0
    GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
  user pgtable: 4k pages, 48-bit VAs, pgdp=0000000107c9f000
  [0000000000000120] pgd=0000000000000000, p4d=0000000000000000
  Internal error: Oops: 0000000096000004 [#1] SMP
  Modules linked in:
  CPU: 1 PID: 47 Comm: kworker/u25:0 Not tainted 6.9.0-0.rc7.58.eln136.aarch64 #1
  Hardware name: Unknown NVIDIA Jetson Orin NX/NVIDIA Jetson Orin NX, BIOS 3.1-32827747 03/19/2023
  Workqueue: events_unbound deferred_probe_work_func
  pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  pc : nvidia_smmu_context_fault+0x1c/0x158
  lr : __free_irq+0x1d4/0x2e8
  sp : ffff80008044b6f0
  x29: ffff80008044b6f0 x28: ffff000080a60b18 x27: ffffd32b5172e970
  x26: 0000000000000000 x25: ffff0000802f5aac x24: ffff0000802f5a30
  x23: ffff0000802f5b60 x22: 0000000000000057 x21: 0000000000000000
  x20: ffff0000802f5a00 x19: ffff000087d4cd80 x18: ffffffffffffffff
  x17: 6234362066666666 x16: 6630303078302d30 x15: ffff00008156d888
  x14: 0000000000000000 x13: ffff0000801db910 x12: ffff00008156d6d0
  x11: 0000000000000003 x10: ffff0000801db918 x9 : ffffd32b50f94d9c
  x8 : 1fffe0001032fda1 x7 : ffff00008197ed00 x6 : 000000000000000f
  x5 : 000000000000010e x4 : 000000000000010e x3 : 0000000000000000
  x2 : ffffd32b51720cd8 x1 : ffff000087e6f700 x0 : 0000000000000057
  Call trace:
   nvidia_smmu_context_fault+0x1c/0x158
   __free_irq+0x1d4/0x2e8
   free_irq+0x3c/0x80
   devm_free_irq+0x64/0xa8
   arm_smmu_domain_free+0xc4/0x158
   iommu_domain_free+0x44/0xa0
   iommu_deinit_device+0xd0/0xf8
   __iommu_group_remove_device+0xcc/0xe0
   iommu_bus_notifier+0x64/0xa8
   notifier_call_chain+0x78/0x148
   blocking_notifier_call_chain+0x4c/0x90
   bus_notify+0x44/0x70
   device_del+0x264/0x3e8
   pci_remove_bus_device+0x84/0x120
   pci_remove_root_bus+0x5c/0xc0
   dw_pcie_host_deinit+0x38/0xe0
   tegra_pcie_config_rp+0xc0/0x1f0
   tegra_pcie_dw_probe+0x34c/0x700
   platform_probe+0x70/0xe8
   really_probe+0xc8/0x3a0
   __driver_probe_device+0x84/0x160
   driver_probe_device+0x44/0x130
   __device_attach_driver+0xc4/0x170
   bus_for_each_drv+0x90/0x100
   __device_attach+0xa8/0x1c8
   device_initial_probe+0x1c/0x30
   bus_probe_device+0xb0/0xc0
   deferred_probe_work_func+0xbc/0x120
   process_one_work+0x194/0x490
   worker_thread+0x284/0x3b0
   kthread+0xf4/0x108
   ret_from_fork+0x10/0x20
  Code: a9b97bfd 910003fd a9025bf5 f85a0035 (b94122a1)

Cc: [email protected]
Fixes: e097633 ("iommu/arm-smmu: Pass arm_smmu_domain to internal functions")
Reported-by: Jerry Snitselaar <[email protected]>
Closes: https://lore.kernel.org/all/jto5e3ili4auk6sbzpnojdvhppgwuegir7mpd755anfhwcbkfz@2u5gh7bxb4iv
Signed-off-by: Jason Gunthorpe <[email protected]>
Tested-by: Jerry Snitselaar <[email protected]>
Acked-by: Jerry Snitselaar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Joerg Roedel <[email protected]>
esmil pushed a commit that referenced this pull request Sep 25, 2024
Error out if kvm_mmu_reload() fails when pre-faulting memory, as trying to
fault-in SPTEs will fail miserably due to root.hpa pointing at garbage.

Note, kvm_mmu_reload() can return -EIO and thus trigger the WARN on -EIO
in kvm_vcpu_pre_fault_memory(), but all such paths also WARN, i.e. the
WARN isn't user-triggerable and won't run afoul of warn-on-panic because
the kernel would already be panicking.

  BUG: unable to handle page fault for address: 000029ffffffffe8
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0000 [#1] PREEMPT SMP
  CPU: 22 PID: 1069 Comm: pre_fault_memor Not tainted 6.10.0-rc7-332d2c1d713e-next-vm torvalds#548
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:is_page_fault_stale+0x3e/0xe0 [kvm]
  RSP: 0018:ffffc9000114bd48 EFLAGS: 00010206
  RAX: 00003fffffffffc0 RBX: ffff88810a07c080 RCX: ffffc9000114bd78
  RDX: ffff88810a07c080 RSI: ffffea0000000000 RDI: ffff88810a07c080
  RBP: ffffc9000114bd78 R08: 00007fa3c8c00000 R09: 8000000000000225
  R10: ffffea00043d7d80 R11: 0000000000000000 R12: ffff88810a07c080
  R13: 0000000100000000 R14: ffffc9000114be58 R15: 0000000000000000
  FS:  00007fa3c9da0740(0000) GS:ffff888277d80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000029ffffffffe8 CR3: 000000011d698000 CR4: 0000000000352eb0
  Call Trace:
   <TASK>
   kvm_tdp_page_fault+0xcc/0x160 [kvm]
   kvm_mmu_do_page_fault+0xfb/0x1f0 [kvm]
   kvm_arch_vcpu_pre_fault_memory+0xd0/0x1a0 [kvm]
   kvm_vcpu_ioctl+0x761/0x8c0 [kvm]
   __x64_sys_ioctl+0x82/0xb0
   do_syscall_64+0x5b/0x160
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>
  Modules linked in: kvm_intel kvm
  CR2: 000029ffffffffe8
  ---[ end trace 0000000000000000 ]---

Fixes: 6e01b76 ("KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()")
Reported-by: [email protected]
Closes: https://lore.kernel.org/all/[email protected]
Reviewed-by: Kai Huang <[email protected]>
Tested-by: xingwei lee <[email protected]>
Tested-by: yuxin wang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
esmil pushed a commit that referenced this pull request Sep 25, 2024
This fixes the random kernel crash seen while removing the driver, when
running the load/unload test over multiple iterations.

1) modprobe btnxpuart
2) hciconfig hci0 reset
3) hciconfig (check hci0 interface up with valid BD address)
4) modprobe -r btnxpuart
Repeat steps 1 to 4

The ps_wakeup() call in btnxpuart_close() schedules the psdata->work(),
which gets scheduled after module is removed, causing a kernel crash.

This hidden issue got highlighted after enabling Power Save by default
in 4183a7b (Bluetooth: btnxpuart: Enable Power Save feature on
startup)

The new ps_cleanup() deasserts UART break immediately while closing
serdev device, cancels any scheduled ps_work and destroys the ps_lock
mutex.

[   85.884604] Unable to handle kernel paging request at virtual address ffffd4a61638f258
[   85.884624] Mem abort info:
[   85.884625]   ESR = 0x0000000086000007
[   85.884628]   EC = 0x21: IABT (current EL), IL = 32 bits
[   85.884633]   SET = 0, FnV = 0
[   85.884636]   EA = 0, S1PTW = 0
[   85.884638]   FSC = 0x07: level 3 translation fault
[   85.884642] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000041dd0000
[   85.884646] [ffffd4a61638f258] pgd=1000000095fff003, p4d=1000000095fff003, pud=100000004823d003, pmd=100000004823e003, pte=0000000000000000
[   85.884662] Internal error: Oops: 0000000086000007 [#1] PREEMPT SMP
[   85.890932] Modules linked in: algif_hash algif_skcipher af_alg overlay fsl_jr_uio caam_jr caamkeyblob_desc caamhash_desc caamalg_desc crypto_engine authenc libdes crct10dif_ce polyval_ce polyval_generic snd_soc_imx_spdif snd_soc_imx_card snd_soc_ak5558 snd_soc_ak4458 caam secvio error snd_soc_fsl_spdif snd_soc_fsl_micfil snd_soc_fsl_sai snd_soc_fsl_utils gpio_ir_recv rc_core fuse [last unloaded: btnxpuart(O)]
[   85.927297] CPU: 1 PID: 67 Comm: kworker/1:3 Tainted: G           O       6.1.36+g937b1be4345a #1
[   85.936176] Hardware name: FSL i.MX8MM EVK board (DT)
[   85.936182] Workqueue: events 0xffffd4a61638f380
[   85.936198] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   85.952817] pc : 0xffffd4a61638f258
[   85.952823] lr : 0xffffd4a61638f258
[   85.952827] sp : ffff8000084fbd70
[   85.952829] x29: ffff8000084fbd70 x28: 0000000000000000 x27: 0000000000000000
[   85.963112] x26: ffffd4a69133f000 x25: ffff4bf1c8540990 x24: ffff4bf215b87305
[   85.963119] x23: ffff4bf215b87300 x22: ffff4bf1c85409d0 x21: ffff4bf1c8540970
[   85.977382] x20: 0000000000000000 x19: ffff4bf1c8540880 x18: 0000000000000000
[   85.977391] x17: 0000000000000000 x16: 0000000000000133 x15: 0000ffffe2217090
[   85.977399] x14: 0000000000000001 x13: 0000000000000133 x12: 0000000000000139
[   85.977407] x11: 0000000000000001 x10: 0000000000000a60 x9 : ffff8000084fbc50
[   85.977417] x8 : ffff4bf215b7d000 x7 : ffff4bf215b83b40 x6 : 00000000000003e8
[   85.977424] x5 : 00000000410fd030 x4 : 0000000000000000 x3 : 0000000000000000
[   85.977432] x2 : 0000000000000000 x1 : ffff4bf1c4265880 x0 : 0000000000000000
[   85.977443] Call trace:
[   85.977446]  0xffffd4a61638f258
[   85.977451]  0xffffd4a61638f3e8
[   85.977455]  process_one_work+0x1d4/0x330
[   85.977464]  worker_thread+0x6c/0x430
[   85.977471]  kthread+0x108/0x10c
[   85.977476]  ret_from_fork+0x10/0x20
[   85.977488] Code: bad PC value
[   85.977491] ---[ end trace 0000000000000000 ]---

Preset since v6.9.11
Fixes: 86d55f1 ("Bluetooth: btnxpuart: Deasset UART break before closing serdev device")
Signed-off-by: Neeraj Sanjay Kale <[email protected]>
Reviewed-by: Paul Menzel <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>
esmil pushed a commit that referenced this pull request Sep 25, 2024
In the cited commit, bond->ipsec_lock is added to protect ipsec_list,
hence xdo_dev_state_add and xdo_dev_state_delete are called inside
this lock. As ipsec_lock is a spin lock and such xfrmdev ops may sleep,
"scheduling while atomic" will be triggered when changing bond's
active slave.

[  101.055189] BUG: scheduling while atomic: bash/902/0x00000200
[  101.055726] Modules linked in:
[  101.058211] CPU: 3 PID: 902 Comm: bash Not tainted 6.9.0-rc4+ #1
[  101.058760] Hardware name:
[  101.059434] Call Trace:
[  101.059436]  <TASK>
[  101.060873]  dump_stack_lvl+0x51/0x60
[  101.061275]  __schedule_bug+0x4e/0x60
[  101.061682]  __schedule+0x612/0x7c0
[  101.062078]  ? __mod_timer+0x25c/0x370
[  101.062486]  schedule+0x25/0xd0
[  101.062845]  schedule_timeout+0x77/0xf0
[  101.063265]  ? asm_common_interrupt+0x22/0x40
[  101.063724]  ? __bpf_trace_itimer_state+0x10/0x10
[  101.064215]  __wait_for_common+0x87/0x190
[  101.064648]  ? usleep_range_state+0x90/0x90
[  101.065091]  cmd_exec+0x437/0xb20 [mlx5_core]
[  101.065569]  mlx5_cmd_do+0x1e/0x40 [mlx5_core]
[  101.066051]  mlx5_cmd_exec+0x18/0x30 [mlx5_core]
[  101.066552]  mlx5_crypto_create_dek_key+0xea/0x120 [mlx5_core]
[  101.067163]  ? bonding_sysfs_store_option+0x4d/0x80 [bonding]
[  101.067738]  ? kmalloc_trace+0x4d/0x350
[  101.068156]  mlx5_ipsec_create_sa_ctx+0x33/0x100 [mlx5_core]
[  101.068747]  mlx5e_xfrm_add_state+0x47b/0xaa0 [mlx5_core]
[  101.069312]  bond_change_active_slave+0x392/0x900 [bonding]
[  101.069868]  bond_option_active_slave_set+0x1c2/0x240 [bonding]
[  101.070454]  __bond_opt_set+0xa6/0x430 [bonding]
[  101.070935]  __bond_opt_set_notify+0x2f/0x90 [bonding]
[  101.071453]  bond_opt_tryset_rtnl+0x72/0xb0 [bonding]
[  101.071965]  bonding_sysfs_store_option+0x4d/0x80 [bonding]
[  101.072567]  kernfs_fop_write_iter+0x10c/0x1a0
[  101.073033]  vfs_write+0x2d8/0x400
[  101.073416]  ? alloc_fd+0x48/0x180
[  101.073798]  ksys_write+0x5f/0xe0
[  101.074175]  do_syscall_64+0x52/0x110
[  101.074576]  entry_SYSCALL_64_after_hwframe+0x4b/0x53

As bond_ipsec_add_sa_all and bond_ipsec_del_sa_all are only called
from bond_change_active_slave, which requires holding the RTNL lock.
And bond_ipsec_add_sa and bond_ipsec_del_sa are xfrm state
xdo_dev_state_add and xdo_dev_state_delete APIs, which are in user
context. So ipsec_lock doesn't have to be spin lock, change it to
mutex, and thus the above issue can be resolved.

Fixes: 9a56055 ("bonding: Add struct bond_ipesc to manage SA")
Signed-off-by: Jianbo Liu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Reviewed-by: Hangbin Liu <[email protected]>
Acked-by: Jay Vosburgh <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
esmil pushed a commit that referenced this pull request Sep 25, 2024
Ethtool callbacks can be executed while reset is in progress and try to
access deleted resources, e.g. getting coalesce settings can result in a
NULL pointer dereference seen below.

Reproduction steps:
Once the driver is fully initialized, trigger reset:
	# echo 1 > /sys/class/net/<interface>/device/reset
when reset is in progress try to get coalesce settings using ethtool:
	# ethtool -c <interface>

BUG: kernel NULL pointer dereference, address: 0000000000000020
PGD 0 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 11 PID: 19713 Comm: ethtool Tainted: G S                 6.10.0-rc7+ #7
RIP: 0010:ice_get_q_coalesce+0x2e/0xa0 [ice]
RSP: 0018:ffffbab1e9bcf6a8 EFLAGS: 00010206
RAX: 000000000000000c RBX: ffff94512305b028 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9451c3f2e588 RDI: ffff9451c3f2e588
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: ffff9451c3f2e580 R11: 000000000000001f R12: ffff945121fa9000
R13: ffffbab1e9bcf760 R14: 0000000000000013 R15: ffffffff9e65dd40
FS:  00007faee5fbe740(0000) GS:ffff94546fd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000020 CR3: 0000000106c2e005 CR4: 00000000001706f0
Call Trace:
<TASK>
ice_get_coalesce+0x17/0x30 [ice]
coalesce_prepare_data+0x61/0x80
ethnl_default_doit+0xde/0x340
genl_family_rcv_msg_doit+0xf2/0x150
genl_rcv_msg+0x1b3/0x2c0
netlink_rcv_skb+0x5b/0x110
genl_rcv+0x28/0x40
netlink_unicast+0x19c/0x290
netlink_sendmsg+0x222/0x490
__sys_sendto+0x1df/0x1f0
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x82/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7faee60d8e27

Calling netif_device_detach() before reset makes the net core not call
the driver when ethtool command is issued, the attempt to execute an
ethtool command during reset will result in the following message:

    netlink error: No such device

instead of NULL pointer dereference. Once reset is done and
ice_rebuild() is executing, the netif_device_attach() is called to allow
for ethtool operations to occur again in a safe manner.

Fixes: fcea6f3 ("ice: Add stats and ethtool support")
Suggested-by: Jakub Kicinski <[email protected]>
Reviewed-by: Igor Bagnucki <[email protected]>
Signed-off-by: Dawid Osuchowski <[email protected]>
Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel)
Reviewed-by: Michal Schmidt <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
esmil pushed a commit that referenced this pull request Sep 25, 2024
…git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

Patch #1 sets on NFT_PKTINFO_L4PROTO for UDP packets less than 4 bytes
payload from netdev/egress by subtracting skb_network_offset() when
validating IPv4 packet length, otherwise 'meta l4proto udp' never
matches.

Patch #2 subtracts skb_network_offset() when validating IPv6 packet
length for netdev/egress.

netfilter pull request 24-08-28

* tag 'nf-24-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables_ipv6: consider network offset in netdev/egress validation
  netfilter: nf_tables: restore IP sanity checks for netdev/egress
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
esmil pushed a commit that referenced this pull request Sep 25, 2024
Steve French reported null pointer dereference error from sha256 lib.
cifs.ko can send session setup requests on reused connection.
If reused connection is used for binding session, conn->binding can
still remain true and generate_preauth_hash() will not set
sess->Preauth_HashValue and it will be NULL.
It is used as a material to create an encryption key in
ksmbd_gen_smb311_encryptionkey. ->Preauth_HashValue cause null pointer
dereference error from crypto_shash_update().

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 8 PID: 429254 Comm: kworker/8:39
Hardware name: LENOVO 20MAS08500/20MAS08500, BIOS N2CET69W (1.52 )
Workqueue: ksmbd-io handle_ksmbd_work [ksmbd]
RIP: 0010:lib_sha256_base_do_update.isra.0+0x11e/0x1d0 [sha256_ssse3]
<TASK>
? show_regs+0x6d/0x80
? __die+0x24/0x80
? page_fault_oops+0x99/0x1b0
? do_user_addr_fault+0x2ee/0x6b0
? exc_page_fault+0x83/0x1b0
? asm_exc_page_fault+0x27/0x30
? __pfx_sha256_transform_rorx+0x10/0x10 [sha256_ssse3]
? lib_sha256_base_do_update.isra.0+0x11e/0x1d0 [sha256_ssse3]
? __pfx_sha256_transform_rorx+0x10/0x10 [sha256_ssse3]
? __pfx_sha256_transform_rorx+0x10/0x10 [sha256_ssse3]
_sha256_update+0x77/0xa0 [sha256_ssse3]
sha256_avx2_update+0x15/0x30 [sha256_ssse3]
crypto_shash_update+0x1e/0x40
hmac_update+0x12/0x20
crypto_shash_update+0x1e/0x40
generate_key+0x234/0x380 [ksmbd]
generate_smb3encryptionkey+0x40/0x1c0 [ksmbd]
ksmbd_gen_smb311_encryptionkey+0x72/0xa0 [ksmbd]
ntlm_authenticate.isra.0+0x423/0x5d0 [ksmbd]
smb2_sess_setup+0x952/0xaa0 [ksmbd]
__process_request+0xa3/0x1d0 [ksmbd]
__handle_ksmbd_work+0x1c4/0x2f0 [ksmbd]
handle_ksmbd_work+0x2d/0xa0 [ksmbd]
process_one_work+0x16c/0x350
worker_thread+0x306/0x440
? __pfx_worker_thread+0x10/0x10
kthread+0xef/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x44/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>

Fixes: f5a544e ("ksmbd: add support for SMB3 multichannel")
Cc: [email protected] # v5.15+
Signed-off-by: Namjae Jeon <[email protected]>
Signed-off-by: Steve French <[email protected]>
esmil pushed a commit that referenced this pull request Sep 25, 2024
When we cork messages in psock->cork, the last message triggers the
flushing will result in sending a sk_msg larger than the current
message size. In this case, in tcp_bpf_send_verdict(), 'copied' becomes
negative at least in the following case:

468         case __SK_DROP:
469         default:
470                 sk_msg_free_partial(sk, msg, tosend);
471                 sk_msg_apply_bytes(psock, tosend);
472                 *copied -= (tosend + delta); // <==== HERE
473                 return -EACCES;

Therefore, it could lead to the following BUG with a proper value of
'copied' (thanks to syzbot). We should not use negative 'copied' as a
return value here.

  ------------[ cut here ]------------
  kernel BUG at net/socket.c:733!
  Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
  Modules linked in:
  CPU: 0 UID: 0 PID: 3265 Comm: syz-executor510 Not tainted 6.11.0-rc3-syzkaller-00060-gd07b43284ab3 #0
  Hardware name: linux,dummy-virt (DT)
  pstate: 61400009 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
  pc : sock_sendmsg_nosec net/socket.c:733 [inline]
  pc : sock_sendmsg_nosec net/socket.c:728 [inline]
  pc : __sock_sendmsg+0x5c/0x60 net/socket.c:745
  lr : sock_sendmsg_nosec net/socket.c:730 [inline]
  lr : __sock_sendmsg+0x54/0x60 net/socket.c:745
  sp : ffff800088ea3b30
  x29: ffff800088ea3b30 x28: fbf00000062bc900 x27: 0000000000000000
  x26: ffff800088ea3bc0 x25: ffff800088ea3bc0 x24: 0000000000000000
  x23: f9f00000048dc000 x22: 0000000000000000 x21: ffff800088ea3d90
  x20: f9f00000048dc000 x19: ffff800088ea3d90 x18: 0000000000000001
  x17: 0000000000000000 x16: 0000000000000000 x15: 000000002002ffaf
  x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
  x11: 0000000000000000 x10: ffff8000815849c0 x9 : ffff8000815b49c0
  x8 : 0000000000000000 x7 : 000000000000003f x6 : 0000000000000000
  x5 : 00000000000007e0 x4 : fff07ffffd239000 x3 : fbf00000062bc900
  x2 : 0000000000000000 x1 : 0000000000000000 x0 : 00000000fffffdef
  Call trace:
   sock_sendmsg_nosec net/socket.c:733 [inline]
   __sock_sendmsg+0x5c/0x60 net/socket.c:745
   ____sys_sendmsg+0x274/0x2ac net/socket.c:2597
   ___sys_sendmsg+0xac/0x100 net/socket.c:2651
   __sys_sendmsg+0x84/0xe0 net/socket.c:2680
   __do_sys_sendmsg net/socket.c:2689 [inline]
   __se_sys_sendmsg net/socket.c:2687 [inline]
   __arm64_sys_sendmsg+0x24/0x30 net/socket.c:2687
   __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
   invoke_syscall+0x48/0x110 arch/arm64/kernel/syscall.c:49
   el0_svc_common.constprop.0+0x40/0xe0 arch/arm64/kernel/syscall.c:132
   do_el0_svc+0x1c/0x28 arch/arm64/kernel/syscall.c:151
   el0_svc+0x34/0xec arch/arm64/kernel/entry-common.c:712
   el0t_64_sync_handler+0x100/0x12c arch/arm64/kernel/entry-common.c:730
   el0t_64_sync+0x19c/0x1a0 arch/arm64/kernel/entry.S:598
  Code: f9404463 d63f0060 3108441f 54fffe81 (d4210000)
  ---[ end trace 0000000000000000 ]---

Fixes: 4f738ad ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data")
Reported-by: [email protected]
Cc: Jakub Sitnicki <[email protected]>
Signed-off-by: Cong Wang <[email protected]>
Reviewed-by: John Fastabend <[email protected]>
Acked-by: Martin KaFai Lau <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
esmil pushed a commit that referenced this pull request Sep 25, 2024
The fscache_cookie_lru_timer is initialized when the fscache module
is inserted, but is not deleted when the fscache module is removed.
If timer_reduce() is called before removing the fscache module,
the fscache_cookie_lru_timer will be added to the timer list of
the current cpu. Afterwards, a use-after-free will be triggered
in the softIRQ after removing the fscache module, as follows:

==================================================================
BUG: unable to handle page fault for address: fffffbfff803c9e9
 PF: supervisor read access in kernel mode
 PF: error_code(0x0000) - not-present page
PGD 21ffea067 P4D 21ffea067 PUD 21ffe6067 PMD 110a7c067 PTE 0
Oops: Oops: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Tainted: G W 6.11.0-rc3 torvalds#855
Tainted: [W]=WARN
RIP: 0010:__run_timer_base.part.0+0x254/0x8a0
Call Trace:
 <IRQ>
 tmigr_handle_remote_up+0x627/0x810
 __walk_groups.isra.0+0x47/0x140
 tmigr_handle_remote+0x1fa/0x2f0
 handle_softirqs+0x180/0x590
 irq_exit_rcu+0x84/0xb0
 sysvec_apic_timer_interrupt+0x6e/0x90
 </IRQ>
 <TASK>
 asm_sysvec_apic_timer_interrupt+0x1a/0x20
RIP: 0010:default_idle+0xf/0x20
 default_idle_call+0x38/0x60
 do_idle+0x2b5/0x300
 cpu_startup_entry+0x54/0x60
 start_secondary+0x20d/0x280
 common_startup_64+0x13e/0x148
 </TASK>
Modules linked in: [last unloaded: netfs]
==================================================================

Therefore delete fscache_cookie_lru_timer when removing the fscahe module.

Fixes: 12bb21a ("fscache: Implement cookie user counting and resource pinning")
Cc: [email protected]
Signed-off-by: Baokun Li <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: David Howells <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
esmil pushed a commit that referenced this pull request Sep 25, 2024
Commit 8c61291 ("mm: fix incorrect vbq reference in
purge_fragmented_block") extended the 'vmap_block' structure to contain a
'cpu' field which is set at allocation time to the id of the initialising
CPU.

When a new 'vmap_block' is being instantiated by new_vmap_block(), the
partially initialised structure is added to the local 'vmap_block_queue'
xarray before the 'cpu' field has been initialised.  If another CPU is
concurrently walking the xarray (e.g.  via vm_unmap_aliases()), then it
may perform an out-of-bounds access to the remote queue thanks to an
uninitialised index.

This has been observed as UBSAN errors in Android:

 | Internal error: UBSAN: array index out of bounds: 00000000f2005512 [#1] PREEMPT SMP
 |
 | Call trace:
 |  purge_fragmented_block+0x204/0x21c
 |  _vm_unmap_aliases+0x170/0x378
 |  vm_unmap_aliases+0x1c/0x28
 |  change_memory_common+0x1dc/0x26c
 |  set_memory_ro+0x18/0x24
 |  module_enable_ro+0x98/0x238
 |  do_init_module+0x1b0/0x310

Move the initialisation of 'vb->cpu' in new_vmap_block() ahead of the
addition to the xarray.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 8c61291 ("mm: fix incorrect vbq reference in purge_fragmented_block")
Signed-off-by: Will Deacon <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Cc: Zhaoyang Huang <[email protected]>
Cc: Hailong.Liu <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
esmil pushed a commit that referenced this pull request Sep 25, 2024
When enable CONFIG_MEMCG & CONFIG_KFENCE & CONFIG_KMEMLEAK, the following
warning always occurs,This is because the following call stack occurred:
mem_pool_alloc
    kmem_cache_alloc_noprof
        slab_alloc_node
            kfence_alloc

Once the kfence allocation is successful,slab->obj_exts will not be empty,
because it has already been assigned a value in kfence_init_pool.

Since in the prepare_slab_obj_exts_hook function,we perform a check for
s->flags & (SLAB_NO_OBJ_EXT | SLAB_NOLEAKTRACE),the alloc_tag_add function
will not be called as a result.Therefore,ref->ct remains NULL.

However,when we call mem_pool_free,since obj_ext is not empty, it
eventually leads to the alloc_tag_sub scenario being invoked.  This is
where the warning occurs.

So we should add corresponding checks in the alloc_tagging_slab_free_hook.
For __GFP_NO_OBJ_EXT case,I didn't see the specific case where it's using
kfence,so I won't add the corresponding check in
alloc_tagging_slab_free_hook for now.

[    3.734349] ------------[ cut here ]------------
[    3.734807] alloc_tag was not set
[    3.735129] WARNING: CPU: 4 PID: 40 at ./include/linux/alloc_tag.h:130 kmem_cache_free+0x444/0x574
[    3.735866] Modules linked in: autofs4
[    3.736211] CPU: 4 UID: 0 PID: 40 Comm: ksoftirqd/4 Tainted: G        W          6.11.0-rc3-dirty #1
[    3.736969] Tainted: [W]=WARN
[    3.737258] Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
[    3.737875] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    3.738501] pc : kmem_cache_free+0x444/0x574
[    3.738951] lr : kmem_cache_free+0x444/0x574
[    3.739361] sp : ffff80008357bb60
[    3.739693] x29: ffff80008357bb70 x28: 0000000000000000 x27: 0000000000000000
[    3.740338] x26: ffff80008207f000 x25: ffff000b2eb2fd60 x24: ffff0000c0005700
[    3.740982] x23: ffff8000804229e4 x22: ffff800082080000 x21: ffff800081756000
[    3.741630] x20: fffffd7ff8253360 x19: 00000000000000a8 x18: ffffffffffffffff
[    3.742274] x17: ffff800ab327f000 x16: ffff800083398000 x15: ffff800081756df0
[    3.742919] x14: 0000000000000000 x13: 205d344320202020 x12: 5b5d373038343337
[    3.743560] x11: ffff80008357b650 x10: 000000000000005d x9 : 00000000ffffffd0
[    3.744231] x8 : 7f7f7f7f7f7f7f7f x7 : ffff80008237bad0 x6 : c0000000ffff7fff
[    3.744907] x5 : ffff80008237ba78 x4 : ffff8000820bbad0 x3 : 0000000000000001
[    3.745580] x2 : 68d66547c09f7800 x1 : 68d66547c09f7800 x0 : 0000000000000000
[    3.746255] Call trace:
[    3.746530]  kmem_cache_free+0x444/0x574
[    3.746931]  mem_pool_free+0x44/0xf4
[    3.747306]  free_object_rcu+0xc8/0xdc
[    3.747693]  rcu_do_batch+0x234/0x8a4
[    3.748075]  rcu_core+0x230/0x3e4
[    3.748424]  rcu_core_si+0x14/0x1c
[    3.748780]  handle_softirqs+0x134/0x378
[    3.749189]  run_ksoftirqd+0x70/0x9c
[    3.749560]  smpboot_thread_fn+0x148/0x22c
[    3.749978]  kthread+0x10c/0x118
[    3.750323]  ret_from_fork+0x10/0x20
[    3.750696] ---[ end trace 0000000000000000 ]---

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 4b87369 ("mm/slab: add allocation accounting into slab allocation and free paths")
Signed-off-by: Hao Ge <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
esmil pushed a commit that referenced this pull request Sep 25, 2024
Binder objects are processed and copied individually into the target
buffer during transactions. Any raw data in-between these objects is
copied as well. However, this raw data copy lacks an out-of-bounds
check. If the raw data exceeds the data section size then the copy
overwrites the offsets section. This eventually triggers an error that
attempts to unwind the processed objects. However, at this point the
offsets used to index these objects are now corrupted.

Unwinding with corrupted offsets can result in decrements of arbitrary
nodes and lead to their premature release. Other users of such nodes are
left with a dangling pointer triggering a use-after-free. This issue is
made evident by the following KASAN report (trimmed):

  ==================================================================
  BUG: KASAN: slab-use-after-free in _raw_spin_lock+0xe4/0x19c
  Write of size 4 at addr ffff47fc91598f04 by task binder-util/743

  CPU: 9 UID: 0 PID: 743 Comm: binder-util Not tainted 6.11.0-rc4 #1
  Hardware name: linux,dummy-virt (DT)
  Call trace:
   _raw_spin_lock+0xe4/0x19c
   binder_free_buf+0x128/0x434
   binder_thread_write+0x8a4/0x3260
   binder_ioctl+0x18f0/0x258c
  [...]

  Allocated by task 743:
   __kmalloc_cache_noprof+0x110/0x270
   binder_new_node+0x50/0x700
   binder_transaction+0x413c/0x6da8
   binder_thread_write+0x978/0x3260
   binder_ioctl+0x18f0/0x258c
  [...]

  Freed by task 745:
   kfree+0xbc/0x208
   binder_thread_read+0x1c5c/0x37d4
   binder_ioctl+0x16d8/0x258c
  [...]
  ==================================================================

To avoid this issue, let's check that the raw data copy is within the
boundaries of the data section.

Fixes: 6d98eb9 ("binder: avoid potential data leakage when copying txn")
Cc: Todd Kjos <[email protected]>
Cc: [email protected]
Signed-off-by: Carlos Llamas <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
esmil pushed a commit that referenced this pull request Sep 25, 2024
The main threat to data consistency in ice_xdp() is a possible asynchronous
PF reset. It can be triggered by a user or by TX timeout handler.

XDP setup and PF reset code access the same resources in the following
sections:
* ice_vsi_close() in ice_prepare_for_reset() - already rtnl-locked
* ice_vsi_rebuild() for the PF VSI - not protected
* ice_vsi_open() - already rtnl-locked

With an unfortunate timing, such accesses can result in a crash such as the
one below:

[ +1.999878] ice 0000:b1:00.0: Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring 14
[ +2.002992] ice 0000:b1:00.0: Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring 18
[Mar15 18:17] ice 0000:b1:00.0 ens801f0np0: NETDEV WATCHDOG: CPU: 38: transmit queue 14 timed out 80692736 ms
[ +0.000093] ice 0000:b1:00.0 ens801f0np0: tx_timeout: VSI_num: 6, Q 14, NTC: 0x0, HW_HEAD: 0x0, NTU: 0x0, INT: 0x4000001
[ +0.000012] ice 0000:b1:00.0 ens801f0np0: tx_timeout recovery level 1, txqueue 14
[ +0.394718] ice 0000:b1:00.0: PTP reset successful
[ +0.006184] BUG: kernel NULL pointer dereference, address: 0000000000000098
[ +0.000045] #PF: supervisor read access in kernel mode
[ +0.000023] #PF: error_code(0x0000) - not-present page
[ +0.000023] PGD 0 P4D 0
[ +0.000018] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ +0.000023] CPU: 38 PID: 7540 Comm: kworker/38:1 Not tainted 6.8.0-rc7 #1
[ +0.000031] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021
[ +0.000036] Workqueue: ice ice_service_task [ice]
[ +0.000183] RIP: 0010:ice_clean_tx_ring+0xa/0xd0 [ice]
[...]
[ +0.000013] Call Trace:
[ +0.000016] <TASK>
[ +0.000014] ? __die+0x1f/0x70
[ +0.000029] ? page_fault_oops+0x171/0x4f0
[ +0.000029] ? schedule+0x3b/0xd0
[ +0.000027] ? exc_page_fault+0x7b/0x180
[ +0.000022] ? asm_exc_page_fault+0x22/0x30
[ +0.000031] ? ice_clean_tx_ring+0xa/0xd0 [ice]
[ +0.000194] ice_free_tx_ring+0xe/0x60 [ice]
[ +0.000186] ice_destroy_xdp_rings+0x157/0x310 [ice]
[ +0.000151] ice_vsi_decfg+0x53/0xe0 [ice]
[ +0.000180] ice_vsi_rebuild+0x239/0x540 [ice]
[ +0.000186] ice_vsi_rebuild_by_type+0x76/0x180 [ice]
[ +0.000145] ice_rebuild+0x18c/0x840 [ice]
[ +0.000145] ? delay_tsc+0x4a/0xc0
[ +0.000022] ? delay_tsc+0x92/0xc0
[ +0.000020] ice_do_reset+0x140/0x180 [ice]
[ +0.000886] ice_service_task+0x404/0x1030 [ice]
[ +0.000824] process_one_work+0x171/0x340
[ +0.000685] worker_thread+0x277/0x3a0
[ +0.000675] ? preempt_count_add+0x6a/0xa0
[ +0.000677] ? _raw_spin_lock_irqsave+0x23/0x50
[ +0.000679] ? __pfx_worker_thread+0x10/0x10
[ +0.000653] kthread+0xf0/0x120
[ +0.000635] ? __pfx_kthread+0x10/0x10
[ +0.000616] ret_from_fork+0x2d/0x50
[ +0.000612] ? __pfx_kthread+0x10/0x10
[ +0.000604] ret_from_fork_asm+0x1b/0x30
[ +0.000604] </TASK>

The previous way of handling this through returning -EBUSY is not viable,
particularly when destroying AF_XDP socket, because the kernel proceeds
with removal anyway.

There is plenty of code between those calls and there is no need to create
a large critical section that covers all of them, same as there is no need
to protect ice_vsi_rebuild() with rtnl_lock().

Add xdp_state_lock mutex to protect ice_vsi_rebuild() and ice_xdp().

Leaving unprotected sections in between would result in two states that
have to be considered:
1. when the VSI is closed, but not yet rebuild
2. when VSI is already rebuild, but not yet open

The latter case is actually already handled through !netif_running() case,
we just need to adjust flag checking a little. The former one is not as
trivial, because between ice_vsi_close() and ice_vsi_rebuild(), a lot of
hardware interaction happens, this can make adding/deleting rings exit
with an error. Luckily, VSI rebuild is pending and can apply new
configuration for us in a managed fashion.

Therefore, add an additional VSI state flag ICE_VSI_REBUILD_PENDING to
indicate that ice_xdp() can just hot-swap the program.

Also, as ice_vsi_rebuild() flow is touched in this patch, make it more
consistent by deconfiguring VSI when coalesce allocation fails.

Fixes: 2d4238f ("ice: Add support for AF_XDP")
Fixes: efc2214 ("ice: Add support for XDP")
Reviewed-by: Wojciech Drewek <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Tested-by: Chandan Kumar Rout <[email protected]>
Signed-off-by: Larysa Zaremba <[email protected]>
Reviewed-by: Maciej Fijalkowski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
esmil pushed a commit that referenced this pull request Sep 25, 2024
If we have 2 threads that are using the same file descriptor and one of
them is doing direct IO writes while the other is doing fsync, we have a
race where we can end up either:

1) Attempt a fsync without holding the inode's lock, triggering an
   assertion failures when assertions are enabled;

2) Do an invalid memory access from the fsync task because the file private
   points to memory allocated on stack by the direct IO task and it may be
   used by the fsync task after the stack was destroyed.

The race happens like this:

1) A user space program opens a file descriptor with O_DIRECT;

2) The program spawns 2 threads using libpthread for example;

3) One of the threads uses the file descriptor to do direct IO writes,
   while the other calls fsync using the same file descriptor.

4) Call task A the thread doing direct IO writes and task B the thread
   doing fsyncs;

5) Task A does a direct IO write, and at btrfs_direct_write() sets the
   file's private to an on stack allocated private with the member
   'fsync_skip_inode_lock' set to true;

6) Task B enters btrfs_sync_file() and sees that there's a private
   structure associated to the file which has 'fsync_skip_inode_lock' set
   to true, so it skips locking the inode's VFS lock;

7) Task A completes the direct IO write, and resets the file's private to
   NULL since it had no prior private and our private was stack allocated.
   Then it unlocks the inode's VFS lock;

8) Task B enters btrfs_get_ordered_extents_for_logging(), then the
   assertion that checks the inode's VFS lock is held fails, since task B
   never locked it and task A has already unlocked it.

The stack trace produced is the following:

   assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983
   ------------[ cut here ]------------
   kernel BUG at fs/btrfs/ordered-data.c:983!
   Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
   CPU: 9 PID: 5072 Comm: worker Tainted: G     U     OE      6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8
   Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020
   RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs]
   Code: 50 d6 86 c0 e8 (...)
   RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246
   RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000
   RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800
   RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38
   R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800
   R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000
   FS:  00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0
   Call Trace:
    <TASK>
    ? __die_body.cold+0x14/0x24
    ? die+0x2e/0x50
    ? do_trap+0xca/0x110
    ? do_error_trap+0x6a/0x90
    ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    ? exc_invalid_op+0x50/0x70
    ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    ? asm_exc_invalid_op+0x1a/0x20
    ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    ? __seccomp_filter+0x31d/0x4f0
    __x64_sys_fdatasync+0x4f/0x90
    do_syscall_64+0x82/0x160
    ? do_futex+0xcb/0x190
    ? __x64_sys_futex+0x10e/0x1d0
    ? switch_fpu_return+0x4f/0xd0
    ? syscall_exit_to_user_mode+0x72/0x220
    ? do_syscall_64+0x8e/0x160
    ? syscall_exit_to_user_mode+0x72/0x220
    ? do_syscall_64+0x8e/0x160
    ? syscall_exit_to_user_mode+0x72/0x220
    ? do_syscall_64+0x8e/0x160
    ? syscall_exit_to_user_mode+0x72/0x220
    ? do_syscall_64+0x8e/0x160
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

Another problem here is if task B grabs the private pointer and then uses
it after task A has finished, since the private was allocated in the stack
of task A, it results in some invalid memory access with a hard to predict
result.

This issue, triggering the assertion, was observed with QEMU workloads by
two users in the Link tags below.

Fix this by not relying on a file's private to pass information to fsync
that it should skip locking the inode and instead pass this information
through a special value stored in current->journal_info. This is safe
because in the relevant section of the direct IO write path we are not
holding a transaction handle, so current->journal_info is NULL.

The following C program triggers the issue:

   $ cat repro.c
   /* Get the O_DIRECT definition. */
   #ifndef _GNU_SOURCE
   #define _GNU_SOURCE
   #endif

   #include <stdio.h>
   #include <stdlib.h>
   #include <unistd.h>
   #include <stdint.h>
   #include <fcntl.h>
   #include <errno.h>
   #include <string.h>
   #include <pthread.h>

   static int fd;

   static ssize_t do_write(int fd, const void *buf, size_t count, off_t offset)
   {
       while (count > 0) {
           ssize_t ret;

           ret = pwrite(fd, buf, count, offset);
           if (ret < 0) {
               if (errno == EINTR)
                   continue;
               return ret;
           }
           count -= ret;
           buf += ret;
       }
       return 0;
   }

   static void *fsync_loop(void *arg)
   {
       while (1) {
           int ret;

           ret = fsync(fd);
           if (ret != 0) {
               perror("Fsync failed");
               exit(6);
           }
       }
   }

   int main(int argc, char *argv[])
   {
       long pagesize;
       void *write_buf;
       pthread_t fsyncer;
       int ret;

       if (argc != 2) {
           fprintf(stderr, "Use: %s <file path>\n", argv[0]);
           return 1;
       }

       fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0666);
       if (fd == -1) {
           perror("Failed to open/create file");
           return 1;
       }

       pagesize = sysconf(_SC_PAGE_SIZE);
       if (pagesize == -1) {
           perror("Failed to get page size");
           return 2;
       }

       ret = posix_memalign(&write_buf, pagesize, pagesize);
       if (ret) {
           perror("Failed to allocate buffer");
           return 3;
       }

       ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL);
       if (ret != 0) {
           fprintf(stderr, "Failed to create writer thread: %d\n", ret);
           return 4;
       }

       while (1) {
           ret = do_write(fd, write_buf, pagesize, 0);
           if (ret != 0) {
               perror("Write failed");
               exit(5);
           }
       }

       return 0;
   }

   $ mkfs.btrfs -f /dev/sdi
   $ mount /dev/sdi /mnt/sdi
   $ timeout 10 ./repro /mnt/sdi/foo

Usually the race is triggered within less than 1 second. A test case for
fstests will follow soon.

Reported-by: Paulo Dias <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219187
Reported-by: Andreas Jahn <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219199
Reported-by: [email protected]
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Fixes: 939b656 ("btrfs: fix corruption after buffer fault in during direct IO append write")
CC: [email protected] # 5.15+
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
esmil pushed a commit that referenced this pull request Sep 25, 2024
Fix circular locking dependency on runtime suspend.

<4> [74.952215] ======================================================
<4> [74.952217] WARNING: possible circular locking dependency detected
<4> [74.952219] 6.10.0-rc7-xe #1 Not tainted
<4> [74.952221] ------------------------------------------------------
<4> [74.952223] kworker/7:1/82 is trying to acquire lock:
<4> [74.952226] ffff888120548488 (&dev->mode_config.mutex){+.+.}-{3:3}, at: drm_modeset_lock_all+0x40/0x1e0 [drm]
<4> [74.952260]
but task is already holding lock:
<4> [74.952262] ffffffffa0ae59c0 (xe_pm_runtime_lockdep_map){+.+.}-{0:0}, at: xe_pm_runtime_suspend+0x2f/0x340 [xe]
<4> [74.952322]
which lock already depends on the new lock.

The commit 'b1d90a86 ("drm/xe: Use the encoder suspend helper also used
by the i915 driver")' didn't do anything wrong. It actually fixed a
critical bug, because the encoder_suspend was never getting actually
called because it was returning if (has_display(xe)) instead of
if (!has_display(xe)). However, this ended up introducing the encoder
suspend calls in the runtime routines as well, causing the circular
locking dependency.

Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2304
Fixes: b1d90a8 ("drm/xe: Use the encoder suspend helper also used by the i915 driver")
Cc: Imre Deak <[email protected]>
Reviewed-by: Jonathan Cavitt <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Rodrigo Vivi <[email protected]>
(cherry picked from commit 8da1944)
Signed-off-by: Rodrigo Vivi <[email protected]>
esmil pushed a commit that referenced this pull request Sep 25, 2024
We observed a null-ptr-deref in fou_gro_receive() while shutting down
a host.  [0]

The NULL pointer is sk->sk_user_data, and the offset 8 is of protocol
in struct fou.

When fou_release() is called due to netns dismantle or explicit tunnel
teardown, udp_tunnel_sock_release() sets NULL to sk->sk_user_data.
Then, the tunnel socket is destroyed after a single RCU grace period.

So, in-flight udp4_gro_receive() could find the socket and execute the
FOU GRO handler, where sk->sk_user_data could be NULL.

Let's use rcu_dereference_sk_user_data() in fou_from_sock() and add NULL
checks in FOU GRO handlers.

[0]:
BUG: kernel NULL pointer dereference, address: 0000000000000008
 PF: supervisor read access in kernel mode
 PF: error_code(0x0000) - not-present page
PGD 80000001032f4067 P4D 80000001032f4067 PUD 103240067 PMD 0
SMP PTI
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.216-204.855.amzn2.x86_64 #1
Hardware name: Amazon EC2 c5.large/, BIOS 1.0 10/16/2017
RIP: 0010:fou_gro_receive (net/ipv4/fou.c:233) [fou]
Code: 41 5f c3 cc cc cc cc e8 e7 2e 69 f4 0f 1f 80 00 00 00 00 0f 1f 44 00 00 49 89 f8 41 54 48 89 f7 48 89 d6 49 8b 80 88 02 00 00 <0f> b6 48 08 0f b7 42 4a 66 25 fd fd 80 cc 02 66 89 42 4a 0f b6 42
RSP: 0018:ffffa330c0003d08 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff93d9e3a6b900 RCX: 0000000000000010
RDX: ffff93d9e3a6b900 RSI: ffff93d9e3a6b900 RDI: ffff93dac2e24d08
RBP: ffff93d9e3a6b900 R08: ffff93dacbce6400 R09: 0000000000000002
R10: 0000000000000000 R11: ffffffffb5f369b0 R12: ffff93dacbce6400
R13: ffff93dac2e24d08 R14: 0000000000000000 R15: ffffffffb4edd1c0
FS:  0000000000000000(0000) GS:ffff93daee800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 0000000102140001 CR4: 00000000007706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <IRQ>
 ? show_trace_log_lvl (arch/x86/kernel/dumpstack.c:259)
 ? __die_body.cold (arch/x86/kernel/dumpstack.c:478 arch/x86/kernel/dumpstack.c:420)
 ? no_context (arch/x86/mm/fault.c:752)
 ? exc_page_fault (arch/x86/include/asm/irqflags.h:49 arch/x86/include/asm/irqflags.h:89 arch/x86/mm/fault.c:1435 arch/x86/mm/fault.c:1483)
 ? asm_exc_page_fault (arch/x86/include/asm/idtentry.h:571)
 ? fou_gro_receive (net/ipv4/fou.c:233) [fou]
 udp_gro_receive (include/linux/netdevice.h:2552 net/ipv4/udp_offload.c:559)
 udp4_gro_receive (net/ipv4/udp_offload.c:604)
 inet_gro_receive (net/ipv4/af_inet.c:1549 (discriminator 7))
 dev_gro_receive (net/core/dev.c:6035 (discriminator 4))
 napi_gro_receive (net/core/dev.c:6170)
 ena_clean_rx_irq (drivers/amazon/net/ena/ena_netdev.c:1558) [ena]
 ena_io_poll (drivers/amazon/net/ena/ena_netdev.c:1742) [ena]
 napi_poll (net/core/dev.c:6847)
 net_rx_action (net/core/dev.c:6917)
 __do_softirq (arch/x86/include/asm/jump_label.h:25 include/linux/jump_label.h:200 include/trace/events/irq.h:142 kernel/softirq.c:299)
 asm_call_irq_on_stack (arch/x86/entry/entry_64.S:809)
</IRQ>
 do_softirq_own_stack (arch/x86/include/asm/irq_stack.h:27 arch/x86/include/asm/irq_stack.h:77 arch/x86/kernel/irq_64.c:77)
 irq_exit_rcu (kernel/softirq.c:393 kernel/softirq.c:423 kernel/softirq.c:435)
 common_interrupt (arch/x86/kernel/irq.c:239)
 asm_common_interrupt (arch/x86/include/asm/idtentry.h:626)
RIP: 0010:acpi_idle_do_entry (arch/x86/include/asm/irqflags.h:49 arch/x86/include/asm/irqflags.h:89 drivers/acpi/processor_idle.c:114 drivers/acpi/processor_idle.c:575)
Code: 8b 15 d1 3c c4 02 ed c3 cc cc cc cc 65 48 8b 04 25 40 ef 01 00 48 8b 00 a8 08 75 eb 0f 1f 44 00 00 0f 00 2d d5 09 55 00 fb f4 <fa> c3 cc cc cc cc e9 be fc ff ff 66 66 2e 0f 1f 84 00 00 00 00 00
RSP: 0018:ffffffffb5603e58 EFLAGS: 00000246
RAX: 0000000000004000 RBX: ffff93dac0929c00 RCX: ffff93daee833900
RDX: ffff93daee800000 RSI: ffff93daee87dc00 RDI: ffff93daee87dc64
RBP: 0000000000000001 R08: ffffffffb5e7b6c0 R09: 0000000000000044
R10: ffff93daee831b04 R11: 00000000000001cd R12: 0000000000000001
R13: ffffffffb5e7b740 R14: 0000000000000001 R15: 0000000000000000
 ? sched_clock_cpu (kernel/sched/clock.c:371)
 acpi_idle_enter (drivers/acpi/processor_idle.c:712 (discriminator 3))
 cpuidle_enter_state (drivers/cpuidle/cpuidle.c:237)
 cpuidle_enter (drivers/cpuidle/cpuidle.c:353)
 cpuidle_idle_call (kernel/sched/idle.c:158 kernel/sched/idle.c:239)
 do_idle (kernel/sched/idle.c:302)
 cpu_startup_entry (kernel/sched/idle.c:395 (discriminator 1))
 start_kernel (init/main.c:1048)
 secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:310)
Modules linked in: udp_diag tcp_diag inet_diag nft_nat ipip tunnel4 dummy fou ip_tunnel nft_masq nft_chain_nat nf_nat wireguard nft_ct curve25519_x86_64 libcurve25519_generic nf_conntrack libchacha20poly1305 nf_defrag_ipv6 nf_defrag_ipv4 nft_objref chacha_x86_64 nft_counter nf_tables nfnetlink poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper mousedev psmouse button ena ptp pps_core crc32c_intel
CR2: 0000000000000008

Fixes: d92283e ("fou: change to use UDP socket GRO")
Reported-by: Alphonse Kurian <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
esmil pushed a commit that referenced this pull request Sep 25, 2024
Chi Zhiling reported:

  We found a null pointer accessing in tracefs[1], the reason is that the
  variable 'ei_child' is set to LIST_POISON1, that means the list was
  removed in eventfs_remove_rec. so when access the ei_child->is_freed, the
  panic triggered.

  by the way, the following script can reproduce this panic

  loop1 (){
      while true
      do
          echo "p:kp submit_bio" > /sys/kernel/debug/tracing/kprobe_events
          echo "" > /sys/kernel/debug/tracing/kprobe_events
      done
  }
  loop2 (){
      while true
      do
          tree /sys/kernel/debug/tracing/events/kprobes/
      done
  }
  loop1 &
  loop2

  [1]:
  [ 1147.959632][T17331] Unable to handle kernel paging request at virtual address dead000000000150
  [ 1147.968239][T17331] Mem abort info:
  [ 1147.971739][T17331]   ESR = 0x0000000096000004
  [ 1147.976172][T17331]   EC = 0x25: DABT (current EL), IL = 32 bits
  [ 1147.982171][T17331]   SET = 0, FnV = 0
  [ 1147.985906][T17331]   EA = 0, S1PTW = 0
  [ 1147.989734][T17331]   FSC = 0x04: level 0 translation fault
  [ 1147.995292][T17331] Data abort info:
  [ 1147.998858][T17331]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
  [ 1148.005023][T17331]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
  [ 1148.010759][T17331]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
  [ 1148.016752][T17331] [dead000000000150] address between user and kernel address ranges
  [ 1148.024571][T17331] Internal error: Oops: 0000000096000004 [#1] SMP
  [ 1148.030825][T17331] Modules linked in: team_mode_loadbalance team nlmon act_gact cls_flower sch_ingress bonding tls macvlan dummy ib_core bridge stp llc veth amdgpu amdxcp mfd_core gpu_sched drm_exec drm_buddy radeon crct10dif_ce video drm_suballoc_helper ghash_ce drm_ttm_helper sha2_ce ttm sha256_arm64 i2c_algo_bit sha1_ce sbsa_gwdt cp210x drm_display_helper cec sr_mod cdrom drm_kms_helper binfmt_misc sg loop fuse drm dm_mod nfnetlink ip_tables autofs4 [last unloaded: tls]
  [ 1148.072808][T17331] CPU: 3 PID: 17331 Comm: ls Tainted: G        W         ------- ----  6.6.43 #2
  [ 1148.081751][T17331] Source Version: 21b3b386e948bedd29369af66f3e98ab01b1c650
  [ 1148.088783][T17331] Hardware name: Greatwall GW-001M1A-FTF/GW-001M1A-FTF, BIOS KunLun BIOS V4.0 07/16/2020
  [ 1148.098419][T17331] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  [ 1148.106060][T17331] pc : eventfs_iterate+0x2c0/0x398
  [ 1148.111017][T17331] lr : eventfs_iterate+0x2fc/0x398
  [ 1148.115969][T17331] sp : ffff80008d56bbd0
  [ 1148.119964][T17331] x29: ffff80008d56bbf0 x28: ffff001ff5be2600 x27: 0000000000000000
  [ 1148.127781][T17331] x26: ffff001ff52ca4e0 x25: 0000000000009977 x24: dead000000000100
  [ 1148.135598][T17331] x23: 0000000000000000 x22: 000000000000000b x21: ffff800082645f10
  [ 1148.143415][T17331] x20: ffff001fddf87c70 x19: ffff80008d56bc90 x18: 0000000000000000
  [ 1148.151231][T17331] x17: 0000000000000000 x16: 0000000000000000 x15: ffff001ff52ca4e0
  [ 1148.159048][T17331] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
  [ 1148.166864][T17331] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff8000804391d0
  [ 1148.174680][T17331] x8 : 0000000180000000 x7 : 0000000000000018 x6 : 0000aaab04b92862
  [ 1148.182498][T17331] x5 : 0000aaab04b92862 x4 : 0000000080000000 x3 : 0000000000000068
  [ 1148.190314][T17331] x2 : 000000000000000f x1 : 0000000000007ea8 x0 : 0000000000000001
  [ 1148.198131][T17331] Call trace:
  [ 1148.201259][T17331]  eventfs_iterate+0x2c0/0x398
  [ 1148.205864][T17331]  iterate_dir+0x98/0x188
  [ 1148.210036][T17331]  __arm64_sys_getdents64+0x78/0x160
  [ 1148.215161][T17331]  invoke_syscall+0x78/0x108
  [ 1148.219593][T17331]  el0_svc_common.constprop.0+0x48/0xf0
  [ 1148.224977][T17331]  do_el0_svc+0x24/0x38
  [ 1148.228974][T17331]  el0_svc+0x40/0x168
  [ 1148.232798][T17331]  el0t_64_sync_handler+0x120/0x130
  [ 1148.237836][T17331]  el0t_64_sync+0x1a4/0x1a8
  [ 1148.242182][T17331] Code: 54ffff6c f9400676 910006d6 f900067 (b9405300)
  [ 1148.248955][T17331] ---[ end trace 0000000000000000 ]---

The issue is that list_del() is used on an SRCU protected list variable
before the synchronization occurs. This can poison the list pointers while
there is a reader iterating the list.

This is simply fixed by using list_del_rcu() that is specifically made for
this purpose.

Link: https://lore.kernel.org/linux-trace-kernel/[email protected]/

Cc: [email protected]
Cc: Masami Hiramatsu <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Link: https://lore.kernel.org/[email protected]
Fixes: 43aa6f9 ("eventfs: Get rid of dentry pointers without refcounts")
Reported-by: Chi Zhiling <[email protected]>
Tested-by: Chi Zhiling <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
esmil pushed a commit that referenced this pull request Sep 25, 2024
The start_kthread() and stop_thread() code was not always called with the
interface_lock held. This means that the kthread variable could be
unexpectedly changed causing the kthread_stop() to be called on it when it
should not have been, leading to:

 while true; do
   rtla timerlat top -u -q & PID=$!;
   sleep 5;
   kill -INT $PID;
   sleep 0.001;
   kill -TERM $PID;
   wait $PID;
  done

Causing the following OOPS:

 Oops: general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
 KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
 CPU: 5 UID: 0 PID: 885 Comm: timerlatu/5 Not tainted 6.11.0-rc4-test-00002-gbc754cc76d1b-dirty torvalds#125 a533010b71dab205ad2f507188ce8c82203b0254
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 RIP: 0010:hrtimer_active+0x58/0x300
 Code: 48 c1 ee 03 41 54 48 01 d1 48 01 d6 55 53 48 83 ec 20 80 39 00 0f 85 30 02 00 00 49 8b 6f 30 4c 8d 75 10 4c 89 f0 48 c1 e8 03 <0f> b6 3c 10 4c 89 f0 83 e0 07 83 c0 03 40 38 f8 7c 09 40 84 ff 0f
 RSP: 0018:ffff88811d97f940 EFLAGS: 00010202
 RAX: 0000000000000002 RBX: ffff88823c6b5b28 RCX: ffffed10478d6b6b
 RDX: dffffc0000000000 RSI: ffffed10478d6b6c RDI: ffff88823c6b5b28
 RBP: 0000000000000000 R08: ffff88823c6b5b58 R09: ffff88823c6b5b60
 R10: ffff88811d97f957 R11: 0000000000000010 R12: 00000000000a801d
 R13: ffff88810d8b35d8 R14: 0000000000000010 R15: ffff88823c6b5b28
 FS:  0000000000000000(0000) GS:ffff88823c680000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000561858ad7258 CR3: 000000007729e001 CR4: 0000000000170ef0
 Call Trace:
  <TASK>
  ? die_addr+0x40/0xa0
  ? exc_general_protection+0x154/0x230
  ? asm_exc_general_protection+0x26/0x30
  ? hrtimer_active+0x58/0x300
  ? __pfx_mutex_lock+0x10/0x10
  ? __pfx_locks_remove_file+0x10/0x10
  hrtimer_cancel+0x15/0x40
  timerlat_fd_release+0x8e/0x1f0
  ? security_file_release+0x43/0x80
  __fput+0x372/0xb10
  task_work_run+0x11e/0x1f0
  ? _raw_spin_lock+0x85/0xe0
  ? __pfx_task_work_run+0x10/0x10
  ? poison_slab_object+0x109/0x170
  ? do_exit+0x7a0/0x24b0
  do_exit+0x7bd/0x24b0
  ? __pfx_migrate_enable+0x10/0x10
  ? __pfx_do_exit+0x10/0x10
  ? __pfx_read_tsc+0x10/0x10
  ? ktime_get+0x64/0x140
  ? _raw_spin_lock_irq+0x86/0xe0
  do_group_exit+0xb0/0x220
  get_signal+0x17ba/0x1b50
  ? vfs_read+0x179/0xa40
  ? timerlat_fd_read+0x30b/0x9d0
  ? __pfx_get_signal+0x10/0x10
  ? __pfx_timerlat_fd_read+0x10/0x10
  arch_do_signal_or_restart+0x8c/0x570
  ? __pfx_arch_do_signal_or_restart+0x10/0x10
  ? vfs_read+0x179/0xa40
  ? ksys_read+0xfe/0x1d0
  ? __pfx_ksys_read+0x10/0x10
  syscall_exit_to_user_mode+0xbc/0x130
  do_syscall_64+0x74/0x110
  ? __pfx___rseq_handle_notify_resume+0x10/0x10
  ? __pfx_ksys_read+0x10/0x10
  ? fpregs_restore_userregs+0xdb/0x1e0
  ? fpregs_restore_userregs+0xdb/0x1e0
  ? syscall_exit_to_user_mode+0x116/0x130
  ? do_syscall_64+0x74/0x110
  ? do_syscall_64+0x74/0x110
  ? do_syscall_64+0x74/0x110
  entry_SYSCALL_64_after_hwframe+0x71/0x79
 RIP: 0033:0x7ff0070eca9c
 Code: Unable to access opcode bytes at 0x7ff0070eca72.
 RSP: 002b:00007ff006dff8c0 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
 RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00007ff0070eca9c
 RDX: 0000000000000400 RSI: 00007ff006dff9a0 RDI: 0000000000000003
 RBP: 00007ff006dffde0 R08: 0000000000000000 R09: 00007ff000000ba0
 R10: 00007ff007004b08 R11: 0000000000000246 R12: 0000000000000003
 R13: 00007ff006dff9a0 R14: 0000000000000007 R15: 0000000000000008
  </TASK>
 Modules linked in: snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hwdep snd_hda_core
 ---[ end trace 0000000000000000 ]---

This is because it would mistakenly call kthread_stop() on a user space
thread making it "exit" before it actually exits.

Since kthreads are created based on global behavior, use a cpumask to know
when kthreads are running and that they need to be shutdown before
proceeding to do new work.

Link: https://lore.kernel.org/all/[email protected]/

This was debugged by using the persistent ring buffer:

Link: https://lore.kernel.org/all/[email protected]/

Note, locking was originally used to fix this, but that proved to cause too
many deadlocks to work around:

  https://lore.kernel.org/linux-trace-kernel/[email protected]/

Cc: [email protected]
Cc: Masami Hiramatsu <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: "Luis Claudio R. Goncalves" <[email protected]>
Link: https://lore.kernel.org/[email protected]
Fixes: e88ed22 ("tracing/timerlat: Add user-space interface")
Reported-by: Tomas Glozar <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
esmil pushed a commit that referenced this pull request Sep 25, 2024
Buffer 'card->dai_link' is reallocated in 'meson_card_reallocate_links()',
so move 'pad' pointer initialization after this function when memory is
already reallocated.

Kasan bug report:

==================================================================
BUG: KASAN: slab-use-after-free in axg_card_add_link+0x76c/0x9bc
Read of size 8 at addr ffff000000e8b260 by task modprobe/356

CPU: 0 PID: 356 Comm: modprobe Tainted: G O 6.9.12-sdkernel #1
Call trace:
 dump_backtrace+0x94/0xec
 show_stack+0x18/0x24
 dump_stack_lvl+0x78/0x90
 print_report+0xfc/0x5c0
 kasan_report+0xb8/0xfc
 __asan_load8+0x9c/0xb8
 axg_card_add_link+0x76c/0x9bc [snd_soc_meson_axg_sound_card]
 meson_card_probe+0x344/0x3b8 [snd_soc_meson_card_utils]
 platform_probe+0x8c/0xf4
 really_probe+0x110/0x39c
 __driver_probe_device+0xb8/0x18c
 driver_probe_device+0x108/0x1d8
 __driver_attach+0xd0/0x25c
 bus_for_each_dev+0xe0/0x154
 driver_attach+0x34/0x44
 bus_add_driver+0x134/0x294
 driver_register+0xa8/0x1e8
 __platform_driver_register+0x44/0x54
 axg_card_pdrv_init+0x20/0x1000 [snd_soc_meson_axg_sound_card]
 do_one_initcall+0xdc/0x25c
 do_init_module+0x10c/0x334
 load_module+0x24c4/0x26cc
 init_module_from_file+0xd4/0x128
 __arm64_sys_finit_module+0x1f4/0x41c
 invoke_syscall+0x60/0x188
 el0_svc_common.constprop.0+0x78/0x13c
 do_el0_svc+0x30/0x40
 el0_svc+0x38/0x78
 el0t_64_sync_handler+0x100/0x12c
 el0t_64_sync+0x190/0x194

Fixes: 7864a79 ("ASoC: meson: add axg sound card support")
Cc: [email protected]
Signed-off-by: Arseniy Krasnov <[email protected]>
Reviewed-by: Jerome Brunet <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Mark Brown <[email protected]>
esmil pushed a commit that referenced this pull request Sep 25, 2024
Marc Hartmayer reported:
        [   23.133876] Unable to handle kernel pointer dereference in virtual kernel address space
        [   23.133950] Failing address: 0000000000000000 TEID: 0000000000000483
        [   23.133954] Fault in home space mode while using kernel ASCE.
        [   23.133957] AS:000000001b8f0007 R3:0000000056cf4007 S:0000000056cf3800 P:000000000000003d
        [   23.134207] Oops: 0004 ilc:2 [#1] SMP
	(snip)
        [   23.134516] Call Trace:
        [   23.134520]  [<0000024e326caf28>] worker_thread+0x48/0x430
        [   23.134525] ([<0000024e326caf18>] worker_thread+0x38/0x430)
        [   23.134528]  [<0000024e326d3a3e>] kthread+0x11e/0x130
        [   23.134533]  [<0000024e3264b0dc>] __ret_from_fork+0x3c/0x60
        [   23.134536]  [<0000024e333fb37a>] ret_from_fork+0xa/0x38
        [   23.134552] Last Breaking-Event-Address:
        [   23.134553]  [<0000024e333f4c04>] mutex_unlock+0x24/0x30
        [   23.134562] Kernel panic - not syncing: Fatal exception: panic_on_oops

With debuging and analysis, worker_thread() accesses to the nullified
worker->pool when the newly created worker is destroyed before being
waken-up, in which case worker_thread() can see the result detach_worker()
reseting worker->pool to NULL at the begining.

Move the code "worker->pool = NULL;" out from detach_worker() to fix the
problem.

worker->pool had been designed to be constant for regular workers and
changeable for rescuer. To share attaching/detaching code for regular
and rescuer workers and to avoid worker->pool being accessed inadvertently
when the worker has been detached, worker->pool is reset to NULL when
detached no matter the worker is rescuer or not.

To maintain worker->pool being reset after detached, move the code
"worker->pool = NULL;" in the worker thread context after detached.

It is either be in the regular worker thread context after PF_WQ_WORKER
is cleared or in rescuer worker thread context with wq_pool_attach_mutex
held. So it is safe to do so.

Cc: Marc Hartmayer <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
Reported-by: Marc Hartmayer <[email protected]>
Fixes: f4b7b53 ("workqueue: Detach workers directly in idle_cull_fn()")
Cc: [email protected] # v6.11+
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
esmil pushed a commit that referenced this pull request Sep 25, 2024
…git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following batch contains two fixes from Florian Westphal:

Patch #1 fixes a sk refcount leak in nft_socket on mismatch.

Patch #2 fixes cgroupsv2 matching from containers due to incorrect
	 level in subtree.

netfilter pull request 24-09-12

* tag 'nf-24-09-12' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nft_socket: make cgroupsv2 matching work with namespaces
  netfilter: nft_socket: fix sk refcount leaks
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
esmil pushed a commit that referenced this pull request Sep 25, 2024
…_module_init()

Move nf flowtable bpf initialization in nf_flow_table module load
routine since nf_flow_table_bpf is part of nf_flow_table module and not
nf_flow_table_inet one. This patch allows to avoid the following kernel
warning running the reproducer below:

$modprobe nf_flow_table_inet
$rmmod nf_flow_table_inet
$modprobe nf_flow_table_inet
modprobe: ERROR: could not insert 'nf_flow_table_inet': Invalid argument

[  184.081501] ------------[ cut here ]------------
[  184.081527] WARNING: CPU: 0 PID: 1362 at kernel/bpf/btf.c:8206 btf_populate_kfunc_set+0x23c/0x330
[  184.081550] CPU: 0 UID: 0 PID: 1362 Comm: modprobe Kdump: loaded Not tainted 6.11.0-0.rc5.22.el10.x86_64 #1
[  184.081553] Hardware name: Red Hat OpenStack Compute, BIOS 1.14.0-1.module+el8.4.0+8855+a9e237a9 04/01/2014
[  184.081554] RIP: 0010:btf_populate_kfunc_set+0x23c/0x330
[  184.081558] RSP: 0018:ff22cfb38071fc90 EFLAGS: 00010202
[  184.081559] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
[  184.081560] RDX: 000000000000006e RSI: ffffffff95c00000 RDI: ff13805543436350
[  184.081561] RBP: ffffffffc0e22180 R08: ff13805543410808 R09: 000000000001ec00
[  184.081562] R10: ff13805541c8113c R11: 0000000000000010 R12: ff13805541b83c00
[  184.081563] R13: ff13805543410800 R14: 0000000000000001 R15: ffffffffc0e2259a
[  184.081564] FS:  00007fa436c46740(0000) GS:ff1380557ba00000(0000) knlGS:0000000000000000
[  184.081569] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  184.081570] CR2: 000055e7b3187000 CR3: 0000000100c48003 CR4: 0000000000771ef0
[  184.081571] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  184.081572] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  184.081572] PKRU: 55555554
[  184.081574] Call Trace:
[  184.081575]  <TASK>
[  184.081578]  ? show_trace_log_lvl+0x1b0/0x2f0
[  184.081580]  ? show_trace_log_lvl+0x1b0/0x2f0
[  184.081582]  ? __register_btf_kfunc_id_set+0x199/0x200
[  184.081585]  ? btf_populate_kfunc_set+0x23c/0x330
[  184.081586]  ? __warn.cold+0x93/0xed
[  184.081590]  ? btf_populate_kfunc_set+0x23c/0x330
[  184.081592]  ? report_bug+0xff/0x140
[  184.081594]  ? handle_bug+0x3a/0x70
[  184.081596]  ? exc_invalid_op+0x17/0x70
[  184.081597]  ? asm_exc_invalid_op+0x1a/0x20
[  184.081601]  ? btf_populate_kfunc_set+0x23c/0x330
[  184.081602]  __register_btf_kfunc_id_set+0x199/0x200
[  184.081605]  ? __pfx_nf_flow_inet_module_init+0x10/0x10 [nf_flow_table_inet]
[  184.081607]  do_one_initcall+0x58/0x300
[  184.081611]  do_init_module+0x60/0x230
[  184.081614]  __do_sys_init_module+0x17a/0x1b0
[  184.081617]  do_syscall_64+0x7d/0x160
[  184.081620]  ? __count_memcg_events+0x58/0xf0
[  184.081623]  ? handle_mm_fault+0x234/0x350
[  184.081626]  ? do_user_addr_fault+0x347/0x640
[  184.081630]  ? clear_bhb_loop+0x25/0x80
[  184.081633]  ? clear_bhb_loop+0x25/0x80
[  184.081634]  ? clear_bhb_loop+0x25/0x80
[  184.081637]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  184.081639] RIP: 0033:0x7fa43652e4ce
[  184.081647] RSP: 002b:00007ffe8213be18 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[  184.081649] RAX: ffffffffffffffda RBX: 000055e7b3176c20 RCX: 00007fa43652e4ce
[  184.081650] RDX: 000055e7737fde79 RSI: 0000000000003990 RDI: 000055e7b3185380
[  184.081651] RBP: 000055e7737fde79 R08: 0000000000000007 R09: 000055e7b3179bd0
[  184.081651] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000040000
[  184.081652] R13: 000055e7b3176fa0 R14: 0000000000000000 R15: 000055e7b3179b80

Fixes: 391bb65 ("netfilter: Add bpf_xdp_flow_lookup kfunc")
Signed-off-by: Lorenzo Bianconi <[email protected]>
Acked-by: Florian Westphal <[email protected]>
Acked-by: Pablo Neira Ayuso <[email protected]>
Link: https://patch.msgid.link/20240911-nf-flowtable-bpf-modprob-fix-v1-1-f9fc075aafc3@kernel.org
Signed-off-by: Paolo Abeni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants