Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test] master_test #2

Closed
wants to merge 1 commit into from
Closed

[test] master_test #2

wants to merge 1 commit into from

Conversation

kernel-patches-bot
Copy link

branch: master_test
base:bpf
version: 3df9d80

kernel-patches-bot pushed a commit that referenced this pull request Feb 18, 2022
…egulator

The interrupt pin of the external ethernet phy is used, instead of the
enable-gpio pin of the tf-io regulator. The GPIOE_2 pin is located in
the gpio_ao bank.

This causes phy interrupt problems at system startup.
[   76.645190] irq 36: nobody cared (try booting with the "irqpoll" option)
[   76.649617] CPU: 0 PID: 1416 Comm: irq/36-0.0:00 Not tainted 5.16.0 #2
[   76.649629] Hardware name: Hardkernel ODROID-HC4 (DT)
[   76.649635] Call trace:
[   76.649638]  dump_backtrace+0x0/0x1c8
[   76.649658]  show_stack+0x14/0x60
[   76.649667]  dump_stack_lvl+0x64/0x7c
[   76.649676]  dump_stack+0x14/0x2c
[   76.649683]  __report_bad_irq+0x38/0xe8
[   76.649695]  note_interrupt+0x220/0x3a0
[   76.649704]  handle_irq_event_percpu+0x58/0x88
[   76.649713]  handle_irq_event+0x44/0xd8
[   76.649721]  handle_fasteoi_irq+0xa8/0x130
[   76.649730]  generic_handle_domain_irq+0x38/0x58
[   76.649738]  gic_handle_irq+0x9c/0xb8
[   76.649747]  call_on_irq_stack+0x28/0x38
[   76.649755]  do_interrupt_handler+0x7c/0x80
[   76.649763]  el1_interrupt+0x34/0x80
[   76.649772]  el1h_64_irq_handler+0x14/0x20
[   76.649781]  el1h_64_irq+0x74/0x78
[   76.649788]  irq_finalize_oneshot.part.56+0x68/0xf8
[   76.649796]  irq_thread_fn+0x5c/0x98
[   76.649804]  irq_thread+0x13c/0x260
[   76.649812]  kthread+0x144/0x178
[   76.649822]  ret_from_fork+0x10/0x20
[   76.649830] handlers:
[   76.653170] [<0000000025a6cd31>] irq_default_primary_handler threaded [<0000000093580eb7>] phy_interrupt
[   76.661256] Disabling IRQ #36

Fixes: 1f80a5c ("arm64: dts: meson-sm1-odroid: add missing enable gpio and supply for tf_io regulator")
Signed-off-by: Lutz Koschorreck <[email protected]>
Reviewed-by: Neil Armstrong <[email protected]>
Signed-off-by: Neil Armstrong <[email protected]>
[narmstrong: removed spurious invalid & blank lines from commit message]
Link: https://lore.kernel.org/r/20220127130537.GA187347@odroid-VirtualBox
kernel-patches-bot pushed a commit that referenced this pull request Feb 18, 2022
When using the flushoncommit mount option, during almost every transaction
commit we trigger a warning from __writeback_inodes_sb_nr():

  $ cat fs/fs-writeback.c:
  (...)
  static void __writeback_inodes_sb_nr(struct super_block *sb, ...
  {
        (...)
        WARN_ON(!rwsem_is_locked(&sb->s_umount));
        (...)
  }
  (...)

The trace produced in dmesg looks like the following:

  [947.473890] WARNING: CPU: 5 PID: 930 at fs/fs-writeback.c:2610 __writeback_inodes_sb_nr+0x7e/0xb3
  [947.481623] Modules linked in: nfsd nls_cp437 cifs asn1_decoder cifs_arc4 fscache cifs_md4 ipmi_ssif
  [947.489571] CPU: 5 PID: 930 Comm: btrfs-transacti Not tainted 95.16.3-srb-asrock-00001-g36437ad63879 #186
  [947.497969] RIP: 0010:__writeback_inodes_sb_nr+0x7e/0xb3
  [947.502097] Code: 24 10 4c 89 44 24 18 c6 (...)
  [947.519760] RSP: 0018:ffffc90000777e10 EFLAGS: 00010246
  [947.523818] RAX: 0000000000000000 RBX: 0000000000963300 RCX: 0000000000000000
  [947.529765] RDX: 0000000000000000 RSI: 000000000000fa51 RDI: ffffc90000777e50
  [947.535740] RBP: ffff888101628a90 R08: ffff888100955800 R09: ffff888100956000
  [947.541701] R10: 0000000000000002 R11: 0000000000000001 R12: ffff888100963488
  [947.547645] R13: ffff888100963000 R14: ffff888112fb7200 R15: ffff888100963460
  [947.553621] FS:  0000000000000000(0000) GS:ffff88841fd40000(0000) knlGS:0000000000000000
  [947.560537] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [947.565122] CR2: 0000000008be50c4 CR3: 000000000220c000 CR4: 00000000001006e0
  [947.571072] Call Trace:
  [947.572354]  <TASK>
  [947.573266]  btrfs_commit_transaction+0x1f1/0x998
  [947.576785]  ? start_transaction+0x3ab/0x44e
  [947.579867]  ? schedule_timeout+0x8a/0xdd
  [947.582716]  transaction_kthread+0xe9/0x156
  [947.585721]  ? btrfs_cleanup_transaction.isra.0+0x407/0x407
  [947.590104]  kthread+0x131/0x139
  [947.592168]  ? set_kthread_struct+0x32/0x32
  [947.595174]  ret_from_fork+0x22/0x30
  [947.597561]  </TASK>
  [947.598553] ---[ end trace 644721052755541c ]---

This is because we started using writeback_inodes_sb() to flush delalloc
when committing a transaction (when using -o flushoncommit), in order to
avoid deadlocks with filesystem freeze operations. This change was made
by commit ce8ea7c ("btrfs: don't call btrfs_start_delalloc_roots
in flushoncommit"). After that change we started producing that warning,
and every now and then a user reports this since the warning happens too
often, it spams dmesg/syslog, and a user is unsure if this reflects any
problem that might compromise the filesystem's reliability.

We can not just lock the sb->s_umount semaphore before calling
writeback_inodes_sb(), because that would at least deadlock with
filesystem freezing, since at fs/super.c:freeze_super() sync_filesystem()
is called while we are holding that semaphore in write mode, and that can
trigger a transaction commit, resulting in a deadlock. It would also
trigger the same type of deadlock in the unmount path. Possibly, it could
also introduce some other locking dependencies that lockdep would report.

To fix this call try_to_writeback_inodes_sb() instead of
writeback_inodes_sb(), because that will try to read lock sb->s_umount
and then will only call writeback_inodes_sb() if it was able to lock it.
This is fine because the cases where it can't read lock sb->s_umount
are during a filesystem unmount or during a filesystem freeze - in those
cases sb->s_umount is write locked and sync_filesystem() is called, which
calls writeback_inodes_sb(). In other words, in all cases where we can't
take a read lock on sb->s_umount, writeback is already being triggered
elsewhere.

An alternative would be to call btrfs_start_delalloc_roots() with a
number of pages different from LONG_MAX, for example matching the number
of delalloc bytes we currently have, in which case we would end up
starting all delalloc with filemap_fdatawrite_wbc() and not with an
async flush via filemap_flush() - that is only possible after the rather
recent commit e076ab2 ("btrfs: shrink delalloc pages instead of
full inodes"). However that creates a whole new can of worms due to new
lock dependencies, which lockdep complains, like for example:

[ 8948.247280] ======================================================
[ 8948.247823] WARNING: possible circular locking dependency detected
[ 8948.248353] 5.17.0-rc1-btrfs-next-111 #1 Not tainted
[ 8948.248786] ------------------------------------------------------
[ 8948.249320] kworker/u16:18/933570 is trying to acquire lock:
[ 8948.249812] ffff9b3de1591690 (sb_internal#2){.+.+}-{0:0}, at: find_free_extent+0x141e/0x1590 [btrfs]
[ 8948.250638]
               but task is already holding lock:
[ 8948.251140] ffff9b3e09c717d8 (&root->delalloc_mutex){+.+.}-{3:3}, at: start_delalloc_inodes+0x78/0x400 [btrfs]
[ 8948.252018]
               which lock already depends on the new lock.

[ 8948.252710]
               the existing dependency chain (in reverse order) is:
[ 8948.253343]
               -> #2 (&root->delalloc_mutex){+.+.}-{3:3}:
[ 8948.253950]        __mutex_lock+0x90/0x900
[ 8948.254354]        start_delalloc_inodes+0x78/0x400 [btrfs]
[ 8948.254859]        btrfs_start_delalloc_roots+0x194/0x2a0 [btrfs]
[ 8948.255408]        btrfs_commit_transaction+0x32f/0xc00 [btrfs]
[ 8948.255942]        btrfs_mksubvol+0x380/0x570 [btrfs]
[ 8948.256406]        btrfs_mksnapshot+0x81/0xb0 [btrfs]
[ 8948.256870]        __btrfs_ioctl_snap_create+0x17f/0x190 [btrfs]
[ 8948.257413]        btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs]
[ 8948.257961]        btrfs_ioctl+0x1196/0x3630 [btrfs]
[ 8948.258418]        __x64_sys_ioctl+0x83/0xb0
[ 8948.258793]        do_syscall_64+0x3b/0xc0
[ 8948.259146]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 8948.259709]
               -> #1 (&fs_info->delalloc_root_mutex){+.+.}-{3:3}:
[ 8948.260330]        __mutex_lock+0x90/0x900
[ 8948.260692]        btrfs_start_delalloc_roots+0x97/0x2a0 [btrfs]
[ 8948.261234]        btrfs_commit_transaction+0x32f/0xc00 [btrfs]
[ 8948.261766]        btrfs_set_free_space_cache_v1_active+0x38/0x60 [btrfs]
[ 8948.262379]        btrfs_start_pre_rw_mount+0x119/0x180 [btrfs]
[ 8948.262909]        open_ctree+0x1511/0x171e [btrfs]
[ 8948.263359]        btrfs_mount_root.cold+0x12/0xde [btrfs]
[ 8948.263863]        legacy_get_tree+0x30/0x50
[ 8948.264242]        vfs_get_tree+0x28/0xc0
[ 8948.264594]        vfs_kern_mount.part.0+0x71/0xb0
[ 8948.265017]        btrfs_mount+0x11d/0x3a0 [btrfs]
[ 8948.265462]        legacy_get_tree+0x30/0x50
[ 8948.265851]        vfs_get_tree+0x28/0xc0
[ 8948.266203]        path_mount+0x2d4/0xbe0
[ 8948.266554]        __x64_sys_mount+0x103/0x140
[ 8948.266940]        do_syscall_64+0x3b/0xc0
[ 8948.267300]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 8948.267790]
               -> #0 (sb_internal#2){.+.+}-{0:0}:
[ 8948.268322]        __lock_acquire+0x12e8/0x2260
[ 8948.268733]        lock_acquire+0xd7/0x310
[ 8948.269092]        start_transaction+0x44c/0x6e0 [btrfs]
[ 8948.269591]        find_free_extent+0x141e/0x1590 [btrfs]
[ 8948.270087]        btrfs_reserve_extent+0x14b/0x280 [btrfs]
[ 8948.270588]        cow_file_range+0x17e/0x490 [btrfs]
[ 8948.271051]        btrfs_run_delalloc_range+0x345/0x7a0 [btrfs]
[ 8948.271586]        writepage_delalloc+0xb5/0x170 [btrfs]
[ 8948.272071]        __extent_writepage+0x156/0x3c0 [btrfs]
[ 8948.272579]        extent_write_cache_pages+0x263/0x460 [btrfs]
[ 8948.273113]        extent_writepages+0x76/0x130 [btrfs]
[ 8948.273573]        do_writepages+0xd2/0x1c0
[ 8948.273942]        filemap_fdatawrite_wbc+0x68/0x90
[ 8948.274371]        start_delalloc_inodes+0x17f/0x400 [btrfs]
[ 8948.274876]        btrfs_start_delalloc_roots+0x194/0x2a0 [btrfs]
[ 8948.275417]        flush_space+0x1f2/0x630 [btrfs]
[ 8948.275863]        btrfs_async_reclaim_data_space+0x108/0x1b0 [btrfs]
[ 8948.276438]        process_one_work+0x252/0x5a0
[ 8948.276829]        worker_thread+0x55/0x3b0
[ 8948.277189]        kthread+0xf2/0x120
[ 8948.277506]        ret_from_fork+0x22/0x30
[ 8948.277868]
               other info that might help us debug this:

[ 8948.278548] Chain exists of:
                 sb_internal#2 --> &fs_info->delalloc_root_mutex --> &root->delalloc_mutex

[ 8948.279601]  Possible unsafe locking scenario:

[ 8948.280102]        CPU0                    CPU1
[ 8948.280508]        ----                    ----
[ 8948.280915]   lock(&root->delalloc_mutex);
[ 8948.281271]                                lock(&fs_info->delalloc_root_mutex);
[ 8948.281915]                                lock(&root->delalloc_mutex);
[ 8948.282487]   lock(sb_internal#2);
[ 8948.282800]
                *** DEADLOCK ***

[ 8948.283333] 4 locks held by kworker/u16:18/933570:
[ 8948.283750]  #0: ffff9b3dc00a9d48 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1d2/0x5a0
[ 8948.284609]  #1: ffffa90349dafe70 ((work_completion)(&fs_info->async_data_reclaim_work)){+.+.}-{0:0}, at: process_one_work+0x1d2/0x5a0
[ 8948.285637]  #2: ffff9b3e14db5040 (&fs_info->delalloc_root_mutex){+.+.}-{3:3}, at: btrfs_start_delalloc_roots+0x97/0x2a0 [btrfs]
[ 8948.286674]  #3: ffff9b3e09c717d8 (&root->delalloc_mutex){+.+.}-{3:3}, at: start_delalloc_inodes+0x78/0x400 [btrfs]
[ 8948.287596]
              stack backtrace:
[ 8948.287975] CPU: 3 PID: 933570 Comm: kworker/u16:18 Not tainted 5.17.0-rc1-btrfs-next-111 #1
[ 8948.288677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 8948.289649] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
[ 8948.290298] Call Trace:
[ 8948.290517]  <TASK>
[ 8948.290700]  dump_stack_lvl+0x59/0x73
[ 8948.291026]  check_noncircular+0xf3/0x110
[ 8948.291375]  ? start_transaction+0x228/0x6e0 [btrfs]
[ 8948.291826]  __lock_acquire+0x12e8/0x2260
[ 8948.292241]  lock_acquire+0xd7/0x310
[ 8948.292714]  ? find_free_extent+0x141e/0x1590 [btrfs]
[ 8948.293241]  ? lock_is_held_type+0xea/0x140
[ 8948.293601]  start_transaction+0x44c/0x6e0 [btrfs]
[ 8948.294055]  ? find_free_extent+0x141e/0x1590 [btrfs]
[ 8948.294518]  find_free_extent+0x141e/0x1590 [btrfs]
[ 8948.294957]  ? _raw_spin_unlock+0x29/0x40
[ 8948.295312]  ? btrfs_get_alloc_profile+0x124/0x290 [btrfs]
[ 8948.295813]  btrfs_reserve_extent+0x14b/0x280 [btrfs]
[ 8948.296270]  cow_file_range+0x17e/0x490 [btrfs]
[ 8948.296691]  btrfs_run_delalloc_range+0x345/0x7a0 [btrfs]
[ 8948.297175]  ? find_lock_delalloc_range+0x247/0x270 [btrfs]
[ 8948.297678]  writepage_delalloc+0xb5/0x170 [btrfs]
[ 8948.298123]  __extent_writepage+0x156/0x3c0 [btrfs]
[ 8948.298570]  extent_write_cache_pages+0x263/0x460 [btrfs]
[ 8948.299061]  extent_writepages+0x76/0x130 [btrfs]
[ 8948.299495]  do_writepages+0xd2/0x1c0
[ 8948.299817]  ? sched_clock_cpu+0xd/0x110
[ 8948.300160]  ? lock_release+0x155/0x4a0
[ 8948.300494]  filemap_fdatawrite_wbc+0x68/0x90
[ 8948.300874]  ? do_raw_spin_unlock+0x4b/0xa0
[ 8948.301243]  start_delalloc_inodes+0x17f/0x400 [btrfs]
[ 8948.301706]  ? lock_release+0x155/0x4a0
[ 8948.302055]  btrfs_start_delalloc_roots+0x194/0x2a0 [btrfs]
[ 8948.302564]  flush_space+0x1f2/0x630 [btrfs]
[ 8948.302970]  btrfs_async_reclaim_data_space+0x108/0x1b0 [btrfs]
[ 8948.303510]  process_one_work+0x252/0x5a0
[ 8948.303860]  ? process_one_work+0x5a0/0x5a0
[ 8948.304221]  worker_thread+0x55/0x3b0
[ 8948.304543]  ? process_one_work+0x5a0/0x5a0
[ 8948.304904]  kthread+0xf2/0x120
[ 8948.305184]  ? kthread_complete_and_exit+0x20/0x20
[ 8948.305598]  ret_from_fork+0x22/0x30
[ 8948.305921]  </TASK>

It all comes from the fact that btrfs_start_delalloc_roots() takes the
delalloc_root_mutex, in the transaction commit path we are holding a
read lock on one of the superblock's freeze semaphores (via
sb_start_intwrite()), the async reclaim task can also do a call to
btrfs_start_delalloc_roots(), which ends up triggering writeback with
calls to filemap_fdatawrite_wbc(), resulting in extent allocation which
in turn can call btrfs_start_transaction(), which will result in taking
the freeze semaphore via sb_start_intwrite(), forming a nasty dependency
on all those locks which can be taken in different orders by different
code paths.

So just adopt the simple approach of calling try_to_writeback_inodes_sb()
at btrfs_start_delalloc_flush().

Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
[ add more link reports ]
Signed-off-by: David Sterba <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Feb 18, 2022
Quota disable ioctl starts a transaction before waiting for the qgroup
rescan worker completes. However, this wait can be infinite and results
in deadlock because of circular dependency among the quota disable
ioctl, the qgroup rescan worker and the other task with transaction such
as block group relocation task.

The deadlock happens with the steps following:

1) Task A calls ioctl to disable quota. It starts a transaction and
   waits for qgroup rescan worker completes.
2) Task B such as block group relocation task starts a transaction and
   joins to the transaction that task A started. Then task B commits to
   the transaction. In this commit, task B waits for a commit by task A.
3) Task C as the qgroup rescan worker starts its job and starts a
   transaction. In this transaction start, task C waits for completion
   of the transaction that task A started and task B committed.

This deadlock was found with fstests test case btrfs/115 and a zoned
null_blk device. The test case enables and disables quota, and the
block group reclaim was triggered during the quota disable by chance.
The deadlock was also observed by running quota enable and disable in
parallel with 'btrfs balance' command on regular null_blk devices.

An example report of the deadlock:

  [372.469894] INFO: task kworker/u16:6:103 blocked for more than 122 seconds.
  [372.479944]       Not tainted 5.16.0-rc8 #7
  [372.485067] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [372.493898] task:kworker/u16:6   state:D stack:    0 pid:  103 ppid:     2 flags:0x00004000
  [372.503285] Workqueue: btrfs-qgroup-rescan btrfs_work_helper [btrfs]
  [372.510782] Call Trace:
  [372.514092]  <TASK>
  [372.521684]  __schedule+0xb56/0x4850
  [372.530104]  ? io_schedule_timeout+0x190/0x190
  [372.538842]  ? lockdep_hardirqs_on+0x7e/0x100
  [372.547092]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
  [372.555591]  schedule+0xe0/0x270
  [372.561894]  btrfs_commit_transaction+0x18bb/0x2610 [btrfs]
  [372.570506]  ? btrfs_apply_pending_changes+0x50/0x50 [btrfs]
  [372.578875]  ? free_unref_page+0x3f2/0x650
  [372.585484]  ? finish_wait+0x270/0x270
  [372.591594]  ? release_extent_buffer+0x224/0x420 [btrfs]
  [372.599264]  btrfs_qgroup_rescan_worker+0xc13/0x10c0 [btrfs]
  [372.607157]  ? lock_release+0x3a9/0x6d0
  [372.613054]  ? btrfs_qgroup_account_extent+0xda0/0xda0 [btrfs]
  [372.620960]  ? do_raw_spin_lock+0x11e/0x250
  [372.627137]  ? rwlock_bug.part.0+0x90/0x90
  [372.633215]  ? lock_is_held_type+0xe4/0x140
  [372.639404]  btrfs_work_helper+0x1ae/0xa90 [btrfs]
  [372.646268]  process_one_work+0x7e9/0x1320
  [372.652321]  ? lock_release+0x6d0/0x6d0
  [372.658081]  ? pwq_dec_nr_in_flight+0x230/0x230
  [372.664513]  ? rwlock_bug.part.0+0x90/0x90
  [372.670529]  worker_thread+0x59e/0xf90
  [372.676172]  ? process_one_work+0x1320/0x1320
  [372.682440]  kthread+0x3b9/0x490
  [372.687550]  ? _raw_spin_unlock_irq+0x24/0x50
  [372.693811]  ? set_kthread_struct+0x100/0x100
  [372.700052]  ret_from_fork+0x22/0x30
  [372.705517]  </TASK>
  [372.709747] INFO: task btrfs-transacti:2347 blocked for more than 123 seconds.
  [372.729827]       Not tainted 5.16.0-rc8 #7
  [372.745907] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [372.767106] task:btrfs-transacti state:D stack:    0 pid: 2347 ppid:     2 flags:0x00004000
  [372.787776] Call Trace:
  [372.801652]  <TASK>
  [372.812961]  __schedule+0xb56/0x4850
  [372.830011]  ? io_schedule_timeout+0x190/0x190
  [372.852547]  ? lockdep_hardirqs_on+0x7e/0x100
  [372.871761]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
  [372.886792]  schedule+0xe0/0x270
  [372.901685]  wait_current_trans+0x22c/0x310 [btrfs]
  [372.919743]  ? btrfs_put_transaction+0x3d0/0x3d0 [btrfs]
  [372.938923]  ? finish_wait+0x270/0x270
  [372.959085]  ? join_transaction+0xc75/0xe30 [btrfs]
  [372.977706]  start_transaction+0x938/0x10a0 [btrfs]
  [372.997168]  transaction_kthread+0x19d/0x3c0 [btrfs]
  [373.013021]  ? btrfs_cleanup_transaction.isra.0+0xfc0/0xfc0 [btrfs]
  [373.031678]  kthread+0x3b9/0x490
  [373.047420]  ? _raw_spin_unlock_irq+0x24/0x50
  [373.064645]  ? set_kthread_struct+0x100/0x100
  [373.078571]  ret_from_fork+0x22/0x30
  [373.091197]  </TASK>
  [373.105611] INFO: task btrfs:3145 blocked for more than 123 seconds.
  [373.114147]       Not tainted 5.16.0-rc8 #7
  [373.120401] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [373.130393] task:btrfs           state:D stack:    0 pid: 3145 ppid:  3141 flags:0x00004000
  [373.140998] Call Trace:
  [373.145501]  <TASK>
  [373.149654]  __schedule+0xb56/0x4850
  [373.155306]  ? io_schedule_timeout+0x190/0x190
  [373.161965]  ? lockdep_hardirqs_on+0x7e/0x100
  [373.168469]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
  [373.175468]  schedule+0xe0/0x270
  [373.180814]  wait_for_commit+0x104/0x150 [btrfs]
  [373.187643]  ? test_and_set_bit+0x20/0x20 [btrfs]
  [373.194772]  ? kmem_cache_free+0x124/0x550
  [373.201191]  ? btrfs_put_transaction+0x69/0x3d0 [btrfs]
  [373.208738]  ? finish_wait+0x270/0x270
  [373.214704]  ? __btrfs_end_transaction+0x347/0x7b0 [btrfs]
  [373.222342]  btrfs_commit_transaction+0x44d/0x2610 [btrfs]
  [373.230233]  ? join_transaction+0x255/0xe30 [btrfs]
  [373.237334]  ? btrfs_record_root_in_trans+0x4d/0x170 [btrfs]
  [373.245251]  ? btrfs_apply_pending_changes+0x50/0x50 [btrfs]
  [373.253296]  relocate_block_group+0x105/0xc20 [btrfs]
  [373.260533]  ? mutex_lock_io_nested+0x1270/0x1270
  [373.267516]  ? btrfs_wait_nocow_writers+0x85/0x180 [btrfs]
  [373.275155]  ? merge_reloc_roots+0x710/0x710 [btrfs]
  [373.283602]  ? btrfs_wait_ordered_extents+0xd30/0xd30 [btrfs]
  [373.291934]  ? kmem_cache_free+0x124/0x550
  [373.298180]  btrfs_relocate_block_group+0x35c/0x930 [btrfs]
  [373.306047]  btrfs_relocate_chunk+0x85/0x210 [btrfs]
  [373.313229]  btrfs_balance+0x12f4/0x2d20 [btrfs]
  [373.320227]  ? lock_release+0x3a9/0x6d0
  [373.326206]  ? btrfs_relocate_chunk+0x210/0x210 [btrfs]
  [373.333591]  ? lock_is_held_type+0xe4/0x140
  [373.340031]  ? rcu_read_lock_sched_held+0x3f/0x70
  [373.346910]  btrfs_ioctl_balance+0x548/0x700 [btrfs]
  [373.354207]  btrfs_ioctl+0x7f2/0x71b0 [btrfs]
  [373.360774]  ? lockdep_hardirqs_on_prepare+0x410/0x410
  [373.367957]  ? lockdep_hardirqs_on_prepare+0x410/0x410
  [373.375327]  ? btrfs_ioctl_get_supported_features+0x20/0x20 [btrfs]
  [373.383841]  ? find_held_lock+0x2c/0x110
  [373.389993]  ? lock_release+0x3a9/0x6d0
  [373.395828]  ? mntput_no_expire+0xf7/0xad0
  [373.402083]  ? lock_is_held_type+0xe4/0x140
  [373.408249]  ? vfs_fileattr_set+0x9f0/0x9f0
  [373.414486]  ? selinux_file_ioctl+0x349/0x4e0
  [373.420938]  ? trace_raw_output_lock+0xb4/0xe0
  [373.427442]  ? selinux_inode_getsecctx+0x80/0x80
  [373.434224]  ? lockdep_hardirqs_on+0x7e/0x100
  [373.440660]  ? force_qs_rnp+0x2a0/0x6b0
  [373.446534]  ? lock_is_held_type+0x9b/0x140
  [373.452763]  ? __blkcg_punt_bio_submit+0x1b0/0x1b0
  [373.459732]  ? security_file_ioctl+0x50/0x90
  [373.466089]  __x64_sys_ioctl+0x127/0x190
  [373.472022]  do_syscall_64+0x3b/0x90
  [373.477513]  entry_SYSCALL_64_after_hwframe+0x44/0xae
  [373.484823] RIP: 0033:0x7f8f4af7e2bb
  [373.490493] RSP: 002b:00007ffcbf936178 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [373.500197] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f8f4af7e2bb
  [373.509451] RDX: 00007ffcbf936220 RSI: 00000000c4009420 RDI: 0000000000000003
  [373.518659] RBP: 00007ffcbf93774a R08: 0000000000000013 R09: 00007f8f4b02d4e0
  [373.527872] R10: 00007f8f4ae87740 R11: 0000000000000246 R12: 0000000000000001
  [373.537222] R13: 00007ffcbf936220 R14: 0000000000000000 R15: 0000000000000002
  [373.546506]  </TASK>
  [373.550878] INFO: task btrfs:3146 blocked for more than 123 seconds.
  [373.559383]       Not tainted 5.16.0-rc8 #7
  [373.565748] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [373.575748] task:btrfs           state:D stack:    0 pid: 3146 ppid:  2168 flags:0x00000000
  [373.586314] Call Trace:
  [373.590846]  <TASK>
  [373.595121]  __schedule+0xb56/0x4850
  [373.600901]  ? __lock_acquire+0x23db/0x5030
  [373.607176]  ? io_schedule_timeout+0x190/0x190
  [373.613954]  schedule+0xe0/0x270
  [373.619157]  schedule_timeout+0x168/0x220
  [373.625170]  ? usleep_range_state+0x150/0x150
  [373.631653]  ? mark_held_locks+0x9e/0xe0
  [373.637767]  ? do_raw_spin_lock+0x11e/0x250
  [373.643993]  ? lockdep_hardirqs_on_prepare+0x17b/0x410
  [373.651267]  ? _raw_spin_unlock_irq+0x24/0x50
  [373.657677]  ? lockdep_hardirqs_on+0x7e/0x100
  [373.664103]  wait_for_completion+0x163/0x250
  [373.670437]  ? bit_wait_timeout+0x160/0x160
  [373.676585]  btrfs_quota_disable+0x176/0x9a0 [btrfs]
  [373.683979]  ? btrfs_quota_enable+0x12f0/0x12f0 [btrfs]
  [373.691340]  ? down_write+0xd0/0x130
  [373.696880]  ? down_write_killable+0x150/0x150
  [373.703352]  btrfs_ioctl+0x3945/0x71b0 [btrfs]
  [373.710061]  ? find_held_lock+0x2c/0x110
  [373.716192]  ? lock_release+0x3a9/0x6d0
  [373.722047]  ? __handle_mm_fault+0x23cd/0x3050
  [373.728486]  ? btrfs_ioctl_get_supported_features+0x20/0x20 [btrfs]
  [373.737032]  ? set_pte+0x6a/0x90
  [373.742271]  ? do_raw_spin_unlock+0x55/0x1f0
  [373.748506]  ? lock_is_held_type+0xe4/0x140
  [373.754792]  ? vfs_fileattr_set+0x9f0/0x9f0
  [373.761083]  ? selinux_file_ioctl+0x349/0x4e0
  [373.767521]  ? selinux_inode_getsecctx+0x80/0x80
  [373.774247]  ? __up_read+0x182/0x6e0
  [373.780026]  ? count_memcg_events.constprop.0+0x46/0x60
  [373.787281]  ? up_write+0x460/0x460
  [373.792932]  ? security_file_ioctl+0x50/0x90
  [373.799232]  __x64_sys_ioctl+0x127/0x190
  [373.805237]  do_syscall_64+0x3b/0x90
  [373.810947]  entry_SYSCALL_64_after_hwframe+0x44/0xae
  [373.818102] RIP: 0033:0x7f1383ea02bb
  [373.823847] RSP: 002b:00007fffeb4d71f8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
  [373.833641] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1383ea02bb
  [373.842961] RDX: 00007fffeb4d7210 RSI: 00000000c0109428 RDI: 0000000000000003
  [373.852179] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
  [373.861408] R10: 00007f1383daec78 R11: 0000000000000202 R12: 00007fffeb4d874a
  [373.870647] R13: 0000000000493099 R14: 0000000000000001 R15: 0000000000000000
  [373.879838]  </TASK>
  [373.884018]
               Showing all locks held in the system:
  [373.894250] 3 locks held by kworker/4:1/58:
  [373.900356] 1 lock held by khungtaskd/63:
  [373.906333]  #0: ffffffff8945ff60 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260
  [373.917307] 3 locks held by kworker/u16:6/103:
  [373.923938]  #0: ffff888127b4f138 ((wq_completion)btrfs-qgroup-rescan){+.+.}-{0:0}, at: process_one_work+0x712/0x1320
  [373.936555]  #1: ffff88810b817dd8 ((work_completion)(&work->normal_work)){+.+.}-{0:0}, at: process_one_work+0x73f/0x1320
  [373.951109]  #2: ffff888102dd4650 (sb_internal#2){.+.+}-{0:0}, at: btrfs_qgroup_rescan_worker+0x1f6/0x10c0 [btrfs]
  [373.964027] 2 locks held by less/1803:
  [373.969982]  #0: ffff88813ed56098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x24/0x80
  [373.981295]  #1: ffffc90000b3b2e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x9e2/0x1060
  [373.992969] 1 lock held by btrfs-transacti/2347:
  [373.999893]  #0: ffff88813d4887a8 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0xe3/0x3c0 [btrfs]
  [374.015872] 3 locks held by btrfs/3145:
  [374.022298]  #0: ffff888102dd4460 (sb_writers#18){.+.+}-{0:0}, at: btrfs_ioctl_balance+0xc3/0x700 [btrfs]
  [374.034456]  #1: ffff88813d48a0a0 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0xfe5/0x2d20 [btrfs]
  [374.047646]  #2: ffff88813d488838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x354/0x930 [btrfs]
  [374.063295] 4 locks held by btrfs/3146:
  [374.069647]  #0: ffff888102dd4460 (sb_writers#18){.+.+}-{0:0}, at: btrfs_ioctl+0x38b1/0x71b0 [btrfs]
  [374.081601]  #1: ffff88813d488bb8 (&fs_info->subvol_sem){+.+.}-{3:3}, at: btrfs_ioctl+0x38fd/0x71b0 [btrfs]
  [374.094283]  #2: ffff888102dd4650 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_disable+0xc8/0x9a0 [btrfs]
  [374.106885]  #3: ffff88813d489800 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_disable+0xd5/0x9a0 [btrfs]

  [374.126780] =============================================

To avoid the deadlock, wait for the qgroup rescan worker to complete
before starting the transaction for the quota disable ioctl. Clear
BTRFS_FS_QUOTA_ENABLE flag before the wait and the transaction to
request the worker to complete. On transaction start failure, set the
BTRFS_FS_QUOTA_ENABLE flag again. These BTRFS_FS_QUOTA_ENABLE flag
changes can be done safely since the function btrfs_quota_disable is not
called concurrently because of fs_info->subvol_sem.

Also check the BTRFS_FS_QUOTA_ENABLE flag in qgroup_rescan_init to avoid
another qgroup rescan worker to start after the previous qgroup worker
completed.

CC: [email protected] # 5.4+
Suggested-by: Nikolay Borisov <[email protected]>
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Shin'ichiro Kawasaki <[email protected]>
Signed-off-by: David Sterba <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Feb 18, 2022
…/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 5.17, take #2

- A couple of fixes when handling an exception while a SError has been
  delivered

- Workaround for Cortex-A510's single-step[ erratum
kernel-patches-bot pushed a commit that referenced this pull request Feb 18, 2022
Yonghong Song says:

====================

The patch [1] exposed a bpf_timer initialization bug in function
check_and_init_map_value(). With bug fix here, the patch [1]
can be applied with all selftests passed. Please see individual
patches for fix details.

  [1] https://lore.kernel.org/bpf/[email protected]/

Changelog:
  v3 -> v4:
    . move header file in patch #1 to avoid bpf-next merge conflict
  v2 -> v3:
    . switch patch #1 and patch #2 for better bisecting
  v1 -> v2:
    . add Fixes tag for patch #1
    . rebase against bpf tree
====================

Signed-off-by: Alexei Starovoitov <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Feb 18, 2022
Sebastian Andrzej Siewior says:

====================
net: dev: PREEMPT_RT fixups.

this series removes or replaces preempt_disable() and local_irq_save()
sections which are problematic on PREEMPT_RT.
Patch 2 makes netif_rx() work from any context after I found suggestions
for it in an old thread. Should that work, then the context-specific
variants could be removed.

v2…v3:
   - #2
     - Export __netif_rx() so it can be used by everyone.
     - Add a lockdep assert to check for interrupt context.
     - Update the kernel doc and mention that the skb is posted to
       backlog NAPI.
     - Use __netif_rx() also in drivers/net/*.c.
     - Added Toke''s review tag and kept Eric's desptite the changes
       made.

v1…v2:
  - #1 and #2
    - merge patch 1 und 2 from the series (as per Toke).
    - updated patch description and corrected the first commit number (as
      per Eric).
   - #2
     - Provide netif_rx() as in v1 and additionally __netif_rx() without
       local_bh disable()+enable() for the loopback driver. __netif_rx() is
       not exported (loopback is built-in only) so it won't be used
       drivers. If this doesn't work then we can still export/ define a
       wrapper as Eric suggested.
     - Added a comment that netif_rx() considered legacy.
   - #3
     - Moved ____napi_schedule() into rps_ipi_queued() and
       renamed it napi_schedule_rps().
   https://lore.kernel.org/all/[email protected]/

v1:
   https://lore.kernel.org/all/[email protected]
====================

Signed-off-by: David S. Miller <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Feb 22, 2022
Fix a race in the xsk socket teardown code that can lead to a null
pointer dereference splat. The current xsk unbind code in
xsk_unbind_dev() starts by setting xs->state to XSK_UNBOUND, sets
xs->dev to NULL and then waits for any NAPI processing to terminate
using synchronize_net(). After that, the release code starts to tear
down the socket state and free allocated memory.

BUG: kernel NULL pointer dereference, address: 00000000000000c0
PGD 8000000932469067 P4D 8000000932469067 PUD 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 25 PID: 69132 Comm: grpcpp_sync_ser Tainted: G          I       5.16.0+ #2
Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 1.2.10 03/09/2015
RIP: 0010:__xsk_sendmsg+0x2c/0x690
Code: 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 38 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8b 87 08 03 00 00 <f6> 80 c0 00 00 00 01 >
RSP: 0018:ffffa2348bd13d50 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000040 RCX: ffff8d5fc632d258
RDX: 0000000000400000 RSI: ffffa2348bd13e10 RDI: ffff8d5fc5489800
RBP: ffffa2348bd13db0 R08: 0000000000000000 R09: 00007ffffffff000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d5fc5489800
R13: ffff8d5fcb0f5140 R14: ffff8d5fcb0f5140 R15: 0000000000000000
FS:  00007f991cff9400(0000) GS:ffff8d6f1f700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000c0 CR3: 0000000114888005 CR4: 00000000001706e0
Call Trace:
<TASK>
? aa_sk_perm+0x43/0x1b0
xsk_sendmsg+0xf0/0x110
sock_sendmsg+0x65/0x70
__sys_sendto+0x113/0x190
? debug_smp_processor_id+0x17/0x20
? fpregs_assert_state_consistent+0x23/0x50
? exit_to_user_mode_prepare+0xa5/0x1d0
__x64_sys_sendto+0x29/0x30
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae

There are two problems with the current code. First, setting xs->dev
to NULL before waiting for all users to stop using the socket is not
correct. The entry to the data plane functions xsk_poll(),
xsk_sendmsg(), and xsk_recvmsg() are all guarded by a test that
xs->state is in the state XSK_BOUND and if not, it returns right
away. But one process might have passed this test but still have not
gotten to the point in which it uses xs->dev in the code. In this
interim, a second process executing xsk_unbind_dev() might have set
xs->dev to NULL which will lead to a crash for the first process. The
solution here is just to get rid of this NULL assignment since it is
not used anymore. Before commit 42fddcc ("xsk: use state member
for socket synchronization"), xs->dev was the gatekeeper to admit
processes into the data plane functions, but it was replaced with the
state variable xs->state in the aforementioned commit.

The second problem is that synchronize_net() does not wait for any
process in xsk_poll(), xsk_sendmsg(), or xsk_recvmsg() to complete,
which means that the state they rely on might be cleaned up
prematurely. This can happen when the notifier gets called (at driver
unload for example) as it uses xsk_unbind_dev(). Solve this by
extending the RCU critical region from just the ndo_xsk_wakeup to the
whole functions mentioned above, so that both the test of xs->state ==
XSK_BOUND and the last use of any member of xs is covered by the RCU
critical section. This will guarantee that when synchronize_net()
completes, there will be no processes left executing xsk_poll(),
xsk_sendmsg(), or xsk_recvmsg() and state can be cleaned up
safely. Note that we need to drop the RCU lock for the SKB xmit path
as it uses functions that might sleep. Due to this, we have to retest
the xs->state after we grab the mutex that protects the SKB xmit code
from, among a number of things, an xsk_unbind_dev() being executed
from the notifier at the same time.

Fixes: 42fddcc ("xsk: use state member for socket synchronization")
Reported-by: Elza Mathew <[email protected]>
Signed-off-by: Magnus Karlsson <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Feb 28, 2022
Fix a race in the xsk socket teardown code that can lead to a null
pointer dereference splat. The current xsk unbind code in
xsk_unbind_dev() starts by setting xs->state to XSK_UNBOUND, sets
xs->dev to NULL and then waits for any NAPI processing to terminate
using synchronize_net(). After that, the release code starts to tear
down the socket state and free allocated memory.

BUG: kernel NULL pointer dereference, address: 00000000000000c0
PGD 8000000932469067 P4D 8000000932469067 PUD 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 25 PID: 69132 Comm: grpcpp_sync_ser Tainted: G          I       5.16.0+ #2
Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 1.2.10 03/09/2015
RIP: 0010:__xsk_sendmsg+0x2c/0x690
Code: 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 38 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8b 87 08 03 00 00 <f6> 80 c0 00 00 00 01 >
RSP: 0018:ffffa2348bd13d50 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000040 RCX: ffff8d5fc632d258
RDX: 0000000000400000 RSI: ffffa2348bd13e10 RDI: ffff8d5fc5489800
RBP: ffffa2348bd13db0 R08: 0000000000000000 R09: 00007ffffffff000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d5fc5489800
R13: ffff8d5fcb0f5140 R14: ffff8d5fcb0f5140 R15: 0000000000000000
FS:  00007f991cff9400(0000) GS:ffff8d6f1f700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000c0 CR3: 0000000114888005 CR4: 00000000001706e0
Call Trace:
<TASK>
? aa_sk_perm+0x43/0x1b0
xsk_sendmsg+0xf0/0x110
sock_sendmsg+0x65/0x70
__sys_sendto+0x113/0x190
? debug_smp_processor_id+0x17/0x20
? fpregs_assert_state_consistent+0x23/0x50
? exit_to_user_mode_prepare+0xa5/0x1d0
__x64_sys_sendto+0x29/0x30
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae

There are two problems with the current code. First, setting xs->dev
to NULL before waiting for all users to stop using the socket is not
correct. The entry to the data plane functions xsk_poll(),
xsk_sendmsg(), and xsk_recvmsg() are all guarded by a test that
xs->state is in the state XSK_BOUND and if not, it returns right
away. But one process might have passed this test but still have not
gotten to the point in which it uses xs->dev in the code. In this
interim, a second process executing xsk_unbind_dev() might have set
xs->dev to NULL which will lead to a crash for the first process. The
solution here is just to get rid of this NULL assignment since it is
not used anymore. Before commit 42fddcc ("xsk: use state member
for socket synchronization"), xs->dev was the gatekeeper to admit
processes into the data plane functions, but it was replaced with the
state variable xs->state in the aforementioned commit.

The second problem is that synchronize_net() does not wait for any
process in xsk_poll(), xsk_sendmsg(), or xsk_recvmsg() to complete,
which means that the state they rely on might be cleaned up
prematurely. This can happen when the notifier gets called (at driver
unload for example) as it uses xsk_unbind_dev(). Solve this by
extending the RCU critical region from just the ndo_xsk_wakeup to the
whole functions mentioned above, so that both the test of xs->state ==
XSK_BOUND and the last use of any member of xs is covered by the RCU
critical section. This will guarantee that when synchronize_net()
completes, there will be no processes left executing xsk_poll(),
xsk_sendmsg(), or xsk_recvmsg() and state can be cleaned up
safely. Note that we need to drop the RCU lock for the SKB xmit path
as it uses functions that might sleep. Due to this, we have to retest
the xs->state after we grab the mutex that protects the SKB xmit code
from, among a number of things, an xsk_unbind_dev() being executed
from the notifier at the same time.

v1 -> v2:
* Naming xsk_zc_xmit() -> xsk_wakeup() [Maciej]

Fixes: 42fddcc ("xsk: use state member for socket synchronization")
Reported-by: Elza Mathew <[email protected]>
Signed-off-by: Magnus Karlsson <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Feb 28, 2022
Fix a race in the xsk socket teardown code that can lead to a null
pointer dereference splat. The current xsk unbind code in
xsk_unbind_dev() starts by setting xs->state to XSK_UNBOUND, sets
xs->dev to NULL and then waits for any NAPI processing to terminate
using synchronize_net(). After that, the release code starts to tear
down the socket state and free allocated memory.

BUG: kernel NULL pointer dereference, address: 00000000000000c0
PGD 8000000932469067 P4D 8000000932469067 PUD 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 25 PID: 69132 Comm: grpcpp_sync_ser Tainted: G          I       5.16.0+ #2
Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 1.2.10 03/09/2015
RIP: 0010:__xsk_sendmsg+0x2c/0x690
Code: 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 38 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8b 87 08 03 00 00 <f6> 80 c0 00 00 00 01 >
RSP: 0018:ffffa2348bd13d50 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000040 RCX: ffff8d5fc632d258
RDX: 0000000000400000 RSI: ffffa2348bd13e10 RDI: ffff8d5fc5489800
RBP: ffffa2348bd13db0 R08: 0000000000000000 R09: 00007ffffffff000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d5fc5489800
R13: ffff8d5fcb0f5140 R14: ffff8d5fcb0f5140 R15: 0000000000000000
FS:  00007f991cff9400(0000) GS:ffff8d6f1f700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000c0 CR3: 0000000114888005 CR4: 00000000001706e0
Call Trace:
<TASK>
? aa_sk_perm+0x43/0x1b0
xsk_sendmsg+0xf0/0x110
sock_sendmsg+0x65/0x70
__sys_sendto+0x113/0x190
? debug_smp_processor_id+0x17/0x20
? fpregs_assert_state_consistent+0x23/0x50
? exit_to_user_mode_prepare+0xa5/0x1d0
__x64_sys_sendto+0x29/0x30
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae

There are two problems with the current code. First, setting xs->dev
to NULL before waiting for all users to stop using the socket is not
correct. The entry to the data plane functions xsk_poll(),
xsk_sendmsg(), and xsk_recvmsg() are all guarded by a test that
xs->state is in the state XSK_BOUND and if not, it returns right
away. But one process might have passed this test but still have not
gotten to the point in which it uses xs->dev in the code. In this
interim, a second process executing xsk_unbind_dev() might have set
xs->dev to NULL which will lead to a crash for the first process. The
solution here is just to get rid of this NULL assignment since it is
not used anymore. Before commit 42fddcc ("xsk: use state member
for socket synchronization"), xs->dev was the gatekeeper to admit
processes into the data plane functions, but it was replaced with the
state variable xs->state in the aforementioned commit.

The second problem is that synchronize_net() does not wait for any
process in xsk_poll(), xsk_sendmsg(), or xsk_recvmsg() to complete,
which means that the state they rely on might be cleaned up
prematurely. This can happen when the notifier gets called (at driver
unload for example) as it uses xsk_unbind_dev(). Solve this by
extending the RCU critical region from just the ndo_xsk_wakeup to the
whole functions mentioned above, so that both the test of xs->state ==
XSK_BOUND and the last use of any member of xs is covered by the RCU
critical section. This will guarantee that when synchronize_net()
completes, there will be no processes left executing xsk_poll(),
xsk_sendmsg(), or xsk_recvmsg() and state can be cleaned up
safely. Note that we need to drop the RCU lock for the SKB xmit path
as it uses functions that might sleep. Due to this, we have to retest
the xs->state after we grab the mutex that protects the SKB xmit code
from, among a number of things, an xsk_unbind_dev() being executed
from the notifier at the same time.

v1 -> v2:
* Naming xsk_zc_xmit() -> xsk_wakeup() [Maciej]

Fixes: 42fddcc ("xsk: use state member for socket synchronization")
Reported-by: Elza Mathew <[email protected]>
Signed-off-by: Magnus Karlsson <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Feb 28, 2022
Fix a race in the xsk socket teardown code that can lead to a null
pointer dereference splat. The current xsk unbind code in
xsk_unbind_dev() starts by setting xs->state to XSK_UNBOUND, sets
xs->dev to NULL and then waits for any NAPI processing to terminate
using synchronize_net(). After that, the release code starts to tear
down the socket state and free allocated memory.

BUG: kernel NULL pointer dereference, address: 00000000000000c0
PGD 8000000932469067 P4D 8000000932469067 PUD 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 25 PID: 69132 Comm: grpcpp_sync_ser Tainted: G          I       5.16.0+ #2
Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 1.2.10 03/09/2015
RIP: 0010:__xsk_sendmsg+0x2c/0x690
Code: 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 38 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8b 87 08 03 00 00 <f6> 80 c0 00 00 00 01 >
RSP: 0018:ffffa2348bd13d50 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000040 RCX: ffff8d5fc632d258
RDX: 0000000000400000 RSI: ffffa2348bd13e10 RDI: ffff8d5fc5489800
RBP: ffffa2348bd13db0 R08: 0000000000000000 R09: 00007ffffffff000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d5fc5489800
R13: ffff8d5fcb0f5140 R14: ffff8d5fcb0f5140 R15: 0000000000000000
FS:  00007f991cff9400(0000) GS:ffff8d6f1f700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000c0 CR3: 0000000114888005 CR4: 00000000001706e0
Call Trace:
<TASK>
? aa_sk_perm+0x43/0x1b0
xsk_sendmsg+0xf0/0x110
sock_sendmsg+0x65/0x70
__sys_sendto+0x113/0x190
? debug_smp_processor_id+0x17/0x20
? fpregs_assert_state_consistent+0x23/0x50
? exit_to_user_mode_prepare+0xa5/0x1d0
__x64_sys_sendto+0x29/0x30
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae

There are two problems with the current code. First, setting xs->dev
to NULL before waiting for all users to stop using the socket is not
correct. The entry to the data plane functions xsk_poll(),
xsk_sendmsg(), and xsk_recvmsg() are all guarded by a test that
xs->state is in the state XSK_BOUND and if not, it returns right
away. But one process might have passed this test but still have not
gotten to the point in which it uses xs->dev in the code. In this
interim, a second process executing xsk_unbind_dev() might have set
xs->dev to NULL which will lead to a crash for the first process. The
solution here is just to get rid of this NULL assignment since it is
not used anymore. Before commit 42fddcc ("xsk: use state member
for socket synchronization"), xs->dev was the gatekeeper to admit
processes into the data plane functions, but it was replaced with the
state variable xs->state in the aforementioned commit.

The second problem is that synchronize_net() does not wait for any
process in xsk_poll(), xsk_sendmsg(), or xsk_recvmsg() to complete,
which means that the state they rely on might be cleaned up
prematurely. This can happen when the notifier gets called (at driver
unload for example) as it uses xsk_unbind_dev(). Solve this by
extending the RCU critical region from just the ndo_xsk_wakeup to the
whole functions mentioned above, so that both the test of xs->state ==
XSK_BOUND and the last use of any member of xs is covered by the RCU
critical section. This will guarantee that when synchronize_net()
completes, there will be no processes left executing xsk_poll(),
xsk_sendmsg(), or xsk_recvmsg() and state can be cleaned up
safely. Note that we need to drop the RCU lock for the SKB xmit path
as it uses functions that might sleep. Due to this, we have to retest
the xs->state after we grab the mutex that protects the SKB xmit code
from, among a number of things, an xsk_unbind_dev() being executed
from the notifier at the same time.

v1 -> v2:
* Naming xsk_zc_xmit() -> xsk_wakeup() [Maciej]

Fixes: 42fddcc ("xsk: use state member for socket synchronization")
Reported-by: Elza Mathew <[email protected]>
Signed-off-by: Magnus Karlsson <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Feb 28, 2022
Fix a race in the xsk socket teardown code that can lead to a NULL pointer
dereference splat. The current xsk unbind code in xsk_unbind_dev() starts by
setting xs->state to XSK_UNBOUND, sets xs->dev to NULL and then waits for any
NAPI processing to terminate using synchronize_net(). After that, the release
code starts to tear down the socket state and free allocated memory.

  BUG: kernel NULL pointer dereference, address: 00000000000000c0
  PGD 8000000932469067 P4D 8000000932469067 PUD 0
  Oops: 0000 [#1] PREEMPT SMP PTI
  CPU: 25 PID: 69132 Comm: grpcpp_sync_ser Tainted: G          I       5.16.0+ #2
  Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 1.2.10 03/09/2015
  RIP: 0010:__xsk_sendmsg+0x2c/0x690
  [...]
  RSP: 0018:ffffa2348bd13d50 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: 0000000000000040 RCX: ffff8d5fc632d258
  RDX: 0000000000400000 RSI: ffffa2348bd13e10 RDI: ffff8d5fc5489800
  RBP: ffffa2348bd13db0 R08: 0000000000000000 R09: 00007ffffffff000
  R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d5fc5489800
  R13: ffff8d5fcb0f5140 R14: ffff8d5fcb0f5140 R15: 0000000000000000
  FS:  00007f991cff9400(0000) GS:ffff8d6f1f700000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000000000c0 CR3: 0000000114888005 CR4: 00000000001706e0
  Call Trace:
  <TASK>
  ? aa_sk_perm+0x43/0x1b0
  xsk_sendmsg+0xf0/0x110
  sock_sendmsg+0x65/0x70
  __sys_sendto+0x113/0x190
  ? debug_smp_processor_id+0x17/0x20
  ? fpregs_assert_state_consistent+0x23/0x50
  ? exit_to_user_mode_prepare+0xa5/0x1d0
  __x64_sys_sendto+0x29/0x30
  do_syscall_64+0x3b/0xc0
  entry_SYSCALL_64_after_hwframe+0x44/0xae

There are two problems with the current code. First, setting xs->dev to NULL
before waiting for all users to stop using the socket is not correct. The
entry to the data plane functions xsk_poll(), xsk_sendmsg(), and xsk_recvmsg()
are all guarded by a test that xs->state is in the state XSK_BOUND and if not,
it returns right away. But one process might have passed this test but still
have not gotten to the point in which it uses xs->dev in the code. In this
interim, a second process executing xsk_unbind_dev() might have set xs->dev to
NULL which will lead to a crash for the first process. The solution here is
just to get rid of this NULL assignment since it is not used anymore. Before
commit 42fddcc ("xsk: use state member for socket synchronization"),
xs->dev was the gatekeeper to admit processes into the data plane functions,
but it was replaced with the state variable xs->state in the aforementioned
commit.

The second problem is that synchronize_net() does not wait for any process in
xsk_poll(), xsk_sendmsg(), or xsk_recvmsg() to complete, which means that the
state they rely on might be cleaned up prematurely. This can happen when the
notifier gets called (at driver unload for example) as it uses xsk_unbind_dev().
Solve this by extending the RCU critical region from just the ndo_xsk_wakeup
to the whole functions mentioned above, so that both the test of xs->state ==
XSK_BOUND and the last use of any member of xs is covered by the RCU critical
section. This will guarantee that when synchronize_net() completes, there will
be no processes left executing xsk_poll(), xsk_sendmsg(), or xsk_recvmsg() and
state can be cleaned up safely. Note that we need to drop the RCU lock for the
skb xmit path as it uses functions that might sleep. Due to this, we have to
retest the xs->state after we grab the mutex that protects the skb xmit code
from, among a number of things, an xsk_unbind_dev() being executed from the
notifier at the same time.

Fixes: 42fddcc ("xsk: use state member for socket synchronization")
Reported-by: Elza Mathew <[email protected]>
Signed-off-by: Magnus Karlsson <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Björn Töpel <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
kernel-patches-bot pushed a commit that referenced this pull request Feb 28, 2022
…ux/kernel/git/arm64/linux

Will Deacon says:

====================
On Tue, Feb 22, 2022 at 10:38:02PM +0000, Will Deacon wrote:
> On Thu, 17 Feb 2022 15:22:28 +0800, Hou Tao wrote:
> > Atomics support in bpf has already been done by "Atomics for eBPF"
> > patch series [1], but it only adds support for x86, and this patchset
> > adds support for arm64.
> >
> > Patch #1 & patch #2 are arm64 related. Patch #1 moves the common used
> > macro AARCH64_BREAK_FAULT into insn-def.h for insn.h. Patch #2 adds
> > necessary encoder helpers for atomic operations.
> >
> > [...]
>
> Applied to arm64 (for-next/insn), thanks!
>
> [1/4] arm64: move AARCH64_BREAK_FAULT into insn-def.h
>       https://git.kernel.org/arm64/c/97e58e395e9c
> [2/4] arm64: insn: add encoders for atomic operations
>       https://git.kernel.org/arm64/c/fa1114d9eba5

Daniel -- let's give this a day or so in -next, then if nothing catches
fire you're more than welcome to pull this branch as a base for the rest
of the series.
====================

Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/20220222224211.GB16976@willie-the-truck
kernel-patches-bot pushed a commit that referenced this pull request Mar 5, 2022
Ido Schimmel says:

====================
ipv4: Invalidate neighbour for broadcast address upon address addition

Patch #1 solves a recently reported issue [1]. See detailed description
in the changelog.

Patch #2 adds a matching test case.

Targeting at net-next since as far as I can tell this use case never
worked.

There are no regressions in fib_tests.sh with this change:

 # ./fib_tests.sh
 ...
 Tests passed: 186
 Tests failed:   0

[1] https://lore.kernel.org/netdev/[email protected]/
====================

Signed-off-by: David S. Miller <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Mar 5, 2022
This patch adds workaround for PTP errata given below.

1. At the time of 1 sec rollover of nano-second counter,
   the nano-second counter is set to 0. However, it should
   be set to (existing counter_value - 10^9). This leads to
   an accumulating error in the timestamp value with each sec
   rollover.
2. Additionally, the nano-second counter currently is rolling
   over at 'h3B9A_C9FF. It should roll over at 'h3B9A_CA00.

The workaround for issue #1 is to speed up the ptp clock by
adjusting PTP_CLOCK_COMP register to the desired value to
compensate for the nanoseconds lost per each second.

The workaround for issue #2 is to slow down the ptp clock
such that the rollover occurs at ~1sec.

Signed-off-by: Naveen Mamindlapalli <[email protected]>
Signed-off-by: Sunil Kovvuri Goutham <[email protected]>
Signed-off-by: Rakesh Babu Saladi <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Mar 5, 2022
Ido Schimmel says:

====================
mlxsw: Various updates

This patchset contains miscellaneous updates to mlxsw gathered over
time.

Patches #1-#2 fix recent regressions present in net-next.

Patches #3-#11 are small cleanups performed while adding line card
support in mlxsw.

Patch #12 adds the SFF-8024 Identifier Value of OSFP transceiver in
order to be able to dump their EEPROM contents over the ethtool IOCTL
interface.
====================

Signed-off-by: David S. Miller <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Mar 5, 2022
Dust Li says:

====================
net/smc: some datapath performance optimizations

This series tries to improve the performance of SMC in datapath.

- patch #1, add sysctl interface to support tuning the behaviour of
  SMC in container environment.

- patch #2/#3, add autocorking support which is very efficient for small
  messages without trade-off for latency.

- patch #4, send directly on setting TCP_NODELAY, without wake up the
  TX worker, this make it consistent with clearing TCP_CORK.

- patch #5, this correct the setting of RMB window update limit, so
  we don't send CDC messages to update peer's RMB window too frequently
  in some cases.

- patch #6, implemented something like NAPI in SMC, decrease the number
  of hardirq when busy.

- patch #7, this moves TX work doing in the BH to the user context when
  sock_lock is hold by user.

With this patchset applied, we can get a good performance gain:
- qperf tcp_bw test has shown a great improvement. Other benchmarks like
  'netperf TCP_STREAM' or 'sockperf throughput' has similar result.
- In my testing environment, running qperf tcp_bw and tcp_lat, SMC behaves
  better then TCP in most all message size.

Here are some test results with the following testing command:
client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
		-t 30 -vu tcp_{bw|lat}
server: smc_run taskset -c 1 qperf

==== Bandwidth ====
 MsgSize        Origin SMC              TCP                SMC with patches
       1         0.578 MB/s      2.392 MB/s(313.57%)      2.561 MB/s(342.83%)
       2         1.159 MB/s      4.780 MB/s(312.53%)      5.162 MB/s(345.46%)
       4         2.283 MB/s     10.266 MB/s(349.77%)     10.122 MB/s(343.46%)
       8         4.668 MB/s     19.040 MB/s(307.86%)     20.521 MB/s(339.59%)
      16         9.147 MB/s     38.904 MB/s(325.31%)     40.823 MB/s(346.29%)
      32        18.369 MB/s     79.587 MB/s(333.25%)     80.535 MB/s(338.42%)
      64        36.562 MB/s    148.668 MB/s(306.61%)    158.170 MB/s(332.60%)
     128        72.961 MB/s    274.913 MB/s(276.80%)    316.217 MB/s(333.41%)
     256       144.705 MB/s    512.059 MB/s(253.86%)    626.019 MB/s(332.62%)
     512       288.873 MB/s    884.977 MB/s(206.35%)   1221.596 MB/s(322.88%)
    1024       574.180 MB/s   1337.736 MB/s(132.98%)   2203.156 MB/s(283.70%)
    2048      1095.192 MB/s   1865.952 MB/s( 70.38%)   3036.448 MB/s(177.25%)
    4096      2066.157 MB/s   2380.337 MB/s( 15.21%)   3834.271 MB/s( 85.58%)
    8192      3717.198 MB/s   2733.073 MB/s(-26.47%)   4904.910 MB/s( 31.95%)
   16384      4742.221 MB/s   2958.693 MB/s(-37.61%)   5220.272 MB/s( 10.08%)
   32768      5349.550 MB/s   3061.285 MB/s(-42.77%)   5321.865 MB/s( -0.52%)
   65536      5162.919 MB/s   3731.408 MB/s(-27.73%)   5245.021 MB/s(  1.59%)
==== Latency ====
 MsgSize        Origin SMC              TCP                SMC with patches
       1        10.540 us     11.938 us( 13.26%)         10.356 us( -1.75%)
       2        10.996 us     11.992 us(  9.06%)         10.073 us( -8.39%)
       4        10.229 us     11.687 us( 14.25%)          9.996 us( -2.28%)
       8        10.203 us     11.653 us( 14.21%)         10.063 us( -1.37%)
      16        10.530 us     11.313 us(  7.44%)         10.013 us( -4.91%)
      32        10.241 us     11.586 us( 13.13%)         10.081 us( -1.56%)
      64        10.693 us     11.652 us(  8.97%)          9.986 us( -6.61%)
     128        10.597 us     11.579 us(  9.27%)         10.262 us( -3.16%)
     256        10.409 us     11.957 us( 14.87%)         10.148 us( -2.51%)
     512        11.088 us     12.505 us( 12.78%)         10.206 us( -7.95%)
    1024        11.240 us     12.255 us(  9.03%)         10.631 us( -5.42%)
    2048        11.485 us     16.970 us( 47.76%)         10.981 us( -4.39%)
    4096        12.077 us     13.948 us( 15.49%)         11.847 us( -1.90%)
    8192        13.683 us     16.693 us( 22.00%)         13.336 us( -2.54%)
   16384        16.470 us     23.615 us( 43.38%)         16.519 us(  0.30%)
   32768        22.540 us     40.966 us( 81.75%)         22.452 us( -0.39%)
   65536        34.192 us     73.003 us(113.51%)         33.916 us( -0.81%)

------------
Test environment notes:
1. Testing is run on 2 VMs within the same physical host
2. The NIC is ConnectX-4Lx, using SRIOV, and passing through 2 VFs to the
   2 VMs respectively.
3. To decrease jitter, VM's vCPU are binded to each physical CPU, and those
   physical CPUs are all isolated using boot parameter `isolcpus=xxx`
4. The queue number are set to 1, and interrupt from the queue is binded to
   CPU0 in the guest
====================

Signed-off-by: David S. Miller <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Mar 5, 2022
Ido Schimmel says:

====================
HW counters for soft devices

Petr says:

Offloading switch device drivers may be able to collect statistics of the
traffic taking place in the HW datapath that pertains to a certain soft
netdevice, such as a VLAN. In this patch set, add the necessary
infrastructure to allow exposing these statistics to the offloaded
netdevice in question, and add mlxsw offload.

Across HW platforms, the counter itself very likely constitutes a limited
resource, and the act of counting may have a performance impact. Therefore
this patch set makes the HW statistics collection opt-in and togglable from
userspace on a per-netdevice basis.

Additionally, HW devices may have various limiting conditions under which
they can realize the counter. Therefore it is also possible to query
whether the requested counter is realized by any driver. In TC parlance,
which is to a degree reused in this patch set, two values are recognized:
"request" tracks whether the user enabled collecting HW statistics, and
"used" tracks whether any HW statistics are actually collected.

In the past, this author has expressed the opinion that `a typical user
doing "ip -s l sh", including various scripts, wants to see the full
picture and not worry what's going on where'. While that would be nice,
unfortunately it cannot work:

- Packets that trap from the HW datapath to the SW datapath would be
  double counted.

  For a given netdevice, some traffic can be purely a SW artifact, and some
  may flow through the HW object corresponding to the netdevice. But some
  traffic can also get trapped to the SW datapath after bumping the HW
  counter. It is not clear how to make sure double-counting does not occur
  in the SW datapath in that case, while still making sure that possibly
  divergent SW forwarding path gets bumped as appropriate.

  So simply adding HW and SW stats may work roughly, most of the time, but
  there are scenarios where the result is nonsensical.

- HW devices will have limitations as to what type of traffic they can
  count.

  In case of mlxsw, which is part of this patch set, there is no reasonable
  way to count all traffic going through a certain netdevice, such as a
  VLAN netdevice enslaved to a bridge. It is however very simple to count
  traffic flowing through an L3 object, such as a VLAN netdevice with an IP
  address.

  Similarly for physical netdevices, the L3 object at which the counter is
  installed is the subport carrying untagged traffic.

  These are not "just counters". It is important that the user understands
  what is being counted. It would be incorrect to conflate these statistics
  with another existing statistics suite.

To that end, this patch set introduces a statistics suite called "L3
stats". This label should make it easy to understand what is being counted,
and to decide whether a given device can or cannot implement this suite for
some type of netdevice. At the same time, the code is written to make
future extensions easy, should a device pop up that can implement a
different flavor of statistics suite (say L2, or an address-family-specific
suite).

For example, using a work-in-progress iproute2[1], to turn on and then list
the counters on a VLAN netdevice:

    # ip stats set dev swp1.200 l3_stats on
    # ip stats show dev swp1.200 group offload subgroup l3_stats
    56: swp1.200: group offload subgroup l3_stats on used on
	RX:  bytes packets errors dropped  missed   mcast
		0       0      0       0       0       0
	TX:  bytes packets errors dropped carrier collsns
		0       0      0       0       0       0

The patchset progresses as follows:

- Patch #1 is a cleanup.

- In patch #2, remove the assumption that all LINK_OFFLOAD_XSTATS are
  dev-backed.

  The only attribute defined under the nest is currently
  IFLA_OFFLOAD_XSTATS_CPU_HIT. L3_STATS differs from CPU_HIT in that the
  driver that supplies the statistics is not the same as the driver that
  implements the netdevice. Make the code compatible with this in patch #2.

- In patch #3, add the possibility to filter inside nests.

  The filter_mask field of RTM_GETSTATS header determines which
  top-level attributes should be included in the netlink response. This
  saves processing time by only including the bits that the user cares
  about instead of always dumping everything. This is doubly important
  for HW-backed statistics that would typically require a trip to the
  device to fetch the stats. In this patch, the UAPI is extended to
  allow filtering inside IFLA_STATS_LINK_OFFLOAD_XSTATS in particular,
  but the scheme is easily extensible to other nests as well.

- In patch #4, propagate extack where we need it.
  In patch #5, make it possible to propagate errors from drivers to the
  user.

- In patch #6, add the in-kernel APIs for keeping track of the new stats
  suite, and the notifiers that the core uses to communicate with the
  drivers.

- In patch #7, add UAPI for obtaining the new stats suite.

- In patch #8, add a new UAPI message, RTM_SETSTATS, which will carry
  the message to toggle the newly-added stats suite.
  In patch #9, add the toggle itself.

At this point the core is ready for drivers to add support for the new
stats suite.

- In patches #10, #11 and #12, apply small tweaks to mlxsw code.

- In patch #13, add support for L3 stats, which are realized as RIF
  counters.

- Finally in patch #14, a selftest is added to the net/forwarding
  directory. Technically this is a HW-specific test, in that without a HW
  implementing the counters, it just will not pass. But devices that
  support L3 statistics at all are likely to be able to reuse this
  selftest, so it seems appropriate to put it in the general forwarding
  directory.

We also have a netdevsim implementation, and a corresponding selftest that
verifies specifically some of the core code. We intend to contribute these
later. Interested parties can take a look at the raw code at [2].

[1] https://github.com/pmachata/iproute2/commits/soft_counters
[2] https://github.com/pmachata/linux_mlxsw/commits/petrm_soft_counters_2

v2:
- Patch #3:
    - Do not declare strict_start_type at the new policies, since they are
      used with nla_parse_nested() (sans _deprecated).
    - Use NLA_POLICY_NESTED to declare what the nest contents should be
    - Use NLA_POLICY_MASK instead of BITFIELD32 for the filtering
      attribute.
- Patch #6:
    - s/monotonous/monotonic/ in commit message
    - Use a newly-added struct rtnl_hw_stats64 for stats transfer
- Patch #7:
    - Use a newly-added struct rtnl_hw_stats64 for stats transfer
- Patch #8:
    - Do not declare strict_start_type at the new policies, since they are
      used with nla_parse_nested() (sans _deprecated).
- Patch #13:
    - Use a newly-added struct rtnl_hw_stats64 for stats transfer
====================

Signed-off-by: David S. Miller <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Mar 5, 2022
Ido Schimmel says:

====================
selftests: mlxsw: A couple of fixes

Patch #1 fixes a breakage due to a change in iproute2 output. The real
problem is not iproute2, but the fact that the check was not strict
enough. Fixed by using JSON output instead. Targeting at net so that the
test will pass as part of old and new kernels regardless of iproute2
version.

Patch #2 fixes an issue uncovered by the first one.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Mar 5, 2022
Andrii Nakryiko says:

====================

Add ability for user applications and libraries to register custom BPF program
SEC() handlers. See patch #2 for examples where this is useful.

Patch #1 does some preliminary refactoring to allow exponsing program
init, preload, and attach callbacks as public API. It also establishes
a protocol to allow optional auto-attach behavior. This will also help the
case of sometimes auto-attachable uprobes.

v4->v5:
  - API documentation improvements (Daniel);
v3->v4:
  - init_fn -> prog_setup_fn, preload_fn -> prog_prepare_load_fn (Alexei);
v2->v3:
  - moved callbacks and cookie into OPTS struct (Alan);
  - added more test scenarios (Alan);
  - address most of Alan's feedback, but kept API name;
v1->v2:
  - resubmitting due to git send-email screw up.

Cc: Alan Maguire <[email protected]>
====================

Signed-off-by: Alexei Starovoitov <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Mar 18, 2022
This driver, like several others, uses a chained IRQ for each GPIO bank,
and forwards .irq_set_wake to the GPIO bank's upstream IRQ. As a result,
a call to irq_set_irq_wake() needs to lock both the upstream and
downstream irq_desc's. Lockdep considers this to be a possible deadlock
when the irq_desc's share lockdep classes, which they do by default:

 ============================================
 WARNING: possible recursive locking detected
 5.17.0-rc3-00394-gc849047c2473 #1 Not tainted
 --------------------------------------------
 init/307 is trying to acquire lock:
 c2dfe27c (&irq_desc_lock_class){-.-.}-{2:2}, at: __irq_get_desc_lock+0x58/0xa0

 but task is already holding lock:
 c3c0ac7c (&irq_desc_lock_class){-.-.}-{2:2}, at: __irq_get_desc_lock+0x58/0xa0

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&irq_desc_lock_class);
   lock(&irq_desc_lock_class);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 4 locks held by init/307:
  #0: c1f29f18 (system_transition_mutex){+.+.}-{3:3}, at: __do_sys_reboot+0x90/0x23c
  #1: c20f7760 (&dev->mutex){....}-{3:3}, at: device_shutdown+0xf4/0x224
  #2: c2e804d8 (&dev->mutex){....}-{3:3}, at: device_shutdown+0x104/0x224
  #3: c3c0ac7c (&irq_desc_lock_class){-.-.}-{2:2}, at: __irq_get_desc_lock+0x58/0xa0

 stack backtrace:
 CPU: 0 PID: 307 Comm: init Not tainted 5.17.0-rc3-00394-gc849047c2473 #1
 Hardware name: Allwinner sun8i Family
  unwind_backtrace from show_stack+0x10/0x14
  show_stack from dump_stack_lvl+0x68/0x90
  dump_stack_lvl from __lock_acquire+0x1680/0x31a0
  __lock_acquire from lock_acquire+0x148/0x3dc
  lock_acquire from _raw_spin_lock_irqsave+0x50/0x6c
  _raw_spin_lock_irqsave from __irq_get_desc_lock+0x58/0xa0
  __irq_get_desc_lock from irq_set_irq_wake+0x2c/0x19c
  irq_set_irq_wake from irq_set_irq_wake+0x13c/0x19c
    [tail call from sunxi_pinctrl_irq_set_wake]
  irq_set_irq_wake from gpio_keys_suspend+0x80/0x1a4
  gpio_keys_suspend from gpio_keys_shutdown+0x10/0x2c
  gpio_keys_shutdown from device_shutdown+0x180/0x224
  device_shutdown from __do_sys_reboot+0x134/0x23c
  __do_sys_reboot from ret_fast_syscall+0x0/0x1c

However, this can never deadlock because the upstream and downstream
IRQs are never the same (nor do they even involve the same irqchip).

Silence this erroneous lockdep splat by applying what appears to be the
usual fix of moving the GPIO IRQs to separate lockdep classes.

Fixes: a59c99d ("pinctrl: sunxi: Forward calls to irq_set_irq_wake")
Reported-by: Guenter Roeck <[email protected]>
Signed-off-by: Samuel Holland <[email protected]>
Reviewed-by: Jernej Skrabec <[email protected]>
Tested-by: Guenter Roeck <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Linus Walleij <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Mar 18, 2022
in tunnel mode, if outer interface(ipv4) is less, it is easily to let
inner IPV6 mtu be less than 1280. If so, a Packet Too Big ICMPV6 message
is received. When send again, packets are fragmentized with 1280, they
are still rejected with ICMPV6(Packet Too Big) by xfrmi_xmit2().

According to RFC4213 Section3.2.2:
if (IPv4 path MTU - 20) is less than 1280
	if packet is larger than 1280 bytes
		Send ICMPv6 "packet too big" with MTU=1280
                Drop packet
        else
		Encapsulate but do not set the Don't Fragment
                flag in the IPv4 header.  The resulting IPv4
                packet might be fragmented by the IPv4 layer
                on the encapsulator or by some router along
                the IPv4 path.
	endif
else
	if packet is larger than (IPv4 path MTU - 20)
        	Send ICMPv6 "packet too big" with
                MTU = (IPv4 path MTU - 20).
                Drop packet.
        else
                Encapsulate and set the Don't Fragment flag
                in the IPv4 header.
        endif
endif
Packets should be fragmentized with ipv4 outer interface, so change it.

After it is fragemtized with ipv4, there will be double fragmenation.
No.48 & No.51 are ipv6 fragment packets, No.48 is double fragmentized,
then tunneled with IPv4(No.49& No.50), which obey spec. And received peer
cannot decrypt it rightly.

48              2002::10        2002::11 1296(length) IPv6 fragment (off=0 more=y ident=0xa20da5bc nxt=50)
49   0x0000 (0) 2002::10        2002::11 1304         IPv6 fragment (off=0 more=y ident=0x7448042c nxt=44)
50   0x0000 (0) 2002::10        2002::11 200          ESP (SPI=0x00035000)
51              2002::10        2002::11 180          Echo (ping) request
52   0x56dc     2002::10        2002::11 248          IPv6 fragment (off=1232 more=n ident=0xa20da5bc nxt=50)

xfrm6_noneed_fragment has fixed above issues. Finally, it acted like below:
1   0x6206 192.168.1.138   192.168.1.1 1316 Fragmented IP protocol (proto=Encap Security Payload 50, off=0, ID=6206) [Reassembled in #2]
2   0x6206 2002::10        2002::11    88   IPv6 fragment (off=0 more=y ident=0x1f440778 nxt=50)
3   0x0000 2002::10        2002::11    248  ICMPv6    Echo (ping) request

Signed-off-by: Lina Wang <[email protected]>
Signed-off-by: Steffen Klassert <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Mar 23, 2022
Petr Machata says:

====================
netdevsim: Support for L3 HW stats

"L3 stats" is a suite of interface statistics aimed at reflecting traffic
taking place in a HW device, on an object corresponding to some software
netdevice. Support for this stats suite has been added recently, in commit
ca0a53d ("Merge branch 'net-hw-counters-for-soft-devices'").

In this patch set:

- Patch #1 adds support for L3 stats to netdevsim.

  Real devices can have various conditions for when an L3 counter is
  available. To simulate this, netdevsim maintains a list of devices
  suitable for HW stats collection. Only when l3_stats is enabled on both a
  netdevice itself, and in netdevsim, will netdevsim contribute values to
  L3 stats.

  This enablement and disablement is done via debugfs:

    # echo $ifindex > /sys/kernel/debug/netdevsim/$DEV/hwstats/l3/enable_ifindex
    # echo $ifindex > /sys/kernel/debug/netdevsim/$DEV/hwstats/l3/disable_ifindex

  Besides this, there is a third toggle to mark a device for future failure:

    # echo $ifindex > /sys/kernel/debug/netdevsim/$DEV/hwstats/l3/fail_next_enable

- This allows HW-independent testing of stats reporting and in-kernel APIs,
  as well as a test for enablement rollback, which is difficult to do
  otherwise. This netdevsim-specific selftest is added in patch #2.

- Patch #3 adds another driver-specific selftest, namely a test aimed at
  checking mlxsw-induced stats monitoring events.

====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Mar 24, 2022
…k_under_node()

Patch series "drivers/base/memory: determine and store zone for single-zone memory blocks", v2.

I remember talking to Michal in the past about removing
test_pages_in_a_zone(), which we use for:
* verifying that a memory block we intend to offline is really only managed
  by a single zone. We don't support offlining of memory blocks that are
  managed by multiple zones (e.g., multiple nodes, DMA and DMA32)
* exposing that zone to user space via
  /sys/devices/system/memory/memory*/valid_zones

Now that I identified some more cases where test_pages_in_a_zone() might
go wrong, and we received an UBSAN report (see patch #3), let's get rid of
this PFN walker.

So instead of detecting the zone at runtime with test_pages_in_a_zone() by
scanning the memmap, let's determine and remember for each memory block if
it's managed by a single zone.  The stored zone can then be used for the
above two cases, avoiding a manual lookup using test_pages_in_a_zone().

This avoids eventually stumbling over uninitialized memmaps in corner
cases, especially when ZONE_DEVICE ranges partly fall into memory block
(that are responsible for managing System RAM).

Handling memory onlining is easy, because we online to exactly one zone.
Handling boot memory is more tricky, because we want to avoid scanning all
zones of all nodes to detect possible zones that overlap with the physical
memory region of interest.  Fortunately, we already have code that
determines the applicable nodes for a memory block, to create sysfs links
-- we'll hook into that.

Patch #1 is a simple cleanup I had laying around for a longer time.
Patch #2 contains the main logic to remove test_pages_in_a_zone() and
further details.

[1] https://lkml.kernel.org/r/[email protected]
[2] https://lkml.kernel.org/r/[email protected]

This patch (of 2):

Let's adjust the stale terminology, making it match
unregister_memory_block_under_nodes() and
do_register_memory_block_under_node().  We're dealing with memory block
devices, which span 1..X memory sections.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Oscar Salvador <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Rafael Parra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Mar 29, 2022
A missing bounds check in vm_access() can lead to an out-of-bounds read
or write in the adjacent memory area, since the len attribute is not
validated before the memcpy later in the function, potentially hitting:

[  183.637831] BUG: unable to handle page fault for address: ffffc90000c86000
[  183.637934] #PF: supervisor read access in kernel mode
[  183.637997] #PF: error_code(0x0000) - not-present page
[  183.638059] PGD 100000067 P4D 100000067 PUD 100258067 PMD 106341067 PTE 0
[  183.638144] Oops: 0000 [#2] PREEMPT SMP NOPTI
[  183.638201] CPU: 3 PID: 1790 Comm: poc Tainted: G      D           5.17.0-rc6-ci-drm-11296+ #1
[  183.638298] Hardware name: Intel Corporation CoffeeLake Client Platform/CoffeeLake H DDR4 RVP, BIOS CNLSFWR1.R00.X208.B00.1905301319 05/30/2019
[  183.638430] RIP: 0010:memcpy_erms+0x6/0x10
[  183.640213] RSP: 0018:ffffc90001763d48 EFLAGS: 00010246
[  183.641117] RAX: ffff888109c14000 RBX: ffff888111bece40 RCX: 0000000000000ffc
[  183.642029] RDX: 0000000000001000 RSI: ffffc90000c86000 RDI: ffff888109c14004
[  183.642946] RBP: 0000000000000ffc R08: 800000000000016b R09: 0000000000000000
[  183.643848] R10: ffffc90000c85000 R11: 0000000000000048 R12: 0000000000001000
[  183.644742] R13: ffff888111bed190 R14: ffff888109c14000 R15: 0000000000001000
[  183.645653] FS:  00007fe5ef807540(0000) GS:ffff88845b380000(0000) knlGS:0000000000000000
[  183.646570] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  183.647481] CR2: ffffc90000c86000 CR3: 000000010ff02006 CR4: 00000000003706e0
[  183.648384] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  183.649271] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  183.650142] Call Trace:
[  183.650988]  <TASK>
[  183.651793]  vm_access+0x1f0/0x2a0 [i915]
[  183.652726]  __access_remote_vm+0x224/0x380
[  183.653561]  mem_rw.isra.0+0xf9/0x190
[  183.654402]  vfs_read+0x9d/0x1b0
[  183.655238]  ksys_read+0x63/0xe0
[  183.656065]  do_syscall_64+0x38/0xc0
[  183.656882]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  183.657663] RIP: 0033:0x7fe5ef725142
[  183.659351] RSP: 002b:00007ffe1e81c7e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[  183.660227] RAX: ffffffffffffffda RBX: 0000557055dfb780 RCX: 00007fe5ef725142
[  183.661104] RDX: 0000000000001000 RSI: 00007ffe1e81d880 RDI: 0000000000000005
[  183.661972] RBP: 00007ffe1e81e890 R08: 0000000000000030 R09: 0000000000000046
[  183.662832] R10: 0000557055dfc2e0 R11: 0000000000000246 R12: 0000557055dfb1c0
[  183.663691] R13: 00007ffe1e81e980 R14: 0000000000000000 R15: 0000000000000000

Changes since v1:
     - Updated if condition with range_overflows_t [Chris Wilson]

Fixes: 9f909e2 ("drm/i915: Implement vm_ops->access for gdb access into mmaps")
Signed-off-by: Mastan Katragadda <[email protected]>
Suggested-by: Adam Zabrocki <[email protected]>
Reported-by: Jackson Cody <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: Jon Bloomfield <[email protected]>
Cc: Sudeep Dutt <[email protected]>
Cc: <[email protected]> # v5.8+
Reviewed-by: Matthew Auld <[email protected]>
[mauld: tidy up the commit message and add Cc: stable]
Signed-off-by: Matthew Auld <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 661412e)
Signed-off-by: Joonas Lahtinen <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Mar 29, 2022
In remove_phb_dynamic() we use &phb->io_resource, after we've called
device_unregister(&host_bridge->dev). But the unregister may have freed
phb, because pcibios_free_controller_deferred() is the release function
for the host_bridge.

If there are no outstanding references when we call device_unregister()
then phb will be freed out from under us.

This has gone mainly unnoticed, but with slub_debug and page_poison
enabled it can lead to a crash:

  PID: 7574   TASK: c0000000d492cb80  CPU: 13  COMMAND: "drmgr"
   #0 [c0000000e4f075a0] crash_kexec at c00000000027d7dc
   #1 [c0000000e4f075d0] oops_end at c000000000029608
   #2 [c0000000e4f07650] __bad_page_fault at c0000000000904b4
   #3 [c0000000e4f076c0] do_bad_slb_fault at c00000000009a5a8
   #4 [c0000000e4f076f0] data_access_slb_common_virt at c000000000008b30
   Data SLB Access [380] exception frame:
   R0:  c000000000167250    R1:  c0000000e4f07a00    R2:  c000000002a46100
   R3:  c000000002b39ce8    R4:  00000000000000c0    R5:  00000000000000a9
   R6:  3894674d000000c0    R7:  0000000000000000    R8:  00000000000000ff
   R9:  0000000000000100    R10: 6b6b6b6b6b6b6b6b    R11: 0000000000008000
   R12: c00000000023da80    R13: c0000009ffd38b00    R14: 0000000000000000
   R15: 000000011c87f0f0    R16: 0000000000000006    R17: 0000000000000003
   R18: 0000000000000002    R19: 0000000000000004    R20: 0000000000000005
   R21: 000000011c87ede8    R22: 000000011c87c5a8    R23: 000000011c87d3a0
   R24: 0000000000000000    R25: 0000000000000001    R26: c0000000e4f07cc8
   R27: c00000004d1cc400    R28: c0080000031d00e8    R29: c00000004d23d800
   R30: c00000004d1d2400    R31: c00000004d1d2540
   NIP: c000000000167258    MSR: 8000000000009033    OR3: c000000000e9f474
   CTR: 0000000000000000    LR:  c000000000167250    XER: 0000000020040003
   CCR: 0000000024088420    MQ:  0000000000000000    DAR: 6b6b6b6b6b6b6ba3
   DSISR: c0000000e4f07920     Syscall Result: fffffffffffffff2
   [NIP  : release_resource+56]
   [LR   : release_resource+48]
   #5 [c0000000e4f07a00] release_resource at c000000000167258  (unreliable)
   #6 [c0000000e4f07a30] remove_phb_dynamic at c000000000105648
   #7 [c0000000e4f07ab0] dlpar_remove_slot at c0080000031a09e8 [rpadlpar_io]
   #8 [c0000000e4f07b50] remove_slot_store at c0080000031a0b9c [rpadlpar_io]
   #9 [c0000000e4f07be0] kobj_attr_store at c000000000817d8c
  #10 [c0000000e4f07c00] sysfs_kf_write at c00000000063e504
  #11 [c0000000e4f07c20] kernfs_fop_write_iter at c00000000063d868
  #12 [c0000000e4f07c70] new_sync_write at c00000000054339c
  #13 [c0000000e4f07d10] vfs_write at c000000000546624
  #14 [c0000000e4f07d60] ksys_write at c0000000005469f4
  #15 [c0000000e4f07db0] system_call_exception at c000000000030840
  #16 [c0000000e4f07e10] system_call_vectored_common at c00000000000c168

To avoid it, we can take a reference to the host_bridge->dev until we're
done using phb. Then when we drop the reference the phb will be freed.

Fixes: 2dd9c11 ("powerpc/pseries: use pci_host_bridge.release_fn() to kfree(phb)")
Reported-by: David Dai <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Tested-by: Sachin Sant <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
kernel-patches-bot pushed a commit that referenced this pull request Mar 29, 2022
The res is initialized here only if there's no errors so passing it to
ttm_resource_fini in the error paths results in a kernel oops. In the
error paths, instead of the unitialized res, we have to use to use
node->base on which ttm_resource_init was called.

Sample affected backtrace:
Unable to handle kernel NULL pointer dereference at virtual address 00000000000000d8
 Mem abort info:
   ESR = 0x96000004
   EC = 0x25: DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
   FSC = 0x04: level 0 translation fault
 Data abort info:
   ISV = 0, ISS = 0x00000004
   CM = 0, WnR = 0
 user pgtable: 4k pages, 48-bit VAs, pgdp=0000000106ac0000
 [00000000000000d8] pgd=0000000000000000, p4d=0000000000000000
 Internal error: Oops: 96000004 [#1] SMP
 Modules linked in: bnep vsock_loopback vmw_vsock_virtio_transport_common
 vsock snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec
 snd_hwdep >
 CPU: 0 PID: 1197 Comm: gnome-shell Tainted: G    U  5.17.0-rc2-vmwgfx #2
 Hardware name: VMware, Inc. VBSA/VBSA, BIOS VEFI 12/31/2020
 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : ttm_resource_fini+0x5c/0xac [ttm]
 lr : ttm_range_man_alloc+0x128/0x1e0 [ttm]
 sp : ffff80000d783510
 x29: ffff80000d783510 x28: 0000000000000000 x27: ffff000086514400
 x26: 0000000000000300 x25: ffff0000809f9e78 x24: 0000000000000000
 x23: ffff80000d783680 x22: ffff000086514400 x21: 00000000ffffffe4
 x20: ffff80000d7836a0 x19: ffff0000809f9e00 x18: 0000000000000000
 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
 x14: 0000000000000000 x13: 0000000000000800 x12: ffff0000f2600a00
 x11: 000000000000fc96 x10: 0000000000000000 x9 : ffff800001295c18
 x8 : 0000000000000000 x7 : 0000000000000300 x6 : 0000000000000000
 x5 : 0000000000000000 x4 : ffff0000f1034e20 x3 : ffff0000f1034600
 x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000600000
 Call trace:
  ttm_resource_fini+0x5c/0xac [ttm]
  ttm_range_man_alloc+0x128/0x1e0 [ttm]
  ttm_resource_alloc+0x58/0x90 [ttm]
  ttm_bo_mem_space+0xc8/0x3e4 [ttm]
  ttm_bo_validate+0xb4/0x134 [ttm]
  vmw_bo_pin_in_start_of_vram+0xbc/0x200 [vmwgfx]
  vmw_framebuffer_pin+0xc0/0x154 [vmwgfx]
  vmw_ldu_primary_plane_atomic_update+0x8c/0x6e0 [vmwgfx]
  drm_atomic_helper_commit_planes+0x11c/0x2e0
  drm_atomic_helper_commit_tail+0x60/0xb0
  commit_tail+0x1b0/0x210
  drm_atomic_helper_commit+0x168/0x400
  drm_atomic_commit+0x64/0x74
  drm_atomic_helper_set_config+0xdc/0x11c
  drm_mode_setcrtc+0x1c4/0x780
  drm_ioctl_kernel+0xd0/0x1a0
  drm_ioctl+0x2c4/0x690
  vmw_generic_ioctl+0xe0/0x174 [vmwgfx]
  vmw_unlocked_ioctl+0x24/0x30 [vmwgfx]
  __arm64_sys_ioctl+0xb4/0x100
  invoke_syscall+0x78/0x100
  el0_svc_common.constprop.0+0x54/0x184
  do_el0_svc+0x34/0x9c
  el0_svc+0x48/0x1b0
  el0t_64_sync_handler+0xa4/0x130
  el0t_64_sync+0x1a4/0x1a8
 Code: 35000260 f9401a81 52800002 f9403a60 (f9406c23)
 ---[ end trace 0000000000000000 ]---

Signed-off-by: Zack Rusin <[email protected]>
Fixes: de3688e ("drm/ttm: add ttm_resource_fini v2")
Cc: Christian König <[email protected]>
Cc: Daniel Vetter <[email protected]>
Reviewed-by: Martin Krastev <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
kernel-patches-bot pushed a commit that referenced this pull request Mar 31, 2022
The per-channel data is available directly in the driver data struct. So
use it without making use of pwm_[gs]et_chip_data().

The relevant change introduced by this patch to lpc18xx_pwm_disable() at
the assembler level (for an arm lpc18xx_defconfig build) is:

	push    {r3, r4, r5, lr}
	mov     r4, r0
	mov     r0, r1
	mov     r5, r1
	bl      0 <pwm_get_chip_data>
	ldr     r3, [r0, #0]

changes to

	ldr     r3, [r1, #8]
	push    {r4, lr}
	add.w   r3, r0, r3, lsl #2
	ldr     r3, [r3, #92]   ; 0x5c

So this reduces stack usage, has an improved runtime behavior because of
better pipeline usage, doesn't branch to an external function and the
generated code is a bit smaller occupying less memory.

The codesize of lpc18xx_pwm_probe() is reduced by 32 bytes.

Signed-off-by: Uwe Kleine-König <[email protected]>
Signed-off-by: Thierry Reding <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Apr 5, 2022
There are some issues in parse_num_list():

First, the end variable is assigned twice when parsing_end is true, it is
unnecessary.

Second, the function does not check that parsing_end is false after parsing
argument. Thus, if the final part of the argument is something like '4-',
parse_num_list() will discard it instead of returning -EINVAL.

Clean up parse_num_list() and fix these issues.

Before:

 $ ./test_progs -n 2,4-
 #2 atomic_bounds:OK
 Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

After:

 $ ./test_progs -n 2,4-
 Failed to parse test numbers.

Signed-off-by: Yuntao Wang <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 25, 2024
Case #1:
SQLite App		GC Thread		Kworker		Shrinker
- f2fs_ioc_start_atomic_write

- f2fs_ioc_commit_atomic_write
 - f2fs_commit_atomic_write
  - filemap_write_and_wait_range
  : write atomic_file's data to cow_inode
								echo 3 > drop_caches
								to drop atomic_file's
								cache.
			- f2fs_gc
			 - gc_data_segment
			  - move_data_page
			   - set_page_dirty

						- writepages
						 - f2fs_do_write_data_page
						 : overwrite atomic_file's data
						   to cow_inode
  - f2fs_down_write(&fi->i_gc_rwsem[WRITE])
  - __f2fs_commit_atomic_write
  - f2fs_up_write(&fi->i_gc_rwsem[WRITE])

Case #2:
SQLite App		GC Thread		Kworker
- f2fs_ioc_start_atomic_write

						- __writeback_single_inode
						 - do_writepages
						  - f2fs_write_cache_pages
						   - f2fs_write_single_data_page
						    - f2fs_do_write_data_page
						    : write atomic_file's data to cow_inode
			- f2fs_gc
			 - gc_data_segment
			  - move_data_page
			   - set_page_dirty

						- writepages
						 - f2fs_do_write_data_page
						 : overwrite atomic_file's data to cow_inode
- f2fs_ioc_commit_atomic_write

In above cases racing in between atomic_write and GC, previous
data in atomic_file may be overwrited to cow_file, result in
data corruption.

This patch introduces PAGE_PRIVATE_ATOMIC_WRITE bit flag in page.private,
and use it to indicate that there is last dirty data in atomic file,
and the data should be writebacked into cow_file, if the flag is not
tagged in page, we should never write data across files.

Fixes: 3db1de0 ("f2fs: change the current atomic write way")
Cc: Daeho Jeong <[email protected]>
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 30, 2024
…git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

v2: with kdoc fixes per Paolo Abeni.

The following patchset contains Netfilter fixes for net:

Patch #1 and #2 handle an esoteric scenario: Given two tasks sending UDP
packets to one another, two packets of the same flow in each direction
handled by different CPUs that result in two conntrack objects in NEW
state, where reply packet loses race. Then, patch #3 adds a testcase for
this scenario. Series from Florian Westphal.

1) NAT engine can falsely detect a port collision if it happens to pick
   up a reply packet as NEW rather than ESTABLISHED. Add extra code to
   detect this and suppress port reallocation in this case.

2) To complete the clash resolution in the reply direction, extend conntrack
   logic to detect clashing conntrack in the reply direction to existing entry.

3) Adds a test case.

Then, an assorted list of fixes follow:

4) Add a selftest for tproxy, from Antonio Ojea.

5) Guard ctnetlink_*_size() functions under
   #if defined(CONFIG_NETFILTER_NETLINK_GLUE_CT) || defined(CONFIG_NF_CONNTRACK_EVENTS)
   From Andy Shevchenko.

6) Use -m socket --transparent in iptables tproxy documentation.
   From XIE Zhibang.

7) Call kfree_rcu() when releasing flowtable hooks to address race with
   netlink dump path, from Phil Sutter.

8) Fix compilation warning in nf_reject with CONFIG_BRIDGE_NETFILTER=n.
   From Simon Horman.

9) Guard ctnetlink_label_size() under CONFIG_NF_CONNTRACK_EVENTS which
   is its only user, to address a compilation warning. From Simon Horman.

10) Use rcu-protected list iteration over basechain hooks from netlink
    dump path.

11) Fix memcg for nf_tables, use GFP_KERNEL_ACCOUNT is not complete.

12) Remove old nfqueue conntrack clash resolution. Instead trying to
    use same destination address consistently which requires double DNAT,
    use the existing clash resolution which allows clashing packets
    go through with different destination. Antonio Ojea originally
    reported an issue from the postrouting chain, I proposed a fix:
    https://lore.kernel.org/netfilter-devel/ZuwSwAqKgCB2a51-@calendula/T/
    which he reported it did not work for him.

13) Adds a selftest for patch 12.

14) Fixes ipvs.sh selftest.

netfilter pull request 24-09-26

* tag 'nf-24-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  selftests: netfilter: Avoid hanging ipvs.sh
  kselftest: add test for nfqueue induced conntrack race
  netfilter: nfnetlink_queue: remove old clash resolution logic
  netfilter: nf_tables: missing objects with no memcg accounting
  netfilter: nf_tables: use rcu chain hook list iterator from netlink dump path
  netfilter: ctnetlink: compile ctnetlink_label_size with CONFIG_NF_CONNTRACK_EVENTS
  netfilter: nf_reject: Fix build warning when CONFIG_BRIDGE_NETFILTER=n
  netfilter: nf_tables: Keep deleted flowtable hooks until after RCU
  docs: tproxy: ignore non-transparent sockets in iptables
  netfilter: ctnetlink: Guard possible unused functions
  selftests: netfilter: nft_tproxy.sh: add tcp tests
  selftests: netfilter: add reverse-clash resolution test case
  netfilter: conntrack: add clash resolution for reverse collisions
  netfilter: nf_nat: don't try nat source port reallocation for reverse dir clash
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 4, 2024
The following calculation used in coalesced_mmio_has_room() to check
whether the ring buffer is full is wrong and results in premature exits if
the start of the valid entries is in the first half of the ring buffer.

  avail = (ring->first - last - 1) % KVM_COALESCED_MMIO_MAX;
  if (avail == 0)
	  /* full */

Because negative values are handled using two's complement, and KVM
computes the result as an unsigned value, the above will get a false
positive if "first < last" and the ring is half-full.

The above might have worked as expected in python for example:
  >>> (-86) % 170
  84

However it doesn't work the same way in C.

  printf("avail: %d\n", (-86) % 170);
  printf("avail: %u\n", (-86) % 170);
  printf("avail: %u\n", (-86u) % 170u);

Using gcc-11 these print:

  avail: -86
  avail: 4294967210
  avail: 0

For illustration purposes, given a 4-bit integer and a ring size of 0xA
(unsigned), 0xA == 0x1010 == -6, and thus (-6u % 0xA) == 0.

Fix the calculation and allow all but one entries in the buffer to be
used as originally intended.

Note, KVM's behavior is self-healing to some extent, as KVM will allow the
entire buffer to be used if ring->first is beyond the halfway point.  In
other words, in the unlikely scenario that a use case benefits from being
able to coalesce more than 86 entries at once, KVM will still provide such
behavior, sometimes.

Note #2, the % operator in C is not the modulo operator but the remainder
operator. Modulo and remainder operators differ with respect to negative
values.  But, the relevant values in KVM are all unsigned, so it's a moot
point in this case anyway.

Note #3, this is almost a pure revert of the buggy commit, plus a
READ_ONCE() to provide additional safety.  Thue buggy commit justified the
change with "it paves the way for making this function lockless", but it's
not at all clear what was intended, nor is there any evidence that the
buggy code was somehow safer.  (a) the fields in question were already
accessed locklessly, from the perspective that they could be modified by
userspace at any time, and (b) the lock guarding the ring itself was
changed, but never dropped, i.e. whatever lockless scheme (SRCU?) was
planned never landed.

Fixes: 105f8d4 ("KVM: Calculate available entries in coalesced mmio ring")
Signed-off-by: Ilias Stamatis <[email protected]>
Reviewed-by: Paul Durrant <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: rework changelog to clarify behavior, call out weirdness of buggy commit]
Signed-off-by: Sean Christopherson <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 4, 2024
Use a dedicated mutex to guard kvm_usage_count to fix a potential deadlock
on x86 due to a chain of locks and SRCU synchronizations.  Translating the
below lockdep splat, CPU1 #6 will wait on CPU0 #1, CPU0 #8 will wait on
CPU2 #3, and CPU2 #7 will wait on CPU1 #4 (if there's a writer, due to the
fairness of r/w semaphores).

    CPU0                     CPU1                     CPU2
1   lock(&kvm->slots_lock);
2                                                     lock(&vcpu->mutex);
3                                                     lock(&kvm->srcu);
4                            lock(cpu_hotplug_lock);
5                            lock(kvm_lock);
6                            lock(&kvm->slots_lock);
7                                                     lock(cpu_hotplug_lock);
8   sync(&kvm->srcu);

Note, there are likely more potential deadlocks in KVM x86, e.g. the same
pattern of taking cpu_hotplug_lock outside of kvm_lock likely exists with
__kvmclock_cpufreq_notifier():

  cpuhp_cpufreq_online()
  |
  -> cpufreq_online()
     |
     -> cpufreq_gov_performance_limits()
        |
        -> __cpufreq_driver_target()
           |
           -> __target_index()
              |
              -> cpufreq_freq_transition_begin()
                 |
                 -> cpufreq_notify_transition()
                    |
                    -> ... __kvmclock_cpufreq_notifier()

But, actually triggering such deadlocks is beyond rare due to the
combination of dependencies and timings involved.  E.g. the cpufreq
notifier is only used on older CPUs without a constant TSC, mucking with
the NX hugepage mitigation while VMs are running is very uncommon, and
doing so while also onlining/offlining a CPU (necessary to generate
contention on cpu_hotplug_lock) would be even more unusual.

The most robust solution to the general cpu_hotplug_lock issue is likely
to switch vm_list to be an RCU-protected list, e.g. so that x86's cpufreq
notifier doesn't to take kvm_lock.  For now, settle for fixing the most
blatant deadlock, as switching to an RCU-protected list is a much more
involved change, but add a comment in locking.rst to call out that care
needs to be taken when walking holding kvm_lock and walking vm_list.

  ======================================================
  WARNING: possible circular locking dependency detected
  6.10.0-smp--c257535a0c9d-pip #330 Tainted: G S         O
  ------------------------------------------------------
  tee/35048 is trying to acquire lock:
  ff6a80eced71e0a8 (&kvm->slots_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x179/0x1e0 [kvm]

  but task is already holding lock:
  ffffffffc07abb08 (kvm_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x14a/0x1e0 [kvm]

  which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:

  -> #3 (kvm_lock){+.+.}-{3:3}:
         __mutex_lock+0x6a/0xb40
         mutex_lock_nested+0x1f/0x30
         kvm_dev_ioctl+0x4fb/0xe50 [kvm]
         __se_sys_ioctl+0x7b/0xd0
         __x64_sys_ioctl+0x21/0x30
         x64_sys_call+0x15d0/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -> #2 (cpu_hotplug_lock){++++}-{0:0}:
         cpus_read_lock+0x2e/0xb0
         static_key_slow_inc+0x16/0x30
         kvm_lapic_set_base+0x6a/0x1c0 [kvm]
         kvm_set_apic_base+0x8f/0xe0 [kvm]
         kvm_set_msr_common+0x9ae/0xf80 [kvm]
         vmx_set_msr+0xa54/0xbe0 [kvm_intel]
         __kvm_set_msr+0xb6/0x1a0 [kvm]
         kvm_arch_vcpu_ioctl+0xeca/0x10c0 [kvm]
         kvm_vcpu_ioctl+0x485/0x5b0 [kvm]
         __se_sys_ioctl+0x7b/0xd0
         __x64_sys_ioctl+0x21/0x30
         x64_sys_call+0x15d0/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -> #1 (&kvm->srcu){.+.+}-{0:0}:
         __synchronize_srcu+0x44/0x1a0
         synchronize_srcu_expedited+0x21/0x30
         kvm_swap_active_memslots+0x110/0x1c0 [kvm]
         kvm_set_memslot+0x360/0x620 [kvm]
         __kvm_set_memory_region+0x27b/0x300 [kvm]
         kvm_vm_ioctl_set_memory_region+0x43/0x60 [kvm]
         kvm_vm_ioctl+0x295/0x650 [kvm]
         __se_sys_ioctl+0x7b/0xd0
         __x64_sys_ioctl+0x21/0x30
         x64_sys_call+0x15d0/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -> #0 (&kvm->slots_lock){+.+.}-{3:3}:
         __lock_acquire+0x15ef/0x2e30
         lock_acquire+0xe0/0x260
         __mutex_lock+0x6a/0xb40
         mutex_lock_nested+0x1f/0x30
         set_nx_huge_pages+0x179/0x1e0 [kvm]
         param_attr_store+0x93/0x100
         module_attr_store+0x22/0x40
         sysfs_kf_write+0x81/0xb0
         kernfs_fop_write_iter+0x133/0x1d0
         vfs_write+0x28d/0x380
         ksys_write+0x70/0xe0
         __x64_sys_write+0x1f/0x30
         x64_sys_call+0x281b/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

Cc: Chao Gao <[email protected]>
Fixes: 0bf5049 ("KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock")
Cc: [email protected]
Reviewed-by: Kai Huang <[email protected]>
Acked-by: Kai Huang <[email protected]>
Tested-by: Farrah Chen <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 9, 2024
Hou Tao says:

====================
Check the remaining info_cnt before repeating btf fields

From: Hou Tao <[email protected]>

Hi,

The patch set adds the missed check again info_cnt when flattening the
array of nested struct. The problem was spotted when developing dynptr
key support for hash map. Patch #1 adds the missed check and patch #2
adds three success test cases and one failure test case for the problem.

Comments are always welcome.

Change Log:
v2:
 * patch #1: check info_cnt in btf_repeat_fields()
 * patch #2: use a hard-coded number instead of BTF_FIELDS_MAX, because
             BTF_FIELDS_MAX is not always available in vmlinux.h (e.g.,
	     for llvm 17/18)

v1: https://lore.kernel.org/bpf/[email protected]/T/#t
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 15, 2024
Tariq Toukan says:

====================
net/mlx5: hw counters refactor

This is a patchset re-post, see:
https://lore.kernel.org/[email protected]

In this patchset, Cosmin refactors hw counters and solves perf scaling
issue.

Series generated against:
commit c824deb ("cxgb4: clip_tbl: Fix spelling mistake "wont" -> "won't"")

HW counters are central to mlx5 driver operations. They are hardware
objects created and used alongside most steering operations, and queried
from a variety of places. Most counters are queried in bulk from a
periodic task in fs_counters.c.

Counter performance is important and as such, a variety of improvements
have been done over the years. Currently, counters are allocated from
pools, which are bulk allocated to amortize the cost of firmware
commands. Counters are managed through an IDR, a doubly linked list and
two atomic single linked lists. Adding/removing counters is a complex
dance between user contexts requesting it and the mlx5_fc_stats_work
task which does most of the work.

Under high load (e.g. from connection tracking flow insertion/deletion),
the counter code becomes a bottleneck, as seen on flame graphs. Whenever
a counter is deleted, it gets added to a list and the wq task is
scheduled to run immediately to actually delete it. This is done via
mod_delayed_work which uses an internal spinlock. In some tests, waiting
for this spinlock took up to 66% of all samples.

This series refactors the counter code to use a more straight-forward
approach, avoiding the mod_delayed_work problem and making the code
easier to understand. For that:

- patch #1 moves counters data structs to a more appropriate place.
- patch #2 simplifies the bulk query allocation scheme by using vmalloc.
- patch #3 replaces the IDR+3 lists with an xarray. This is the main
  patch of the series, solving the spinlock congestion issue.
- patch #4 removes an unnecessary cacheline alignment causing a lot of
  memory to be wasted.
- patches #5 and #6 are small cleanups enabled by the refactoring.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 15, 2024
Fix a kernel panic in the br_netfilter module when sending untagged
traffic via a VxLAN device.
This happens during the check for fragmentation in br_nf_dev_queue_xmit.

It is dependent on:
1) the br_netfilter module being loaded;
2) net.bridge.bridge-nf-call-iptables set to 1;
3) a bridge with a VxLAN (single-vxlan-device) netdevice as a bridge port;
4) untagged frames with size higher than the VxLAN MTU forwarded/flooded

When forwarding the untagged packet to the VxLAN bridge port, before
the netfilter hooks are called, br_handle_egress_vlan_tunnel is called and
changes the skb_dst to the tunnel dst. The tunnel_dst is a metadata type
of dst, i.e., skb_valid_dst(skb) is false, and metadata->dst.dev is NULL.

Then in the br_netfilter hooks, in br_nf_dev_queue_xmit, there's a check
for frames that needs to be fragmented: frames with higher MTU than the
VxLAN device end up calling br_nf_ip_fragment, which in turns call
ip_skb_dst_mtu.

The ip_dst_mtu tries to use the skb_dst(skb) as if it was a valid dst
with valid dst->dev, thus the crash.

This case was never supported in the first place, so drop the packet
instead.

PING 10.0.0.2 (10.0.0.2) from 0.0.0.0 h1-eth0: 2000(2028) bytes of data.
[  176.291791] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000110
[  176.292101] Mem abort info:
[  176.292184]   ESR = 0x0000000096000004
[  176.292322]   EC = 0x25: DABT (current EL), IL = 32 bits
[  176.292530]   SET = 0, FnV = 0
[  176.292709]   EA = 0, S1PTW = 0
[  176.292862]   FSC = 0x04: level 0 translation fault
[  176.293013] Data abort info:
[  176.293104]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[  176.293488]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  176.293787]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  176.293995] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000043ef5000
[  176.294166] [0000000000000110] pgd=0000000000000000,
p4d=0000000000000000
[  176.294827] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[  176.295252] Modules linked in: vxlan ip6_udp_tunnel udp_tunnel veth
br_netfilter bridge stp llc ipv6 crct10dif_ce
[  176.295923] CPU: 0 PID: 188 Comm: ping Not tainted
6.8.0-rc3-g5b3fbd61b9d1 #2
[  176.296314] Hardware name: linux,dummy-virt (DT)
[  176.296535] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS
BTYPE=--)
[  176.296808] pc : br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter]
[  176.297382] lr : br_nf_dev_queue_xmit+0x2ac/0x4ec [br_netfilter]
[  176.297636] sp : ffff800080003630
[  176.297743] x29: ffff800080003630 x28: 0000000000000008 x27:
ffff6828c49ad9f8
[  176.298093] x26: ffff6828c49ad000 x25: 0000000000000000 x24:
00000000000003e8
[  176.298430] x23: 0000000000000000 x22: ffff6828c4960b40 x21:
ffff6828c3b16d28
[  176.298652] x20: ffff6828c3167048 x19: ffff6828c3b16d00 x18:
0000000000000014
[  176.298926] x17: ffffb0476322f000 x16: ffffb7e164023730 x15:
0000000095744632
[  176.299296] x14: ffff6828c3f1c880 x13: 0000000000000002 x12:
ffffb7e137926a70
[  176.299574] x11: 0000000000000001 x10: ffff6828c3f1c898 x9 :
0000000000000000
[  176.300049] x8 : ffff6828c49bf070 x7 : 0008460f18d5f20e x6 :
f20e0100bebafeca
[  176.300302] x5 : ffff6828c7f918fe x4 : ffff6828c49bf070 x3 :
0000000000000000
[  176.300586] x2 : 0000000000000000 x1 : ffff6828c3c7ad00 x0 :
ffff6828c7f918f0
[  176.300889] Call trace:
[  176.301123]  br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter]
[  176.301411]  br_nf_post_routing+0x2a8/0x3e4 [br_netfilter]
[  176.301703]  nf_hook_slow+0x48/0x124
[  176.302060]  br_forward_finish+0xc8/0xe8 [bridge]
[  176.302371]  br_nf_hook_thresh+0x124/0x134 [br_netfilter]
[  176.302605]  br_nf_forward_finish+0x118/0x22c [br_netfilter]
[  176.302824]  br_nf_forward_ip.part.0+0x264/0x290 [br_netfilter]
[  176.303136]  br_nf_forward+0x2b8/0x4e0 [br_netfilter]
[  176.303359]  nf_hook_slow+0x48/0x124
[  176.303803]  __br_forward+0xc4/0x194 [bridge]
[  176.304013]  br_flood+0xd4/0x168 [bridge]
[  176.304300]  br_handle_frame_finish+0x1d4/0x5c4 [bridge]
[  176.304536]  br_nf_hook_thresh+0x124/0x134 [br_netfilter]
[  176.304978]  br_nf_pre_routing_finish+0x29c/0x494 [br_netfilter]
[  176.305188]  br_nf_pre_routing+0x250/0x524 [br_netfilter]
[  176.305428]  br_handle_frame+0x244/0x3cc [bridge]
[  176.305695]  __netif_receive_skb_core.constprop.0+0x33c/0xecc
[  176.306080]  __netif_receive_skb_one_core+0x40/0x8c
[  176.306197]  __netif_receive_skb+0x18/0x64
[  176.306369]  process_backlog+0x80/0x124
[  176.306540]  __napi_poll+0x38/0x17c
[  176.306636]  net_rx_action+0x124/0x26c
[  176.306758]  __do_softirq+0x100/0x26c
[  176.307051]  ____do_softirq+0x10/0x1c
[  176.307162]  call_on_irq_stack+0x24/0x4c
[  176.307289]  do_softirq_own_stack+0x1c/0x2c
[  176.307396]  do_softirq+0x54/0x6c
[  176.307485]  __local_bh_enable_ip+0x8c/0x98
[  176.307637]  __dev_queue_xmit+0x22c/0xd28
[  176.307775]  neigh_resolve_output+0xf4/0x1a0
[  176.308018]  ip_finish_output2+0x1c8/0x628
[  176.308137]  ip_do_fragment+0x5b4/0x658
[  176.308279]  ip_fragment.constprop.0+0x48/0xec
[  176.308420]  __ip_finish_output+0xa4/0x254
[  176.308593]  ip_finish_output+0x34/0x130
[  176.308814]  ip_output+0x6c/0x108
[  176.308929]  ip_send_skb+0x50/0xf0
[  176.309095]  ip_push_pending_frames+0x30/0x54
[  176.309254]  raw_sendmsg+0x758/0xaec
[  176.309568]  inet_sendmsg+0x44/0x70
[  176.309667]  __sys_sendto+0x110/0x178
[  176.309758]  __arm64_sys_sendto+0x28/0x38
[  176.309918]  invoke_syscall+0x48/0x110
[  176.310211]  el0_svc_common.constprop.0+0x40/0xe0
[  176.310353]  do_el0_svc+0x1c/0x28
[  176.310434]  el0_svc+0x34/0xb4
[  176.310551]  el0t_64_sync_handler+0x120/0x12c
[  176.310690]  el0t_64_sync+0x190/0x194
[  176.311066] Code: f9402e61 79402aa2 927ff821 f9400023 (f9408860)
[  176.315743] ---[ end trace 0000000000000000 ]---
[  176.316060] Kernel panic - not syncing: Oops: Fatal exception in
interrupt
[  176.316371] Kernel Offset: 0x37e0e3000000 from 0xffff800080000000
[  176.316564] PHYS_OFFSET: 0xffff97d780000000
[  176.316782] CPU features: 0x0,88000203,3c020000,0100421b
[  176.317210] Memory Limit: none
[  176.317527] ---[ end Kernel panic - not syncing: Oops: Fatal
Exception in interrupt ]---\

Fixes: 11538d0 ("bridge: vlan dst_metadata hooks in ingress and egress paths")
Reviewed-by: Ido Schimmel <[email protected]>
Signed-off-by: Andy Roulin <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 15, 2024
Andy Roulin says:

====================
netfilter: br_netfilter: fix panic with metadata_dst skb

There's a kernel panic possible in the br_netfilter module when sending
untagged traffic via a VxLAN device. Traceback is included below.
This happens during the check for fragmentation in br_nf_dev_queue_xmit
if the MTU on the VxLAN device is not big enough.

It is dependent on:
1) the br_netfilter module being loaded;
2) net.bridge.bridge-nf-call-iptables set to 1;
3) a bridge with a VxLAN (single-vxlan-device) netdevice as a bridge port;
4) untagged frames with size higher than the VxLAN MTU forwarded/flooded

This case was never supported in the first place, so the first patch drops
such packets.

A regression selftest is added as part of the second patch.

PING 10.0.0.2 (10.0.0.2) from 0.0.0.0 h1-eth0: 2000(2028) bytes of data.
[  176.291791] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000110
[  176.292101] Mem abort info:
[  176.292184]   ESR = 0x0000000096000004
[  176.292322]   EC = 0x25: DABT (current EL), IL = 32 bits
[  176.292530]   SET = 0, FnV = 0
[  176.292709]   EA = 0, S1PTW = 0
[  176.292862]   FSC = 0x04: level 0 translation fault
[  176.293013] Data abort info:
[  176.293104]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[  176.293488]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  176.293787]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  176.293995] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000043ef5000
[  176.294166] [0000000000000110] pgd=0000000000000000,
p4d=0000000000000000
[  176.294827] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[  176.295252] Modules linked in: vxlan ip6_udp_tunnel udp_tunnel veth
br_netfilter bridge stp llc ipv6 crct10dif_ce
[  176.295923] CPU: 0 PID: 188 Comm: ping Not tainted
6.8.0-rc3-g5b3fbd61b9d1 #2
[  176.296314] Hardware name: linux,dummy-virt (DT)
[  176.296535] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS
BTYPE=--)
[  176.296808] pc : br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter]
[  176.297382] lr : br_nf_dev_queue_xmit+0x2ac/0x4ec [br_netfilter]
[  176.297636] sp : ffff800080003630
[  176.297743] x29: ffff800080003630 x28: 0000000000000008 x27:
ffff6828c49ad9f8
[  176.298093] x26: ffff6828c49ad000 x25: 0000000000000000 x24:
00000000000003e8
[  176.298430] x23: 0000000000000000 x22: ffff6828c4960b40 x21:
ffff6828c3b16d28
[  176.298652] x20: ffff6828c3167048 x19: ffff6828c3b16d00 x18:
0000000000000014
[  176.298926] x17: ffffb0476322f000 x16: ffffb7e164023730 x15:
0000000095744632
[  176.299296] x14: ffff6828c3f1c880 x13: 0000000000000002 x12:
ffffb7e137926a70
[  176.299574] x11: 0000000000000001 x10: ffff6828c3f1c898 x9 :
0000000000000000
[  176.300049] x8 : ffff6828c49bf070 x7 : 0008460f18d5f20e x6 :
f20e0100bebafeca
[  176.300302] x5 : ffff6828c7f918fe x4 : ffff6828c49bf070 x3 :
0000000000000000
[  176.300586] x2 : 0000000000000000 x1 : ffff6828c3c7ad00 x0 :
ffff6828c7f918f0
[  176.300889] Call trace:
[  176.301123]  br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter]
[  176.301411]  br_nf_post_routing+0x2a8/0x3e4 [br_netfilter]
[  176.301703]  nf_hook_slow+0x48/0x124
[  176.302060]  br_forward_finish+0xc8/0xe8 [bridge]
[  176.302371]  br_nf_hook_thresh+0x124/0x134 [br_netfilter]
[  176.302605]  br_nf_forward_finish+0x118/0x22c [br_netfilter]
[  176.302824]  br_nf_forward_ip.part.0+0x264/0x290 [br_netfilter]
[  176.303136]  br_nf_forward+0x2b8/0x4e0 [br_netfilter]
[  176.303359]  nf_hook_slow+0x48/0x124
[  176.303803]  __br_forward+0xc4/0x194 [bridge]
[  176.304013]  br_flood+0xd4/0x168 [bridge]
[  176.304300]  br_handle_frame_finish+0x1d4/0x5c4 [bridge]
[  176.304536]  br_nf_hook_thresh+0x124/0x134 [br_netfilter]
[  176.304978]  br_nf_pre_routing_finish+0x29c/0x494 [br_netfilter]
[  176.305188]  br_nf_pre_routing+0x250/0x524 [br_netfilter]
[  176.305428]  br_handle_frame+0x244/0x3cc [bridge]
[  176.305695]  __netif_receive_skb_core.constprop.0+0x33c/0xecc
[  176.306080]  __netif_receive_skb_one_core+0x40/0x8c
[  176.306197]  __netif_receive_skb+0x18/0x64
[  176.306369]  process_backlog+0x80/0x124
[  176.306540]  __napi_poll+0x38/0x17c
[  176.306636]  net_rx_action+0x124/0x26c
[  176.306758]  __do_softirq+0x100/0x26c
[  176.307051]  ____do_softirq+0x10/0x1c
[  176.307162]  call_on_irq_stack+0x24/0x4c
[  176.307289]  do_softirq_own_stack+0x1c/0x2c
[  176.307396]  do_softirq+0x54/0x6c
[  176.307485]  __local_bh_enable_ip+0x8c/0x98
[  176.307637]  __dev_queue_xmit+0x22c/0xd28
[  176.307775]  neigh_resolve_output+0xf4/0x1a0
[  176.308018]  ip_finish_output2+0x1c8/0x628
[  176.308137]  ip_do_fragment+0x5b4/0x658
[  176.308279]  ip_fragment.constprop.0+0x48/0xec
[  176.308420]  __ip_finish_output+0xa4/0x254
[  176.308593]  ip_finish_output+0x34/0x130
[  176.308814]  ip_output+0x6c/0x108
[  176.308929]  ip_send_skb+0x50/0xf0
[  176.309095]  ip_push_pending_frames+0x30/0x54
[  176.309254]  raw_sendmsg+0x758/0xaec
[  176.309568]  inet_sendmsg+0x44/0x70
[  176.309667]  __sys_sendto+0x110/0x178
[  176.309758]  __arm64_sys_sendto+0x28/0x38
[  176.309918]  invoke_syscall+0x48/0x110
[  176.310211]  el0_svc_common.constprop.0+0x40/0xe0
[  176.310353]  do_el0_svc+0x1c/0x28
[  176.310434]  el0_svc+0x34/0xb4
[  176.310551]  el0t_64_sync_handler+0x120/0x12c
[  176.310690]  el0t_64_sync+0x190/0x194
[  176.311066] Code: f9402e61 79402aa2 927ff821 f9400023 (f9408860)
[  176.315743] ---[ end trace 0000000000000000 ]---
[  176.316060] Kernel panic - not syncing: Oops: Fatal exception in
interrupt
[  176.316371] Kernel Offset: 0x37e0e3000000 from 0xffff800080000000
[  176.316564] PHYS_OFFSET: 0xffff97d780000000
[  176.316782] CPU features: 0x0,88000203,3c020000,0100421b
[  176.317210] Memory Limit: none
[  176.317527] ---[ end Kernel panic - not syncing: Oops: Fatal
Exception in interrupt ]---\
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 15, 2024
Daniel Machon says:

====================
net: sparx5: prepare for lan969x switch driver

== Description:

This series is the first of a multi-part series, that prepares and adds
support for the new lan969x switch driver.

The upstreaming efforts is split into multiple series (might change a
bit as we go along):

    1) Prepare the Sparx5 driver for lan969x (this series)
    2) Add support lan969x (same basic features as Sparx5 provides +
       RGMII, excl.  FDMA and VCAP)
    3) Add support for lan969x FDMA
    4) Add support for lan969x VCAP

== Lan969x in short:

The lan969x Ethernet switch family [1] provides a rich set of
switching features and port configurations (up to 30 ports) from 10Mbps
to 10Gbps, with support for RGMII, SGMII, QSGMII, USGMII, and USXGMII,
ideal for industrial & process automation infrastructure applications,
transport, grid automation, power substation automation, and ring &
intra-ring topologies. The LAN969x family is hardware and software
compatible and scalable supporting 46Gbps to 102Gbps switch bandwidths.

== Preparing Sparx5 for lan969x:

The lan969x switch chip reuses many of the IP's of the Sparx5 switch
chip, therefore it has been decided to add support through the existing
Sparx5 driver, in order to avoid a bunch of duplicate code. However, in
order to reuse the Sparx5 switch driver, we have to introduce some
mechanisms to handle the chip differences that are there.  These
mechanisms are:

    - Platform match data to contain all the differences that needs to
      be handled (constants, ops etc.)

    - Register macro indirection layer so that we can reuse the existing
      register macros.

    - Function for branching out on platform type where required.

In some places we ops out functions and in other places we branch on the
chip type. Exactly when we choose one over the other, is an estimate in
each case.

After this series is applied, the Sparx5 driver will be prepared for
lan969x and still function exactly as before.

== Patch breakdown:

Patch #1        adds private match data

Patch #2        adds register macro indirection layer

Patch #3-#4     does some preparation work

Patch #5-#7     adds chip constants and updates the code to use them

Patch #8-#13    adds and uses ops for handling functions differently on the
                two platforms.

Patch #14       adds and uses a macro for branching out on the chip type.

Patch #15 (NEW) redefines macros for internal ports and PGID's.

[1] https://www.microchip.com/en-us/product/lan9698

To: David S. Miller <[email protected]>
To: Eric Dumazet <[email protected]>
To: Jakub Kicinski <[email protected]>
To: Paolo Abeni <[email protected]>
To: Lars Povlsen <[email protected]>
To: Steen Hegelund <[email protected]>
To: [email protected]
To: [email protected]
To: [email protected]
To: Richard Cochran <[email protected]>
To: [email protected]
To: [email protected]
To: [email protected]
To: [email protected]
To: [email protected]
To: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

Signed-off-by: Daniel Machon <[email protected]>
====================

Link: https://patch.msgid.link/20241004-b4-sparx5-lan969x-switch-driver-v2-0-d3290f581663@microchip.com
Signed-off-by: Paolo Abeni <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 15, 2024
Eric Dumazet says:

====================
net: remove RTNL from fib_seq_sum()

This series is inspired by a syzbot report showing
rtnl contention and one thread blocked in:

7 locks held by syz-executor/10835:
  #0: ffff888033390420 (sb_writers#8){.+.+}-{0:0}, at: file_start_write include/linux/fs.h:2931 [inline]
  #0: ffff888033390420 (sb_writers#8){.+.+}-{0:0}, at: vfs_write+0x224/0xc90 fs/read_write.c:679
  #1: ffff88806df6bc88 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x1ea/0x500 fs/kernfs/file.c:325
  #2: ffff888026fcf3c8 (kn->active#50){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x20e/0x500 fs/kernfs/file.c:326
  #3: ffffffff8f56f848 (nsim_bus_dev_list_lock){+.+.}-{3:3}, at: new_device_store+0x1b4/0x890 drivers/net/netdevsim/bus.c:166
  #4: ffff88805e0140e8 (&dev->mutex){....}-{3:3}, at: device_lock include/linux/device.h:1014 [inline]
  #4: ffff88805e0140e8 (&dev->mutex){....}-{3:3}, at: __device_attach+0x8e/0x520 drivers/base/dd.c:1005
  #5: ffff88805c5fb250 (&devlink->lock_key#55){+.+.}-{3:3}, at: nsim_drv_probe+0xcb/0xb80 drivers/net/netdevsim/dev.c:1534
  #6: ffffffff8fcd1748 (rtnl_mutex){+.+.}-{3:3}, at: fib_seq_sum+0x31/0x290 net/core/fib_notifier.c:46
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 15, 2024
Danielle Ratson says:

====================
ethtool: Add support for writing firmware

In the CMIS specification for pluggable modules, LPL (Local Payload) and
EPL (Extended Payload) are two types of data payloads used for managing
various functions and features of the module.

EPL payloads are used for more complex and extensive management functions
that require a larger amount of data, so writing firmware blocks using EPL
is much more efficient.

Currently, only LPL payload is supported for writing firmware blocks to
the module.

Add support for writing firmware block using EPL payload, both to support
modules that support only EPL write mechanism, and to optimize the flashing
process of modules that support LPL and EPL.

Running the flashing command on the same sample module using EPL vs. LPL
showed an improvement of 84%.

Patchset overview:
Patch #1: preparations
Patch #2: Add EPL support

v5: Resending- no changes.

v4: Resending the right version after wrong v3.
    No changes from v2.

v2:
	* Fix the commit meassges to align the cover letter about the
	  right meaning of LPL and EPL.
	Patch #2:
	* Initialize the variable 'bytes_written' before the first
	  iteration.
====================

Signed-off-by: David S. Miller <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 21, 2024
On the node of an NFS client, some files saved in the mountpoint of the
NFS server were copied to another location of the same NFS server.
Accidentally, the nfs42_complete_copies() got a NULL-pointer dereference
crash with the following syslog:

[232064.838881] NFSv4: state recovery failed for open file nfs/pvc-12b5200d-cd0f-46a3-b9f0-af8f4fe0ef64.qcow2, error = -116
[232064.839360] NFSv4: state recovery failed for open file nfs/pvc-12b5200d-cd0f-46a3-b9f0-af8f4fe0ef64.qcow2, error = -116
[232066.588183] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058
[232066.588586] Mem abort info:
[232066.588701]   ESR = 0x0000000096000007
[232066.588862]   EC = 0x25: DABT (current EL), IL = 32 bits
[232066.589084]   SET = 0, FnV = 0
[232066.589216]   EA = 0, S1PTW = 0
[232066.589340]   FSC = 0x07: level 3 translation fault
[232066.589559] Data abort info:
[232066.589683]   ISV = 0, ISS = 0x00000007
[232066.589842]   CM = 0, WnR = 0
[232066.589967] user pgtable: 64k pages, 48-bit VAs, pgdp=00002000956ff400
[232066.590231] [0000000000000058] pgd=08001100ae100003, p4d=08001100ae100003, pud=08001100ae100003, pmd=08001100b3c00003, pte=0000000000000000
[232066.590757] Internal error: Oops: 96000007 [#1] SMP
[232066.590958] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm vhost_net vhost vhost_iotlb tap tun ipt_rpfilter xt_multiport ip_set_hash_ip ip_set_hash_net xfrm_interface xfrm6_tunnel tunnel4 tunnel6 esp4 ah4 wireguard libcurve25519_generic veth xt_addrtype xt_set nf_conntrack_netlink ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_bitmap_port ip_set_hash_ipport dummy ip_set ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs iptable_filter sch_ingress nfnetlink_cttimeout vport_gre ip_gre ip_tunnel gre vport_geneve geneve vport_vxlan vxlan ip6_udp_tunnel udp_tunnel openvswitch nf_conncount dm_round_robin dm_service_time dm_multipath xt_nat xt_MASQUERADE nft_chain_nat nf_nat xt_mark xt_conntrack xt_comment nft_compat nft_counter nf_tables nfnetlink ocfs2 ocfs2_nodemanager ocfs2_stackglue iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_ssif nbd overlay 8021q garp mrp bonding tls rfkill sunrpc ext4 mbcache jbd2
[232066.591052]  vfat fat cas_cache cas_disk ses enclosure scsi_transport_sas sg acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler ip_tables vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_type1 vfio dm_mirror dm_region_hash dm_log dm_mod nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc fuse xfs libcrc32c ast drm_vram_helper qla2xxx drm_kms_helper syscopyarea crct10dif_ce sysfillrect ghash_ce sysimgblt sha2_ce fb_sys_fops cec sha256_arm64 sha1_ce drm_ttm_helper ttm nvme_fc igb sbsa_gwdt nvme_fabrics drm nvme_core i2c_algo_bit i40e scsi_transport_fc megaraid_sas aes_neon_bs
[232066.596953] CPU: 6 PID: 4124696 Comm: 10.253.166.125- Kdump: loaded Not tainted 5.15.131-9.cl9_ocfs2.aarch64 #1
[232066.597356] Hardware name: Great Wall .\x93\x8e...RF6260 V5/GWMSSE2GL1T, BIOS T656FBE_V3.0.18 2024-01-06
[232066.597721] pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[232066.598034] pc : nfs4_reclaim_open_state+0x220/0x800 [nfsv4]
[232066.598327] lr : nfs4_reclaim_open_state+0x12c/0x800 [nfsv4]
[232066.598595] sp : ffff8000f568fc70
[232066.598731] x29: ffff8000f568fc70 x28: 0000000000001000 x27: ffff21003db33000
[232066.599030] x26: ffff800005521ae0 x25: ffff0100f98fa3f0 x24: 0000000000000001
[232066.599319] x23: ffff800009920008 x22: ffff21003db33040 x21: ffff21003db33050
[232066.599628] x20: ffff410172fe9e40 x19: ffff410172fe9e00 x18: 0000000000000000
[232066.599914] x17: 0000000000000000 x16: 0000000000000004 x15: 0000000000000000
[232066.600195] x14: 0000000000000000 x13: ffff800008e685a8 x12: 00000000eac0c6e6
[232066.600498] x11: 0000000000000000 x10: 0000000000000008 x9 : ffff8000054e5828
[232066.600784] x8 : 00000000ffffffbf x7 : 0000000000000001 x6 : 000000000a9eb14a
[232066.601062] x5 : 0000000000000000 x4 : ffff70ff8a14a800 x3 : 0000000000000058
[232066.601348] x2 : 0000000000000001 x1 : 54dce46366daa6c6 x0 : 0000000000000000
[232066.601636] Call trace:
[232066.601749]  nfs4_reclaim_open_state+0x220/0x800 [nfsv4]
[232066.601998]  nfs4_do_reclaim+0x1b8/0x28c [nfsv4]
[232066.602218]  nfs4_state_manager+0x928/0x10f0 [nfsv4]
[232066.602455]  nfs4_run_state_manager+0x78/0x1b0 [nfsv4]
[232066.602690]  kthread+0x110/0x114
[232066.602830]  ret_from_fork+0x10/0x20
[232066.602985] Code: 1400000d f9403f20 f9402e61 91016003 (f9402c00)
[232066.603284] SMP: stopping secondary CPUs
[232066.606936] Starting crashdump kernel...
[232066.607146] Bye!

Analysing the vmcore, we know that nfs4_copy_state listed by destination
nfs_server->ss_copies was added by the field copies in handle_async_copy(),
and we found a waiting copy process with the stack as:
PID: 3511963  TASK: ffff710028b47e00  CPU: 0   COMMAND: "cp"
 #0 [ffff8001116ef740] __switch_to at ffff8000081b92f4
 #1 [ffff8001116ef760] __schedule at ffff800008dd0650
 #2 [ffff8001116ef7c0] schedule at ffff800008dd0a00
 #3 [ffff8001116ef7e0] schedule_timeout at ffff800008dd6aa0
 #4 [ffff8001116ef860] __wait_for_common at ffff800008dd166c
 #5 [ffff8001116ef8e0] wait_for_completion_interruptible at ffff800008dd1898
 #6 [ffff8001116ef8f0] handle_async_copy at ffff8000055142f4 [nfsv4]
 #7 [ffff8001116ef970] _nfs42_proc_copy at ffff8000055147c8 [nfsv4]
 #8 [ffff8001116efa80] nfs42_proc_copy at ffff800005514cf0 [nfsv4]
 #9 [ffff8001116efc50] __nfs4_copy_file_range.constprop.0 at ffff8000054ed694 [nfsv4]

The NULL-pointer dereference was due to nfs42_complete_copies() listed
the nfs_server->ss_copies by the field ss_copies of nfs4_copy_state.
So the nfs4_copy_state address ffff0100f98fa3f0 was offset by 0x10 and
the data accessed through this pointer was also incorrect. Generally,
the ordered list nfs4_state_owner->so_states indicate open(O_RDWR) or
open(O_WRITE) states are reclaimed firstly by nfs4_reclaim_open_state().
When destination state reclaim is failed with NFS_STATE_RECOVERY_FAILED
and copies are not deleted in nfs_server->ss_copies, the source state
may be passed to the nfs42_complete_copies() process earlier, resulting
in this crash scene finally. To solve this issue, we add a list_head
nfs_server->ss_src_copies for a server-to-server copy specially.

Fixes: 0e65a32 ("NFS: handle source server reboot")
Signed-off-by: Yanjun Zhang <[email protected]>
Reviewed-by: Trond Myklebust <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 21, 2024
Syzkaller reported a lockdep splat:

  ============================================
  WARNING: possible recursive locking detected
  6.11.0-rc6-syzkaller-00019-g67784a74e258 #0 Not tainted
  --------------------------------------------
  syz-executor364/5113 is trying to acquire lock:
  ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
  ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328

  but task is already holding lock:
  ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
  ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(k-slock-AF_INET);
    lock(k-slock-AF_INET);

   *** DEADLOCK ***

   May be due to missing lock nesting notation

  7 locks held by syz-executor364/5113:
   #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline]
   #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x153/0x1b10 net/mptcp/protocol.c:1806
   #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline]
   #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg_fastopen+0x11f/0x530 net/mptcp/protocol.c:1727
   #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline]
   #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline]
   #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x5f/0x1b80 net/ipv4/ip_output.c:470
   #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline]
   #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline]
   #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x45f/0x1390 net/ipv4/ip_output.c:228
   #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: local_lock_acquire include/linux/local_lock_internal.h:29 [inline]
   #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: process_backlog+0x33b/0x15b0 net/core/dev.c:6104
   #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline]
   #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline]
   #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0x230/0x5f0 net/ipv4/ip_input.c:232
   #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
   #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328

  stack backtrace:
  CPU: 0 UID: 0 PID: 5113 Comm: syz-executor364 Not tainted 6.11.0-rc6-syzkaller-00019-g67784a74e258 #0
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
  Call Trace:
   <IRQ>
   __dump_stack lib/dump_stack.c:93 [inline]
   dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119
   check_deadlock kernel/locking/lockdep.c:3061 [inline]
   validate_chain+0x15d3/0x5900 kernel/locking/lockdep.c:3855
   __lock_acquire+0x137a/0x2040 kernel/locking/lockdep.c:5142
   lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5759
   __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
   _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
   spin_lock include/linux/spinlock.h:351 [inline]
   sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328
   mptcp_sk_clone_init+0x32/0x13c0 net/mptcp/protocol.c:3279
   subflow_syn_recv_sock+0x931/0x1920 net/mptcp/subflow.c:874
   tcp_check_req+0xfe4/0x1a20 net/ipv4/tcp_minisocks.c:853
   tcp_v4_rcv+0x1c3e/0x37f0 net/ipv4/tcp_ipv4.c:2267
   ip_protocol_deliver_rcu+0x22e/0x440 net/ipv4/ip_input.c:205
   ip_local_deliver_finish+0x341/0x5f0 net/ipv4/ip_input.c:233
   NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314
   NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314
   __netif_receive_skb_one_core net/core/dev.c:5661 [inline]
   __netif_receive_skb+0x2bf/0x650 net/core/dev.c:5775
   process_backlog+0x662/0x15b0 net/core/dev.c:6108
   __napi_poll+0xcb/0x490 net/core/dev.c:6772
   napi_poll net/core/dev.c:6841 [inline]
   net_rx_action+0x89b/0x1240 net/core/dev.c:6963
   handle_softirqs+0x2c4/0x970 kernel/softirq.c:554
   do_softirq+0x11b/0x1e0 kernel/softirq.c:455
   </IRQ>
   <TASK>
   __local_bh_enable_ip+0x1bb/0x200 kernel/softirq.c:382
   local_bh_enable include/linux/bottom_half.h:33 [inline]
   rcu_read_unlock_bh include/linux/rcupdate.h:908 [inline]
   __dev_queue_xmit+0x1763/0x3e90 net/core/dev.c:4450
   dev_queue_xmit include/linux/netdevice.h:3105 [inline]
   neigh_hh_output include/net/neighbour.h:526 [inline]
   neigh_output include/net/neighbour.h:540 [inline]
   ip_finish_output2+0xd41/0x1390 net/ipv4/ip_output.c:235
   ip_local_out net/ipv4/ip_output.c:129 [inline]
   __ip_queue_xmit+0x118c/0x1b80 net/ipv4/ip_output.c:535
   __tcp_transmit_skb+0x2544/0x3b30 net/ipv4/tcp_output.c:1466
   tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:6542 [inline]
   tcp_rcv_state_process+0x2c32/0x4570 net/ipv4/tcp_input.c:6729
   tcp_v4_do_rcv+0x77d/0xc70 net/ipv4/tcp_ipv4.c:1934
   sk_backlog_rcv include/net/sock.h:1111 [inline]
   __release_sock+0x214/0x350 net/core/sock.c:3004
   release_sock+0x61/0x1f0 net/core/sock.c:3558
   mptcp_sendmsg_fastopen+0x1ad/0x530 net/mptcp/protocol.c:1733
   mptcp_sendmsg+0x1884/0x1b10 net/mptcp/protocol.c:1812
   sock_sendmsg_nosec net/socket.c:730 [inline]
   __sock_sendmsg+0x1a6/0x270 net/socket.c:745
   ____sys_sendmsg+0x525/0x7d0 net/socket.c:2597
   ___sys_sendmsg net/socket.c:2651 [inline]
   __sys_sendmmsg+0x3b2/0x740 net/socket.c:2737
   __do_sys_sendmmsg net/socket.c:2766 [inline]
   __se_sys_sendmmsg net/socket.c:2763 [inline]
   __x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2763
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
  RIP: 0033:0x7f04fb13a6b9
  Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 01 1a 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
  RSP: 002b:00007ffd651f42d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
  RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f04fb13a6b9
  RDX: 0000000000000001 RSI: 0000000020000d00 RDI: 0000000000000004
  RBP: 00007ffd651f4310 R08: 0000000000000001 R09: 0000000000000001
  R10: 0000000020000080 R11: 0000000000000246 R12: 00000000000f4240
  R13: 00007f04fb187449 R14: 00007ffd651f42f4 R15: 00007ffd651f4300
   </TASK>

As noted by Cong Wang, the splat is false positive, but the code
path leading to the report is an unexpected one: a client is
attempting an MPC handshake towards the in-kernel listener created
by the in-kernel PM for a port based signal endpoint.

Such connection will be never accepted; many of them can make the
listener queue full and preventing the creation of MPJ subflow via
such listener - its intended role.

Explicitly detect this scenario at initial-syn time and drop the
incoming MPC request.

Fixes: 1729cf1 ("mptcp: create the listening socket for new port")
Cc: [email protected]
Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=f4aacdfef2c6a6529c3e
Cc: Cong Wang <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Reviewed-by: Matthieu Baerts (NGI0) <[email protected]>
Reviewed-by: Mat Martineau <[email protected]>
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 21, 2024
Viktor Malik says:

====================
selftests/bpf: Improve building with extra

When trying to build BPF selftests with additional compiler and linker
flags, we're running into multiple problems. This series addresses all
of them:

- CFLAGS are not passed to sub-makes of bpftool and libbpf. This is a
  problem when compiling with PIE as libbpf.a ends up being non-PIE and
  cannot be linked with other binaries (patch #1).

- bpftool Makefile runs `llvm-config --cflags` and appends the result to
  CFLAGS. The result typically contains `-D_GNU_SOURCE` which may be
  already set in CFLAGS. That causes a compilation error (patch #2).

- Some GCC flags are not supported by Clang but there are binaries which
  are always built with Clang but reuse user-defined CFLAGS. When CFLAGS
  contain such flags, compilation fails (patch #3).

Changelog:
----------
v1 -> v2:
- cover forgotten case in patch#1 (noted by Eduard)
- remove -D_GNU_SOURCE unconditionally in patch#2 (suggested by Andrii)
- rewrite patch#3 to just add -Wno-unused-command-line-argument
  (suggested by Andrii)
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Andrii Nakryiko <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 24, 2024
Andrii Nakryiko says:

====================
Fix libbpf's bpf_object and BPF subskel interoperability

Fix libbpf's global data map mmap()'ing logic to make BPF objects loaded
through generic bpf_object__load() API interoperable with BPF subskeleton
instantiated from such BPF object. The issue is in re-mmap()'ing of global
data maps after BPF object is loaded into kernel, which is currently done in
BPF skeleton-specific code, and should instead be done in generic and common
bpf_object_load() logic.

See patch #2 for the fix, patch #3 for the selftests.  Patch #1 is preliminary
fix for existing spin_lock selftests which currently works by accident.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 24, 2024
Hou Tao says:

====================
Add the missing BPF_LINK_TYPE invocation for sockmap

From: Hou Tao <[email protected]>

Hi,

The tiny patch set fixes the out-of-bound read problem when reading the
fdinfo of sock map link fd. And in order to spot such omission early for
the newly-added link type in the future, it also checks the validity of
the link->type and adds a WARN_ONCE() for missed invocation.

Please see individual patches for more details. And comments are always
welcome.

v3:
  * patch #2: check and warn the validity of link->type instead of
    adding a static assertion for bpf_link_type_strs array.

v2: http://lore.kernel.org/bpf/[email protected]
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Andrii Nakryiko <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 25, 2024
…/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.12, take #2

- Fix the guest view of the ID registers, making the relevant fields
  writable from userspace (affecting ID_AA64DFR0_EL1 and ID_AA64PFR1_EL1)

- Correcly expose S1PIE to guests, fixing a regression introduced
  in 6.12-rc1 with the S1POE support

- Fix the recycling of stage-2 shadow MMUs by tracking the context
  (are we allowed to block or not) as well as the recycling state

- Address a couple of issues with the vgic when userspace misconfigures
  the emulation, resulting in various splats. Headaches courtesy
  of our Syzkaller friends
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 25, 2024
Fix possible use-after-free in 'taprio_dump()' by adding RCU
read-side critical section there. Never seen on x86 but
found on a KASAN-enabled arm64 system when investigating
https://syzkaller.appspot.com/bug?extid=b65e0af58423fc8a73aa:

[T15862] BUG: KASAN: slab-use-after-free in taprio_dump+0xa0c/0xbb0
[T15862] Read of size 4 at addr ffff0000d4bb88f8 by task repro/15862
[T15862]
[T15862] CPU: 0 UID: 0 PID: 15862 Comm: repro Not tainted 6.11.0-rc1-00293-gdefaf1a2113a-dirty #2
[T15862] Hardware name: QEMU QEMU Virtual Machine, BIOS edk2-20240524-5.fc40 05/24/2024
[T15862] Call trace:
[T15862]  dump_backtrace+0x20c/0x220
[T15862]  show_stack+0x2c/0x40
[T15862]  dump_stack_lvl+0xf8/0x174
[T15862]  print_report+0x170/0x4d8
[T15862]  kasan_report+0xb8/0x1d4
[T15862]  __asan_report_load4_noabort+0x20/0x2c
[T15862]  taprio_dump+0xa0c/0xbb0
[T15862]  tc_fill_qdisc+0x540/0x1020
[T15862]  qdisc_notify.isra.0+0x330/0x3a0
[T15862]  tc_modify_qdisc+0x7b8/0x1838
[T15862]  rtnetlink_rcv_msg+0x3c8/0xc20
[T15862]  netlink_rcv_skb+0x1f8/0x3d4
[T15862]  rtnetlink_rcv+0x28/0x40
[T15862]  netlink_unicast+0x51c/0x790
[T15862]  netlink_sendmsg+0x79c/0xc20
[T15862]  __sock_sendmsg+0xe0/0x1a0
[T15862]  ____sys_sendmsg+0x6c0/0x840
[T15862]  ___sys_sendmsg+0x1ac/0x1f0
[T15862]  __sys_sendmsg+0x110/0x1d0
[T15862]  __arm64_sys_sendmsg+0x74/0xb0
[T15862]  invoke_syscall+0x88/0x2e0
[T15862]  el0_svc_common.constprop.0+0xe4/0x2a0
[T15862]  do_el0_svc+0x44/0x60
[T15862]  el0_svc+0x50/0x184
[T15862]  el0t_64_sync_handler+0x120/0x12c
[T15862]  el0t_64_sync+0x190/0x194
[T15862]
[T15862] Allocated by task 15857:
[T15862]  kasan_save_stack+0x3c/0x70
[T15862]  kasan_save_track+0x20/0x3c
[T15862]  kasan_save_alloc_info+0x40/0x60
[T15862]  __kasan_kmalloc+0xd4/0xe0
[T15862]  __kmalloc_cache_noprof+0x194/0x334
[T15862]  taprio_change+0x45c/0x2fe0
[T15862]  tc_modify_qdisc+0x6a8/0x1838
[T15862]  rtnetlink_rcv_msg+0x3c8/0xc20
[T15862]  netlink_rcv_skb+0x1f8/0x3d4
[T15862]  rtnetlink_rcv+0x28/0x40
[T15862]  netlink_unicast+0x51c/0x790
[T15862]  netlink_sendmsg+0x79c/0xc20
[T15862]  __sock_sendmsg+0xe0/0x1a0
[T15862]  ____sys_sendmsg+0x6c0/0x840
[T15862]  ___sys_sendmsg+0x1ac/0x1f0
[T15862]  __sys_sendmsg+0x110/0x1d0
[T15862]  __arm64_sys_sendmsg+0x74/0xb0
[T15862]  invoke_syscall+0x88/0x2e0
[T15862]  el0_svc_common.constprop.0+0xe4/0x2a0
[T15862]  do_el0_svc+0x44/0x60
[T15862]  el0_svc+0x50/0x184
[T15862]  el0t_64_sync_handler+0x120/0x12c
[T15862]  el0t_64_sync+0x190/0x194
[T15862]
[T15862] Freed by task 6192:
[T15862]  kasan_save_stack+0x3c/0x70
[T15862]  kasan_save_track+0x20/0x3c
[T15862]  kasan_save_free_info+0x4c/0x80
[T15862]  poison_slab_object+0x110/0x160
[T15862]  __kasan_slab_free+0x3c/0x74
[T15862]  kfree+0x134/0x3c0
[T15862]  taprio_free_sched_cb+0x18c/0x220
[T15862]  rcu_core+0x920/0x1b7c
[T15862]  rcu_core_si+0x10/0x1c
[T15862]  handle_softirqs+0x2e8/0xd64
[T15862]  __do_softirq+0x14/0x20

Fixes: 18cdd2f ("net/sched: taprio: taprio_dump and taprio_change are protected by rtnl_mutex")
Acked-by: Vinicius Costa Gomes <[email protected]>
Signed-off-by: Dmitry Antipov <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 30, 2024
Hou Tao says:

====================
The patch set fixes several issues in bits iterator. Patch #1 fixes the
kmemleak problem of bits iterator. Patch #2~#3 fix the overflow problem
of nr_bits. Patch #4 fixes the potential stack corruption when bits
iterator is used on 32-bit host. Patch #5 adds more test cases for bits
iterator.

Please see the individual patches for more details. And comments are
always welcome.
---
v4:
 * patch #1: add ack from Yafang
 * patch #3: revert code-churn like changes:
   (1) compute nr_bytes and nr_bits before the check of nr_words.
   (2) use nr_bits == 64 to check for single u64, preventing build
       warning on 32-bit hosts.
 * patch #4: use "BITS_PER_LONG == 32" instead of "!defined(CONFIG_64BIT)"

v3: https://lore.kernel.org/bpf/[email protected]/T/#t
  * split the bits-iterator related patches from "Misc fixes for bpf"
    patch set
  * patch #1: use "!nr_bits || bits >= nr_bits" to stop the iteration
  * patch #2: add a new helper for the overflow problem
  * patch #3: decrease the limitation from 512 to 511 and check whether
    nr_bytes is too large for bpf memory allocator explicitly
  * patch #5: add two more test cases for bit iterator

v2: http://lore.kernel.org/bpf/[email protected]
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Nov 1, 2024
Petr Machata says:

====================
mlxsw: Fixes

In this patchset:

- Tx header should be pushed for each packet which is transmitted via
  Spectrum ASICs. Patch #1 adds a missing call to skb_cow_head() to make
  sure that there is both enough room to push the Tx header and that the
  SKB header is not cloned and can be modified.

- Commit b5b60bb ("mlxsw: pci: Use page pool for Rx buffers
  allocation") converted mlxsw to use page pool for Rx buffers allocation.
  Sync for CPU and for device should be done for Rx pages. In patches #2
  and #3, add the missing calls to sync pages for, respectively, CPU and
  the device.

- Patch #4 then fixes a bug to IPv6 GRE forwarding offload. Patch #5 adds
  a generic forwarding test that fails with mlxsw ports prior to the fix.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Nov 4, 2024
Petr Machata says:

====================
selftests: net: Introduce deferred commands

Recently, a defer helper was added to Python selftests. The idea is to keep
cleanup commands close to their dirtying counterparts, thereby making it
more transparent what is cleaning up what, making it harder to miss a
cleanup, and make the whole cleanup business exception safe. All these
benefits are applicable to bash as well, exception safety can be
interpreted in terms of safety vs. a SIGINT.

This patchset therefore introduces a framework of several helpers that
serve to schedule cleanups in bash selftests.

- Patch #1 has more details about the primitives being introduced.
  Patch #2 adds a fallback cleanup() function to lib.sh, because ideally
  selftests wouldn't need to introduce a dedicated cleanup function at all.

- Patch #3 adds a parameter to stop_traffic(), which makes it possible to
  start other background processes after the traffic is started without
  confusing the cleanup.

- Patches #4 to #10 convert a number of selftests.

  The goal was to convert all tests that use start_traffic / stop_traffic
  to the defer framework. Leftover traffic generators are a particularly
  painful sort of a missed cleanup. Normal unfinished cleanups can usually
  be cleaned up simply by rerunning the test and interrupting it early to
  let the cleanups run again / in full. This does not work with
  stop_traffic, because it is only issued at the end of the test case that
  starts the traffic. At the same time, leftover traffic generators
  influence follow-up test runs, and are hard to notice.

  The tests were however converted whole-sale, not just their traffic bits.
  Thus they form a proof of concept of the defer framework.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Nov 4, 2024
Yafang Shao says:

====================
Add noinline_for_tracing and apply it to tcp_drop_reason

This patchset introduces a new compiler annotation, noinline_for_tracing,
designed to prevent specific functions from being inlined to facilitate
tracing. In Patch #2, this annotation is applied to the tcp_drop_reason().
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Nov 5, 2024
Currently, when configuring TMU (Time Management Unit) mode of a given
router, we take into account only its own TMU requirements ignoring
other routers in the domain. This is problematic if the router we are
configuring has lower TMU requirements than what is already configured
in the domain.

In the scenario below, we have a host router with two USB4 ports: A and
B. Port A connected to device router #1 (which supports CL states) and
existing DisplayPort tunnel, thus, the TMU mode is HiFi uni-directional.

1. Initial topology

          [Host]
         A/
         /
 [Device #1]
   /
Monitor

2. Plug in device #2 (that supports CL states) to downstream port B of
   the host router

         [Host]
        A/    B\
        /       \
 [Device #1]    [Device #2]
   /
Monitor

The TMU mode on port B and port A will be configured to LowRes which is
not what we want and will cause monitor to start flickering.

To address this we first scan the domain and search for any router
configured to HiFi uni-directional mode, and if found, configure TMU
mode of the given router to HiFi uni-directional as well.

Cc: [email protected]
Signed-off-by: Gil Fine <[email protected]>
Signed-off-by: Mika Westerberg <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Nov 5, 2024
Running rcutorture scenario TREE05, the below warning is triggered.

[   32.604594] WARNING: suspicious RCU usage
[   32.605928] 6.11.0-rc5-00040-g4ba4f1afb6a9 #55238 Not tainted
[   32.607812] -----------------------------
[   32.609140] kernel/events/core.c:13946 RCU-list traversed in non-reader section!!
[   32.611595] other info that might help us debug this:
[   32.614247] rcu_scheduler_active = 2, debug_locks = 1
[   32.616392] 3 locks held by cpuhp/4/35:
[   32.617687]  #0: ffffffffb666a650 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x4e/0x200
[   32.620563]  #1: ffffffffb666cd20 (cpuhp_state-down){+.+.}-{0:0}, at: cpuhp_thread_fun+0x4e/0x200
[   32.623412]  #2: ffffffffb677c288 (pmus_lock){+.+.}-{3:3}, at: perf_event_exit_cpu_context+0x32/0x2f0

In perf_event_clear_cpumask(), uses list_for_each_entry_rcu() without an
obvious RCU read-side critical section.

Either pmus_srcu or pmus_lock is good enough to protect the pmus list.
In the current context, pmus_lock is already held. The
list_for_each_entry_rcu() is not required.

Fixes: 4ba4f1a ("perf: Generic hotplug support for a PMU with a scope")
Closes: https://lore.kernel.org/lkml/2b66dff8-b827-494b-b151-1ad8d56f13e6@paulmck-laptop/
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Reported-by: "Paul E. McKenney" <[email protected]>
Reported-by: kernel test robot <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: "Paul E. McKenney" <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Nov 5, 2024
generic/077 on x86_32 CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP=y with highmem,
on huge=always tmpfs, issues a warning and then hangs (interruptibly):

WARNING: CPU: 5 PID: 3517 at mm/highmem.c:622 kunmap_local_indexed+0x62/0xc9
CPU: 5 UID: 0 PID: 3517 Comm: cp Not tainted 6.12.0-rc4 #2
...
copy_page_from_iter_atomic+0xa6/0x5ec
generic_perform_write+0xf6/0x1b4
shmem_file_write_iter+0x54/0x67

Fix copy_page_from_iter_atomic() by limiting it in that case
(include/linux/skbuff.h skb_frag_must_loop() does similar).

But going forward, perhaps CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP is too
surprising, has outlived its usefulness, and should just be removed?

Fixes: 908a1ad ("iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic()")
Signed-off-by: Hugh Dickins <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: [email protected]
Signed-off-by: Christian Brauner <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Nov 5, 2024
When we compile and load lib/slub_kunit.c,it will cause a panic.

The root cause is that __kmalloc_cache_noprof was directly called instead
of kmem_cache_alloc,which resulted in no alloc_tag being allocated.This
caused current->alloc_tag to be null,leading to a null pointer dereference
in alloc_tag_ref_set.

Despite the fact that my colleague Pei Xiao will later fix the code in
slub_kunit.c,we still need fix null pointer check logic for ref and tag to
avoid panic caused by a null pointer dereference.

Here is the log for the panic:

[   74.779373][ T2158] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
[   74.780130][ T2158] Mem abort info:
[   74.780406][ T2158]   ESR = 0x0000000096000004
[   74.780756][ T2158]   EC = 0x25: DABT (current EL), IL = 32 bits
[   74.781225][ T2158]   SET = 0, FnV = 0
[   74.781529][ T2158]   EA = 0, S1PTW = 0
[   74.781836][ T2158]   FSC = 0x04: level 0 translation fault
[   74.782288][ T2158] Data abort info:
[   74.782577][ T2158]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[   74.783068][ T2158]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[   74.783533][ T2158]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[   74.784010][ T2158] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000105f34000
[   74.784586][ T2158] [0000000000000020] pgd=0000000000000000, p4d=0000000000000000
[   74.785293][ T2158] Internal error: Oops: 0000000096000004 [#1] SMP
[   74.785805][ T2158] Modules linked in: slub_kunit kunit ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_mangle 4
[   74.790661][ T2158] CPU: 0 UID: 0 PID: 2158 Comm: kunit_try_catch Kdump: loaded Tainted: G        W        N 6.12.0-rc3+ #2
[   74.791535][ T2158] Tainted: [W]=WARN, [N]=TEST
[   74.791889][ T2158] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[   74.792479][ T2158] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   74.793101][ T2158] pc : alloc_tagging_slab_alloc_hook+0x120/0x270
[   74.793607][ T2158] lr : alloc_tagging_slab_alloc_hook+0x120/0x270
[   74.794095][ T2158] sp : ffff800084d33cd0
[   74.794418][ T2158] x29: ffff800084d33cd0 x28: 0000000000000000 x27: 0000000000000000
[   74.795095][ T2158] x26: 0000000000000000 x25: 0000000000000012 x24: ffff80007b30e314
[   74.795822][ T2158] x23: ffff000390ff6f10 x22: 0000000000000000 x21: 0000000000000088
[   74.796555][ T2158] x20: ffff000390285840 x19: fffffd7fc3ef7830 x18: ffffffffffffffff
[   74.797283][ T2158] x17: ffff8000800e63b4 x16: ffff80007b33afc4 x15: ffff800081654c00
[   74.798011][ T2158] x14: 0000000000000000 x13: 205d383531325420 x12: 5b5d383734363537
[   74.798744][ T2158] x11: ffff800084d337e0 x10: 000000000000005d x9 : 00000000ffffffd0
[   74.799476][ T2158] x8 : 7f7f7f7f7f7f7f7f x7 : ffff80008219d188 x6 : c0000000ffff7fff
[   74.800206][ T2158] x5 : ffff0003fdbc9208 x4 : ffff800081edd188 x3 : 0000000000000001
[   74.800932][ T2158] x2 : 0beaa6dee1ac5a00 x1 : 0beaa6dee1ac5a00 x0 : ffff80037c2cb000
[   74.801656][ T2158] Call trace:
[   74.801954][ T2158]  alloc_tagging_slab_alloc_hook+0x120/0x270
[   74.802494][ T2158]  __kmalloc_cache_noprof+0x148/0x33c
[   74.802976][ T2158]  test_kmalloc_redzone_access+0x4c/0x104 [slub_kunit]
[   74.803607][ T2158]  kunit_try_run_case+0x70/0x17c [kunit]
[   74.804124][ T2158]  kunit_generic_run_threadfn_adapter+0x2c/0x4c [kunit]
[   74.804768][ T2158]  kthread+0x10c/0x118
[   74.805141][ T2158]  ret_from_fork+0x10/0x20
[   74.805540][ T2158] Code: b9400a80 11000400 b9000a80 97ffd858 (f94012d3)
[   74.806176][ T2158] SMP: stopping secondary CPUs
[   74.808130][ T2158] Starting crashdump kernel...

Link: https://lkml.kernel.org/r/[email protected]
Fixes: e0a955b ("mm/codetag: add pgalloc_tag_copy()")
Signed-off-by: Hao Ge <[email protected]>
Acked-by: Suren Baghdasaryan <[email protected]>
Suggested-by: Suren Baghdasaryan <[email protected]>
Acked-by: Yu Zhao <[email protected]>
Cc: Kent Overstreet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Nov 8, 2024
When PREEMPT_RT is enabled, 'spinlock_t' becomes preemptible
and bpf program has owned a raw_spinlock under a interrupt handler,
which results in invalid lock acquire context.

[ BUG: Invalid wait context ]
6.12.0-rc5-next-20241031-syzkaller #0 Not tainted
-----------------------------
swapper/0/0 is trying to lock:
ffff8880261e7a00 (&trie->lock){....}-{3:3},
at: trie_delete_elem+0x96/0x6a0 kernel/bpf/lpm_trie.c:462
other info that might help us debug this:
context-{3:3}
5 locks held by swapper/0/0:
 #0: ffff888020bb75c8 (&vp_dev->lock){-...}-{3:3},
at: vp_vring_interrupt drivers/virtio/virtio_pci_common.c:80 [inline]
 #0: ffff888020bb75c8 (&vp_dev->lock){-...}-{3:3},
at: vp_interrupt+0x142/0x200 drivers/virtio/virtio_pci_common.c:113
 #1: ffff88814174a120 (&vb->stop_update_lock){-...}-{3:3},
at: spin_lock include/linux/spinlock.h:351 [inline]
 #1: ffff88814174a120 (&vb->stop_update_lock){-...}-{3:3},
at: stats_request+0x6f/0x230 drivers/virtio/virtio_balloon.c:438
 #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
 #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
 #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: __queue_work+0x199/0xf50 kernel/workqueue.c:2259
 #3: ffff8880b863dd18 (&pool->lock){-.-.}-{2:2},
at: __queue_work+0x759/0xf50
 #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
 #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
 #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: __bpf_trace_run kernel/trace/bpf_trace.c:2339 [inline]
 #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: bpf_trace_run1+0x1d6/0x520 kernel/trace/bpf_trace.c:2380
stack backtrace:
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted
6.12.0-rc5-next-20241031-syzkaller #0
Hardware name: Google Compute Engine/Google Compute Engine,
BIOS Google 09/13/2024
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
 print_lock_invalid_wait_context kernel/locking/lockdep.c:4826 [inline]
 check_wait_context kernel/locking/lockdep.c:4898 [inline]
 __lock_acquire+0x15a8/0x2100 kernel/locking/lockdep.c:5176
 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
 _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
 trie_delete_elem+0x96/0x6a0 kernel/bpf/lpm_trie.c:462
 bpf_prog_2c29ac5cdc6b1842+0x43/0x47
 bpf_dispatcher_nop_func include/linux/bpf.h:1290 [inline]
 __bpf_prog_run include/linux/filter.h:701 [inline]
 bpf_prog_run include/linux/filter.h:708 [inline]
 __bpf_trace_run kernel/trace/bpf_trace.c:2340 [inline]
 bpf_trace_run1+0x2ca/0x520 kernel/trace/bpf_trace.c:2380
 trace_workqueue_activate_work+0x186/0x1f0 include/trace/events/workqueue.h:59
 __queue_work+0xc7b/0xf50 kernel/workqueue.c:2338
 queue_work_on+0x1c2/0x380 kernel/workqueue.c:2390
 queue_work include/linux/workqueue.h:662 [inline]
 stats_request+0x1a3/0x230 drivers/virtio/virtio_balloon.c:441
 vring_interrupt+0x21d/0x380 drivers/virtio/virtio_ring.c:2595
 vp_vring_interrupt drivers/virtio/virtio_pci_common.c:82 [inline]
 vp_interrupt+0x192/0x200 drivers/virtio/virtio_pci_common.c:113
 __handle_irq_event_percpu+0x29a/0xa80 kernel/irq/handle.c:158
 handle_irq_event_percpu kernel/irq/handle.c:193 [inline]
 handle_irq_event+0x89/0x1f0 kernel/irq/handle.c:210
 handle_fasteoi_irq+0x48a/0xae0 kernel/irq/chip.c:720
 generic_handle_irq_desc include/linux/irqdesc.h:173 [inline]
 handle_irq arch/x86/kernel/irq.c:247 [inline]
 call_irq_handler arch/x86/kernel/irq.c:259 [inline]
 __common_interrupt+0x136/0x230 arch/x86/kernel/irq.c:285
 common_interrupt+0xb4/0xd0 arch/x86/kernel/irq.c:278
 </IRQ>

Reported-by: [email protected]
Closes: https://lore.kernel.org/bpf/[email protected]/
Fixes: 66150d0 ("bpf, lpm: Make locking RT friendly")
Signed-off-by: Kunwu Chan <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Nov 8, 2024
When PREEMPT_RT is enabled, 'spinlock_t' becomes preemptible
and bpf program has owned a raw_spinlock under a interrupt handler,
which results in invalid lock acquire context.

[ BUG: Invalid wait context ]
6.12.0-rc5-next-20241031-syzkaller #0 Not tainted
-----------------------------
swapper/0/0 is trying to lock:
ffff8880261e7a00 (&trie->lock){....}-{3:3},
at: trie_delete_elem+0x96/0x6a0 kernel/bpf/lpm_trie.c:462
other info that might help us debug this:
context-{3:3}
5 locks held by swapper/0/0:
 #0: ffff888020bb75c8 (&vp_dev->lock){-...}-{3:3},
at: vp_vring_interrupt drivers/virtio/virtio_pci_common.c:80 [inline]
 #0: ffff888020bb75c8 (&vp_dev->lock){-...}-{3:3},
at: vp_interrupt+0x142/0x200 drivers/virtio/virtio_pci_common.c:113
 #1: ffff88814174a120 (&vb->stop_update_lock){-...}-{3:3},
at: spin_lock include/linux/spinlock.h:351 [inline]
 #1: ffff88814174a120 (&vb->stop_update_lock){-...}-{3:3},
at: stats_request+0x6f/0x230 drivers/virtio/virtio_balloon.c:438
 #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
 #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
 #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: __queue_work+0x199/0xf50 kernel/workqueue.c:2259
 #3: ffff8880b863dd18 (&pool->lock){-.-.}-{2:2},
at: __queue_work+0x759/0xf50
 #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
 #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
 #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: __bpf_trace_run kernel/trace/bpf_trace.c:2339 [inline]
 #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: bpf_trace_run1+0x1d6/0x520 kernel/trace/bpf_trace.c:2380
stack backtrace:
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted
6.12.0-rc5-next-20241031-syzkaller #0
Hardware name: Google Compute Engine/Google Compute Engine,
BIOS Google 09/13/2024
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
 print_lock_invalid_wait_context kernel/locking/lockdep.c:4826 [inline]
 check_wait_context kernel/locking/lockdep.c:4898 [inline]
 __lock_acquire+0x15a8/0x2100 kernel/locking/lockdep.c:5176
 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
 _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
 trie_delete_elem+0x96/0x6a0 kernel/bpf/lpm_trie.c:462
 bpf_prog_2c29ac5cdc6b1842+0x43/0x47
 bpf_dispatcher_nop_func include/linux/bpf.h:1290 [inline]
 __bpf_prog_run include/linux/filter.h:701 [inline]
 bpf_prog_run include/linux/filter.h:708 [inline]
 __bpf_trace_run kernel/trace/bpf_trace.c:2340 [inline]
 bpf_trace_run1+0x2ca/0x520 kernel/trace/bpf_trace.c:2380
 trace_workqueue_activate_work+0x186/0x1f0 include/trace/events/workqueue.h:59
 __queue_work+0xc7b/0xf50 kernel/workqueue.c:2338
 queue_work_on+0x1c2/0x380 kernel/workqueue.c:2390
 queue_work include/linux/workqueue.h:662 [inline]
 stats_request+0x1a3/0x230 drivers/virtio/virtio_balloon.c:441
 vring_interrupt+0x21d/0x380 drivers/virtio/virtio_ring.c:2595
 vp_vring_interrupt drivers/virtio/virtio_pci_common.c:82 [inline]
 vp_interrupt+0x192/0x200 drivers/virtio/virtio_pci_common.c:113
 __handle_irq_event_percpu+0x29a/0xa80 kernel/irq/handle.c:158
 handle_irq_event_percpu kernel/irq/handle.c:193 [inline]
 handle_irq_event+0x89/0x1f0 kernel/irq/handle.c:210
 handle_fasteoi_irq+0x48a/0xae0 kernel/irq/chip.c:720
 generic_handle_irq_desc include/linux/irqdesc.h:173 [inline]
 handle_irq arch/x86/kernel/irq.c:247 [inline]
 call_irq_handler arch/x86/kernel/irq.c:259 [inline]
 __common_interrupt+0x136/0x230 arch/x86/kernel/irq.c:285
 common_interrupt+0xb4/0xd0 arch/x86/kernel/irq.c:278
 </IRQ>

Reported-by: [email protected]
Closes: https://lore.kernel.org/bpf/[email protected]/
Fixes: 66150d0 ("bpf, lpm: Make locking RT friendly")
Signed-off-by: Kunwu Chan <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Nov 8, 2024
When PREEMPT_RT is enabled, 'spinlock_t' becomes preemptible
and bpf program has owned a raw_spinlock under a interrupt handler,
which results in invalid lock acquire context.

[ BUG: Invalid wait context ]
6.12.0-rc5-next-20241031-syzkaller #0 Not tainted
-----------------------------
swapper/0/0 is trying to lock:
ffff8880261e7a00 (&trie->lock){....}-{3:3},
at: trie_delete_elem+0x96/0x6a0 kernel/bpf/lpm_trie.c:462
other info that might help us debug this:
context-{3:3}
5 locks held by swapper/0/0:
 #0: ffff888020bb75c8 (&vp_dev->lock){-...}-{3:3},
at: vp_vring_interrupt drivers/virtio/virtio_pci_common.c:80 [inline]
 #0: ffff888020bb75c8 (&vp_dev->lock){-...}-{3:3},
at: vp_interrupt+0x142/0x200 drivers/virtio/virtio_pci_common.c:113
 #1: ffff88814174a120 (&vb->stop_update_lock){-...}-{3:3},
at: spin_lock include/linux/spinlock.h:351 [inline]
 #1: ffff88814174a120 (&vb->stop_update_lock){-...}-{3:3},
at: stats_request+0x6f/0x230 drivers/virtio/virtio_balloon.c:438
 #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
 #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
 #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: __queue_work+0x199/0xf50 kernel/workqueue.c:2259
 #3: ffff8880b863dd18 (&pool->lock){-.-.}-{2:2},
at: __queue_work+0x759/0xf50
 #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
 #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
 #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: __bpf_trace_run kernel/trace/bpf_trace.c:2339 [inline]
 #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: bpf_trace_run1+0x1d6/0x520 kernel/trace/bpf_trace.c:2380
stack backtrace:
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted
6.12.0-rc5-next-20241031-syzkaller #0
Hardware name: Google Compute Engine/Google Compute Engine,
BIOS Google 09/13/2024
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
 print_lock_invalid_wait_context kernel/locking/lockdep.c:4826 [inline]
 check_wait_context kernel/locking/lockdep.c:4898 [inline]
 __lock_acquire+0x15a8/0x2100 kernel/locking/lockdep.c:5176
 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
 _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
 trie_delete_elem+0x96/0x6a0 kernel/bpf/lpm_trie.c:462
 bpf_prog_2c29ac5cdc6b1842+0x43/0x47
 bpf_dispatcher_nop_func include/linux/bpf.h:1290 [inline]
 __bpf_prog_run include/linux/filter.h:701 [inline]
 bpf_prog_run include/linux/filter.h:708 [inline]
 __bpf_trace_run kernel/trace/bpf_trace.c:2340 [inline]
 bpf_trace_run1+0x2ca/0x520 kernel/trace/bpf_trace.c:2380
 trace_workqueue_activate_work+0x186/0x1f0 include/trace/events/workqueue.h:59
 __queue_work+0xc7b/0xf50 kernel/workqueue.c:2338
 queue_work_on+0x1c2/0x380 kernel/workqueue.c:2390
 queue_work include/linux/workqueue.h:662 [inline]
 stats_request+0x1a3/0x230 drivers/virtio/virtio_balloon.c:441
 vring_interrupt+0x21d/0x380 drivers/virtio/virtio_ring.c:2595
 vp_vring_interrupt drivers/virtio/virtio_pci_common.c:82 [inline]
 vp_interrupt+0x192/0x200 drivers/virtio/virtio_pci_common.c:113
 __handle_irq_event_percpu+0x29a/0xa80 kernel/irq/handle.c:158
 handle_irq_event_percpu kernel/irq/handle.c:193 [inline]
 handle_irq_event+0x89/0x1f0 kernel/irq/handle.c:210
 handle_fasteoi_irq+0x48a/0xae0 kernel/irq/chip.c:720
 generic_handle_irq_desc include/linux/irqdesc.h:173 [inline]
 handle_irq arch/x86/kernel/irq.c:247 [inline]
 call_irq_handler arch/x86/kernel/irq.c:259 [inline]
 __common_interrupt+0x136/0x230 arch/x86/kernel/irq.c:285
 common_interrupt+0xb4/0xd0 arch/x86/kernel/irq.c:278
 </IRQ>

Reported-by: [email protected]
Closes: https://lore.kernel.org/bpf/[email protected]/
Fixes: 66150d0 ("bpf, lpm: Make locking RT friendly")
Signed-off-by: Kunwu Chan <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Nov 13, 2024
The scope of the TX skb is wider than just mse102x_tx_frame_spi(),
so in case the TX skb room needs to be expanded, we should free the
the temporary skb instead of the original skb. Otherwise the original
TX skb pointer would be freed again in mse102x_tx_work(), which leads
to crashes:

  Internal error: Oops: 0000000096000004 [#2] PREEMPT SMP
  CPU: 0 PID: 712 Comm: kworker/0:1 Tainted: G      D            6.6.23
  Hardware name: chargebyte Charge SOM DC-ONE (DT)
  Workqueue: events mse102x_tx_work [mse102x]
  pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  pc : skb_release_data+0xb8/0x1d8
  lr : skb_release_data+0x1ac/0x1d8
  sp : ffff8000819a3cc0
  x29: ffff8000819a3cc0 x28: ffff0000046daa60 x27: ffff0000057f2dc0
  x26: ffff000005386c00 x25: 0000000000000002 x24: 00000000ffffffff
  x23: 0000000000000000 x22: 0000000000000001 x21: ffff0000057f2e50
  x20: 0000000000000006 x19: 0000000000000000 x18: ffff00003fdacfcc
  x17: e69ad452d0c49def x16: 84a005feff870102 x15: 0000000000000000
  x14: 000000000000024a x13: 0000000000000002 x12: 0000000000000000
  x11: 0000000000000400 x10: 0000000000000930 x9 : ffff00003fd913e8
  x8 : fffffc00001bc008
  x7 : 0000000000000000 x6 : 0000000000000008
  x5 : ffff00003fd91340 x4 : 0000000000000000 x3 : 0000000000000009
  x2 : 00000000fffffffe x1 : 0000000000000000 x0 : 0000000000000000
  Call trace:
   skb_release_data+0xb8/0x1d8
   kfree_skb_reason+0x48/0xb0
   mse102x_tx_work+0x164/0x35c [mse102x]
   process_one_work+0x138/0x260
   worker_thread+0x32c/0x438
   kthread+0x118/0x11c
   ret_from_fork+0x10/0x20
  Code: aa1303e0 97fffab6 72001c1f 54000141 (f9400660)

Cc: [email protected]
Fixes: 2f207cb ("net: vertexcom: Add MSE102x SPI support")
Signed-off-by: Stefan Wahren <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants