Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sdw rework #3

Closed

Conversation

RanderWang
Copy link

In this PR, some sdw special operations are moved to sdw dai driver, such as prepare, trigger and stream free, so that we don't need to check dai index or get sdw stream data form hda data.

Now set soundwire stream in prepare function to follow
hda codec driver. It makes easier for soundwire dai driver
to prepare soundwire stream data for codec driver for
the soundwire stream data is not available in hw_params
function

Signed-off-by: Rander Wang <[email protected]>
Now move these functions to sdw dai driver. And there is
no need to check dai index and store sdw stream data in
privated_data in runtime.

Signed-off-by: Rander Wang <[email protected]>
Support prepare and trigger functions.
Free more resource in hw_free function.
Setup sdw stream in hw_params function.

Signed-off-by: Rander Wang <[email protected]>
sdw_free_stream release dma usage in FW when sdw stream is
free.

Signed-off-by: Rander Wang <[email protected]>
Copy link
Owner

@plbossart plbossart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RanderWang this looks very interesting.
I didn't fully get the code for the RT700, can you please comment more, and the error handling needs to be improved, but this would be good if we can avoid changing the SOF parts.
Nice work.
btw when you re-submit this can you add it on top of sof/integration/soundwire?

}
port_config = &rt700->port_config;
port_config->ch_mask = (1 << (num_channels)) - 1;
port_config->num = port;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

port is not initialized?

}
port_config = &rt700->port_config;
port_config->ch_mask = (1 << (num_channels)) - 1;
port_config->num = port;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why you need those two structures? Is this to memorize information that will be used in prepare? If yes, maybe you can add comments to help the reviewer, there are none in your changes...


retval = sdw_stream_add_slave(rt700->slave, &rt700->stream_config,
&rt700->port_config, 1,
stream->sdw_stream);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully get this code, is this going to initialize the 'stream'?
Again please add comments to explain what you are doing, the code is not self-explanatory.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RanderWang yeah same here. I dont understand why we need to move this to prepare. AFAICT, sdw_stream_add_slave() need the stream_config to be set, particularly the frame_rate.


return 0;
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be separate patch

error:
kfree(name);
sdw_release_stream(sdw_stream);
return ret;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not done in the right order, you need to revert the kfree and sdw_release_stream

"alloc stream failed for DAI %s: %d",
dai->name, ret);
ret = -ENOMEM;
goto error;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you only kfree(name) here

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is wrong. The concept of 'stream' means we can use multiple DAIs (cpu and codec), so doing a sdw_alloc_stream here is just not quite right. see comments

/**
 * sdw_alloc_stream() - Allocate and return stream runtime
 *
 * @stream_name: SoundWire stream name
 *
 * Allocates a SoundWire stream runtime instance.
 * sdw_alloc_stream should be called only once per stream. Typically
 * invoked from ALSA/ASoC machine/platform driver.
 */

if (ret < 0) {
dev_err(dai->dev, "failed to set stream pointer on cpu dai %s",
dai->name);
return ret;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory leak, use goto error

/* Set stream pointer on CPU DAI */
ret = snd_soc_dai_set_sdw_stream(dai,
sdw_stream,
substream->stream);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RanderWang please use one line if possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!I get it.

struct snd_soc_pcm_runtime *rtd = substream->private_data;
struct sdw_stream_runtime *sdw_stream = NULL;
char *name;
int i, ret = 0;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superfluous initialisation?

if (ret < 0) {
dev_err(dai->dev, "failed to set stream pointer on codec dai %s",
rtd->codec_dais[i]->name);
return ret;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory leak here too?

@@ -750,6 +817,12 @@ static int intel_hw_params(struct snd_pcm_substream *substream,
int ret, i, ch, dir;
bool pcm = true;

ret = sdw_stream_setup(substream, dai);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we do sdw_stream_step() in hw_params().I think it should be done in startup() to keep it in line with what is there already

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should be done in a dai operation, the notion of stream is larger than a dai.
For example the same stereo 'stream' can be rendered by using link0-dai0 and link1-dai4, going to two separate amplifiers. This setup should be done in a dailink operation where we will have the list of all cpu and codec dais handled.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still struggling here, in the initial proposal there was no notion of DAI,

Initially we did all the soundwire stream handling in pcm_ops callbacks.

Here everything is done as part of dai_ops callbacks.

That's very very different!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plbossart but all the dai_ops are called from the pcm ops context anyway isnt it? like for example:
https://github.com/thesofproject/linux/blob/b45315d57e2ab6fce1de8bd6391d925143b4fbda/sound/soc/soc-pcm.c#L806

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @ranj063, I don't get your point. I never fully understood how DPCM works so not sure if the two models are was referring to are equivalent.

Does anyone have a sequence diagram showing how those rtd, pcm_ops and cpu_dai and codec_dai ops are called?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check the ASOC and give a process flow picture

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should be done in a dai operation, the notion of stream is larger than a dai.
For example the same stereo 'stream' can be rendered by using link0-dai0 and link1-dai4, going to two separate amplifiers. This setup should be done in a dailink operation where we will have the list of all cpu and codec dais handled.

I agree with @plbossart the operation can be done in dailink. Actually I think intel_hw_params() will be invoked twice if there are two sdw components in the same dailink. @RanderWang Can we allocate and release stream in machine driver? Maybe we can use startup and shutdown ops?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drawing1

Copy link
Author

@RanderWang RanderWang Aug 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are component pcm driver and dai driver in ASoC. they all have prepare, hw_params, and trigger functions. In the following picture, dai hw_params is for dai driver , and the only hw_params is for component pcm driver. Now there is only component pcm driver in SOF for SDW, and there is only dai driver in SDW component and codec comopnent.

All the components pcm driver are invoked in both FE and BE dai, so you will find some functions are invoke twice. Although there is no component pcm driver in SDW and codec component, I still marked some functions in in picture to reflect the process flow in ASoC

Drawing2

stream->sdw_stream);
if (retval) {
dev_err(dai->dev, "Unable to configure port\n");
return retval;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RanderWang just return retval outside the if. No need for 2 separate return statements here

@ranj063
Copy link

ranj063 commented Aug 14, 2019

@RanderWang the second and third commits need to be squashed

static const struct sdw_intel_ops sdw_callback = {
.config_stream = sdw_config_stream,
.free_stream = sdw_free_stream,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RanderWang where will this get called? Would it be in the hw_free dai op

@RanderWang
Copy link
Author

@plbossart @ranj063 Thanks for your review ! I know there are many defects in the PR, I just want to show my idea about SDW. Now sdw runtime data is stored in dai and shared with sdw dai driver and codec dai driver. I want to move these functions to dai driver to make SOF simple. And also it is easy for CAVS driver to support SDW, although maybe there is no such requirement.

@plbossart
Copy link
Owner

@RanderWang I did a bit more thinking and I think your idea to implement new DAI callbacks is the right direction. There is one issue with the PR, which I mentioned above, that the sdw_alloc_stream() is not a DAI operation and should be done at the DAI link level. But we can do this later.
The additional simplification would be to remove the callbacks for config and free (using the dai index instead) but again we can do this in an update.
Can you try and provide an updated version on top of sof/integration and we'll work from there?
Thank you.

@RanderWang
Copy link
Author

@RanderWang I did a bit more thinking and I think your idea to implement new DAI callbacks is the right direction. There is one issue with the PR, which I mentioned above, that the sdw_alloc_stream() is not a DAI operation and should be done at the DAI link level. But we can do this later.
The additional simplification would be to remove the callbacks for config and free (using the dai index instead) but again we can do this in an update.
Can you try and provide an updated version on top of sof/integration and we'll work from there?
Thank you.

yes, I will refine my PR and submit it.

plbossart pushed a commit that referenced this pull request Aug 21, 2019
…OL_MF_STRICT were specified

When both MPOL_MF_MOVE* and MPOL_MF_STRICT was specified, mbind() should
try best to migrate misplaced pages, if some of the pages could not be
migrated, then return -EIO.

There are three different sub-cases:
 1. vma is not migratable
 2. vma is migratable, but there are unmovable pages
 3. vma is migratable, pages are movable, but migrate_pages() fails

If #1 happens, kernel would just abort immediately, then return -EIO,
after a7f40cf ("mm: mempolicy: make mbind() return -EIO when
MPOL_MF_STRICT is specified").

If #3 happens, kernel would set policy and migrate pages with
best-effort, but won't rollback the migrated pages and reset the policy
back.

Before that commit, they behaves in the same way.  It'd better to keep
their behavior consistent.  But, rolling back the migrated pages and
resetting the policy back sounds not feasible, so just make #1 behave as
same as #3.

Userspace will know that not everything was successfully migrated (via
-EIO), and can take whatever steps it deems necessary - attempt
rollback, determine which exact page(s) are violating the policy, etc.

Make queue_pages_range() return 1 to indicate there are unmovable pages
or vma is not migratable.

The #2 is not handled correctly in the current kernel, the following
patch will fix it.

[[email protected]: fix review comments from Vlastimil]
  Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
@RanderWang
Copy link
Author

done

@RanderWang RanderWang closed this Sep 1, 2019
plbossart pushed a commit that referenced this pull request Sep 3, 2019
Revert the commit bd293d0. The proper
fix has been made available with commit d0a255e ("loop: set
PF_MEMALLOC_NOIO for the worker thread").

Note that the fix offered by commit bd293d0 doesn't really prevent
the deadlock from occuring - if we look at the stacktrace reported by
Junxiao Bi, we see that it hangs in bit_wait_io and not on the mutex -
i.e. it has already successfully taken the mutex. Changing the mutex
from mutex_lock to mutex_trylock won't help with deadlocks that happen
afterwards.

PID: 474    TASK: ffff8813e11f4600  CPU: 10  COMMAND: "kswapd0"
   #0 [ffff8813dedfb938] __schedule at ffffffff8173f405
   #1 [ffff8813dedfb990] schedule at ffffffff8173fa27
   #2 [ffff8813dedfb9b0] schedule_timeout at ffffffff81742fec
   #3 [ffff8813dedfba60] io_schedule_timeout at ffffffff8173f186
   #4 [ffff8813dedfbaa0] bit_wait_io at ffffffff8174034f
   #5 [ffff8813dedfbac0] __wait_on_bit at ffffffff8173fec8
   #6 [ffff8813dedfbb10] out_of_line_wait_on_bit at ffffffff8173ff81
   #7 [ffff8813dedfbb90] __make_buffer_clean at ffffffffa038736f [dm_bufio]
   #8 [ffff8813dedfbbb0] __try_evict_buffer at ffffffffa0387bb8 [dm_bufio]
   #9 [ffff8813dedfbbd0] dm_bufio_shrink_scan at ffffffffa0387cc3 [dm_bufio]
  #10 [ffff8813dedfbc40] shrink_slab at ffffffff811a87ce
  #11 [ffff8813dedfbd30] shrink_zone at ffffffff811ad778
  #12 [ffff8813dedfbdc0] kswapd at ffffffff811ae92f
  #13 [ffff8813dedfbec0] kthread at ffffffff810a8428
  #14 [ffff8813dedfbf50] ret_from_fork at ffffffff81745242

Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]
Fixes: bd293d0 ("dm bufio: fix deadlock with loop device")
Depends-on: d0a255e ("loop: set PF_MEMALLOC_NOIO for the worker thread")
Signed-off-by: Mike Snitzer <[email protected]>
plbossart pushed a commit that referenced this pull request Sep 3, 2019
reply_cache_stats uses wrong parameter as seq file private structure and
thus causes the following kernel crash when users read
/proc/fs/nfsd/reply_cache_stats

BUG: kernel NULL pointer dereference, address: 00000000000001f9
PGD 0 P4D 0
Oops: 0000 [#3] SMP PTI
CPU: 6 PID: 1502 Comm: cat Tainted: G      D           5.3.0-rc3+ #1
Hardware name: Intel Corporation Broadwell Client platform/Basking Ridge, BIOS BDW-E2R1.86C.0118.R01.1503110618 03/11/2015
RIP: 0010:nfsd_reply_cache_stats_show+0x3b/0x2d0
Code: 41 54 49 89 f4 48 89 fe 48 c7 c7 b3 10 33 88 53 bb e8 03 00 00 e8 88 82 d1 ff bf 58 89 41 00 e8 eb c5 85 00 48 83 eb 01 75 f0 <41> 8b 94 24 f8 01 00 00 48 c7 c6 be 10 33 88 4c 89 ef bb e8 03 00
RSP: 0018:ffffaa520106fe08 EFLAGS: 00010246
RAX: 000000cfe1a77123 RBX: 0000000000000000 RCX: 0000000000291b46
RDX: 000000cf00000000 RSI: 0000000000000006 RDI: 0000000000291b28
RBP: ffffaa520106fe20 R08: 0000000000000006 R09: 000000cfe17e55dd
R10: ffffa424e47c0000 R11: 000000000000030b R12: 0000000000000001
R13: ffffa424e5697000 R14: 0000000000000001 R15: ffffa424e5697000
FS:  00007f805735f580(0000) GS:ffffa424f8f80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000001f9 CR3: 00000000655ce005 CR4: 00000000003606e0
Call Trace:
 seq_read+0x194/0x3e0
 __vfs_read+0x1b/0x40
 vfs_read+0x95/0x140
 ksys_read+0x61/0xe0
 __x64_sys_read+0x1a/0x20
 do_syscall_64+0x4d/0x120
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f805728b861
Code: fe ff ff 50 48 8d 3d 86 b4 09 00 e8 79 e0 01 00 66 0f 1f 84 00 00 00 00 00 48 8d 05 d9 19 0d 00 8b 00 85 c0 75 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 57 c3 66 0f 1f 44 00 00 48 83 ec 28 48 89 54
RSP: 002b:00007ffea1ce3c38 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f805728b861
RDX: 0000000000020000 RSI: 00007f8057183000 RDI: 0000000000000003
RBP: 00007f8057183000 R08: 00007f8057182010 R09: 0000000000000000
R10: 0000000000000022 R11: 0000000000000246 R12: 0000559a60e8ff10
R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
Modules linked in:
CR2: 00000000000001f9
---[ end trace 01613595153f0cba ]---
RIP: 0010:nfsd_reply_cache_stats_show+0x3b/0x2d0
Code: 41 54 49 89 f4 48 89 fe 48 c7 c7 b3 10 33 88 53 bb e8 03 00 00 e8 88 82 d1 ff bf 58 89 41 00 e8 eb c5 85 00 48 83 eb 01 75 f0 <41> 8b 94 24 f8 01 00 00 48 c7 c6 be 10 33 88 4c 89 ef bb e8 03 00
RSP: 0018:ffffaa52004b3e08 EFLAGS: 00010246
RAX: 0000002bab45a7c6 RBX: 0000000000000000 RCX: 0000000000291b4c
RDX: 0000002b00000000 RSI: 0000000000000004 RDI: 0000000000291b28
RBP: ffffaa52004b3e20 R08: 0000000000000004 R09: 0000002bab1c8c7a
R10: ffffa424e5500000 R11: 00000000000002a9 R12: 0000000000000001
R13: ffffa424e4475000 R14: 0000000000000001 R15: ffffa424e4475000
FS:  00007f805735f580(0000) GS:ffffa424f8f80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000001f9 CR3: 00000000655ce005 CR4: 00000000003606e0
Killed

Fixes: 3ba7583 ("nfsd4: drc containerization")
Signed-off-by: He Zhe <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
plbossart pushed a commit that referenced this pull request Sep 3, 2019
…s_blob()

Calling ceph_buffer_put() in __ceph_build_xattrs_blob() may result in
freeing the i_xattrs.blob buffer while holding the i_ceph_lock.  This can
be fixed by having this function returning the old blob buffer and have
the callers of this function freeing it when the lock is released.

The following backtrace was triggered by fstests generic/117.

  BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
  in_atomic(): 1, irqs_disabled(): 0, pid: 649, name: fsstress
  4 locks held by fsstress/649:
   #0: 00000000a7478e7e (&type->s_umount_key#19){++++}, at: iterate_supers+0x77/0xf0
   #1: 00000000f8de1423 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: ceph_check_caps+0x7b/0xc60
   #2: 00000000562f2b27 (&s->s_mutex){+.+.}, at: ceph_check_caps+0x3bd/0xc60
   #3: 00000000f83ce16a (&mdsc->snap_rwsem){++++}, at: ceph_check_caps+0x3ed/0xc60
  CPU: 1 PID: 649 Comm: fsstress Not tainted 5.2.0+ torvalds#439
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x67/0x90
   ___might_sleep.cold+0x9f/0xb1
   vfree+0x4b/0x60
   ceph_buffer_release+0x1b/0x60
   __ceph_build_xattrs_blob+0x12b/0x170
   __send_cap+0x302/0x540
   ? __lock_acquire+0x23c/0x1e40
   ? __mark_caps_flushing+0x15c/0x280
   ? _raw_spin_unlock+0x24/0x30
   ceph_check_caps+0x5f0/0xc60
   ceph_flush_dirty_caps+0x7c/0x150
   ? __ia32_sys_fdatasync+0x20/0x20
   ceph_sync_fs+0x5a/0x130
   iterate_supers+0x8f/0xf0
   ksys_sync+0x4f/0xb0
   __ia32_sys_sync+0xa/0x10
   do_syscall_64+0x50/0x1c0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe
  RIP: 0033:0x7fc6409ab617

Signed-off-by: Luis Henriques <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>
plbossart pushed a commit that referenced this pull request Sep 3, 2019
Calling ceph_buffer_put() in fill_inode() may result in freeing the
i_xattrs.blob buffer while holding the i_ceph_lock.  This can be fixed by
postponing the call until later, when the lock is released.

The following backtrace was triggered by fstests generic/070.

  BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
  in_atomic(): 1, irqs_disabled(): 0, pid: 3852, name: kworker/0:4
  6 locks held by kworker/0:4/3852:
   #0: 000000004270f6bb ((wq_completion)ceph-msgr){+.+.}, at: process_one_work+0x1b8/0x5f0
   #1: 00000000eb420803 ((work_completion)(&(&con->work)->work)){+.+.}, at: process_one_work+0x1b8/0x5f0
   #2: 00000000be1c53a4 (&s->s_mutex){+.+.}, at: dispatch+0x288/0x1476
   #3: 00000000559cb958 (&mdsc->snap_rwsem){++++}, at: dispatch+0x2eb/0x1476
   #4: 000000000d5ebbae (&req->r_fill_mutex){+.+.}, at: dispatch+0x2fc/0x1476
   #5: 00000000a83d0514 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: fill_inode.isra.0+0xf8/0xf70
  CPU: 0 PID: 3852 Comm: kworker/0:4 Not tainted 5.2.0+ torvalds#441
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
  Workqueue: ceph-msgr ceph_con_workfn
  Call Trace:
   dump_stack+0x67/0x90
   ___might_sleep.cold+0x9f/0xb1
   vfree+0x4b/0x60
   ceph_buffer_release+0x1b/0x60
   fill_inode.isra.0+0xa9b/0xf70
   ceph_fill_trace+0x13b/0xc70
   ? dispatch+0x2eb/0x1476
   dispatch+0x320/0x1476
   ? __mutex_unlock_slowpath+0x4d/0x2a0
   ceph_con_workfn+0xc97/0x2ec0
   ? process_one_work+0x1b8/0x5f0
   process_one_work+0x244/0x5f0
   worker_thread+0x4d/0x3e0
   kthread+0x105/0x140
   ? process_one_work+0x5f0/0x5f0
   ? kthread_park+0x90/0x90
   ret_from_fork+0x3a/0x50

Signed-off-by: Luis Henriques <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>
plbossart pushed a commit that referenced this pull request Oct 18, 2019
This patch fixes the lock inversion complaint:

============================================
WARNING: possible recursive locking detected
5.3.0-rc7-dbg+ #1 Not tainted
--------------------------------------------
kworker/u16:6/171 is trying to acquire lock:
00000000035c6e6c (&id_priv->handler_mutex){+.+.}, at: rdma_destroy_id+0x78/0x4a0 [rdma_cm]

but task is already holding lock:
00000000bc7c307d (&id_priv->handler_mutex){+.+.}, at: iw_conn_req_handler+0x151/0x680 [rdma_cm]

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&id_priv->handler_mutex);
  lock(&id_priv->handler_mutex);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

3 locks held by kworker/u16:6/171:
 #0: 00000000e2eaa773 ((wq_completion)iw_cm_wq){+.+.}, at: process_one_work+0x472/0xac0
 #1: 000000001efd357b ((work_completion)(&work->work)#3){+.+.}, at: process_one_work+0x476/0xac0
 #2: 00000000bc7c307d (&id_priv->handler_mutex){+.+.}, at: iw_conn_req_handler+0x151/0x680 [rdma_cm]

stack backtrace:
CPU: 3 PID: 171 Comm: kworker/u16:6 Not tainted 5.3.0-rc7-dbg+ #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Workqueue: iw_cm_wq cm_work_handler [iw_cm]
Call Trace:
 dump_stack+0x8a/0xd6
 __lock_acquire.cold+0xe1/0x24d
 lock_acquire+0x106/0x240
 __mutex_lock+0x12e/0xcb0
 mutex_lock_nested+0x1f/0x30
 rdma_destroy_id+0x78/0x4a0 [rdma_cm]
 iw_conn_req_handler+0x5c9/0x680 [rdma_cm]
 cm_work_handler+0xe62/0x1100 [iw_cm]
 process_one_work+0x56d/0xac0
 worker_thread+0x7a/0x5d0
 kthread+0x1bc/0x210
 ret_from_fork+0x24/0x30

This is not a bug as there are actually two lock classes here.

Link: https://lore.kernel.org/r/[email protected]
Fixes: de910bd ("RDMA/cma: Simplify locking needed for serialization of callbacks")
Signed-off-by: Bart Van Assche <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
plbossart pushed a commit that referenced this pull request Oct 25, 2019
Daniel Vetter uncovered a nasty cycle in using the mmu-notifiers to
invalidate userptr objects which also happen to be pulled into GGTT
mmaps. That is when we unbind the userptr object (on mmu invalidation),
we revoke all CPU mmaps, which may then recurse into mmu invalidation.

We looked for ways of breaking the cycle, but the revocation on
invalidation is required and cannot be avoided. The only solution we
could see was to not allow such GGTT bindings of userptr objects in the
first place. In practice, no one really wants to use a GGTT mmapping of
a CPU pointer...

Just before Daniel's explosive lockdep patches land in v5.4-rc1, we got
a genuine blip from CI:

<4>[  246.793958] ======================================================
<4>[  246.793972] WARNING: possible circular locking dependency detected
<4>[  246.793989] 5.3.0-gbd6c56f50d15-drmtip_372+ #1 Tainted: G     U
<4>[  246.794003] ------------------------------------------------------
<4>[  246.794017] kswapd0/145 is trying to acquire lock:
<4>[  246.794030] 000000003f565be6 (&dev->struct_mutex/1){+.+.}, at: userptr_mn_invalidate_range_start+0x18f/0x220 [i915]
<4>[  246.794250]
                  but task is already holding lock:
<4>[  246.794263] 000000001799cef9 (&anon_vma->rwsem){++++}, at: page_lock_anon_vma_read+0xe6/0x2a0
<4>[  246.794291]
                  which lock already depends on the new lock.

<4>[  246.794307]
                  the existing dependency chain (in reverse order) is:
<4>[  246.794322]
                  -> #3 (&anon_vma->rwsem){++++}:
<4>[  246.794344]        down_write+0x33/0x70
<4>[  246.794357]        __vma_adjust+0x3d9/0x7b0
<4>[  246.794370]        __split_vma+0x16a/0x180
<4>[  246.794385]        mprotect_fixup+0x2a5/0x320
<4>[  246.794399]        do_mprotect_pkey+0x208/0x2e0
<4>[  246.794413]        __x64_sys_mprotect+0x16/0x20
<4>[  246.794429]        do_syscall_64+0x55/0x1c0
<4>[  246.794443]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  246.794456]
                  -> #2 (&mapping->i_mmap_rwsem){++++}:
<4>[  246.794478]        down_write+0x33/0x70
<4>[  246.794493]        unmap_mapping_pages+0x48/0x130
<4>[  246.794519]        i915_vma_revoke_mmap+0x81/0x1b0 [i915]
<4>[  246.794519]        i915_vma_unbind+0x11d/0x4a0 [i915]
<4>[  246.794519]        i915_vma_destroy+0x31/0x300 [i915]
<4>[  246.794519]        __i915_gem_free_objects+0xb8/0x4b0 [i915]
<4>[  246.794519]        drm_file_free.part.0+0x1e6/0x290
<4>[  246.794519]        drm_release+0xa6/0xe0
<4>[  246.794519]        __fput+0xc2/0x250
<4>[  246.794519]        task_work_run+0x82/0xb0
<4>[  246.794519]        do_exit+0x35b/0xdb0
<4>[  246.794519]        do_group_exit+0x34/0xb0
<4>[  246.794519]        __x64_sys_exit_group+0xf/0x10
<4>[  246.794519]        do_syscall_64+0x55/0x1c0
<4>[  246.794519]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  246.794519]
                  -> #1 (&vm->mutex){+.+.}:
<4>[  246.794519]        i915_gem_shrinker_taints_mutex+0x6d/0xe0 [i915]
<4>[  246.794519]        i915_address_space_init+0x9f/0x160 [i915]
<4>[  246.794519]        i915_ggtt_init_hw+0x55/0x170 [i915]
<4>[  246.794519]        i915_driver_probe+0xc9f/0x1620 [i915]
<4>[  246.794519]        i915_pci_probe+0x43/0x1b0 [i915]
<4>[  246.794519]        pci_device_probe+0x9e/0x120
<4>[  246.794519]        really_probe+0xea/0x3d0
<4>[  246.794519]        driver_probe_device+0x10b/0x120
<4>[  246.794519]        device_driver_attach+0x4a/0x50
<4>[  246.794519]        __driver_attach+0x97/0x130
<4>[  246.794519]        bus_for_each_dev+0x74/0xc0
<4>[  246.794519]        bus_add_driver+0x13f/0x210
<4>[  246.794519]        driver_register+0x56/0xe0
<4>[  246.794519]        do_one_initcall+0x58/0x300
<4>[  246.794519]        do_init_module+0x56/0x1f6
<4>[  246.794519]        load_module+0x25bd/0x2a40
<4>[  246.794519]        __se_sys_finit_module+0xd3/0xf0
<4>[  246.794519]        do_syscall_64+0x55/0x1c0
<4>[  246.794519]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  246.794519]
                  -> #0 (&dev->struct_mutex/1){+.+.}:
<4>[  246.794519]        __lock_acquire+0x15d8/0x1e90
<4>[  246.794519]        lock_acquire+0xa6/0x1c0
<4>[  246.794519]        __mutex_lock+0x9d/0x9b0
<4>[  246.794519]        userptr_mn_invalidate_range_start+0x18f/0x220 [i915]
<4>[  246.794519]        __mmu_notifier_invalidate_range_start+0x85/0x110
<4>[  246.794519]        try_to_unmap_one+0x76b/0x860
<4>[  246.794519]        rmap_walk_anon+0x104/0x280
<4>[  246.794519]        try_to_unmap+0xc0/0xf0
<4>[  246.794519]        shrink_page_list+0x561/0xc10
<4>[  246.794519]        shrink_inactive_list+0x220/0x440
<4>[  246.794519]        shrink_node_memcg+0x36e/0x740
<4>[  246.794519]        shrink_node+0xcb/0x490
<4>[  246.794519]        balance_pgdat+0x241/0x580
<4>[  246.794519]        kswapd+0x16c/0x530
<4>[  246.794519]        kthread+0x119/0x130
<4>[  246.794519]        ret_from_fork+0x24/0x50
<4>[  246.794519]
                  other info that might help us debug this:

<4>[  246.794519] Chain exists of:
                    &dev->struct_mutex/1 --> &mapping->i_mmap_rwsem --> &anon_vma->rwsem

<4>[  246.794519]  Possible unsafe locking scenario:

<4>[  246.794519]        CPU0                    CPU1
<4>[  246.794519]        ----                    ----
<4>[  246.794519]   lock(&anon_vma->rwsem);
<4>[  246.794519]                                lock(&mapping->i_mmap_rwsem);
<4>[  246.794519]                                lock(&anon_vma->rwsem);
<4>[  246.794519]   lock(&dev->struct_mutex/1);
<4>[  246.794519]
                   *** DEADLOCK ***

v2: Say no to mmap_ioctl

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111744
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111870
Signed-off-by: Chris Wilson <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: [email protected]
Reviewed-by: Tvrtko Ursulin <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit a431174)
Signed-off-by: Rodrigo Vivi <[email protected]>
plbossart pushed a commit that referenced this pull request Oct 25, 2019
Both multi_cpu_stop() and set_state() access multi_stop_data::state
racily using plain accesses. These are subject to compiler
transformations which could break the intended behaviour of the code,
and this situation is detected by KCSAN on both arm64 and x86 (splats
below).

Improve matters by using READ_ONCE() and WRITE_ONCE() to ensure that the
compiler cannot elide, replay, or tear loads and stores.

In multi_cpu_stop() the two loads of multi_stop_data::state are expected to
be a consistent value, so snapshot the value into a temporary variable to
ensure this.

The state transitions are serialized by atomic manipulation of
multi_stop_data::num_threads, and other fields in multi_stop_data are not
modified while subject to concurrent reads.

KCSAN splat on arm64:

| BUG: KCSAN: data-race in multi_cpu_stop+0xa8/0x198 and set_state+0x80/0xb0
|
| write to 0xffff00001003bd00 of 4 bytes by task 24 on cpu 3:
|  set_state+0x80/0xb0
|  multi_cpu_stop+0x16c/0x198
|  cpu_stopper_thread+0x170/0x298
|  smpboot_thread_fn+0x40c/0x560
|  kthread+0x1a8/0x1b0
|  ret_from_fork+0x10/0x18
|
| read to 0xffff00001003bd00 of 4 bytes by task 14 on cpu 1:
|  multi_cpu_stop+0xa8/0x198
|  cpu_stopper_thread+0x170/0x298
|  smpboot_thread_fn+0x40c/0x560
|  kthread+0x1a8/0x1b0
|  ret_from_fork+0x10/0x18
|
| Reported by Kernel Concurrency Sanitizer on:
| CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.3.0-00007-g67ab35a199f4-dirty #3
| Hardware name: linux,dummy-virt (DT)

KCSAN splat on x86:

| write to 0xffffb0bac0013e18 of 4 bytes by task 19 on cpu 2:
|  set_state kernel/stop_machine.c:170 [inline]
|  ack_state kernel/stop_machine.c:177 [inline]
|  multi_cpu_stop+0x1a4/0x220 kernel/stop_machine.c:227
|  cpu_stopper_thread+0x19e/0x280 kernel/stop_machine.c:516
|  smpboot_thread_fn+0x1a8/0x300 kernel/smpboot.c:165
|  kthread+0x1b5/0x200 kernel/kthread.c:255
|  ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:352
|
| read to 0xffffb0bac0013e18 of 4 bytes by task 44 on cpu 7:
|  multi_cpu_stop+0xb4/0x220 kernel/stop_machine.c:213
|  cpu_stopper_thread+0x19e/0x280 kernel/stop_machine.c:516
|  smpboot_thread_fn+0x1a8/0x300 kernel/smpboot.c:165
|  kthread+0x1b5/0x200 kernel/kthread.c:255
|  ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:352
|
| Reported by Kernel Concurrency Sanitizer on:
| CPU: 7 PID: 44 Comm: migration/7 Not tainted 5.3.0+ #1
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014

Signed-off-by: Mark Rutland <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Marco Elver <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
plbossart pushed a commit that referenced this pull request Nov 11, 2019
When lockdep is enabled, plugging Thunderbolt dock on Dominik's laptop
triggers following splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.3.0-rc6+ #1 Tainted: G                T
  ------------------------------------------------------
  pool-/usr/lib/b/1258 is trying to acquire lock:
  000000005ab0ad43 (pci_rescan_remove_lock){+.+.}, at: authorized_store+0xe8/0x210

  but task is already holding lock:
  00000000bfb796b5 (&tb->lock){+.+.}, at: authorized_store+0x7c/0x210

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #1 (&tb->lock){+.+.}:
         __mutex_lock+0xac/0x9a0
         tb_domain_add+0x2d/0x130
         nhi_probe+0x1dd/0x330
         pci_device_probe+0xd2/0x150
         really_probe+0xee/0x280
         driver_probe_device+0x50/0xc0
         bus_for_each_drv+0x84/0xd0
         __device_attach+0xe4/0x150
         pci_bus_add_device+0x4e/0x70
         pci_bus_add_devices+0x2e/0x66
         pci_bus_add_devices+0x59/0x66
         pci_bus_add_devices+0x59/0x66
         enable_slot+0x344/0x450
         acpiphp_check_bridge.part.0+0x119/0x150
         acpiphp_hotplug_notify+0xaa/0x140
         acpi_device_hotplug+0xa2/0x3f0
         acpi_hotplug_work_fn+0x1a/0x30
         process_one_work+0x234/0x580
         worker_thread+0x50/0x3b0
         kthread+0x10a/0x140
         ret_from_fork+0x3a/0x50

  -> #0 (pci_rescan_remove_lock){+.+.}:
         __lock_acquire+0xe54/0x1ac0
         lock_acquire+0xb8/0x1b0
         __mutex_lock+0xac/0x9a0
         authorized_store+0xe8/0x210
         kernfs_fop_write+0x125/0x1b0
         vfs_write+0xc2/0x1d0
         ksys_write+0x6c/0xf0
         do_syscall_64+0x50/0x180
         entry_SYSCALL_64_after_hwframe+0x49/0xbe

  other info that might help us debug this:
   Possible unsafe locking scenario:
         CPU0                    CPU1
         ----                    ----
    lock(&tb->lock);
                                 lock(pci_rescan_remove_lock);
                                 lock(&tb->lock);
    lock(pci_rescan_remove_lock);

   *** DEADLOCK ***
  5 locks held by pool-/usr/lib/b/1258:
   #0: 000000003df1a1ad (&f->f_pos_lock){+.+.}, at: __fdget_pos+0x4d/0x60
   #1: 0000000095a40b02 (sb_writers#6){.+.+}, at: vfs_write+0x185/0x1d0
   #2: 0000000017a7d714 (&of->mutex){+.+.}, at: kernfs_fop_write+0xf2/0x1b0
   #3: 000000004f262981 (kn->count#208){.+.+}, at: kernfs_fop_write+0xfa/0x1b0
   #4: 00000000bfb796b5 (&tb->lock){+.+.}, at: authorized_store+0x7c/0x210

  stack backtrace:
  CPU: 0 PID: 1258 Comm: pool-/usr/lib/b Tainted: G                T 5.3.0-rc6+ #1

On an system using ACPI hotplug the host router gets hotplugged first and then
the firmware starts sending notifications about connected devices so the above
scenario should not happen in reality. However, after taking a second
look at commit a03e828 ("thunderbolt: Serialize PCIe tunnel
creation with PCI rescan") that introduced the locking, I don't think it
is actually correct. It may have cured the symptom but probably the real
root cause was somewhere closer to PCI stack and possibly is already
fixed with recent kernels. I also tried to reproduce the original issue
with the commit reverted but could not.

So to keep lockdep happy and the code bit less complex drop calls to
pci_lock_rescan_remove()/pci_unlock_rescan_remove() in
tb_switch_set_authorized() effectively reverting a03e828.

Link: https://lkml.org/lkml/2019/8/30/513
Fixes: a03e828 ("thunderbolt: Serialize PCIe tunnel creation with PCI rescan")
Reported-by: Dominik Brodowski <[email protected]>
Signed-off-by: Mika Westerberg <[email protected]>
plbossart pushed a commit that referenced this pull request Nov 11, 2019
8-letter strings representing ARC perf events are stores in two
32-bit registers as ASCII characters like that: "IJMP", "IALL", "IJMPTAK" etc.

And the same order of bytes in the word is used regardless CPU endianness.

Which means in case of big-endian CPU core we need to swap bytes to get
the same order as if it was on little-endian CPU.

Otherwise we're seeing the following error message on boot:
------------------------->8----------------------
ARC perf        : 8 counters (32 bits), 40 conditions, [overflow IRQ support]
sysfs: cannot create duplicate filename '/devices/arc_pct/events/pmji'
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.2.18 #3
Stack Trace:
  arc_unwind_core+0xd4/0xfc
  dump_stack+0x64/0x80
  sysfs_warn_dup+0x46/0x58
  sysfs_add_file_mode_ns+0xb2/0x168
  create_files+0x70/0x2a0
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1 at kernel/events/core.c:12144 perf_event_sysfs_init+0x70/0xa0
Failed to register pmu: arc_pct, reason -17
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.2.18 #3
Stack Trace:
  arc_unwind_core+0xd4/0xfc
  dump_stack+0x64/0x80
  __warn+0x9c/0xd4
  warn_slowpath_fmt+0x22/0x2c
  perf_event_sysfs_init+0x70/0xa0
---[ end trace a75fb9a9837bd1ec ]---
------------------------->8----------------------

What happens here we're trying to register more than one raw perf event
with the same name "PMJI". Why? Because ARC perf events are 4 to 8 letters
and encoded into two 32-bit words. In this particular case we deal with 2
events:
 * "IJMP____" which counts all jump & branch instructions
 * "IJMPC___" which counts only conditional jumps & branches

Those strings are split in two 32-bit words this way "IJMP" + "____" &
"IJMP" + "C___" correspondingly. Now if we read them swapped due to CPU core
being big-endian then we read "PMJI" + "____" & "PMJI" + "___C".

And since we interpret read array of ASCII letters as a null-terminated string
on big-endian CPU we end up with 2 events of the same name "PMJI".

Signed-off-by: Alexey Brodkin <[email protected]>
Cc: [email protected]
Signed-off-by: Vineet Gupta <[email protected]>
plbossart pushed a commit that referenced this pull request Mar 11, 2024
Hou Tao says:

====================
Fix the read of vsyscall page through bpf

From: Hou Tao <[email protected]>

Hi,

As reported by syzboot [1] and [2], when trying to read vsyscall page
by using bpf_probe_read_kernel() or bpf_probe_read(), oops may happen.

Thomas Gleixner had proposed a test patch [3], but it seems that no
formal patch is posted after about one month [4], so I post it instead
and add an Originally-by tag in patch #2.

Patch #1 makes is_vsyscall_vaddr() being a common helper. Patch #2 fixes
the problem by disallowing vsyscall page read for
copy_from_kernel_nofault(). Patch #3 adds one test case to ensure the
read of vsyscall page through bpf is rejected. Please see individual
patches for more details.

Comments are always welcome.

[1]: https://lore.kernel.org/bpf/CAG48ez06TZft=ATH1qh2c5mpS5BT8UakwNkzi6nvK5_djC-4Nw@mail.gmail.com/
[2]: https://lore.kernel.org/bpf/CABOYnLynjBoFZOf3Z4BhaZkc5hx_kHfsjiW+UWLoB=w33LvScw@mail.gmail.com/
[3]: https://lore.kernel.org/bpf/87r0jwquhv.ffs@tglx/
[4]: https://lore.kernel.org/bpf/[email protected]/

Change Log:
v3:
 * rephrase commit message for patch #1 & #2 (Sohil)
 * reword comments in copy_from_kernel_nofault_allowed() (Sohil)
 * add Rvb tag for patch #1 and Acked-by tag for patch #3 (Sohil, Yonghong)

v2: https://lore.kernel.org/bpf/[email protected]/
  * move is_vsyscall_vaddr to asm/vsyscall.h instead (Sohil)
  * elaborate on the reason for disallowing of vsyscall page read in
    copy_from_kernel_nofault_allowed() (Sohil)
  * update the commit message of patch #2 to more clearly explain how
    the oops occurs. (Sohil)
  * update the commit message of patch #3 to explain the expected return
    values of various bpf helpers (Yonghong)

v1: https://lore.kernel.org/bpf/[email protected]/
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
plbossart pushed a commit that referenced this pull request Mar 11, 2024
  ...
  cdns3_gadget_ep_free_request(&priv_ep->endpoint, &priv_req->request);
  list_del_init(&priv_req->list);
  ...

'priv_req' actually free at cdns3_gadget_ep_free_request(). But
list_del_init() use priv_req->list after it.

[ 1542.642868][  T534] BUG: KFENCE: use-after-free read in __list_del_entry_valid+0x10/0xd4
[ 1542.642868][  T534]
[ 1542.653162][  T534] Use-after-free read at 0x000000009ed0ba99 (in kfence-#3):
[ 1542.660311][  T534]  __list_del_entry_valid+0x10/0xd4
[ 1542.665375][  T534]  cdns3_gadget_ep_disable+0x1f8/0x388 [cdns3]
[ 1542.671571][  T534]  usb_ep_disable+0x44/0xe4
[ 1542.675948][  T534]  ffs_func_eps_disable+0x64/0xc8
[ 1542.680839][  T534]  ffs_func_set_alt+0x74/0x368
[ 1542.685478][  T534]  ffs_func_disable+0x18/0x28

Move list_del_init() before cdns3_gadget_ep_free_request() to resolve this
problem.

Cc: [email protected]
Fixes: 7733f6c ("usb: cdns3: Add Cadence USB3 DRD Driver")
Signed-off-by: Frank Li <[email protected]>
Reviewed-by: Roger Quadros <[email protected]>
Acked-by: Peter Chen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
plbossart pushed a commit that referenced this pull request Mar 11, 2024
…kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.8, take #3

- Check for the validity of interrupts handled by a MOVALL
  command

- Check for the validity of interrupts while reading the
  pending state on enabling LPIs.
plbossart pushed a commit that referenced this pull request Mar 11, 2024
…git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

Patch #1 restores NFPROTO_INET with nft_compat, from Ignat Korchagin.

Patch #2 fixes an issue with bridge netfilter and broadcast/multicast
packets.

There is a day 0 bug in br_netfilter when used with connection tracking.

Conntrack assumes that an nf_conn structure that is not yet added to
hash table ("unconfirmed"), is only visible by the current cpu that is
processing the sk_buff.

For bridge this isn't true, sk_buff can get cloned in between, and
clones can be processed in parallel on different cpu.

This patch disables NAT and conntrack helpers for multicast packets.

Patch #3 adds a selftest to cover for the br_netfilter bug.

netfilter pull request 24-02-29

* tag 'nf-24-02-29' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  selftests: netfilter: add bridge conntrack + multicast test case
  netfilter: bridge: confirm multicast packets before passing them up the stack
  netfilter: nf_tables: allow NFPROTO_INET in nft_(match/target)_validate()
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
plbossart pushed a commit that referenced this pull request Mar 11, 2024
The pf->dplls.lock mutex is initialized too late, after its first use.
Move it to the top of ice_dpll_init.
Note that the "err_exit" error path destroys the mutex. And the mutex is
the last thing destroyed in ice_dpll_deinit.
This fixes the following warning with CONFIG_DEBUG_MUTEXES:

 ice 0000:10:00.0: The DDP package was successfully loaded: ICE OS Default Package version 1.3.36.0
 ice 0000:10:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
 ice 0000:10:00.0: PTP init successful
 ------------[ cut here ]------------
 DEBUG_LOCKS_WARN_ON(lock->magic != lock)
 WARNING: CPU: 0 PID: 410 at kernel/locking/mutex.c:587 __mutex_lock+0x773/0xd40
 Modules linked in: crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ice(+) nvme nvme_c>
 CPU: 0 PID: 410 Comm: kworker/0:4 Not tainted 6.8.0-rc5+ #3
 Hardware name: HPE ProLiant DL110 Gen10 Plus/ProLiant DL110 Gen10 Plus, BIOS U56 10/19/2023
 Workqueue: events work_for_cpu_fn
 RIP: 0010:__mutex_lock+0x773/0xd40
 Code: c0 0f 84 1d f9 ff ff 44 8b 35 0d 9c 69 01 45 85 f6 0f 85 0d f9 ff ff 48 c7 c6 12 a2 a9 85 48 c7 c7 12 f1 a>
 RSP: 0018:ff7eb1a3417a7ae0 EFLAGS: 00010286
 RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000
 RDX: 0000000000000002 RSI: ffffffff85ac2bff RDI: 00000000ffffffff
 RBP: ff7eb1a3417a7b80 R08: 0000000000000000 R09: 00000000ffffbfff
 R10: ff7eb1a3417a7978 R11: ff32b80f7fd2e568 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000000000000000 R15: ff32b7f02c50e0d8
 FS:  0000000000000000(0000) GS:ff32b80efe800000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000055b5852cc000 CR3: 000000003c43a004 CR4: 0000000000771ef0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  <TASK>
  ? __warn+0x84/0x170
  ? __mutex_lock+0x773/0xd40
  ? report_bug+0x1c7/0x1d0
  ? prb_read_valid+0x1b/0x30
  ? handle_bug+0x42/0x70
  ? exc_invalid_op+0x18/0x70
  ? asm_exc_invalid_op+0x1a/0x20
  ? __mutex_lock+0x773/0xd40
  ? rcu_is_watching+0x11/0x50
  ? __kmalloc_node_track_caller+0x346/0x490
  ? ice_dpll_lock_status_get+0x28/0x50 [ice]
  ? __pfx_ice_dpll_lock_status_get+0x10/0x10 [ice]
  ? ice_dpll_lock_status_get+0x28/0x50 [ice]
  ice_dpll_lock_status_get+0x28/0x50 [ice]
  dpll_device_get_one+0x14f/0x2e0
  dpll_device_event_send+0x7d/0x150
  dpll_device_register+0x124/0x180
  ice_dpll_init_dpll+0x7b/0xd0 [ice]
  ice_dpll_init+0x224/0xa40 [ice]
  ? _dev_info+0x70/0x90
  ice_load+0x468/0x690 [ice]
  ice_probe+0x75b/0xa10 [ice]
  ? _raw_spin_unlock_irqrestore+0x4f/0x80
  ? process_one_work+0x1a3/0x500
  local_pci_probe+0x47/0xa0
  work_for_cpu_fn+0x17/0x30
  process_one_work+0x20d/0x500
  worker_thread+0x1df/0x3e0
  ? __pfx_worker_thread+0x10/0x10
  kthread+0x103/0x140
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x31/0x50
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1b/0x30
  </TASK>
 irq event stamp: 125197
 hardirqs last  enabled at (125197): [<ffffffff8416409d>] finish_task_switch.isra.0+0x12d/0x3d0
 hardirqs last disabled at (125196): [<ffffffff85134044>] __schedule+0xea4/0x19f0
 softirqs last  enabled at (105334): [<ffffffff84e1e65a>] napi_get_frags_check+0x1a/0x60
 softirqs last disabled at (105332): [<ffffffff84e1e65a>] napi_get_frags_check+0x1a/0x60
 ---[ end trace 0000000000000000 ]---

Fixes: d7999f5 ("ice: implement dpll interface to control cgu")
Signed-off-by: Michal Schmidt <[email protected]>
Reviewed-by: Maciej Fijalkowski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
plbossart pushed a commit that referenced this pull request Mar 11, 2024
…git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains fixes for net:

Patch #1 disallows anonymous sets with timeout, except for dynamic sets.
         Anonymous sets with timeouts using the pipapo set backend makes
         no sense from userspace perspective.

Patch #2 rejects constant sets with timeout which has no practical usecase.
         This kind of set, once bound, contains elements that expire but
         no new elements can be added.

Patch #3 restores custom conntrack expectations with NFPROTO_INET,
         from Florian Westphal.

Patch #4 marks rhashtable anonymous set with timeout as dead from the
         commit path to avoid that async GC collects these elements. Rules
         that refers to the anonymous set get released with no mutex held
         from the commit path.

Patch #5 fixes a UBSAN shift overflow in H.323 conntrack helper,
         from Lena Wang.

netfilter pull request 24-03-07

* tag 'nf-24-03-07' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_conntrack_h323: Add protection for bmp length out of range
  netfilter: nf_tables: mark set as dead when unbinding anonymous set with timeout
  netfilter: nft_ct: fix l3num expectations with inet pseudo family
  netfilter: nf_tables: reject constant set with timeout
  netfilter: nf_tables: disallow anonymous set with timeout flag
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
plbossart pushed a commit that referenced this pull request Apr 8, 2024
As for ice bug fixed by commit b7306b4 ("ice: manage interrupts
during poll exit") followed by commit 23be707 ("ice: fix software
generating extra interrupts") I'm seeing the similar issue also with
i40e driver.

In certain situation when busy-loop is enabled together with adaptive
coalescing, the driver occasionally misses that there are outstanding
descriptors to clean when exiting busy poll.

Try to catch the remaining work by triggering a software interrupt
when exiting busy poll. No extra interrupts will be generated when
busy polling is not used.

The issue was found when running sockperf ping-pong tcp test with
adaptive coalescing and busy poll enabled (50 as value busy_pool
and busy_read sysctl knobs) and results in huge latency spikes
with more than 100000us.

The fix is inspired from the ice driver and do the following:
1) During napi poll exit in case of busy-poll (napo_complete_done()
   returns false) this is recorded to q_vector that we were in busy
   loop.
2) Extends i40e_buildreg_itr() to be able to add an enforced software
   interrupt into built value
2) In i40e_update_enable_itr() enforces a software interrupt trigger
   if we are exiting busy poll to catch any pending clean-ups
3) Reuses unused 3rd ITR (interrupt throttle) index and set it to
   20K interrupts per second to limit the number of these sw interrupts.

Test results
============
Prior:
[root@dell-per640-07 net]# sockperf ping-pong -i 10.9.9.1 --tcp -m 1000 --mps=max -t 120
sockperf: == version #3.10-no.git ==
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.9.9.1        PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=119.999 sec; Warm up time=400 msec; SentMessages=2438563; ReceivedMessages=2438562
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=119.549 sec; SentMessages=2429473; ReceivedMessages=2429473
sockperf: ====> avg-latency=24.571 (std-dev=93.297, mean-ad=4.904, median-ad=1.510, siqr=1.063, cv=3.797, std-error=0.060, 99.0% ci=[24.417, 24.725])
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 24.571 usec
sockperf: Total 2429473 observations; each percentile contains 24294.73 observations
sockperf: ---> <MAX> observation = 103294.331
sockperf: ---> percentile 99.999 =   45.633
sockperf: ---> percentile 99.990 =   37.013
sockperf: ---> percentile 99.900 =   35.910
sockperf: ---> percentile 99.000 =   33.390
sockperf: ---> percentile 90.000 =   28.626
sockperf: ---> percentile 75.000 =   27.741
sockperf: ---> percentile 50.000 =   26.743
sockperf: ---> percentile 25.000 =   25.614
sockperf: ---> <MIN> observation =   12.220

After:
[root@dell-per640-07 net]# sockperf ping-pong -i 10.9.9.1 --tcp -m 1000 --mps=max -t 120
sockperf: == version #3.10-no.git ==
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.9.9.1        PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=119.999 sec; Warm up time=400 msec; SentMessages=2400055; ReceivedMessages=2400054
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=119.549 sec; SentMessages=2391186; ReceivedMessages=2391186
sockperf: ====> avg-latency=24.965 (std-dev=5.934, mean-ad=4.642, median-ad=1.485, siqr=1.067, cv=0.238, std-error=0.004, 99.0% ci=[24.955, 24.975])
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 24.965 usec
sockperf: Total 2391186 observations; each percentile contains 23911.86 observations
sockperf: ---> <MAX> observation =  195.841
sockperf: ---> percentile 99.999 =   45.026
sockperf: ---> percentile 99.990 =   39.009
sockperf: ---> percentile 99.900 =   35.922
sockperf: ---> percentile 99.000 =   33.482
sockperf: ---> percentile 90.000 =   28.902
sockperf: ---> percentile 75.000 =   27.821
sockperf: ---> percentile 50.000 =   26.860
sockperf: ---> percentile 25.000 =   25.685
sockperf: ---> <MIN> observation =   12.277

Fixes: 0bcd952 ("ethernet/intel: consolidate NAPI and NAPI exit")
Reported-by: Hugo Ferreira <[email protected]>
Reviewed-by: Michal Schmidt <[email protected]>
Signed-off-by: Ivan Vecera <[email protected]>
Reviewed-by: Jesse Brandeburg <[email protected]>
Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <[email protected]>
plbossart pushed a commit that referenced this pull request Apr 8, 2024
The driver creates /sys/kernel/debug/dri/0/mob_ttm even when the
corresponding ttm_resource_manager is not allocated.
This leads to a crash when trying to read from this file.

Add a check to create mob_ttm, system_mob_ttm, and gmr_ttm debug file
only when the corresponding ttm_resource_manager is allocated.

crash> bt
PID: 3133409  TASK: ffff8fe4834a5000  CPU: 3    COMMAND: "grep"
 #0 [ffffb954506b3b20] machine_kexec at ffffffffb2a6bec3
 #1 [ffffb954506b3b78] __crash_kexec at ffffffffb2bb598a
 #2 [ffffb954506b3c38] crash_kexec at ffffffffb2bb68c1
 #3 [ffffb954506b3c50] oops_end at ffffffffb2a2a9b1
 #4 [ffffb954506b3c70] no_context at ffffffffb2a7e913
 #5 [ffffb954506b3cc8] __bad_area_nosemaphore at ffffffffb2a7ec8c
 #6 [ffffb954506b3d10] do_page_fault at ffffffffb2a7f887
 #7 [ffffb954506b3d40] page_fault at ffffffffb360116e
    [exception RIP: ttm_resource_manager_debug+0x11]
    RIP: ffffffffc04afd11  RSP: ffffb954506b3df0  RFLAGS: 00010246
    RAX: ffff8fe41a6d1200  RBX: 0000000000000000  RCX: 0000000000000940
    RDX: 0000000000000000  RSI: ffffffffc04b4338  RDI: 0000000000000000
    RBP: ffffb954506b3e08   R8: ffff8fee3ffad000   R9: 0000000000000000
    R10: ffff8fe41a76a000  R11: 0000000000000001  R12: 00000000ffffffff
    R13: 0000000000000001  R14: ffff8fe5bb6f3900  R15: ffff8fe41a6d1200
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffffb954506b3e00] ttm_resource_manager_show at ffffffffc04afde7 [ttm]
 #9 [ffffb954506b3e30] seq_read at ffffffffb2d8f9f3
    RIP: 00007f4c4eda8985  RSP: 00007ffdbba9e9f8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 000000000037e000  RCX: 00007f4c4eda8985
    RDX: 000000000037e000  RSI: 00007f4c41573000  RDI: 0000000000000003
    RBP: 000000000037e000   R8: 0000000000000000   R9: 000000000037fe30
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007f4c41573000
    R13: 0000000000000003  R14: 00007f4c41572010  R15: 0000000000000003
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b

Signed-off-by: Jocelyn Falempe <[email protected]>
Fixes: af4a25b ("drm/vmwgfx: Add debugfs entries for various ttm resource managers")
Cc: <[email protected]>
Reviewed-by: Zack Rusin <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
plbossart pushed a commit that referenced this pull request Apr 8, 2024
…git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

Patch #1 reject destroy chain command to delete device hooks in netdev
         family, hence, only delchain commands are allowed.

Patch #2 reject table flag update interference with netdev basechain
	 hook updates, this can leave hooks in inconsistent
	 registration/unregistration state.

Patch #3 do not unregister netdev basechain hooks if table is dormant.
	 Otherwise, splat with double unregistration is possible.

Patch #4 fixes Kconfig to allow to restore IP_NF_ARPTABLES,
	 from Kuniyuki Iwashima.

There are a more fixes still in progress on my side that need more work.

* tag 'nf-24-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: arptables: Select NETFILTER_FAMILY_ARP when building arp_tables.c
  netfilter: nf_tables: skip netdev hook unregistration if table is dormant
  netfilter: nf_tables: reject table flag and netdev basechain updates
  netfilter: nf_tables: reject destroy command to remove basechain hooks
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
plbossart pushed a commit that referenced this pull request Apr 8, 2024
…git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

Patch #1 unlike early commit path stage which triggers a call to abort,
         an explicit release of the batch is required on abort, otherwise
         mutex is released and commit_list remains in place.

Patch #2 release mutex after nft_gc_seq_end() in commit path, otherwise
         async GC worker could collect expired objects.

Patch #3 flush pending destroy work in module removal path, otherwise UaF
         is possible.

Patch #4 and #6 restrict the table dormant flag with basechain updates
	 to fix state inconsistency in the hook registration.

Patch #5 adds missing RCU read side lock to flowtable type to avoid races
	 with module removal.

* tag 'nf-24-04-04' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables: discard table flag update with pending basechain deletion
  netfilter: nf_tables: Fix potential data-race in __nft_flowtable_type_get()
  netfilter: nf_tables: reject new basechain after table flag update
  netfilter: nf_tables: flush pending destroy work before exit_net release
  netfilter: nf_tables: release mutex after nft_gc_seq_end from abort path
  netfilter: nf_tables: release batch on table validation from abort path
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
plbossart pushed a commit that referenced this pull request May 3, 2024
At current x1e80100 interface table, interface #3 is wrongly
connected to DP controller #0 and interface #4 wrongly connected
to DP controller #2. Fix this problem by connect Interface #3 to
DP controller #0 and interface #4 connect to DP controller #1.
Also add interface #6, #7 and #8 connections to DP controller to
complete x1e80100 interface table.

Changs in V3:
-- add v2 changes log

Changs in V2:
-- add x1e80100 to subject
-- add Fixes

Fixes: e3b1f36 ("drm/msm/dpu: Add X1E80100 support")
Signed-off-by: Kuogee Hsieh <[email protected]>
Reviewed-by: Abhinav Kumar <[email protected]>
Reviewed-by: Abel Vesa <[email protected]>
Patchwork: https://patchwork.freedesktop.org/patch/585549/
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Abhinav Kumar <[email protected]>
plbossart pushed a commit that referenced this pull request May 3, 2024
Drop support for virtualizing adaptive PEBS, as KVM's implementation is
architecturally broken without an obvious/easy path forward, and because
exposing adaptive PEBS can leak host LBRs to the guest, i.e. can leak
host kernel addresses to the guest.

Bug #1 is that KVM doesn't account for the upper 32 bits of
IA32_FIXED_CTR_CTRL when (re)programming fixed counters, e.g
fixed_ctrl_field() drops the upper bits, reprogram_fixed_counters()
stores local variables as u8s and truncates the upper bits too, etc.

Bug #2 is that, because KVM _always_ sets precise_ip to a non-zero value
for PEBS events, perf will _always_ generate an adaptive record, even if
the guest requested a basic record.  Note, KVM will also enable adaptive
PEBS in individual *counter*, even if adaptive PEBS isn't exposed to the
guest, but this is benign as MSR_PEBS_DATA_CFG is guaranteed to be zero,
i.e. the guest will only ever see Basic records.

Bug #3 is in perf.  intel_pmu_disable_fixed() doesn't clear the upper
bits either, i.e. leaves ICL_FIXED_0_ADAPTIVE set, and
intel_pmu_enable_fixed() effectively doesn't clear ICL_FIXED_0_ADAPTIVE
either.  I.e. perf _always_ enables ADAPTIVE counters, regardless of what
KVM requests.

Bug #4 is that adaptive PEBS *might* effectively bypass event filters set
by the host, as "Updated Memory Access Info Group" records information
that might be disallowed by userspace via KVM_SET_PMU_EVENT_FILTER.

Bug #5 is that KVM doesn't ensure LBR MSRs hold guest values (or at least
zeros) when entering a vCPU with adaptive PEBS, which allows the guest
to read host LBRs, i.e. host RIPs/addresses, by enabling "LBR Entries"
records.

Disable adaptive PEBS support as an immediate fix due to the severity of
the LBR leak in particular, and because fixing all of the bugs will be
non-trivial, e.g. not suitable for backporting to stable kernels.

Note!  This will break live migration, but trying to make KVM play nice
with live migration would be quite complicated, wouldn't be guaranteed to
work (i.e. KVM might still kill/confuse the guest), and it's not clear
that there are any publicly available VMMs that support adaptive PEBS,
let alone live migrate VMs that support adaptive PEBS, e.g. QEMU doesn't
support PEBS in any capacity.

Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/all/[email protected]
Fixes: c59a1f1 ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS")
Cc: [email protected]
Cc: Like Xu <[email protected]>
Cc: Mingwei Zhang <[email protected]>
Cc: Zhenyu Wang <[email protected]>
Cc: Zhang Xiong <[email protected]>
Cc: Lv Zhiyuan <[email protected]>
Cc: Dapeng Mi <[email protected]>
Cc: Jim Mattson <[email protected]>
Acked-by: Like Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
plbossart pushed a commit that referenced this pull request May 3, 2024
…git/netfilter/nf

netfilter pull request 24-04-11

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

Patches #1 and #2 add missing rcu read side lock when iterating over
expression and object type list which could race with module removal.

Patch #3 prevents promisc packet from visiting the bridge/input hook
	 to amend a recent fix to address conntrack confirmation race
	 in br_netfilter and nf_conntrack_bridge.

Patch #4 adds and uses iterate decorator type to fetch the current
	 pipapo set backend datastructure view when netlink dumps the
	 set elements.

Patch #5 fixes removal of duplicate elements in the pipapo set backend.

Patch #6 flowtable validates pppoe header before accessing it.

Patch #7 fixes flowtable datapath for pppoe packets, otherwise lookup
         fails and pppoe packets follow classic path.
====================

Signed-off-by: David S. Miller <[email protected]>
plbossart pushed a commit that referenced this pull request May 3, 2024
When I did hard offline test with hugetlb pages, below deadlock occurs:

======================================================
WARNING: possible circular locking dependency detected
6.8.0-11409-gf6cef5f8c37f #1 Not tainted
------------------------------------------------------
bash/46904 is trying to acquire lock:
ffffffffabe68910 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_slow_dec+0x16/0x60

but task is already holding lock:
ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (pcp_batch_high_lock){+.+.}-{3:3}:
       __mutex_lock+0x6c/0x770
       page_alloc_cpu_online+0x3c/0x70
       cpuhp_invoke_callback+0x397/0x5f0
       __cpuhp_invoke_callback_range+0x71/0xe0
       _cpu_up+0xeb/0x210
       cpu_up+0x91/0xe0
       cpuhp_bringup_mask+0x49/0xb0
       bringup_nonboot_cpus+0xb7/0xe0
       smp_init+0x25/0xa0
       kernel_init_freeable+0x15f/0x3e0
       kernel_init+0x15/0x1b0
       ret_from_fork+0x2f/0x50
       ret_from_fork_asm+0x1a/0x30

-> #0 (cpu_hotplug_lock){++++}-{0:0}:
       __lock_acquire+0x1298/0x1cd0
       lock_acquire+0xc0/0x2b0
       cpus_read_lock+0x2a/0xc0
       static_key_slow_dec+0x16/0x60
       __hugetlb_vmemmap_restore_folio+0x1b9/0x200
       dissolve_free_huge_page+0x211/0x260
       __page_handle_poison+0x45/0xc0
       memory_failure+0x65e/0xc70
       hard_offline_page_store+0x55/0xa0
       kernfs_fop_write_iter+0x12c/0x1d0
       vfs_write+0x387/0x550
       ksys_write+0x64/0xe0
       do_syscall_64+0xca/0x1e0
       entry_SYSCALL_64_after_hwframe+0x6d/0x75

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(pcp_batch_high_lock);
                               lock(cpu_hotplug_lock);
                               lock(pcp_batch_high_lock);
  rlock(cpu_hotplug_lock);

 *** DEADLOCK ***

5 locks held by bash/46904:
 #0: ffff98f6c3bb23f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
 #1: ffff98f6c328e488 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0
 #2: ffff98ef83b31890 (kn->active#113){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0
 #3: ffffffffabf9db48 (mf_mutex){+.+.}-{3:3}, at: memory_failure+0x44/0xc70
 #4: ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40

stack backtrace:
CPU: 10 PID: 46904 Comm: bash Kdump: loaded Not tainted 6.8.0-11409-gf6cef5f8c37f #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x68/0xa0
 check_noncircular+0x129/0x140
 __lock_acquire+0x1298/0x1cd0
 lock_acquire+0xc0/0x2b0
 cpus_read_lock+0x2a/0xc0
 static_key_slow_dec+0x16/0x60
 __hugetlb_vmemmap_restore_folio+0x1b9/0x200
 dissolve_free_huge_page+0x211/0x260
 __page_handle_poison+0x45/0xc0
 memory_failure+0x65e/0xc70
 hard_offline_page_store+0x55/0xa0
 kernfs_fop_write_iter+0x12c/0x1d0
 vfs_write+0x387/0x550
 ksys_write+0x64/0xe0
 do_syscall_64+0xca/0x1e0
 entry_SYSCALL_64_after_hwframe+0x6d/0x75
RIP: 0033:0x7fc862314887
Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007fff19311268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007fc862314887
RDX: 000000000000000c RSI: 000056405645fe10 RDI: 0000000000000001
RBP: 000056405645fe10 R08: 00007fc8623d1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
R13: 00007fc86241b780 R14: 00007fc862417600 R15: 00007fc862416a00

In short, below scene breaks the lock dependency chain:

 memory_failure
  __page_handle_poison
   zone_pcp_disable -- lock(pcp_batch_high_lock)
   dissolve_free_huge_page
    __hugetlb_vmemmap_restore_folio
     static_key_slow_dec
      cpus_read_lock -- rlock(cpu_hotplug_lock)

Fix this by calling drain_all_pages() instead.

This issue won't occur until commit a6b4085 ("mm: hugetlb: replace
hugetlb_free_vmemmap_enabled with a static_key").  As it introduced
rlock(cpu_hotplug_lock) in dissolve_free_huge_page() code path while
lock(pcp_batch_high_lock) is already in the __page_handle_poison().

[[email protected]: extend comment per Oscar]
[[email protected]: reflow block comment]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: a6b4085 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key")
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Oscar Salvador <[email protected]>
Reviewed-by: Jane Chu <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
plbossart pushed a commit that referenced this pull request May 3, 2024
vhost_worker will call tun call backs to receive packets. If too many
illegal packets arrives, tun_do_read will keep dumping packet contents.
When console is enabled, it will costs much more cpu time to dump
packet and soft lockup will be detected.

net_ratelimit mechanism can be used to limit the dumping rate.

PID: 33036    TASK: ffff949da6f20000  CPU: 23   COMMAND: "vhost-32980"
 #0 [fffffe00003fce50] crash_nmi_callback at ffffffff89249253
 #1 [fffffe00003fce58] nmi_handle at ffffffff89225fa3
 #2 [fffffe00003fceb0] default_do_nmi at ffffffff8922642e
 #3 [fffffe00003fced0] do_nmi at ffffffff8922660d
 #4 [fffffe00003fcef0] end_repeat_nmi at ffffffff89c01663
    [exception RIP: io_serial_in+20]
    RIP: ffffffff89792594  RSP: ffffa655314979e8  RFLAGS: 00000002
    RAX: ffffffff89792500  RBX: ffffffff8af428a0  RCX: 0000000000000000
    RDX: 00000000000003fd  RSI: 0000000000000005  RDI: ffffffff8af428a0
    RBP: 0000000000002710   R8: 0000000000000004   R9: 000000000000000f
    R10: 0000000000000000  R11: ffffffff8acbf64f  R12: 0000000000000020
    R13: ffffffff8acbf698  R14: 0000000000000058  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffa655314979e8] io_serial_in at ffffffff89792594
 #6 [ffffa655314979e8] wait_for_xmitr at ffffffff89793470
 #7 [ffffa65531497a08] serial8250_console_putchar at ffffffff897934f6
 #8 [ffffa65531497a20] uart_console_write at ffffffff8978b605
 #9 [ffffa65531497a48] serial8250_console_write at ffffffff89796558
 #10 [ffffa65531497ac8] console_unlock at ffffffff89316124
 #11 [ffffa65531497b10] vprintk_emit at ffffffff89317c07
 #12 [ffffa65531497b68] printk at ffffffff89318306
 #13 [ffffa65531497bc8] print_hex_dump at ffffffff89650765
 #14 [ffffa65531497ca8] tun_do_read at ffffffffc0b06c27 [tun]
 #15 [ffffa65531497d38] tun_recvmsg at ffffffffc0b06e34 [tun]
 #16 [ffffa65531497d68] handle_rx at ffffffffc0c5d682 [vhost_net]
 #17 [ffffa65531497ed0] vhost_worker at ffffffffc0c644dc [vhost]
 #18 [ffffa65531497f10] kthread at ffffffff892d2e72
 #19 [ffffa65531497f50] ret_from_fork at ffffffff89c0022f

Fixes: ef3db4a ("tun: avoid BUG, dump packet on GSO errors")
Signed-off-by: Lei Chen <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Acked-by: Jason Wang <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
plbossart pushed a commit that referenced this pull request May 3, 2024
…git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

Patch #1 amends a missing spot where the set iterator type is unset.
	 This is fixing a issue in the previous pull request.

Patch #2 fixes the delete set command abort path by restoring state
         of the elements. Reverse logic for the activate (abort) case
	 otherwise element state is not restored, this requires to move
	 the check for active/inactive elements to the set iterator
	 callback. From the deactivate path, toggle the next generation
	 bit and from the activate (abort) path, clear the next generation
	 bitmask.

Patch #3 skips elements already restored by delete set command from the
	 abort path in case there is a previous delete element command in
	 the batch. Check for the next generation bit just like it is done
	 via set iteration to restore maps.

netfilter pull request 24-04-18

* tag 'nf-24-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables: fix memleak in map from abort path
  netfilter: nf_tables: restore set elements when delete set fails
  netfilter: nf_tables: missing iterator type in lookup walk
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
plbossart pushed a commit that referenced this pull request May 3, 2024
Petr Machata says:

====================
mlxsw: Fixes

This patchset fixes the following issues:

- During driver de-initialization the driver unregisters the EMAD
  response trap by setting its action to DISCARD. However the manual
  only permits TRAP and FORWARD, and future firmware versions will
  enforce this.

  In patch #1, suppress the error message by aligning the driver to the
  manual and use a FORWARD (NOP) action when unregistering the trap.

- The driver queries the Management Capabilities Mask (MCAM) register
  during initialization to understand if certain features are supported.

  However, not all firmware versions support this register, leading to
  the driver failing to load.

  Patches #2 and #3 fix this issue by treating an error in the register
  query as an indication that the feature is not supported.

v2:
- Patch #2:
    - Make mlxsw_env_max_module_eeprom_len_query() void
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
plbossart pushed a commit that referenced this pull request May 3, 2024
9f74a3d ("ice: Fix VF Reset paths when interface in a failed over
aggregate"), the ice driver has acquired the LAG mutex in ice_reset_vf().
The commit placed this lock acquisition just prior to the acquisition of
the VF configuration lock.

If ice_reset_vf() acquires the configuration lock via the ICE_VF_RESET_LOCK
flag, this could deadlock with ice_vc_cfg_qs_msg() because it always
acquires the locks in the order of the VF configuration lock and then the
LAG mutex.

Lockdep reports this violation almost immediately on creating and then
removing 2 VF:

======================================================
WARNING: possible circular locking dependency detected
6.8.0-rc6 torvalds#54 Tainted: G        W  O
------------------------------------------------------
kworker/60:3/6771 is trying to acquire lock:
ff40d43e099380a0 (&vf->cfg_lock){+.+.}-{3:3}, at: ice_reset_vf+0x22f/0x4d0 [ice]

but task is already holding lock:
ff40d43ea1961210 (&pf->lag_mutex){+.+.}-{3:3}, at: ice_reset_vf+0xb7/0x4d0 [ice]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&pf->lag_mutex){+.+.}-{3:3}:
       __lock_acquire+0x4f8/0xb40
       lock_acquire+0xd4/0x2d0
       __mutex_lock+0x9b/0xbf0
       ice_vc_cfg_qs_msg+0x45/0x690 [ice]
       ice_vc_process_vf_msg+0x4f5/0x870 [ice]
       __ice_clean_ctrlq+0x2b5/0x600 [ice]
       ice_service_task+0x2c9/0x480 [ice]
       process_one_work+0x1e9/0x4d0
       worker_thread+0x1e1/0x3d0
       kthread+0x104/0x140
       ret_from_fork+0x31/0x50
       ret_from_fork_asm+0x1b/0x30

-> #0 (&vf->cfg_lock){+.+.}-{3:3}:
       check_prev_add+0xe2/0xc50
       validate_chain+0x558/0x800
       __lock_acquire+0x4f8/0xb40
       lock_acquire+0xd4/0x2d0
       __mutex_lock+0x9b/0xbf0
       ice_reset_vf+0x22f/0x4d0 [ice]
       ice_process_vflr_event+0x98/0xd0 [ice]
       ice_service_task+0x1cc/0x480 [ice]
       process_one_work+0x1e9/0x4d0
       worker_thread+0x1e1/0x3d0
       kthread+0x104/0x140
       ret_from_fork+0x31/0x50
       ret_from_fork_asm+0x1b/0x30

other info that might help us debug this:
 Possible unsafe locking scenario:
       CPU0                    CPU1
       ----                    ----
  lock(&pf->lag_mutex);
                               lock(&vf->cfg_lock);
                               lock(&pf->lag_mutex);
  lock(&vf->cfg_lock);

 *** DEADLOCK ***
4 locks held by kworker/60:3/6771:
 #0: ff40d43e05428b38 ((wq_completion)ice){+.+.}-{0:0}, at: process_one_work+0x176/0x4d0
 #1: ff50d06e05197e58 ((work_completion)(&pf->serv_task)){+.+.}-{0:0}, at: process_one_work+0x176/0x4d0
 #2: ff40d43ea1960e50 (&pf->vfs.table_lock){+.+.}-{3:3}, at: ice_process_vflr_event+0x48/0xd0 [ice]
 #3: ff40d43ea1961210 (&pf->lag_mutex){+.+.}-{3:3}, at: ice_reset_vf+0xb7/0x4d0 [ice]

stack backtrace:
CPU: 60 PID: 6771 Comm: kworker/60:3 Tainted: G        W  O       6.8.0-rc6 torvalds#54
Hardware name:
Workqueue: ice ice_service_task [ice]
Call Trace:
 <TASK>
 dump_stack_lvl+0x4a/0x80
 check_noncircular+0x12d/0x150
 check_prev_add+0xe2/0xc50
 ? save_trace+0x59/0x230
 ? add_chain_cache+0x109/0x450
 validate_chain+0x558/0x800
 __lock_acquire+0x4f8/0xb40
 ? lockdep_hardirqs_on+0x7d/0x100
 lock_acquire+0xd4/0x2d0
 ? ice_reset_vf+0x22f/0x4d0 [ice]
 ? lock_is_held_type+0xc7/0x120
 __mutex_lock+0x9b/0xbf0
 ? ice_reset_vf+0x22f/0x4d0 [ice]
 ? ice_reset_vf+0x22f/0x4d0 [ice]
 ? rcu_is_watching+0x11/0x50
 ? ice_reset_vf+0x22f/0x4d0 [ice]
 ice_reset_vf+0x22f/0x4d0 [ice]
 ? process_one_work+0x176/0x4d0
 ice_process_vflr_event+0x98/0xd0 [ice]
 ice_service_task+0x1cc/0x480 [ice]
 process_one_work+0x1e9/0x4d0
 worker_thread+0x1e1/0x3d0
 ? __pfx_worker_thread+0x10/0x10
 kthread+0x104/0x140
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x31/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 </TASK>

To avoid deadlock, we must acquire the LAG mutex only after acquiring the
VF configuration lock. Fix the ice_reset_vf() to acquire the LAG mutex only
after we either acquire or check that the VF configuration lock is held.

Fixes: 9f74a3d ("ice: Fix VF Reset paths when interface in a failed over aggregate")
Signed-off-by: Jacob Keller <[email protected]>
Reviewed-by: Dave Ertman <[email protected]>
Reviewed-by: Mateusz Polchlopek <[email protected]>
Tested-by: Przemek Kitszel <[email protected]>
Tested-by: Rafal Romanowski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
plbossart pushed a commit that referenced this pull request May 3, 2024
Merge series from Jerome Brunet <[email protected]>:

This patchset fixes 2 problems on TDM which both find a solution
by properly implementing the .trigger() callback for the TDM backend.

ATM, enabling the TDM formatters is done by the .prepare() callback
because handling the formatter is slow due to necessary calls to CCF.

The first problem affects the TDMIN. Because .prepare() is called on DPCM
backend first, the formatter are started before the FIFOs and this may
cause a random channel shifts if the TDMIN use multiple lanes with more
than 2 slots per lanes. Using trigger() allows to set the FE/BE order,
solving the problem.

There has already been an attempt to fix this 3y ago [1] and reverted [2]
It triggered a 'sleep in irq' error on the period IRQ. The solution is
to just use the bottom half of threaded IRQ. This is patch #1. Patch #2
and #3 remain mostly the same as 3y ago.

For TDMOUT, the problem is on pause. ATM pause only stops the FIFO and
the TDMOUT just starves. When it does, it will actually repeat the last
sample continuously. Depending on the platform, if there is no high-pass
filter on the analog path, this may translate to a constant position of
the speaker membrane. There is no audible glitch but it may damage the
speaker coil.

Properly stopping the TDMOUT in pause solves the problem. There is
behaviour change associated with that fix. Clocks used to be continuous
on pause because of the problem above. They will now be gated on pause by
default, as they should. The last change introduce the proper support for
continuous clocks, if needed.

[1]: https://lore.kernel.org/linux-amlogic/[email protected]
[2]: https://lore.kernel.org/linux-amlogic/[email protected]
plbossart pushed a commit that referenced this pull request Jun 12, 2024
Andrii Nakryiko says:

====================
Fix BPF multi-uprobe PID filtering logic

It turns out that current implementation of multi-uprobe PID filtering logic
is broken. It filters by thread, while the promise is filtering by process.
Patch #1 fixes the logic trivially. The rest is testing and mitigations that
are necessary for libbpf to not break users of USDT programs.

v1->v2:
  - fix selftest in last patch (CI);
  - use semicolon in patch #3 (Jiri).
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
plbossart pushed a commit that referenced this pull request Jun 12, 2024
…git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

Patch #1 syzbot reports that nf_reinject() could be called without
         rcu_read_lock() when flushing pending packets at nfnetlink
         queue removal, from Eric Dumazet.

Patch #2 flushes ipset list:set when canceling garbage collection to
         reference to other lists to fix a race, from Jozsef Kadlecsik.

Patch #3 restores q-in-q matching with nft_payload by reverting
         f6ae9f1 ("netfilter: nft_payload: add C-VLAN support").

Patch #4 fixes vlan mangling in skbuff when vlan offload is present
         in skbuff, without this patch nft_payload corrupts packets
         in this case.

Patch #5 fixes possible nul-deref in tproxy no IP address is found in
         netdevice, reported by syzbot and patch from Florian Westphal.

Patch #6 removes a superfluous restriction which prevents loose fib
         lookups from input and forward hooks, from Eric Garver.

My assessment is that patches #1, #2 and #5 address possible kernel
crash, anything else in this batch fixes broken features.

netfilter pull request 24-05-29

* tag 'nf-24-05-29' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nft_fib: allow from forward/input without iif selector
  netfilter: tproxy: bail out if IP has been disabled on the device
  netfilter: nft_payload: skbuff vlan metadata mangle support
  netfilter: nft_payload: restore vlan q-in-q match support
  netfilter: ipset: Add list flush to cancel_gc
  netfilter: nfnetlink_queue: acquire rcu_read_lock() in instance_destroy_rcu()
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
plbossart pushed a commit that referenced this pull request Jun 24, 2024
With commit c4cb231 ("iommu/amd: Add support for enable/disable IOPF")
we are hitting below issue. This happens because in IOPF enablement path
it holds spin lock with irq disable and then tries to take mutex lock.

dmesg:
-----
[    0.938739] =============================
[    0.938740] [ BUG: Invalid wait context ]
[    0.938742] 6.10.0-rc1+ #1 Not tainted
[    0.938745] -----------------------------
[    0.938746] swapper/0/1 is trying to lock:
[    0.938748] ffffffff8c9f01d8 (&port_lock_key){....}-{3:3}, at: serial8250_console_write+0x78/0x4a0
[    0.938767] other info that might help us debug this:
[    0.938768] context-{5:5}
[    0.938769] 7 locks held by swapper/0/1:
[    0.938772]  #0: ffff888101a91310 (&group->mutex){+.+.}-{4:4}, at: bus_iommu_probe+0x70/0x160
[    0.938790]  #1: ffff888101d1f1b8 (&domain->lock){....}-{3:3}, at: amd_iommu_attach_device+0xa5/0x700
[    0.938799]  #2: ffff888101cc3d18 (&dev_data->lock){....}-{3:3}, at: amd_iommu_attach_device+0xc5/0x700
[    0.938806]  #3: ffff888100052830 (&iommu->lock){....}-{2:2}, at: amd_iommu_iopf_add_device+0x3f/0xa0
[    0.938813]  #4: ffffffff8945a340 (console_lock){+.+.}-{0:0}, at: _printk+0x48/0x50
[    0.938822]  #5: ffffffff8945a390 (console_srcu){....}-{0:0}, at: console_flush_all+0x58/0x4e0
[    0.938867]  #6: ffffffff82459f80 (console_owner){....}-{0:0}, at: console_flush_all+0x1f0/0x4e0
[    0.938872] stack backtrace:
[    0.938874] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.10.0-rc1+ #1
[    0.938877] Hardware name: HP HP EliteBook 745 G3/807E, BIOS N73 Ver. 01.39 04/16/2019

Fix above issue by re-arranging code in attach device path:
  - move device PASID/IOPF enablement outside lock in AMD IOMMU driver.
    This is safe as core layer holds group->mutex lock before calling
    iommu_ops->attach_dev.

Reported-by: Borislav Petkov <[email protected]>
Reported-by: Mikhail Gavrilov <[email protected]>
Reported-by: Chris Bainbridge <[email protected]>
Fixes: c4cb231 ("iommu/amd: Add support for enable/disable IOPF")
Tested-by: Borislav Petkov <[email protected]>
Tested-by: Chris Bainbridge <[email protected]>
Tested-by: Mikhail Gavrilov <[email protected]>
Signed-off-by: Vasant Hegde <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Joerg Roedel <[email protected]>
plbossart pushed a commit that referenced this pull request Jun 24, 2024
…PLES event"

This reverts commit 7d1405c.

This causes segfaults in some cases, as reported by Milian:

  ```
  sudo /usr/bin/perf record -z --call-graph dwarf -e cycles -e
  raw_syscalls:sys_enter ls
  ...
  [ perf record: Woken up 3 times to write data ]
  malloc(): invalid next size (unsorted)
  Aborted
  ```

  Backtrace with GDB + debuginfod:

  ```
  malloc(): invalid next size (unsorted)

  Thread 1 "perf" received signal SIGABRT, Aborted.
  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6,
  no_tid=no_tid@entry=0) at pthread_kill.c:44
  Downloading source file /usr/src/debug/glibc/glibc/nptl/pthread_kill.c
  44            return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO
  (ret) : 0;
  (gdb) bt
  #0  __pthread_kill_implementation (threadid=<optimized out>,
  signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
  #1  0x00007ffff6ea8eb3 in __pthread_kill_internal (threadid=<optimized out>,
  signo=6) at pthread_kill.c:78
  #2  0x00007ffff6e50a30 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/
  raise.c:26
  #3  0x00007ffff6e384c3 in __GI_abort () at abort.c:79
  #4  0x00007ffff6e39354 in __libc_message_impl (fmt=fmt@entry=0x7ffff6fc22ea
  "%s\n") at ../sysdeps/posix/libc_fatal.c:132
  #5  0x00007ffff6eb3085 in malloc_printerr (str=str@entry=0x7ffff6fc5850
  "malloc(): invalid next size (unsorted)") at malloc.c:5772
  #6  0x00007ffff6eb657c in _int_malloc (av=av@entry=0x7ffff6ff6ac0
  <main_arena>, bytes=bytes@entry=368) at malloc.c:4081
  #7  0x00007ffff6eb877e in __libc_calloc (n=<optimized out>,
  elem_size=<optimized out>) at malloc.c:3754
  #8  0x000055555569bdb6 in perf_session.do_write_header ()
  #9  0x00005555555a373a in __cmd_record.constprop.0 ()
  #10 0x00005555555a6846 in cmd_record ()
  #11 0x000055555564db7f in run_builtin ()
  #12 0x000055555558ed77 in main ()
  ```

  Valgrind memcheck:
  ```
  ==45136== Invalid write of size 8
  ==45136==    at 0x2B38A5: perf_event__synthesize_id_sample (in /usr/bin/perf)
  ==45136==    by 0x157069: __cmd_record.constprop.0 (in /usr/bin/perf)
  ==45136==    by 0x15A845: cmd_record (in /usr/bin/perf)
  ==45136==    by 0x201B7E: run_builtin (in /usr/bin/perf)
  ==45136==    by 0x142D76: main (in /usr/bin/perf)
  ==45136==  Address 0x6a866a8 is 0 bytes after a block of size 40 alloc'd
  ==45136==    at 0x4849BF3: calloc (vg_replace_malloc.c:1675)
  ==45136==    by 0x3574AB: zalloc (in /usr/bin/perf)
  ==45136==    by 0x1570E0: __cmd_record.constprop.0 (in /usr/bin/perf)
  ==45136==    by 0x15A845: cmd_record (in /usr/bin/perf)
  ==45136==    by 0x201B7E: run_builtin (in /usr/bin/perf)
  ==45136==    by 0x142D76: main (in /usr/bin/perf)
  ==45136==
  ==45136== Syscall param write(buf) points to unaddressable byte(s)
  ==45136==    at 0x575953D: __libc_write (write.c:26)
  ==45136==    by 0x575953D: write (write.c:24)
  ==45136==    by 0x35761F: ion (in /usr/bin/perf)
  ==45136==    by 0x357778: writen (in /usr/bin/perf)
  ==45136==    by 0x1548F7: record__write (in /usr/bin/perf)
  ==45136==    by 0x15708A: __cmd_record.constprop.0 (in /usr/bin/perf)
  ==45136==    by 0x15A845: cmd_record (in /usr/bin/perf)
  ==45136==    by 0x201B7E: run_builtin (in /usr/bin/perf)
  ==45136==    by 0x142D76: main (in /usr/bin/perf)
  ==45136==  Address 0x6a866a8 is 0 bytes after a block of size 40 alloc'd
  ==45136==    at 0x4849BF3: calloc (vg_replace_malloc.c:1675)
  ==45136==    by 0x3574AB: zalloc (in /usr/bin/perf)
  ==45136==    by 0x1570E0: __cmd_record.constprop.0 (in /usr/bin/perf)
  ==45136==    by 0x15A845: cmd_record (in /usr/bin/perf)
  ==45136==    by 0x201B7E: run_builtin (in /usr/bin/perf)
  ==45136==    by 0x142D76: main (in /usr/bin/perf)
  ==45136==
 -----

Closes: https://lore.kernel.org/linux-perf-users/23879991.0LEYPuXRzz@milian-workstation/
Reported-by: Milian Wolff <[email protected]>
Tested-by: Milian Wolff <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: [email protected] # 6.8+
Link: https://lore.kernel.org/lkml/Zl9ksOlHJHnKM70p@x1
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
plbossart pushed a commit that referenced this pull request Jun 24, 2024
We have been seeing crashes on duplicate keys in
btrfs_set_item_key_safe():

  BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192)
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.c:2620!
  invalid opcode: 0000 [#1] PREEMPT SMP PTI
  CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
  RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs]

With the following stack trace:

  #0  btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4)
  #1  btrfs_drop_extents (fs/btrfs/file.c:411:4)
  #2  log_one_extent (fs/btrfs/tree-log.c:4732:9)
  #3  btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9)
  #4  btrfs_log_inode (fs/btrfs/tree-log.c:6626:9)
  #5  btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8)
  #6  btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8)
  #7  btrfs_sync_file (fs/btrfs/file.c:1933:8)
  #8  vfs_fsync_range (fs/sync.c:188:9)
  #9  vfs_fsync (fs/sync.c:202:9)
  #10 do_fsync (fs/sync.c:212:9)
  #11 __do_sys_fdatasync (fs/sync.c:225:9)
  #12 __se_sys_fdatasync (fs/sync.c:223:1)
  #13 __x64_sys_fdatasync (fs/sync.c:223:1)
  #14 do_syscall_x64 (arch/x86/entry/common.c:52:14)
  #15 do_syscall_64 (arch/x86/entry/common.c:83:7)
  #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121)

So we're logging a changed extent from fsync, which is splitting an
extent in the log tree. But this split part already exists in the tree,
triggering the BUG().

This is the state of the log tree at the time of the crash, dumped with
drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py)
to get more details than btrfs_print_leaf() gives us:

  >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"])
  leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610
  leaf 33439744 flags 0x100000000000000
  fs uuid e5bd3946-400c-4223-8923-190ef1f18677
  chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
          item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160
                  generation 7 transid 9 size 8192 nbytes 8473563889606862198
                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                  sequence 204 flags 0x10(PREALLOC)
                  atime 1716417703.220000000 (2024-05-22 15:41:43)
                  ctime 1716417704.983333333 (2024-05-22 15:41:44)
                  mtime 1716417704.983333333 (2024-05-22 15:41:44)
                  otime 17592186044416.000000000 (559444-03-08 01:40:16)
          item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13
                  index 195 namelen 3 name: 193
          item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37
                  location key (0 UNKNOWN.0 0) type XATTR
                  transid 7 data_len 1 name_len 6
                  name: user.a
                  data a
          item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53
                  generation 9 type 1 (regular)
                  extent data disk byte 303144960 nr 12288
                  extent data offset 0 nr 4096 ram 12288
                  extent compression 0 (none)
          item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 4096 nr 8192
          item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 8192 nr 4096
  ...

So the real problem happened earlier: notice that items 4 (4k-12k) and 5
(8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and
item 5 starts at i_size.

Here is the state of the filesystem tree at the time of the crash:

  >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root
  >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0))
  >>> print_extent_buffer(nodes[0])
  leaf 30425088 level 0 items 184 generation 9 owner 5
  leaf 30425088 flags 0x100000000000000
  fs uuid e5bd3946-400c-4223-8923-190ef1f18677
  chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
  	...
          item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160
                  generation 7 transid 7 size 4096 nbytes 12288
                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                  sequence 6 flags 0x10(PREALLOC)
                  atime 1716417703.220000000 (2024-05-22 15:41:43)
                  ctime 1716417703.220000000 (2024-05-22 15:41:43)
                  mtime 1716417703.220000000 (2024-05-22 15:41:43)
                  otime 1716417703.220000000 (2024-05-22 15:41:43)
          item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13
                  index 195 namelen 3 name: 193
          item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37
                  location key (0 UNKNOWN.0 0) type XATTR
                  transid 7 data_len 1 name_len 6
                  name: user.a
                  data a
          item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53
                  generation 9 type 1 (regular)
                  extent data disk byte 303144960 nr 12288
                  extent data offset 0 nr 8192 ram 12288
                  extent compression 0 (none)
          item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 8192 nr 4096

Item 5 in the log tree corresponds to item 183 in the filesystem tree,
but nothing matches item 4. Furthermore, item 183 is the last item in
the leaf.

btrfs_log_prealloc_extents() is responsible for logging prealloc extents
beyond i_size. It first truncates any previously logged prealloc extents
that start beyond i_size. Then, it walks the filesystem tree and copies
the prealloc extent items to the log tree.

If it hits the end of a leaf, then it calls btrfs_next_leaf(), which
unlocks the tree and does another search. However, while the filesystem
tree is unlocked, an ordered extent completion may modify the tree. In
particular, it may insert an extent item that overlaps with an extent
item that was already copied to the log tree.

This may manifest in several ways depending on the exact scenario,
including an EEXIST error that is silently translated to a full sync,
overlapping items in the log tree, or this crash. This particular crash
is triggered by the following sequence of events:

- Initially, the file has i_size=4k, a regular extent from 0-4k, and a
  prealloc extent beyond i_size from 4k-12k. The prealloc extent item is
  the last item in its B-tree leaf.
- The file is fsync'd, which copies its inode item and both extent items
  to the log tree.
- An xattr is set on the file, which sets the
  BTRFS_INODE_COPY_EVERYTHING flag.
- The range 4k-8k in the file is written using direct I/O. i_size is
  extended to 8k, but the ordered extent is still in flight.
- The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this
  calls copy_inode_items_to_log(), which calls
  btrfs_log_prealloc_extents().
- btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the
  filesystem tree. Since it starts before i_size, it skips it. Since it
  is the last item in its B-tree leaf, it calls btrfs_next_leaf().
- btrfs_next_leaf() unlocks the path.
- The ordered extent completion runs, which converts the 4k-8k part of
  the prealloc extent to written and inserts the remaining prealloc part
  from 8k-12k.
- btrfs_next_leaf() does a search and finds the new prealloc extent
  8k-12k.
- btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into
  the log tree. Note that it overlaps with the 4k-12k prealloc extent
  that was copied to the log tree by the first fsync.
- fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k
  extent that was written.
- This tries to drop the range 4k-8k in the log tree, which requires
  adjusting the start of the 4k-12k prealloc extent in the log tree to
  8k.
- btrfs_set_item_key_safe() sees that there is already an extent
  starting at 8k in the log tree and calls BUG().

Fix this by detecting when we're about to insert an overlapping file
extent item in the log tree and truncating the part that would overlap.

CC: [email protected] # 6.1+
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: David Sterba <[email protected]>
plbossart pushed a commit that referenced this pull request Jun 24, 2024
…git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

Patch #1 fixes insufficient sanitization of netlink attributes for the
	 inner expression which can trigger nul-pointer dereference,
	 from Davide Ornaghi.

Patch #2 address a report that there is a race condition between
         namespace cleanup and the garbage collection of the list:set
         type. This patch resolves this issue with other minor issues
	 as well, from Jozsef Kadlecsik.

Patch #3 ip6_route_me_harder() ignores flowlabel/dsfield when ip dscp
	 has been mangled, this unbreaks ip6 dscp set $v,
	 from Florian Westphal.

All of these patches address issues that are present in several releases.

* tag 'nf-24-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: Use flowlabel flow key when re-routing mangled packets
  netfilter: ipset: Fix race between namespace cleanup and gc in the list:set type
  netfilter: nft_inner: validate mandatory meta and payload
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
plbossart pushed a commit that referenced this pull request Jun 24, 2024
Nikolay Aleksandrov says:

====================
net: bridge: mst: fix suspicious rcu usage warning

This set fixes a suspicious RCU usage warning triggered by syzbot[1] in
the bridge's MST code. After I converted br_mst_set_state to RCU, I
forgot to update the vlan group dereference helper. Fix it by using
the proper helper, in order to do that we need to pass the vlan group
which is already obtained correctly by the callers for their respective
context. Patch 01 is a requirement for the fix in patch 02.

Note I did consider rcu_dereference_rtnl() but the churn is much bigger
and in every part of the bridge. We can do that as a cleanup in
net-next.

[1] https://syzkaller.appspot.com/bug?extid=9bbe2de1bc9d470eb5fe
 =============================
 WARNING: suspicious RCU usage
 6.10.0-rc2-syzkaller-00235-g8a92980606e3 #0 Not tainted
 -----------------------------
 net/bridge/br_private.h:1599 suspicious rcu_dereference_protected() usage!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 1
 4 locks held by syz-executor.1/5374:
  #0: ffff888022d50b18 (&mm->mmap_lock){++++}-{3:3}, at: mmap_read_lock include/linux/mmap_lock.h:144 [inline]
  #0: ffff888022d50b18 (&mm->mmap_lock){++++}-{3:3}, at: __mm_populate+0x1b0/0x460 mm/gup.c:2111
  #1: ffffc90000a18c00 ((&p->forward_delay_timer)){+.-.}-{0:0}, at: call_timer_fn+0xc0/0x650 kernel/time/timer.c:1789
  #2: ffff88805fb2ccb8 (&br->lock){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
  #2: ffff88805fb2ccb8 (&br->lock){+.-.}-{2:2}, at: br_forward_delay_timer_expired+0x50/0x440 net/bridge/br_stp_timer.c:86
  #3: ffffffff8e333fa0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:329 [inline]
  #3: ffffffff8e333fa0 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:781 [inline]
  #3: ffffffff8e333fa0 (rcu_read_lock){....}-{1:2}, at: br_mst_set_state+0x171/0x7a0 net/bridge/br_mst.c:105

 stack backtrace:
 CPU: 1 PID: 5374 Comm: syz-executor.1 Not tainted 6.10.0-rc2-syzkaller-00235-g8a92980606e3 #0
 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024
 Call Trace:
  <IRQ>
  __dump_stack lib/dump_stack.c:88 [inline]
  dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
  lockdep_rcu_suspicious+0x221/0x340 kernel/locking/lockdep.c:6712
  nbp_vlan_group net/bridge/br_private.h:1599 [inline]
  br_mst_set_state+0x29e/0x7a0 net/bridge/br_mst.c:106
  br_set_state+0x28a/0x7b0 net/bridge/br_stp.c:47
  br_forward_delay_timer_expired+0x176/0x440 net/bridge/br_stp_timer.c:88
  call_timer_fn+0x18e/0x650 kernel/time/timer.c:1792
  expire_timers kernel/time/timer.c:1843 [inline]
  __run_timers kernel/time/timer.c:2417 [inline]
  __run_timer_base+0x66a/0x8e0 kernel/time/timer.c:2428
  run_timer_base kernel/time/timer.c:2437 [inline]
  run_timer_softirq+0xb7/0x170 kernel/time/timer.c:2447
  handle_softirqs+0x2c4/0x970 kernel/softirq.c:554
  __do_softirq kernel/softirq.c:588 [inline]
  invoke_softirq kernel/softirq.c:428 [inline]
  __irq_exit_rcu+0xf4/0x1c0 kernel/softirq.c:637
  irq_exit_rcu+0x9/0x30 kernel/softirq.c:649
  instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1043 [inline]
  sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1043
  </IRQ>
  <TASK>
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
plbossart pushed a commit that referenced this pull request Jun 24, 2024
The syzbot fuzzer found that the interrupt-URB completion callback in
the cdc-wdm driver was taking too long, and the driver's immediate
resubmission of interrupt URBs with -EPROTO status combined with the
dummy-hcd emulation to cause a CPU lockup:

cdc_wdm 1-1:1.0: nonzero urb status received: -71
cdc_wdm 1-1:1.0: wdm_int_callback - 0 bytes
watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [syz-executor782:6625]
CPU#0 Utilization every 4s during lockup:
	#1:  98% system,	  0% softirq,	  3% hardirq,	  0% idle
	#2:  98% system,	  0% softirq,	  3% hardirq,	  0% idle
	#3:  98% system,	  0% softirq,	  3% hardirq,	  0% idle
	#4:  98% system,	  0% softirq,	  3% hardirq,	  0% idle
	#5:  98% system,	  1% softirq,	  3% hardirq,	  0% idle
Modules linked in:
irq event stamp: 73096
hardirqs last  enabled at (73095): [<ffff80008037bc00>] console_emit_next_record kernel/printk/printk.c:2935 [inline]
hardirqs last  enabled at (73095): [<ffff80008037bc00>] console_flush_all+0x650/0xb74 kernel/printk/printk.c:2994
hardirqs last disabled at (73096): [<ffff80008af10b00>] __el1_irq arch/arm64/kernel/entry-common.c:533 [inline]
hardirqs last disabled at (73096): [<ffff80008af10b00>] el1_interrupt+0x24/0x68 arch/arm64/kernel/entry-common.c:551
softirqs last  enabled at (73048): [<ffff8000801ea530>] softirq_handle_end kernel/softirq.c:400 [inline]
softirqs last  enabled at (73048): [<ffff8000801ea530>] handle_softirqs+0xa60/0xc34 kernel/softirq.c:582
softirqs last disabled at (73043): [<ffff800080020de8>] __do_softirq+0x14/0x20 kernel/softirq.c:588
CPU: 0 PID: 6625 Comm: syz-executor782 Tainted: G        W          6.10.0-rc2-syzkaller-g8867bbd4a056 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024

Testing showed that the problem did not occur if the two error
messages -- the first two lines above -- were removed; apparently adding
material to the kernel log takes a surprisingly large amount of time.

In any case, the best approach for preventing these lockups and to
avoid spamming the log with thousands of error messages per second is
to ratelimit the two dev_err() calls.  Therefore we replace them with
dev_err_ratelimited().

Signed-off-by: Alan Stern <[email protected]>
Suggested-by: Greg KH <[email protected]>
Reported-and-tested-by: [email protected]
Closes: https://lore.kernel.org/linux-usb/[email protected]/
Reported-and-tested-by: [email protected]
Closes: https://lore.kernel.org/linux-usb/[email protected]/
Fixes: 9908a32 ("USB: remove err() macro from usb class drivers")
Link: https://lore.kernel.org/linux-usb/[email protected]/
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
plbossart pushed a commit that referenced this pull request Jul 15, 2024
…git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

Patch #1 fixes the suspicious RCU usage warning that resulted from the
	 recent fix for the race between namespace cleanup and gc in
	 ipset left out checking the pernet exit phase when calling
	 rcu_dereference_protected(), from Jozsef Kadlecsik.

Patch #2 fixes incorrect input and output netdevice in SRv6 prerouting
	 hooks, from Jianguo Wu.

Patch #3 moves nf_hooks_lwtunnel sysctl toggle to the netfilter core.
	 The connection tracking system is loaded on-demand, this
	 ensures availability of this knob regardless.

Patch #4-#5 adds selftests for SRv6 netfilter hooks also from Jianguo Wu.

netfilter pull request 24-06-19

* tag 'nf-24-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  selftests: add selftest for the SRv6 End.DX6 behavior with netfilter
  selftests: add selftest for the SRv6 End.DX4 behavior with netfilter
  netfilter: move the sysctl nf_hooks_lwtunnel into the netfilter core
  seg6: fix parameter passing when calling NF_HOOK() in End.DX4 and End.DX6 behaviors
  netfilter: ipset: Fix suspicious rcu_dereference_protected()
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
plbossart pushed a commit that referenced this pull request Sep 12, 2024
When l2tp tunnels use a socket provided by userspace, we can hit
lockdep splats like the below when data is transmitted through another
(unrelated) userspace socket which then gets routed over l2tp.

This issue was previously discussed here:
https://lore.kernel.org/netdev/[email protected]/

The solution is to have lockdep treat socket locks of l2tp tunnel
sockets separately than those of standard INET sockets. To do so, use
a different lockdep subclass where lock nesting is possible.

  ============================================
  WARNING: possible recursive locking detected
  6.10.0+ torvalds#34 Not tainted
  --------------------------------------------
  iperf3/771 is trying to acquire lock:
  ffff8881027601d8 (slock-AF_INET/1){+.-.}-{2:2}, at: l2tp_xmit_skb+0x243/0x9d0

  but task is already holding lock:
  ffff888102650d98 (slock-AF_INET/1){+.-.}-{2:2}, at: tcp_v4_rcv+0x1848/0x1e10

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(slock-AF_INET/1);
    lock(slock-AF_INET/1);

   *** DEADLOCK ***

   May be due to missing lock nesting notation

  10 locks held by iperf3/771:
   #0: ffff888102650258 (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_sendmsg+0x1a/0x40
   #1: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x4b/0xbc0
   #2: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x17a/0x1130
   #3: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: process_backlog+0x28b/0x9f0
   #4: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0xf9/0x260
   #5: ffff888102650d98 (slock-AF_INET/1){+.-.}-{2:2}, at: tcp_v4_rcv+0x1848/0x1e10
   #6: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x4b/0xbc0
   #7: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x17a/0x1130
   #8: ffffffff822ac1e0 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0xcc/0x1450
   #9: ffff888101f33258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock#2){+...}-{2:2}, at: __dev_queue_xmit+0x513/0x1450

  stack backtrace:
  CPU: 2 UID: 0 PID: 771 Comm: iperf3 Not tainted 6.10.0+ torvalds#34
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  Call Trace:
   <IRQ>
   dump_stack_lvl+0x69/0xa0
   dump_stack+0xc/0x20
   __lock_acquire+0x135d/0x2600
   ? srso_alias_return_thunk+0x5/0xfbef5
   lock_acquire+0xc4/0x2a0
   ? l2tp_xmit_skb+0x243/0x9d0
   ? __skb_checksum+0xa3/0x540
   _raw_spin_lock_nested+0x35/0x50
   ? l2tp_xmit_skb+0x243/0x9d0
   l2tp_xmit_skb+0x243/0x9d0
   l2tp_eth_dev_xmit+0x3c/0xc0
   dev_hard_start_xmit+0x11e/0x420
   sch_direct_xmit+0xc3/0x640
   __dev_queue_xmit+0x61c/0x1450
   ? ip_finish_output2+0xf4c/0x1130
   ip_finish_output2+0x6b6/0x1130
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __ip_finish_output+0x217/0x380
   ? srso_alias_return_thunk+0x5/0xfbef5
   __ip_finish_output+0x217/0x380
   ip_output+0x99/0x120
   __ip_queue_xmit+0xae4/0xbc0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? tcp_options_write.constprop.0+0xcb/0x3e0
   ip_queue_xmit+0x34/0x40
   __tcp_transmit_skb+0x1625/0x1890
   __tcp_send_ack+0x1b8/0x340
   tcp_send_ack+0x23/0x30
   __tcp_ack_snd_check+0xa8/0x530
   ? srso_alias_return_thunk+0x5/0xfbef5
   tcp_rcv_established+0x412/0xd70
   tcp_v4_do_rcv+0x299/0x420
   tcp_v4_rcv+0x1991/0x1e10
   ip_protocol_deliver_rcu+0x50/0x220
   ip_local_deliver_finish+0x158/0x260
   ip_local_deliver+0xc8/0xe0
   ip_rcv+0xe5/0x1d0
   ? __pfx_ip_rcv+0x10/0x10
   __netif_receive_skb_one_core+0xce/0xe0
   ? process_backlog+0x28b/0x9f0
   __netif_receive_skb+0x34/0xd0
   ? process_backlog+0x28b/0x9f0
   process_backlog+0x2cb/0x9f0
   __napi_poll.constprop.0+0x61/0x280
   net_rx_action+0x332/0x670
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? find_held_lock+0x2b/0x80
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   handle_softirqs+0xda/0x480
   ? __dev_queue_xmit+0xa2c/0x1450
   do_softirq+0xa1/0xd0
   </IRQ>
   <TASK>
   __local_bh_enable_ip+0xc8/0xe0
   ? __dev_queue_xmit+0xa2c/0x1450
   __dev_queue_xmit+0xa48/0x1450
   ? ip_finish_output2+0xf4c/0x1130
   ip_finish_output2+0x6b6/0x1130
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __ip_finish_output+0x217/0x380
   ? srso_alias_return_thunk+0x5/0xfbef5
   __ip_finish_output+0x217/0x380
   ip_output+0x99/0x120
   __ip_queue_xmit+0xae4/0xbc0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? tcp_options_write.constprop.0+0xcb/0x3e0
   ip_queue_xmit+0x34/0x40
   __tcp_transmit_skb+0x1625/0x1890
   tcp_write_xmit+0x766/0x2fb0
   ? __entry_text_end+0x102ba9/0x102bad
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __might_fault+0x74/0xc0
   ? srso_alias_return_thunk+0x5/0xfbef5
   __tcp_push_pending_frames+0x56/0x190
   tcp_push+0x117/0x310
   tcp_sendmsg_locked+0x14c1/0x1740
   tcp_sendmsg+0x28/0x40
   inet_sendmsg+0x5d/0x90
   sock_write_iter+0x242/0x2b0
   vfs_write+0x68d/0x800
   ? __pfx_sock_write_iter+0x10/0x10
   ksys_write+0xc8/0xf0
   __x64_sys_write+0x3d/0x50
   x64_sys_call+0xfaf/0x1f50
   do_syscall_64+0x6d/0x140
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f4d143af992
  Code: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 01 cc ff ff 41 54 b8 02 00 00 0
  RSP: 002b:00007ffd65032058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
  RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f4d143af992
  RDX: 0000000000000025 RSI: 00007f4d143f3bcc RDI: 0000000000000005
  RBP: 00007f4d143f2b28 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4d143f3bcc
  R13: 0000000000000005 R14: 0000000000000000 R15: 00007ffd650323f0
   </TASK>

Fixes: 0b2c597 ("l2tp: close all race conditions in l2tp_tunnel_register()")
Suggested-by: Eric Dumazet <[email protected]>
Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=6acef9e0a4d1f46c83d4
CC: [email protected]
CC: [email protected]
Signed-off-by: James Chapman <[email protected]>
Signed-off-by: Tom Parkin <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
plbossart pushed a commit that referenced this pull request Sep 12, 2024
Lockdep reported a warning in Linux version 6.6:

[  414.344659] ================================
[  414.345155] WARNING: inconsistent lock state
[  414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted
[  414.346221] --------------------------------
[  414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[  414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes:
[  414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[  414.351204] {IN-SOFTIRQ-W} state was registered at:
[  414.351751]   lock_acquire+0x18d/0x460
[  414.352218]   _raw_spin_lock_irqsave+0x39/0x60
[  414.352769]   __wake_up_common_lock+0x22/0x60
[  414.353289]   sbitmap_queue_wake_up+0x375/0x4f0
[  414.353829]   sbitmap_queue_clear+0xdd/0x270
[  414.354338]   blk_mq_put_tag+0xdf/0x170
[  414.354807]   __blk_mq_free_request+0x381/0x4d0
[  414.355335]   blk_mq_free_request+0x28b/0x3e0
[  414.355847]   __blk_mq_end_request+0x242/0xc30
[  414.356367]   scsi_end_request+0x2c1/0x830
[  414.345155] WARNING: inconsistent lock state
[  414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted
[  414.346221] --------------------------------
[  414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[  414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes:
[  414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[  414.351204] {IN-SOFTIRQ-W} state was registered at:
[  414.351751]   lock_acquire+0x18d/0x460
[  414.352218]   _raw_spin_lock_irqsave+0x39/0x60
[  414.352769]   __wake_up_common_lock+0x22/0x60
[  414.353289]   sbitmap_queue_wake_up+0x375/0x4f0
[  414.353829]   sbitmap_queue_clear+0xdd/0x270
[  414.354338]   blk_mq_put_tag+0xdf/0x170
[  414.354807]   __blk_mq_free_request+0x381/0x4d0
[  414.355335]   blk_mq_free_request+0x28b/0x3e0
[  414.355847]   __blk_mq_end_request+0x242/0xc30
[  414.356367]   scsi_end_request+0x2c1/0x830
[  414.356863]   scsi_io_completion+0x177/0x1610
[  414.357379]   scsi_complete+0x12f/0x260
[  414.357856]   blk_complete_reqs+0xba/0xf0
[  414.358338]   __do_softirq+0x1b0/0x7a2
[  414.358796]   irq_exit_rcu+0x14b/0x1a0
[  414.359262]   sysvec_call_function_single+0xaf/0xc0
[  414.359828]   asm_sysvec_call_function_single+0x1a/0x20
[  414.360426]   default_idle+0x1e/0x30
[  414.360873]   default_idle_call+0x9b/0x1f0
[  414.361390]   do_idle+0x2d2/0x3e0
[  414.361819]   cpu_startup_entry+0x55/0x60
[  414.362314]   start_secondary+0x235/0x2b0
[  414.362809]   secondary_startup_64_no_verify+0x18f/0x19b
[  414.363413] irq event stamp: 428794
[  414.363825] hardirqs last  enabled at (428793): [<ffffffff816bfd1c>] ktime_get+0x1dc/0x200
[  414.364694] hardirqs last disabled at (428794): [<ffffffff85470177>] _raw_spin_lock_irq+0x47/0x50
[  414.365629] softirqs last  enabled at (428444): [<ffffffff85474780>] __do_softirq+0x540/0x7a2
[  414.366522] softirqs last disabled at (428419): [<ffffffff813f65ab>] irq_exit_rcu+0x14b/0x1a0
[  414.367425]
               other info that might help us debug this:
[  414.368194]  Possible unsafe locking scenario:
[  414.368900]        CPU0
[  414.369225]        ----
[  414.369548]   lock(&sbq->ws[i].wait);
[  414.370000]   <Interrupt>
[  414.370342]     lock(&sbq->ws[i].wait);
[  414.370802]
                *** DEADLOCK ***
[  414.371569] 5 locks held by kworker/u10:3/1152:
[  414.372088]  #0: ffff88810130e938 ((wq_completion)writeback){+.+.}-{0:0}, at: process_scheduled_works+0x357/0x13f0
[  414.373180]  #1: ffff88810201fdb8 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_scheduled_works+0x3a3/0x13f0
[  414.374384]  #2: ffffffff86ffbdc0 (rcu_read_lock){....}-{1:2}, at: blk_mq_run_hw_queue+0x637/0xa00
[  414.375342]  #3: ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[  414.376377]  #4: ffff888106205a08 (&hctx->dispatch_wait_lock){+.-.}-{2:2}, at: blk_mq_dispatch_rq_list+0x1337/0x1ee0
[  414.378607]
               stack backtrace:
[  414.379177] CPU: 0 PID: 1152 Comm: kworker/u10:3 Not tainted 6.6.0-07439-gba2303cacfda #6
[  414.380032] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[  414.381177] Workqueue: writeback wb_workfn (flush-253:0)
[  414.381805] Call Trace:
[  414.382136]  <TASK>
[  414.382429]  dump_stack_lvl+0x91/0xf0
[  414.382884]  mark_lock_irq+0xb3b/0x1260
[  414.383367]  ? __pfx_mark_lock_irq+0x10/0x10
[  414.383889]  ? stack_trace_save+0x8e/0xc0
[  414.384373]  ? __pfx_stack_trace_save+0x10/0x10
[  414.384903]  ? graph_lock+0xcf/0x410
[  414.385350]  ? save_trace+0x3d/0xc70
[  414.385808]  mark_lock.part.20+0x56d/0xa90
[  414.386317]  mark_held_locks+0xb0/0x110
[  414.386791]  ? __pfx_do_raw_spin_lock+0x10/0x10
[  414.387320]  lockdep_hardirqs_on_prepare+0x297/0x3f0
[  414.387901]  ? _raw_spin_unlock_irq+0x28/0x50
[  414.388422]  trace_hardirqs_on+0x58/0x100
[  414.388917]  _raw_spin_unlock_irq+0x28/0x50
[  414.389422]  __blk_mq_tag_busy+0x1d6/0x2a0
[  414.389920]  __blk_mq_get_driver_tag+0x761/0x9f0
[  414.390899]  blk_mq_dispatch_rq_list+0x1780/0x1ee0
[  414.391473]  ? __pfx_blk_mq_dispatch_rq_list+0x10/0x10
[  414.392070]  ? sbitmap_get+0x2b8/0x450
[  414.392533]  ? __blk_mq_get_driver_tag+0x210/0x9f0
[  414.393095]  __blk_mq_sched_dispatch_requests+0xd99/0x1690
[  414.393730]  ? elv_attempt_insert_merge+0x1b1/0x420
[  414.394302]  ? __pfx___blk_mq_sched_dispatch_requests+0x10/0x10
[  414.394970]  ? lock_acquire+0x18d/0x460
[  414.395456]  ? blk_mq_run_hw_queue+0x637/0xa00
[  414.395986]  ? __pfx_lock_acquire+0x10/0x10
[  414.396499]  blk_mq_sched_dispatch_requests+0x109/0x190
[  414.397100]  blk_mq_run_hw_queue+0x66e/0xa00
[  414.397616]  blk_mq_flush_plug_list.part.17+0x614/0x2030
[  414.398244]  ? __pfx_blk_mq_flush_plug_list.part.17+0x10/0x10
[  414.398897]  ? writeback_sb_inodes+0x241/0xcc0
[  414.399429]  blk_mq_flush_plug_list+0x65/0x80
[  414.399957]  __blk_flush_plug+0x2f1/0x530
[  414.400458]  ? __pfx___blk_flush_plug+0x10/0x10
[  414.400999]  blk_finish_plug+0x59/0xa0
[  414.401467]  wb_writeback+0x7cc/0x920
[  414.401935]  ? __pfx_wb_writeback+0x10/0x10
[  414.402442]  ? mark_held_locks+0xb0/0x110
[  414.402931]  ? __pfx_do_raw_spin_lock+0x10/0x10
[  414.403462]  ? lockdep_hardirqs_on_prepare+0x297/0x3f0
[  414.404062]  wb_workfn+0x2b3/0xcf0
[  414.404500]  ? __pfx_wb_workfn+0x10/0x10
[  414.404989]  process_scheduled_works+0x432/0x13f0
[  414.405546]  ? __pfx_process_scheduled_works+0x10/0x10
[  414.406139]  ? do_raw_spin_lock+0x101/0x2a0
[  414.406641]  ? assign_work+0x19b/0x240
[  414.407106]  ? lock_is_held_type+0x9d/0x110
[  414.407604]  worker_thread+0x6f2/0x1160
[  414.408075]  ? __kthread_parkme+0x62/0x210
[  414.408572]  ? lockdep_hardirqs_on_prepare+0x297/0x3f0
[  414.409168]  ? __kthread_parkme+0x13c/0x210
[  414.409678]  ? __pfx_worker_thread+0x10/0x10
[  414.410191]  kthread+0x33c/0x440
[  414.410602]  ? __pfx_kthread+0x10/0x10
[  414.411068]  ret_from_fork+0x4d/0x80
[  414.411526]  ? __pfx_kthread+0x10/0x10
[  414.411993]  ret_from_fork_asm+0x1b/0x30
[  414.412489]  </TASK>

When interrupt is turned on while a lock holding by spin_lock_irq it
throws a warning because of potential deadlock.

blk_mq_prep_dispatch_rq
 blk_mq_get_driver_tag
  __blk_mq_get_driver_tag
   __blk_mq_alloc_driver_tag
    blk_mq_tag_busy -> tag is already busy
    // failed to get driver tag
 blk_mq_mark_tag_wait
  spin_lock_irq(&wq->lock) -> lock A (&sbq->ws[i].wait)
  __add_wait_queue(wq, wait) -> wait queue active
  blk_mq_get_driver_tag
  __blk_mq_tag_busy
-> 1) tag must be idle, which means there can't be inflight IO
   spin_lock_irq(&tags->lock) -> lock B (hctx->tags)
   spin_unlock_irq(&tags->lock) -> unlock B, turn on interrupt accidentally
-> 2) context must be preempt by IO interrupt to trigger deadlock.

As shown above, the deadlock is not possible in theory, but the warning
still need to be fixed.

Fix it by using spin_lock_irqsave to get lockB instead of spin_lock_irq.

Fixes: 4f1731d ("blk-mq: fix potential io hang by wrong 'wake_batch'")
Signed-off-by: Li Lingfeng <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Reviewed-by: Yu Kuai <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants