Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cifsd fixes #8

Merged
merged 15 commits into from
Jun 26, 2020
Merged

Cifsd fixes #8

merged 15 commits into from
Jun 26, 2020

Conversation

namjaejeon
Copy link
Collaborator

No description provided.

namjaejeon and others added 2 commits June 1, 2020 14:35
When mounting with signing(sec=ntlmv2i), The following error messages
are coming.

[ 2045.267558] CIFS: Attempting to mount //10.88.103.87/homes2
[ 2045.278424] CIFS VFS: sign fail cmd 0x3 message id 0x3
[ 2045.278429] CIFS VFS: \10.88.103.87 SMB signature verification
returned error = -13
[ 2045.278916] CIFS VFS: sign fail cmd 0x3 message id 0x4
[ 2045.278920] CIFS VFS: \10.88.103.87 SMB signature verification
returned error = -13
[ 2045.279266] CIFS VFS: sign fail cmd 0x5 message id 0x5

The problem is that cifsd doesn't set and check signature as per compound
request. This patch make cifsd correctly working signing verification and
setting with compounding and fix the file deletion issue with windows 10
client.

Signed-off-by: Namjae Jeon <[email protected]>
Signed-off-by: Steve French <[email protected]>
When signing is mandatory by server configuration, include required
signing also in first negotiation response when upgrading from SMB1 to
SMB2 in auto-negotiation.

Signed-off-by: Fredrik Ternerot <[email protected]>
Signed-off-by: Namjae Jeon <[email protected]>
Signed-off-by: Steve French <[email protected]>
Build error in cifsd occurred as get rid of kernel_getsockopt and
kernel_setsockopt in order to eliminate set_fs/get_fs calls.
This patch use the replacements of kernel_setcoketopt().

Signed-off-by: Namjae Jeon <[email protected]>
Use wake_up_all to wake up every tasks waiting for oplock break ack.

Signed-off-by: Namjae Jeon <[email protected]>
If there is some issue in ksmbd.mountd(hang by out of memory),
trans cache buffer exhausted and ksmbd can be hung to allocate cache
buffer. And no need to use trans cache buffer here. ipc doesn't need
peroformance gain.

Signed-off-by: Namjae Jeon <[email protected]>
Add SMB2_MAX_BUFFER_SIZE macro, Use it instead of constant value.

Signed-off-by: Namjae Jeon <[email protected]>
Use DIV_ROUND_UP() for credit charge calculation.

Signed-off-by: Namjae Jeon <[email protected]>
When mounting with smb3 encryption, can see the following message from
cifs.

  CIFS: VFS: \10.88.103.87 unknown negcontext of type 16 ignored

There is a problem that the cifs cannot recognize the posix negotiate
context because cifsd doesn't make encryption context as 8-byte alignment.
use round_up() instead of +2 to make 8byte alignment about all neg context.

Signed-off-by: Namjae Jeon <[email protected]>
When request/response buffer is allocated by vmalloc(), kernel oops
caused on mounting with encryption enable.

1-285.0731] Backtrace:
[1-285.0731] [<c00259d4>] (kmap_atomic) from [<c02d6128>]
(scatterwalk_map+0x34/0x48)
[1-285.0731]  r10:00080070 r9:dc75f100 r8:dc75f100 r7:00000fc8
r6:00000000 r5:00000000
[1-285.0731]  r4:e2863cd4
[1-285.0731] [<c02d60f4>] (scatterwalk_map) from [<c02d86a4>]
(blkcipher_walk_next+0x15c/0x380)
[1-285.0731]  r5:00000000 r4:e2863cc4
[1-285.0731] [<c02d8548>] (blkcipher_walk_next) from [<c02d897c>]
(blkcipher_walk_done+0xb4/0x204)
[1-285.0731]  r10:00000000 r9:c0028200 r8:e2863ce0 r7:e2863d44
r6:00000010 r5:00080070
[1-285.0731]  r4:e2863cc4
[1-285.0731] [<c02d88c8>] (blkcipher_walk_done) from [<c02e3940>]
(crypto_ctr_crypt+0xf0/0x230)
[1-285.0731]  r8:dc75f780 r7:e2863ca0 r5:c2b36d80 r4:cfa2a930
[1-285.0731] [<c02e3850>] (crypto_ctr_crypt) from [<c02d83f0>]
(async_decrypt+0x50/0x58)
[1-285.0732]  r10:cfa2a920 r9:c0a86040 r8:d2b46240 r7:00000010
r6:dc75f780 r5:cfa2a8c0
[1-285.0732]  r4:c02e3850
[1-285.0732] [<c02d83a0>] (async_decrypt) from [<c02e6edc>]
(crypto_ccm_decrypt+0x210/0x268)
[1-285.0732]  r4:cfa2a900
[1-285.0732] [<c02e6ccc>] (crypto_ccm_decrypt) from [<bf3a0c74>]
(cifsd_crypt_message+0x400/0x4d4 [cifsd])
[1-285.0732]  r10:e2863e58 r9:c0a86040 r8:cfa2a8c0 r7:00000038
r6:00000001 r5:00000002
[1-285.0732]  r4:dc75f780
[1-285.0733] [<bf3a0874>] (cifsd_crypt_message [cifsd]) from
[<bf3b9730>] (smb3_decrypt_req+0x90/0x128 [cifsd])
[1-285.0733]  r10:00000000 r9:e2862000 r8:c646a1c0 r7:00080070
r6:000800a8 r5:000800a4
[1-285.0733]  r4:e9e00000
[1-285.0733] [<bf3b96a0>] (smb3_decrypt_req [cifsd]) from [<bf3ad070>]
(handle_cifsd_work+0x368/0x418 [cifsd])
[1-285.0733]  r8:ff7db800 r7:c34b3050 r6:00000000 r5:c34b3000
r4:c646a1c0

data address and length should be physically continuous in a sg entry.
but The vmalloced buffer does not guarantee that it is allocated
physically continuously. So cifsd should set sg entry as per page unit
in case of vmalloced buffer. This patch handle sg entries set on
vmalloced request/response buffer.

Signed-off-by: Namjae Jeon <[email protected]>
smfrench pushed a commit that referenced this pull request Jun 22, 2020
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:935
 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 0, name: swapper/5
 1 lock held by swapper/5/0:
  #0: ffff80001002bd90 (samples/ftrace/sample-trace-array.c:38){+.-.}-{0:0}, at: call_timer_fn+0x8/0x3e0
 CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.7.0+ #8
 Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
 Call trace:
  dump_backtrace+0x0/0x1a0
  show_stack+0x20/0x30
  dump_stack+0xe4/0x150
  ___might_sleep+0x160/0x200
  __might_sleep+0x58/0x90
  __mutex_lock+0x64/0x948
  mutex_lock_nested+0x3c/0x58
  __ftrace_set_clr_event+0x44/0x88
  trace_array_set_clr_event+0x24/0x38
  mytimer_handler+0x34/0x40 [sample_trace_array]

mutex_lock() will be called in interrupt context, using workqueue to fix it.

Link: https://lkml.kernel.org/r/[email protected]

Cc: [email protected]
Fixes: 89ed424 ("tracing: Sample module to demonstrate kernel access to Ftrace instances.")
Reviewed-by: Divya Indi <[email protected]>
Signed-off-by: Kefeng Wang <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
namjaejeon and others added 6 commits June 25, 2020 09:03
We found open failure issue with smb client apps from smartphone when
configure read-only share(i.e. read-only = yes)
cifsd do not allow to write on smb2_write(),  smb2_rename, copy offload
in smb2_ioctl and instead, allow to open with O_RDWR or O_WRONLY flags.

Signed-off-by: Namjae Jeon <[email protected]>
As specification describe, check permission with desired access in
smb2_read/write() and return -EACCES error.

Signed-off-by: Namjae Jeon <[email protected]>
As MS-SMB2 specification describe, return -EINVAL error when length is
bigger than max_read/write_size.

Signed-off-by: Namjae Jeon <[email protected]>
…IZE macro

Replace ksmbd_small_buffer_size() with MAX_CIFS_SMALL_BUFFER_SIZE macro.

Signed-off-by: Namjae Jeon <[email protected]>
work->request_buf for oplock break don't have smb2 header.
i.e. work->request_buf = (char *)br_info;

allocate_rsp_buf() uses the smb header command to select the cache
buffer or normal allocation. oplock break respone buffer can be
allocated to the cache buffer by referring to the garbage value.

Add allocate_oplock_break_buf to allocate the oplock break buffer
using ksmbd_alloc_response directly.

Cc: Sebastian Gottschall <[email protected]>
Reviewed-by: Hyunchul Lee <[email protected]>
Signed-off-by: Namjae Jeon <[email protected]>
Make sure work->sess is not NULL when calling check_sign_req() to
prevent NULL pointer dereference. This was causing kernel panic when
using signing and pysmb (SMB1) as client.

Signed-off-by: Fredrik Ternerot <[email protected]>
Signed-off-by: Namjae Jeon <[email protected]>
@smfrench smfrench merged commit ae555d7 into smfrench:cifsd-for-next Jun 26, 2020
smfrench pushed a commit that referenced this pull request Jun 29, 2020
Commit 7e9f5e6 ("arm64: vdso: Add --eh-frame-hdr to ldflags") results
in a .eh_frame_hdr section for the vDSO, which in turn causes the libgcc
unwinder to unwind out of signal handlers using the .eh_frame information
populated by our .cfi directives. In conjunction with a4eb355
("arm64: vdso: Fix CFI directives in sigreturn trampoline"), this has
been shown to cause segmentation faults originating from within the
unwinder during thread cancellation:

 | Thread 14 "virtio-net-rx" received signal SIGSEGV, Segmentation fault.
 | 0x0000000000435e24 in uw_frame_state_for ()
 | (gdb) bt
 | #0  0x0000000000435e24 in uw_frame_state_for ()
 | #1  0x0000000000436e88 in _Unwind_ForcedUnwind_Phase2 ()
 | #2  0x00000000004374d8 in _Unwind_ForcedUnwind ()
 | #3  0x0000000000428400 in __pthread_unwind (buf=<optimized out>) at unwind.c:121
 | #4  0x0000000000429808 in __do_cancel () at ./pthreadP.h:304
 | #5  sigcancel_handler (sig=32, si=0xffff33c743f0, ctx=<optimized out>) at nptl-init.c:200
 | #6  sigcancel_handler (sig=<optimized out>, si=0xffff33c743f0, ctx=<optimized out>) at nptl-init.c:165
 | #7  <signal handler called>
 | #8  futex_wait_cancelable (private=0, expected=0, futex_word=0x3890b708) at ../sysdeps/unix/sysv/linux/futex-internal.h:88

After considerable bashing of heads, it appears that our CFI directives
for unwinding out of the sigreturn trampoline are only processed by libgcc
when both a .eh_frame_hdr section is present *and* the mysterious NOP is
covered by an entry in .eh_frame. With both of these now in place, it has
highlighted that our CFI directives are not comprehensive enough to
restore the stack pointer of the interrupted context. This results in libgcc
falling back to an arm64-specific unwinder after computing a bogus PC value
from the unwind tables. The unwinder promptly dereferences this bogus address
in an attempt to see if the pointed-to instruction sequence looks like
the sigreturn trampoline.

Restore the old unwind behaviour, which relied solely on heuristics in
the unwinder, by removing the .eh_frame_hdr section from the vDSO and
commenting out the insufficient CFI directives for now. Add comments to
explain the current, miserable state of affairs.

Cc: Tamas Zsoldos <[email protected]>
Cc: Szabolcs Nagy <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Daniel Kiss <[email protected]>
Acked-by: Dave Martin <[email protected]>
Reviewed-by: Vincenzo Frascino <[email protected]>
Reviewed-by: Ard Biesheuvel <[email protected]>
Reported-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
@namjaejeon namjaejeon deleted the cifsd-fixes branch July 3, 2020 04:33
smfrench pushed a commit that referenced this pull request Jul 27, 2020
I/O requests may be held in scheduler queue because of resource contention.
The starvation scenario was handled properly in the regular completion
path but we failed to account for it during I/O submission. This lead to
the hang captured below. Make sure we run the queue when resource
contention is encountered in the submission path.

[   39.054963] scsi 13:0:0:0: rejecting I/O to dead device
[   39.058700] scsi 13:0:0:0: rejecting I/O to dead device
[   39.087855] sd 13:0:0:1: [sdd] Synchronizing SCSI cache
[   39.088909] scsi 13:0:0:1: rejecting I/O to dead device
[   39.095351] scsi 13:0:0:1: rejecting I/O to dead device
[   39.096962] scsi 13:0:0:1: rejecting I/O to dead device
[  247.021859] INFO: task scsi-stress-rem:813 blocked for more than 122 seconds.
[  247.023258]       Not tainted 5.8.0-rc2 #8
[  247.024069] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  247.025331] scsi-stress-rem D    0   813    802 0x00004000
[  247.025334] Call Trace:
[  247.025354]  __schedule+0x504/0x55f
[  247.027987]  schedule+0x72/0xa8
[  247.027991]  blk_mq_freeze_queue_wait+0x63/0x8c
[  247.027994]  ? do_wait_intr_irq+0x7a/0x7a
[  247.027996]  blk_cleanup_queue+0x4b/0xc9
[  247.028000]  __scsi_remove_device+0xf6/0x14e
[  247.028002]  scsi_remove_device+0x21/0x2b
[  247.029037]  sdev_store_delete+0x58/0x7c
[  247.029041]  kernfs_fop_write+0x10d/0x14f
[  247.031281]  vfs_write+0xa2/0xdf
[  247.032670]  ksys_write+0x6b/0xb3
[  247.032673]  do_syscall_64+0x56/0x82
[  247.034053]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  247.034059] RIP: 0033:0x7f69f39e9008
[  247.036330] Code: Bad RIP value.
[  247.036331] RSP: 002b:00007ffdd8116498 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  247.037613] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f69f39e9008
[  247.039714] RDX: 0000000000000002 RSI: 000055cde92a0ab0 RDI: 0000000000000001
[  247.039715] RBP: 000055cde92a0ab0 R08: 000000000000000a R09: 00007f69f3a79e80
[  247.039716] R10: 000000000000000a R11: 0000000000000246 R12: 00007f69f3abb780
[  247.039717] R13: 0000000000000002 R14: 00007f69f3ab6740 R15: 0000000000000002

Link: https://lore.kernel.org/r/[email protected]
Cc: [email protected]
Cc: Christoph Hellwig <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
smfrench pushed a commit that referenced this pull request Aug 2, 2020
I compiled with AddressSanitizer and I had these memory leaks while I
was using the tep_parse_format function:

    Direct leak of 28 byte(s) in 4 object(s) allocated from:
        #0 0x7fb07db49ffe in __interceptor_realloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dffe)
        #1 0x7fb07a724228 in extend_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:985
        #2 0x7fb07a724c21 in __read_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1140
        #3 0x7fb07a724f78 in read_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1206
        #4 0x7fb07a725191 in __read_expect_type /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1291
        #5 0x7fb07a7251df in read_expect_type /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1299
        #6 0x7fb07a72e6c8 in process_dynamic_array_len /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:2849
        #7 0x7fb07a7304b8 in process_function /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3161
        #8 0x7fb07a730900 in process_arg_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3207
        #9 0x7fb07a727c0b in process_arg /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1786
        #10 0x7fb07a731080 in event_read_print_args /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3285
        #11 0x7fb07a731722 in event_read_print /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3369
        #12 0x7fb07a740054 in __tep_parse_format /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:6335
        #13 0x7fb07a74047a in __parse_event /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:6389
        #14 0x7fb07a740536 in tep_parse_format /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:6431
        #15 0x7fb07a785acf in parse_event ../../../src/fs-src/fs.c:251
        #16 0x7fb07a785ccd in parse_systems ../../../src/fs-src/fs.c:284
        #17 0x7fb07a786fb3 in read_metadata ../../../src/fs-src/fs.c:593
        #18 0x7fb07a78760e in ftrace_fs_source_init ../../../src/fs-src/fs.c:727
        #19 0x7fb07d90c19c in add_component_with_init_method_data ../../../../src/lib/graph/graph.c:1048
        #20 0x7fb07d90c87b in add_source_component_with_initialize_method_data ../../../../src/lib/graph/graph.c:1127
        #21 0x7fb07d90c92a in bt_graph_add_source_component ../../../../src/lib/graph/graph.c:1152
        #22 0x55db11aa632e in cmd_run_ctx_create_components_from_config_components ../../../src/cli/babeltrace2.c:2252
        #23 0x55db11aa6fda in cmd_run_ctx_create_components ../../../src/cli/babeltrace2.c:2347
        #24 0x55db11aa780c in cmd_run ../../../src/cli/babeltrace2.c:2461
        #25 0x55db11aa8a7d in main ../../../src/cli/babeltrace2.c:2673
        #26 0x7fb07d5460b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)

The token variable in the process_dynamic_array_len function is
allocated in the read_expect_type function, but is not freed before
calling the read_token function.

Free the token variable before calling read_token in order to plug the
leak.

Signed-off-by: Philippe Duplessis-Guindon <[email protected]>
Reviewed-by: Steven Rostedt (VMware) <[email protected]>
Link: https://lore.kernel.org/linux-trace-devel/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
smfrench pushed a commit that referenced this pull request Sep 29, 2020
This patch addresses an irq free warning and null pointer dereference
error problem when nvme devices got timeout error during initialization.
This problem happens when nvme_timeout() function is called while
nvme_reset_work() is still in execution. This patch fixed the problem by
setting flag of the problematic request to NVME_REQ_CANCELLED before
calling nvme_dev_disable() to make sure __nvme_submit_sync_cmd() returns
an error code and let nvme_submit_sync_cmd() fail gracefully.
The following is console output.

[   62.472097] nvme nvme0: I/O 13 QID 0 timeout, disable controller
[   62.488796] nvme nvme0: could not set timestamp (881)
[   62.494888] ------------[ cut here ]------------
[   62.495142] Trying to free already-free IRQ 11
[   62.495366] WARNING: CPU: 0 PID: 7 at kernel/irq/manage.c:1751 free_irq+0x1f7/0x370
[   62.495742] Modules linked in:
[   62.495902] CPU: 0 PID: 7 Comm: kworker/u4:0 Not tainted 5.8.0+ #8
[   62.496206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-p4
[   62.496772] Workqueue: nvme-reset-wq nvme_reset_work
[   62.497019] RIP: 0010:free_irq+0x1f7/0x370
[   62.497223] Code: e8 ce 49 11 00 48 83 c4 08 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 44 89 f6 48 c70
[   62.498133] RSP: 0000:ffffa96800043d40 EFLAGS: 00010086
[   62.498391] RAX: 0000000000000000 RBX: ffff9b87fc458400 RCX: 0000000000000000
[   62.498741] RDX: 0000000000000001 RSI: 0000000000000096 RDI: ffffffff9693d72c
[   62.499091] RBP: ffff9b87fd4c8f60 R08: ffffa96800043bfd R09: 0000000000000163
[   62.499440] R10: ffffa96800043bf8 R11: ffffa96800043bfd R12: ffff9b87fd4c8e00
[   62.499790] R13: ffff9b87fd4c8ea4 R14: 000000000000000b R15: ffff9b87fd76b000
[   62.500140] FS:  0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
[   62.500534] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   62.500816] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
[   62.501165] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   62.501515] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   62.501864] Call Trace:
[   62.501993]  pci_free_irq+0x13/0x20
[   62.502167]  nvme_reset_work+0x5d0/0x12a0
[   62.502369]  ? update_load_avg+0x59/0x580
[   62.502569]  ? ttwu_queue_wakelist+0xa8/0xc0
[   62.502780]  ? try_to_wake_up+0x1a2/0x450
[   62.502979]  process_one_work+0x1d2/0x390
[   62.503179]  worker_thread+0x45/0x3b0
[   62.503361]  ? process_one_work+0x390/0x390
[   62.503568]  kthread+0xf9/0x130
[   62.503726]  ? kthread_park+0x80/0x80
[   62.503911]  ret_from_fork+0x22/0x30
[   62.504090] ---[ end trace de9ed4a70f8d71e2 ]---
[  123.912275] nvme nvme0: I/O 12 QID 0 timeout, disable controller
[  123.914670] nvme nvme0: 1/0/0 default/read/poll queues
[  123.916310] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  123.917469] #PF: supervisor write access in kernel mode
[  123.917725] #PF: error_code(0x0002) - not-present page
[  123.917976] PGD 0 P4D 0
[  123.918109] Oops: 0002 [#1] SMP PTI
[  123.918283] CPU: 0 PID: 7 Comm: kworker/u4:0 Tainted: G        W         5.8.0+ #8
[  123.918650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-p4
[  123.919219] Workqueue: nvme-reset-wq nvme_reset_work
[  123.919469] RIP: 0010:__blk_mq_alloc_map_and_request+0x21/0x80
[  123.919757] Code: 66 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 63 ee 53 48 8b 47 68 89 ee 48 89 fb 8b4
[  123.920657] RSP: 0000:ffffa96800043d40 EFLAGS: 00010286
[  123.920912] RAX: ffff9b87fc4fee40 RBX: ffff9b87fc8cb008 RCX: 0000000000000000
[  123.921258] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b87fc618000
[  123.921602] RBP: 0000000000000000 R08: ffff9b87fdc2c4a0 R09: ffff9b87fc616000
[  123.921949] R10: 0000000000000000 R11: ffff9b87fffd1500 R12: 0000000000000000
[  123.922295] R13: 0000000000000000 R14: ffff9b87fc8cb200 R15: ffff9b87fc8cb000
[  123.922641] FS:  0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
[  123.923032] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  123.923312] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
[  123.923660] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  123.924007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  123.924353] Call Trace:
[  123.924479]  blk_mq_alloc_tag_set+0x137/0x2a0
[  123.924694]  nvme_reset_work+0xed6/0x12a0
[  123.924898]  process_one_work+0x1d2/0x390
[  123.925099]  worker_thread+0x45/0x3b0
[  123.925280]  ? process_one_work+0x390/0x390
[  123.925486]  kthread+0xf9/0x130
[  123.925642]  ? kthread_park+0x80/0x80
[  123.925825]  ret_from_fork+0x22/0x30
[  123.926004] Modules linked in:
[  123.926158] CR2: 0000000000000000
[  123.926322] ---[ end trace de9ed4a70f8d71e3 ]---
[  123.926549] RIP: 0010:__blk_mq_alloc_map_and_request+0x21/0x80
[  123.926832] Code: 66 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 63 ee 53 48 8b 47 68 89 ee 48 89 fb 8b4
[  123.927734] RSP: 0000:ffffa96800043d40 EFLAGS: 00010286
[  123.927989] RAX: ffff9b87fc4fee40 RBX: ffff9b87fc8cb008 RCX: 0000000000000000
[  123.928336] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b87fc618000
[  123.928679] RBP: 0000000000000000 R08: ffff9b87fdc2c4a0 R09: ffff9b87fc616000
[  123.929025] R10: 0000000000000000 R11: ffff9b87fffd1500 R12: 0000000000000000
[  123.929370] R13: 0000000000000000 R14: ffff9b87fc8cb200 R15: ffff9b87fc8cb000
[  123.929715] FS:  0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
[  123.930106] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  123.930384] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
[  123.930731] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  123.931077] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Co-developed-by: Keith Busch <[email protected]>
Signed-off-by: Tong Zhang <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
smfrench pushed a commit that referenced this pull request Sep 29, 2020
…s metrics" test

Linux 5.9 introduced perf test case "Parse and process metrics" and
on s390 this test case always dumps core:

  [root@t35lp67 perf]# ./perf test -vvvv -F 67
  67: Parse and process metrics                             :
  --- start ---
  metric expr inst_retired.any / cpu_clk_unhalted.thread for IPC
  parsing metric: inst_retired.any / cpu_clk_unhalted.thread
  Segmentation fault (core dumped)
  [root@t35lp67 perf]#

I debugged this core dump and gdb shows this call chain:

  (gdb) where
   #0  0x000003ffabc3192a in __strnlen_c_1 () from /lib64/libc.so.6
   #1  0x000003ffabc293de in strcasestr () from /lib64/libc.so.6
   #2  0x0000000001102ba2 in match_metric(list=0x1e6ea20 "inst_retired.any",
            n=<optimized out>)
       at util/metricgroup.c:368
   #3  find_metric (map=<optimized out>, map=<optimized out>,
           metric=0x1e6ea20 "inst_retired.any")
      at util/metricgroup.c:765
   #4  __resolve_metric (ids=0x0, map=<optimized out>, metric_list=0x0,
           metric_no_group=<optimized out>, m=<optimized out>)
      at util/metricgroup.c:844
   #5  resolve_metric (ids=0x0, map=0x0, metric_list=0x0,
          metric_no_group=<optimized out>)
      at util/metricgroup.c:881
   #6  metricgroup__add_metric (metric=<optimized out>,
        metric_no_group=metric_no_group@entry=false, events=<optimized out>,
        events@entry=0x3ffd84fb878, metric_list=0x0,
        metric_list@entry=0x3ffd84fb868, map=0x0)
      at util/metricgroup.c:943
   #7  0x00000000011034ae in metricgroup__add_metric_list (map=0x13f9828 <map>,
        metric_list=0x3ffd84fb868, events=0x3ffd84fb878,
        metric_no_group=<optimized out>, list=<optimized out>)
      at util/metricgroup.c:988
   #8  parse_groups (perf_evlist=perf_evlist@entry=0x1e70260,
          str=str@entry=0x12f34b2 "IPC", metric_no_group=<optimized out>,
          metric_no_merge=<optimized out>,
          fake_pmu=fake_pmu@entry=0x1462f18 <perf_pmu.fake>,
          metric_events=0x3ffd84fba58, map=0x1)
      at util/metricgroup.c:1040
   #9  0x0000000001103eb2 in metricgroup__parse_groups_test(
  	evlist=evlist@entry=0x1e70260, map=map@entry=0x13f9828 <map>,
  	str=str@entry=0x12f34b2 "IPC",
  	metric_no_group=metric_no_group@entry=false,
  	metric_no_merge=metric_no_merge@entry=false,
  	metric_events=0x3ffd84fba58)
      at util/metricgroup.c:1082
   #10 0x00000000010c84d8 in __compute_metric (ratio2=0x0, name2=0x0,
          ratio1=<synthetic pointer>, name1=0x12f34b2 "IPC",
  	vals=0x3ffd84fbad8, name=0x12f34b2 "IPC")
      at tests/parse-metric.c:159
   #11 compute_metric (ratio=<synthetic pointer>, vals=0x3ffd84fbad8,
  	name=0x12f34b2 "IPC")
      at tests/parse-metric.c:189
   #12 test_ipc () at tests/parse-metric.c:208
.....
..... omitted many more lines

This test case was added with
commit 218ca91 ("perf tests: Add parse metric test for frontend metric").

When I compile with make DEBUG=y it works fine and I do not get a core dump.

It turned out that the above listed function call chain worked on a struct
pmu_event array which requires a trailing element with zeroes which was
missing. The marco map_for_each_event() loops over that array tests for members
metric_expr/metric_name/metric_group being non-NULL. Adding this element fixes
the issue.

Output after:

  [root@t35lp46 perf]# ./perf test 67
  67: Parse and process metrics                             : Ok
  [root@t35lp46 perf]#

Committer notes:

As Ian remarks, this is not s390 specific:

<quote Ian>
  This also shows up with address sanitizer on all architectures
  (perhaps change the patch title) and perhaps add a "Fixes: <commit>"
  tag.

  =================================================================
  ==4718==ERROR: AddressSanitizer: global-buffer-overflow on address
  0x55c93b4d59e8 at pc 0x55c93a1541e2 bp 0x7ffd24327c60 sp
  0x7ffd24327c58
  READ of size 8 at 0x55c93b4d59e8 thread T0
      #0 0x55c93a1541e1 in find_metric tools/perf/util/metricgroup.c:764:2
      #1 0x55c93a153e6c in __resolve_metric tools/perf/util/metricgroup.c:844:9
      #2 0x55c93a152f18 in resolve_metric tools/perf/util/metricgroup.c:881:9
      #3 0x55c93a1528db in metricgroup__add_metric
  tools/perf/util/metricgroup.c:943:9
      #4 0x55c93a151996 in metricgroup__add_metric_list
  tools/perf/util/metricgroup.c:988:9
      #5 0x55c93a1511b9 in parse_groups tools/perf/util/metricgroup.c:1040:8
      #6 0x55c93a1513e1 in metricgroup__parse_groups_test
  tools/perf/util/metricgroup.c:1082:9
      #7 0x55c93a0108ae in __compute_metric tools/perf/tests/parse-metric.c:159:8
      #8 0x55c93a010744 in compute_metric tools/perf/tests/parse-metric.c:189:9
      #9 0x55c93a00f5ee in test_ipc tools/perf/tests/parse-metric.c:208:2
      #10 0x55c93a00f1e8 in test__parse_metric
  tools/perf/tests/parse-metric.c:345:2
      #11 0x55c939fd7202 in run_test tools/perf/tests/builtin-test.c:410:9
      #12 0x55c939fd6736 in test_and_print tools/perf/tests/builtin-test.c:440:9
      #13 0x55c939fd58c3 in __cmd_test tools/perf/tests/builtin-test.c:661:4
      #14 0x55c939fd4e02 in cmd_test tools/perf/tests/builtin-test.c:807:9
      #15 0x55c939e4763d in run_builtin tools/perf/perf.c:313:11
      #16 0x55c939e46475 in handle_internal_command tools/perf/perf.c:365:8
      #17 0x55c939e4737e in run_argv tools/perf/perf.c:409:2
      #18 0x55c939e45f7e in main tools/perf/perf.c:539:3

  0x55c93b4d59e8 is located 0 bytes to the right of global variable
  'pme_test' defined in 'tools/perf/tests/parse-metric.c:17:25'
  (0x55c93b4d54a0) of size 1352
  SUMMARY: AddressSanitizer: global-buffer-overflow
  tools/perf/util/metricgroup.c:764:2 in find_metric
  Shadow bytes around the buggy address:
    0x0ab9a7692ae0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x0ab9a7692af0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x0ab9a7692b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x0ab9a7692b10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x0ab9a7692b20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  =>0x0ab9a7692b30: 00 00 00 00 00 00 00 00 00 00 00 00 00[f9]f9 f9
    0x0ab9a7692b40: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9
    0x0ab9a7692b50: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9
    0x0ab9a7692b60: f9 f9 f9 f9 f9 f9 f9 f9 00 00 00 00 00 00 00 00
    0x0ab9a7692b70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x0ab9a7692b80: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9
  Shadow byte legend (one shadow byte represents 8 application bytes):
    Addressable:           00
    Partially addressable: 01 02 03 04 05 06 07
    Heap left redzone:	   fa
    Freed heap region:	   fd
    Stack left redzone:	   f1
    Stack mid redzone:	   f2
    Stack right redzone:     f3
    Stack after return:	   f5
    Stack use after scope:   f8
    Global redzone:          f9
    Global init order:	   f6
    Poisoned by user:        f7
    Container overflow:	   fc
    Array cookie:            ac
    Intra object redzone:    bb
    ASan internal:           fe
    Left alloca redzone:     ca
    Right alloca redzone:    cb
    Shadow gap:              cc
</quote>

I'm also adding the missing "Fixes" tag and setting just .name to NULL,
as doing it that way is more compact (the compiler will zero out
everything else) and the table iterators look for .name being NULL as
the sentinel marking the end of the table.

Fixes: 0a507af ("perf tests: Add parse metric test for ipc metric")
Signed-off-by: Thomas Richter <[email protected]>
Reviewed-by: Sumanth Korikkar <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
smfrench pushed a commit that referenced this pull request Sep 29, 2020
The aliases were never released causing the following leaks:

  Indirect leak of 1224 byte(s) in 9 object(s) allocated from:
    #0 0x7feefb830628 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x107628)
    #1 0x56332c8f1b62 in __perf_pmu__new_alias util/pmu.c:322
    #2 0x56332c8f401f in pmu_add_cpu_aliases_map util/pmu.c:778
    #3 0x56332c792ce9 in __test__pmu_event_aliases tests/pmu-events.c:295
    #4 0x56332c792ce9 in test_aliases tests/pmu-events.c:367
    #5 0x56332c76a09b in run_test tests/builtin-test.c:410
    #6 0x56332c76a09b in test_and_print tests/builtin-test.c:440
    #7 0x56332c76ce69 in __cmd_test tests/builtin-test.c:695
    #8 0x56332c76ce69 in cmd_test tests/builtin-test.c:807
    #9 0x56332c7d2214 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
    #10 0x56332c6701a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
    #11 0x56332c6701a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
    #12 0x56332c6701a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
    #13 0x7feefb359cc9 in __libc_start_main ../csu/libc-start.c:308

Fixes: 956a783 ("perf test: Test pmu-events aliases")
Signed-off-by: Namhyung Kim <[email protected]>
Reviewed-by: John Garry <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
smfrench pushed a commit that referenced this pull request Sep 29, 2020
The evsel->unit borrows a pointer of pmu event or alias instead of
owns a string.  But tool event (duration_time) passes a result of
strdup() caused a leak.

It was found by ASAN during metric test:

  Direct leak of 210 byte(s) in 70 object(s) allocated from:
    #0 0x7fe366fca0b5 in strdup (/lib/x86_64-linux-gnu/libasan.so.5+0x920b5)
    #1 0x559fbbcc6ea3 in add_event_tool util/parse-events.c:414
    #2 0x559fbbcc6ea3 in parse_events_add_tool util/parse-events.c:1414
    #3 0x559fbbd8474d in parse_events_parse util/parse-events.y:439
    #4 0x559fbbcc95da in parse_events__scanner util/parse-events.c:2096
    #5 0x559fbbcc95da in __parse_events util/parse-events.c:2141
    #6 0x559fbbc28555 in check_parse_id tests/pmu-events.c:406
    #7 0x559fbbc28555 in check_parse_id tests/pmu-events.c:393
    #8 0x559fbbc28555 in check_parse_cpu tests/pmu-events.c:415
    #9 0x559fbbc28555 in test_parsing tests/pmu-events.c:498
    #10 0x559fbbc0109b in run_test tests/builtin-test.c:410
    #11 0x559fbbc0109b in test_and_print tests/builtin-test.c:440
    #12 0x559fbbc03e69 in __cmd_test tests/builtin-test.c:695
    #13 0x559fbbc03e69 in cmd_test tests/builtin-test.c:807
    #14 0x559fbbc691f4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
    #15 0x559fbbb071a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
    #16 0x559fbbb071a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
    #17 0x559fbbb071a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
    #18 0x7fe366b68cc9 in __libc_start_main ../csu/libc-start.c:308

Fixes: f0fbb11 ("perf stat: Implement duration_time as a proper event")
Signed-off-by: Namhyung Kim <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
smfrench pushed a commit that referenced this pull request Sep 29, 2020
The test_generic_metric() missed to release entries in the pctx.  Asan
reported following leak (and more):

  Direct leak of 128 byte(s) in 1 object(s) allocated from:
    #0 0x7f4c9396980e in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10780e)
    #1 0x55f7e748cc14 in hashmap_grow (/home/namhyung/project/linux/tools/perf/perf+0x90cc14)
    #2 0x55f7e748d497 in hashmap__insert (/home/namhyung/project/linux/tools/perf/perf+0x90d497)
    #3 0x55f7e7341667 in hashmap__set /home/namhyung/project/linux/tools/perf/util/hashmap.h:111
    #4 0x55f7e7341667 in expr__add_ref util/expr.c:120
    #5 0x55f7e7292436 in prepare_metric util/stat-shadow.c:783
    #6 0x55f7e729556d in test_generic_metric util/stat-shadow.c:858
    #7 0x55f7e712390b in compute_single tests/parse-metric.c:128
    #8 0x55f7e712390b in __compute_metric tests/parse-metric.c:180
    #9 0x55f7e712446d in compute_metric tests/parse-metric.c:196
    #10 0x55f7e712446d in test_dcache_l2 tests/parse-metric.c:295
    #11 0x55f7e712446d in test__parse_metric tests/parse-metric.c:355
    #12 0x55f7e70be09b in run_test tests/builtin-test.c:410
    #13 0x55f7e70be09b in test_and_print tests/builtin-test.c:440
    #14 0x55f7e70c101a in __cmd_test tests/builtin-test.c:661
    #15 0x55f7e70c101a in cmd_test tests/builtin-test.c:807
    #16 0x55f7e7126214 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
    #17 0x55f7e6fc41a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
    #18 0x55f7e6fc41a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
    #19 0x55f7e6fc41a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
    #20 0x7f4c93492cc9 in __libc_start_main ../csu/libc-start.c:308

Fixes: 6d432c4 ("perf tools: Add test_generic_metric function")
Signed-off-by: Namhyung Kim <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
smfrench pushed a commit that referenced this pull request Sep 29, 2020
The metricgroup__add_metric() can find multiple match for a metric group
and it's possible to fail.  Also it can fail in the middle like in
resolve_metric() even for single metric.

In those cases, the intermediate list and ids will be leaked like:

  Direct leak of 3 byte(s) in 1 object(s) allocated from:
    #0 0x7f4c938f40b5 in strdup (/lib/x86_64-linux-gnu/libasan.so.5+0x920b5)
    #1 0x55f7e71c1bef in __add_metric util/metricgroup.c:683
    #2 0x55f7e71c31d0 in add_metric util/metricgroup.c:906
    #3 0x55f7e71c3844 in metricgroup__add_metric util/metricgroup.c:940
    #4 0x55f7e71c488d in metricgroup__add_metric_list util/metricgroup.c:993
    #5 0x55f7e71c488d in parse_groups util/metricgroup.c:1045
    #6 0x55f7e71c60a4 in metricgroup__parse_groups_test util/metricgroup.c:1087
    #7 0x55f7e71235ae in __compute_metric tests/parse-metric.c:164
    #8 0x55f7e7124650 in compute_metric tests/parse-metric.c:196
    #9 0x55f7e7124650 in test_recursion_fail tests/parse-metric.c:318
    #10 0x55f7e7124650 in test__parse_metric tests/parse-metric.c:356
    #11 0x55f7e70be09b in run_test tests/builtin-test.c:410
    #12 0x55f7e70be09b in test_and_print tests/builtin-test.c:440
    #13 0x55f7e70c101a in __cmd_test tests/builtin-test.c:661
    #14 0x55f7e70c101a in cmd_test tests/builtin-test.c:807
    #15 0x55f7e7126214 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
    #16 0x55f7e6fc41a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
    #17 0x55f7e6fc41a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
    #18 0x55f7e6fc41a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
    #19 0x7f4c93492cc9 in __libc_start_main ../csu/libc-start.c:308

Fixes: 83de0b7 ("perf metric: Collect referenced metrics in struct metric_ref_node")
Signed-off-by: Namhyung Kim <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
smfrench pushed a commit that referenced this pull request Sep 29, 2020
The following leaks were detected by ASAN:

  Indirect leak of 360 byte(s) in 9 object(s) allocated from:
    #0 0x7fecc305180e in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10780e)
    #1 0x560578f6dce5 in perf_pmu__new_format util/pmu.c:1333
    #2 0x560578f752fc in perf_pmu_parse util/pmu.y:59
    #3 0x560578f6a8b7 in perf_pmu__format_parse util/pmu.c:73
    #4 0x560578e07045 in test__pmu tests/pmu.c:155
    #5 0x560578de109b in run_test tests/builtin-test.c:410
    #6 0x560578de109b in test_and_print tests/builtin-test.c:440
    #7 0x560578de401a in __cmd_test tests/builtin-test.c:661
    #8 0x560578de401a in cmd_test tests/builtin-test.c:807
    #9 0x560578e49354 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
    #10 0x560578ce71a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
    #11 0x560578ce71a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
    #12 0x560578ce71a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
    #13 0x7fecc2b7acc9 in __libc_start_main ../csu/libc-start.c:308

Fixes: cff7f95 ("perf tests: Move pmu tests into separate object")
Signed-off-by: Namhyung Kim <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
smfrench pushed a commit that referenced this pull request Apr 8, 2024
Petr Machata says:

====================
mlxsw: Maintain candidate RIFs

The mlxsw driver currently makes the assumption that the user applies
configuration in a bottom-up manner. Thus netdevices need to be added to
the bridge before IP addresses are configured on that bridge or SVI added
on top of it. Enslaving a netdevice to another netdevice that already has
uppers is in fact forbidden by mlxsw for this reason. Despite this safety,
it is rather easy to get into situations where the offloaded configuration
is just plain wrong.

As an example, take a front panel port, configure an IP address: it gets a
RIF. Now enslave the port to the bridge, and the RIF is gone. Remove the
port from the bridge again, but the RIF never comes back. There is a number
of similar situations, where changing the configuration there and back
utterly breaks the offload.

The situation is going to be made better by implementing a range of replays
and post-hoc offloads.

This patch set lays the ground for replay of next hops. The particular
issue that it deals with is that currently, driver-specific bookkeeping for
next hops is hooked off RIF objects, which come and go across the lifetime
of a netdevice. We would rather keep these objects at an entity that
mirrors the lifetime of the netdevice itself. That way they are at hand and
can be offloaded when a RIF is eventually created.

To that end, with this patchset, mlxsw keeps a hash table of CRIFs:
candidate RIFs, persistent handles for netdevices that mlxsw deems
potentially interesting. The lifetime of a CRIF matches that of the
underlying netdevice, and thus a RIF can always assume a CRIF exists. A
CRIF is where next hops are kept, and when RIF is created, these next hops
can be easily offloaded. (Previously only the next hops created after the
RIF was created were offloaded.)

- Patches #1 and #2 are minor adjustments.
- In patches #3 and #4, add CRIF bookkeeping.
- In patch #5, link CRIFs to RIFs such that given a netdevice-backed RIF,
  the corresponding CRIF is easy to look up.
- Patch #6 is a clean-up allowed by the previous patches
- Patches #7 and #8 move next hop tracking to CRIFs

No observable effects are intended as of yet. This will be useful once
there is support for RIF creation for netdevices that become mlxsw uppers,
which will come in following patch sets.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
smfrench pushed a commit that referenced this pull request Apr 14, 2024
At current x1e80100 interface table, interface #3 is wrongly
connected to DP controller #0 and interface #4 wrongly connected
to DP controller #2. Fix this problem by connect Interface #3 to
DP controller #0 and interface #4 connect to DP controller #1.
Also add interface #6, #7 and #8 connections to DP controller to
complete x1e80100 interface table.

Changs in V3:
-- add v2 changes log

Changs in V2:
-- add x1e80100 to subject
-- add Fixes

Fixes: e3b1f36 ("drm/msm/dpu: Add X1E80100 support")
Signed-off-by: Kuogee Hsieh <[email protected]>
Reviewed-by: Abhinav Kumar <[email protected]>
Reviewed-by: Abel Vesa <[email protected]>
Patchwork: https://patchwork.freedesktop.org/patch/585549/
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Abhinav Kumar <[email protected]>
smfrench pushed a commit that referenced this pull request Apr 23, 2024
vhost_worker will call tun call backs to receive packets. If too many
illegal packets arrives, tun_do_read will keep dumping packet contents.
When console is enabled, it will costs much more cpu time to dump
packet and soft lockup will be detected.

net_ratelimit mechanism can be used to limit the dumping rate.

PID: 33036    TASK: ffff949da6f20000  CPU: 23   COMMAND: "vhost-32980"
 #0 [fffffe00003fce50] crash_nmi_callback at ffffffff89249253
 #1 [fffffe00003fce58] nmi_handle at ffffffff89225fa3
 #2 [fffffe00003fceb0] default_do_nmi at ffffffff8922642e
 #3 [fffffe00003fced0] do_nmi at ffffffff8922660d
 #4 [fffffe00003fcef0] end_repeat_nmi at ffffffff89c01663
    [exception RIP: io_serial_in+20]
    RIP: ffffffff89792594  RSP: ffffa655314979e8  RFLAGS: 00000002
    RAX: ffffffff89792500  RBX: ffffffff8af428a0  RCX: 0000000000000000
    RDX: 00000000000003fd  RSI: 0000000000000005  RDI: ffffffff8af428a0
    RBP: 0000000000002710   R8: 0000000000000004   R9: 000000000000000f
    R10: 0000000000000000  R11: ffffffff8acbf64f  R12: 0000000000000020
    R13: ffffffff8acbf698  R14: 0000000000000058  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffa655314979e8] io_serial_in at ffffffff89792594
 #6 [ffffa655314979e8] wait_for_xmitr at ffffffff89793470
 #7 [ffffa65531497a08] serial8250_console_putchar at ffffffff897934f6
 #8 [ffffa65531497a20] uart_console_write at ffffffff8978b605
 #9 [ffffa65531497a48] serial8250_console_write at ffffffff89796558
 #10 [ffffa65531497ac8] console_unlock at ffffffff89316124
 #11 [ffffa65531497b10] vprintk_emit at ffffffff89317c07
 #12 [ffffa65531497b68] printk at ffffffff89318306
 #13 [ffffa65531497bc8] print_hex_dump at ffffffff89650765
 #14 [ffffa65531497ca8] tun_do_read at ffffffffc0b06c27 [tun]
 #15 [ffffa65531497d38] tun_recvmsg at ffffffffc0b06e34 [tun]
 #16 [ffffa65531497d68] handle_rx at ffffffffc0c5d682 [vhost_net]
 #17 [ffffa65531497ed0] vhost_worker at ffffffffc0c644dc [vhost]
 #18 [ffffa65531497f10] kthread at ffffffff892d2e72
 #19 [ffffa65531497f50] ret_from_fork at ffffffff89c0022f

Fixes: ef3db4a ("tun: avoid BUG, dump packet on GSO errors")
Signed-off-by: Lei Chen <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Acked-by: Jason Wang <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
smfrench pushed a commit that referenced this pull request Apr 30, 2024
[ Upstream commit f8bbc07 ]

vhost_worker will call tun call backs to receive packets. If too many
illegal packets arrives, tun_do_read will keep dumping packet contents.
When console is enabled, it will costs much more cpu time to dump
packet and soft lockup will be detected.

net_ratelimit mechanism can be used to limit the dumping rate.

PID: 33036    TASK: ffff949da6f20000  CPU: 23   COMMAND: "vhost-32980"
 #0 [fffffe00003fce50] crash_nmi_callback at ffffffff89249253
 #1 [fffffe00003fce58] nmi_handle at ffffffff89225fa3
 #2 [fffffe00003fceb0] default_do_nmi at ffffffff8922642e
 #3 [fffffe00003fced0] do_nmi at ffffffff8922660d
 #4 [fffffe00003fcef0] end_repeat_nmi at ffffffff89c01663
    [exception RIP: io_serial_in+20]
    RIP: ffffffff89792594  RSP: ffffa655314979e8  RFLAGS: 00000002
    RAX: ffffffff89792500  RBX: ffffffff8af428a0  RCX: 0000000000000000
    RDX: 00000000000003fd  RSI: 0000000000000005  RDI: ffffffff8af428a0
    RBP: 0000000000002710   R8: 0000000000000004   R9: 000000000000000f
    R10: 0000000000000000  R11: ffffffff8acbf64f  R12: 0000000000000020
    R13: ffffffff8acbf698  R14: 0000000000000058  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffa655314979e8] io_serial_in at ffffffff89792594
 #6 [ffffa655314979e8] wait_for_xmitr at ffffffff89793470
 #7 [ffffa65531497a08] serial8250_console_putchar at ffffffff897934f6
 #8 [ffffa65531497a20] uart_console_write at ffffffff8978b605
 #9 [ffffa65531497a48] serial8250_console_write at ffffffff89796558
 #10 [ffffa65531497ac8] console_unlock at ffffffff89316124
 #11 [ffffa65531497b10] vprintk_emit at ffffffff89317c07
 #12 [ffffa65531497b68] printk at ffffffff89318306
 #13 [ffffa65531497bc8] print_hex_dump at ffffffff89650765
 #14 [ffffa65531497ca8] tun_do_read at ffffffffc0b06c27 [tun]
 #15 [ffffa65531497d38] tun_recvmsg at ffffffffc0b06e34 [tun]
 #16 [ffffa65531497d68] handle_rx at ffffffffc0c5d682 [vhost_net]
 #17 [ffffa65531497ed0] vhost_worker at ffffffffc0c644dc [vhost]
 #18 [ffffa65531497f10] kthread at ffffffff892d2e72
 #19 [ffffa65531497f50] ret_from_fork at ffffffff89c0022f

Fixes: ef3db4a ("tun: avoid BUG, dump packet on GSO errors")
Signed-off-by: Lei Chen <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Acked-by: Jason Wang <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
smfrench pushed a commit that referenced this pull request May 26, 2024
ui_browser__show() is capturing the input title that is stack allocated
memory in hist_browser__run().

Avoid a use after return by strdup-ing the string.

Committer notes:

Further explanation from Ian Rogers:

My command line using tui is:
$ sudo bash -c 'rm /tmp/asan.log*; export
ASAN_OPTIONS="log_path=/tmp/asan.log"; /tmp/perf/perf mem record -a
sleep 1; /tmp/perf/perf mem report'
I then go to the perf annotate view and quit. This triggers the asan
error (from the log file):
```
==1254591==ERROR: AddressSanitizer: stack-use-after-return on address
0x7f2813331920 at pc 0x7f28180
65991 bp 0x7fff0a21c750 sp 0x7fff0a21bf10
READ of size 80 at 0x7f2813331920 thread T0
    #0 0x7f2818065990 in __interceptor_strlen
../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:461
    #1 0x7f2817698251 in SLsmg_write_wrapped_string
(/lib/x86_64-linux-gnu/libslang.so.2+0x98251)
    #2 0x7f28176984b9 in SLsmg_write_nstring
(/lib/x86_64-linux-gnu/libslang.so.2+0x984b9)
    #3 0x55c94045b365 in ui_browser__write_nstring ui/browser.c:60
    #4 0x55c94045c558 in __ui_browser__show_title ui/browser.c:266
    #5 0x55c94045c776 in ui_browser__show ui/browser.c:288
    #6 0x55c94045c06d in ui_browser__handle_resize ui/browser.c:206
    #7 0x55c94047979b in do_annotate ui/browsers/hists.c:2458
    #8 0x55c94047fb17 in evsel__hists_browse ui/browsers/hists.c:3412
    #9 0x55c940480a0c in perf_evsel_menu__run ui/browsers/hists.c:3527
    #10 0x55c940481108 in __evlist__tui_browse_hists ui/browsers/hists.c:3613
    #11 0x55c9404813f7 in evlist__tui_browse_hists ui/browsers/hists.c:3661
    #12 0x55c93ffa253f in report__browse_hists tools/perf/builtin-report.c:671
    #13 0x55c93ffa58ca in __cmd_report tools/perf/builtin-report.c:1141
    #14 0x55c93ffaf159 in cmd_report tools/perf/builtin-report.c:1805
    #15 0x55c94000c05c in report_events tools/perf/builtin-mem.c:374
    #16 0x55c94000d96d in cmd_mem tools/perf/builtin-mem.c:516
    #17 0x55c9400e44ee in run_builtin tools/perf/perf.c:350
    #18 0x55c9400e4a5a in handle_internal_command tools/perf/perf.c:403
    #19 0x55c9400e4e22 in run_argv tools/perf/perf.c:447
    #20 0x55c9400e53ad in main tools/perf/perf.c:561
    #21 0x7f28170456c9 in __libc_start_call_main
../sysdeps/nptl/libc_start_call_main.h:58
    #22 0x7f2817045784 in __libc_start_main_impl ../csu/libc-start.c:360
    #23 0x55c93ff544c0 in _start (/tmp/perf/perf+0x19a4c0) (BuildId:
84899b0e8c7d3a3eaa67b2eb35e3d8b2f8cd4c93)

Address 0x7f2813331920 is located in stack of thread T0 at offset 32 in frame
    #0 0x55c94046e85e in hist_browser__run ui/browsers/hists.c:746

  This frame has 1 object(s):
    [32, 192) 'title' (line 747) <== Memory access at offset 32 is
inside this variable
HINT: this may be a false positive if your program uses some custom
stack unwind mechanism, swapcontext or vfork
```
hist_browser__run isn't on the stack so the asan error looks legit.
There's no clean init/exit on struct ui_browser so I may be trading a
use-after-return for a memory leak, but that seems look a good trade
anyway.

Fixes: 05e8b08 ("perf ui browser: Stop using 'self'")
Signed-off-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Ben Gainey <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kajol Jain <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Li Dong <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Oliver Upton <[email protected]>
Cc: Paran Lee <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Cc: Sun Haiyong <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Yanteng Si <[email protected]>
Cc: Yicong Yang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
smfrench pushed a commit that referenced this pull request May 26, 2024
A system crash like this

  Failing address: 200000cb7df6f000 TEID: 200000cb7df6f403
  Fault in home space mode while using kernel ASCE.
  AS:00000002d71bc007 R3:00000003fe5b8007 S:000000011a446000 P:000000015660c13d
  Oops: 0038 ilc:3 [#1] PREEMPT SMP
  Modules linked in: mlx5_ib ...
  CPU: 8 PID: 7556 Comm: bash Not tainted 6.9.0-rc7 #8
  Hardware name: IBM 3931 A01 704 (LPAR)
  Krnl PSW : 0704e00180000000 0000014b75e7b606 (ap_parse_bitmap_str+0x10e/0x1f8)
  R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
  Krnl GPRS: 0000000000000001 ffffffffffffffc0 0000000000000001 00000048f96b75d3
  000000cb00000100 ffffffffffffffff ffffffffffffffff 000000cb7df6fce0
  000000cb7df6fce0 00000000ffffffff 000000000000002b 00000048ffffffff
  000003ff9b2dbc80 200000cb7df6fcd8 0000014bffffffc0 000000cb7df6fbc8
  Krnl Code: 0000014b75e7b5fc: a7840047            brc     8,0000014b75e7b68a
  0000014b75e7b600: 18b2                lr      %r11,%r2
  #0000014b75e7b602: a7f4000a            brc     15,0000014b75e7b616
  >0000014b75e7b606: eb22d00000e6        laog    %r2,%r2,0(%r13)
  0000014b75e7b60c: a7680001            lhi     %r6,1
  0000014b75e7b610: 187b                lr      %r7,%r11
  0000014b75e7b612: 84960021            brxh    %r9,%r6,0000014b75e7b654
  0000014b75e7b616: 18e9                lr      %r14,%r9
  Call Trace:
  [<0000014b75e7b606>] ap_parse_bitmap_str+0x10e/0x1f8
  ([<0000014b75e7b5dc>] ap_parse_bitmap_str+0xe4/0x1f8)
  [<0000014b75e7b758>] apmask_store+0x68/0x140
  [<0000014b75679196>] kernfs_fop_write_iter+0x14e/0x1e8
  [<0000014b75598524>] vfs_write+0x1b4/0x448
  [<0000014b7559894c>] ksys_write+0x74/0x100
  [<0000014b7618a440>] __do_syscall+0x268/0x328
  [<0000014b761a3558>] system_call+0x70/0x98
  INFO: lockdep is turned off.
  Last Breaking-Event-Address:
  [<0000014b75e7b636>] ap_parse_bitmap_str+0x13e/0x1f8
  Kernel panic - not syncing: Fatal exception: panic_on_oops

occured when /sys/bus/ap/a[pq]mask was updated with a relative mask value
(like +0x10-0x12,+60,-90) with one of the numeric values exceeding INT_MAX.

The fix is simple: use unsigned long values for the internal variables. The
correct checks are already in place in the function but a simple int for
the internal variables was used with the possibility to overflow.

Reported-by: Marc Hartmayer <[email protected]>
Signed-off-by: Harald Freudenberger <[email protected]>
Tested-by: Marc Hartmayer <[email protected]>
Reviewed-by: Holger Dengler <[email protected]>
Cc: <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>
smfrench pushed a commit that referenced this pull request May 26, 2024
When CONFIG_DEBUG_BUGVERBOSE=n, we fail to add necessary padding bytes
to bug_table entries, and as a result the last entry in a bug table will
be ignored, potentially leading to an unexpected panic(). All prior
entries in the table will be handled correctly.

The arm64 ABI requires that struct fields of up to 8 bytes are
naturally-aligned, with padding added within a struct such that struct
are suitably aligned within arrays.

When CONFIG_DEBUG_BUGVERPOSE=y, the layout of a bug_entry is:

	struct bug_entry {
		signed int      bug_addr_disp;	// 4 bytes
		signed int      file_disp;	// 4 bytes
		unsigned short  line;		// 2 bytes
		unsigned short  flags;		// 2 bytes
	}

... with 12 bytes total, requiring 4-byte alignment.

When CONFIG_DEBUG_BUGVERBOSE=n, the layout of a bug_entry is:

	struct bug_entry {
		signed int      bug_addr_disp;	// 4 bytes
		unsigned short  flags;		// 2 bytes
		< implicit padding >		// 2 bytes
	}

... with 8 bytes total, with 6 bytes of data and 2 bytes of trailing
padding, requiring 4-byte alginment.

When we create a bug_entry in assembly, we align the start of the entry
to 4 bytes, which implicitly handles padding for any prior entries.
However, we do not align the end of the entry, and so when
CONFIG_DEBUG_BUGVERBOSE=n, the final entry lacks the trailing padding
bytes.

For the main kernel image this is not a problem as find_bug() doesn't
depend on the trailing padding bytes when searching for entries:

	for (bug = __start___bug_table; bug < __stop___bug_table; ++bug)
		if (bugaddr == bug_addr(bug))
			return bug;

However for modules, module_bug_finalize() depends on the trailing
bytes when calculating the number of entries:

	mod->num_bugs = sechdrs[i].sh_size / sizeof(struct bug_entry);

... and as the last bug_entry lacks the necessary padding bytes, this entry
will not be counted, e.g. in the case of a single entry:

	sechdrs[i].sh_size == 6
	sizeof(struct bug_entry) == 8;

	sechdrs[i].sh_size / sizeof(struct bug_entry) == 0;

Consequently module_find_bug() will miss the last bug_entry when it does:

	for (i = 0; i < mod->num_bugs; ++i, ++bug)
		if (bugaddr == bug_addr(bug))
			goto out;

... which can lead to a kenrel panic due to an unhandled bug.

This can be demonstrated with the following module:

	static int __init buginit(void)
	{
		WARN(1, "hello\n");
		return 0;
	}

	static void __exit bugexit(void)
	{
	}

	module_init(buginit);
	module_exit(bugexit);
	MODULE_LICENSE("GPL");

... which will trigger a kernel panic when loaded:

	------------[ cut here ]------------
	hello
	Unexpected kernel BRK exception at EL1
	Internal error: BRK handler: 00000000f2000800 [#1] PREEMPT SMP
	Modules linked in: hello(O+)
	CPU: 0 PID: 50 Comm: insmod Tainted: G           O       6.9.1 #8
	Hardware name: linux,dummy-virt (DT)
	pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
	pc : buginit+0x18/0x1000 [hello]
	lr : buginit+0x18/0x1000 [hello]
	sp : ffff800080533ae0
	x29: ffff800080533ae0 x28: 0000000000000000 x27: 0000000000000000
	x26: ffffaba8c4e70510 x25: ffff800080533c30 x24: ffffaba8c4a28a58
	x23: 0000000000000000 x22: 0000000000000000 x21: ffff3947c0eab3c0
	x20: ffffaba8c4e3f000 x19: ffffaba846464000 x18: 0000000000000006
	x17: 0000000000000000 x16: ffffaba8c2492834 x15: 0720072007200720
	x14: 0720072007200720 x13: ffffaba8c49b27c8 x12: 0000000000000312
	x11: 0000000000000106 x10: ffffaba8c4a0a7c8 x9 : ffffaba8c49b27c8
	x8 : 00000000ffffefff x7 : ffffaba8c4a0a7c8 x6 : 80000000fffff000
	x5 : 0000000000000107 x4 : 0000000000000000 x3 : 0000000000000000
	x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff3947c0eab3c0
	Call trace:
	 buginit+0x18/0x1000 [hello]
	 do_one_initcall+0x80/0x1c8
	 do_init_module+0x60/0x218
	 load_module+0x1ba4/0x1d70
	 __do_sys_init_module+0x198/0x1d0
	 __arm64_sys_init_module+0x1c/0x28
	 invoke_syscall+0x48/0x114
	 el0_svc_common.constprop.0+0x40/0xe0
	 do_el0_svc+0x1c/0x28
	 el0_svc+0x34/0xd8
	 el0t_64_sync_handler+0x120/0x12c
	 el0t_64_sync+0x190/0x194
	Code: d0ffffe0 910003fd 91000000 9400000b (d4210000)
	---[ end trace 0000000000000000 ]---
	Kernel panic - not syncing: BRK handler: Fatal exception

Fix this by always aligning the end of a bug_entry to 4 bytes, which is
correct regardless of CONFIG_DEBUG_BUGVERBOSE.

Fixes: 9fb7410 ("arm64/BUG: Use BRK instruction for generic BUG traps")

Signed-off-by: Yuanbin Xie <[email protected]>
Signed-off-by: Jiangfeng Xiao <[email protected]>
Reviewed-by: Mark Rutland <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
smfrench pushed a commit that referenced this pull request Jun 9, 2024
We have been seeing crashes on duplicate keys in
btrfs_set_item_key_safe():

  BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192)
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.c:2620!
  invalid opcode: 0000 [#1] PREEMPT SMP PTI
  CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
  RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs]

With the following stack trace:

  #0  btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4)
  #1  btrfs_drop_extents (fs/btrfs/file.c:411:4)
  #2  log_one_extent (fs/btrfs/tree-log.c:4732:9)
  #3  btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9)
  #4  btrfs_log_inode (fs/btrfs/tree-log.c:6626:9)
  #5  btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8)
  #6  btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8)
  #7  btrfs_sync_file (fs/btrfs/file.c:1933:8)
  #8  vfs_fsync_range (fs/sync.c:188:9)
  #9  vfs_fsync (fs/sync.c:202:9)
  #10 do_fsync (fs/sync.c:212:9)
  #11 __do_sys_fdatasync (fs/sync.c:225:9)
  #12 __se_sys_fdatasync (fs/sync.c:223:1)
  #13 __x64_sys_fdatasync (fs/sync.c:223:1)
  #14 do_syscall_x64 (arch/x86/entry/common.c:52:14)
  #15 do_syscall_64 (arch/x86/entry/common.c:83:7)
  #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121)

So we're logging a changed extent from fsync, which is splitting an
extent in the log tree. But this split part already exists in the tree,
triggering the BUG().

This is the state of the log tree at the time of the crash, dumped with
drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py)
to get more details than btrfs_print_leaf() gives us:

  >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"])
  leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610
  leaf 33439744 flags 0x100000000000000
  fs uuid e5bd3946-400c-4223-8923-190ef1f18677
  chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
          item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160
                  generation 7 transid 9 size 8192 nbytes 8473563889606862198
                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                  sequence 204 flags 0x10(PREALLOC)
                  atime 1716417703.220000000 (2024-05-22 15:41:43)
                  ctime 1716417704.983333333 (2024-05-22 15:41:44)
                  mtime 1716417704.983333333 (2024-05-22 15:41:44)
                  otime 17592186044416.000000000 (559444-03-08 01:40:16)
          item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13
                  index 195 namelen 3 name: 193
          item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37
                  location key (0 UNKNOWN.0 0) type XATTR
                  transid 7 data_len 1 name_len 6
                  name: user.a
                  data a
          item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53
                  generation 9 type 1 (regular)
                  extent data disk byte 303144960 nr 12288
                  extent data offset 0 nr 4096 ram 12288
                  extent compression 0 (none)
          item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 4096 nr 8192
          item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 8192 nr 4096
  ...

So the real problem happened earlier: notice that items 4 (4k-12k) and 5
(8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and
item 5 starts at i_size.

Here is the state of the filesystem tree at the time of the crash:

  >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root
  >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0))
  >>> print_extent_buffer(nodes[0])
  leaf 30425088 level 0 items 184 generation 9 owner 5
  leaf 30425088 flags 0x100000000000000
  fs uuid e5bd3946-400c-4223-8923-190ef1f18677
  chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
  	...
          item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160
                  generation 7 transid 7 size 4096 nbytes 12288
                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                  sequence 6 flags 0x10(PREALLOC)
                  atime 1716417703.220000000 (2024-05-22 15:41:43)
                  ctime 1716417703.220000000 (2024-05-22 15:41:43)
                  mtime 1716417703.220000000 (2024-05-22 15:41:43)
                  otime 1716417703.220000000 (2024-05-22 15:41:43)
          item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13
                  index 195 namelen 3 name: 193
          item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37
                  location key (0 UNKNOWN.0 0) type XATTR
                  transid 7 data_len 1 name_len 6
                  name: user.a
                  data a
          item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53
                  generation 9 type 1 (regular)
                  extent data disk byte 303144960 nr 12288
                  extent data offset 0 nr 8192 ram 12288
                  extent compression 0 (none)
          item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 8192 nr 4096

Item 5 in the log tree corresponds to item 183 in the filesystem tree,
but nothing matches item 4. Furthermore, item 183 is the last item in
the leaf.

btrfs_log_prealloc_extents() is responsible for logging prealloc extents
beyond i_size. It first truncates any previously logged prealloc extents
that start beyond i_size. Then, it walks the filesystem tree and copies
the prealloc extent items to the log tree.

If it hits the end of a leaf, then it calls btrfs_next_leaf(), which
unlocks the tree and does another search. However, while the filesystem
tree is unlocked, an ordered extent completion may modify the tree. In
particular, it may insert an extent item that overlaps with an extent
item that was already copied to the log tree.

This may manifest in several ways depending on the exact scenario,
including an EEXIST error that is silently translated to a full sync,
overlapping items in the log tree, or this crash. This particular crash
is triggered by the following sequence of events:

- Initially, the file has i_size=4k, a regular extent from 0-4k, and a
  prealloc extent beyond i_size from 4k-12k. The prealloc extent item is
  the last item in its B-tree leaf.
- The file is fsync'd, which copies its inode item and both extent items
  to the log tree.
- An xattr is set on the file, which sets the
  BTRFS_INODE_COPY_EVERYTHING flag.
- The range 4k-8k in the file is written using direct I/O. i_size is
  extended to 8k, but the ordered extent is still in flight.
- The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this
  calls copy_inode_items_to_log(), which calls
  btrfs_log_prealloc_extents().
- btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the
  filesystem tree. Since it starts before i_size, it skips it. Since it
  is the last item in its B-tree leaf, it calls btrfs_next_leaf().
- btrfs_next_leaf() unlocks the path.
- The ordered extent completion runs, which converts the 4k-8k part of
  the prealloc extent to written and inserts the remaining prealloc part
  from 8k-12k.
- btrfs_next_leaf() does a search and finds the new prealloc extent
  8k-12k.
- btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into
  the log tree. Note that it overlaps with the 4k-12k prealloc extent
  that was copied to the log tree by the first fsync.
- fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k
  extent that was written.
- This tries to drop the range 4k-8k in the log tree, which requires
  adjusting the start of the 4k-12k prealloc extent in the log tree to
  8k.
- btrfs_set_item_key_safe() sees that there is already an extent
  starting at 8k in the log tree and calls BUG().

Fix this by detecting when we're about to insert an overlapping file
extent item in the log tree and truncating the part that would overlap.

CC: [email protected] # 6.1+
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: David Sterba <[email protected]>
smfrench pushed a commit that referenced this pull request Jun 9, 2024
…PLES event"

This reverts commit 7d1405c.

This causes segfaults in some cases, as reported by Milian:

  ```
  sudo /usr/bin/perf record -z --call-graph dwarf -e cycles -e
  raw_syscalls:sys_enter ls
  ...
  [ perf record: Woken up 3 times to write data ]
  malloc(): invalid next size (unsorted)
  Aborted
  ```

  Backtrace with GDB + debuginfod:

  ```
  malloc(): invalid next size (unsorted)

  Thread 1 "perf" received signal SIGABRT, Aborted.
  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6,
  no_tid=no_tid@entry=0) at pthread_kill.c:44
  Downloading source file /usr/src/debug/glibc/glibc/nptl/pthread_kill.c
  44            return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO
  (ret) : 0;
  (gdb) bt
  #0  __pthread_kill_implementation (threadid=<optimized out>,
  signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
  #1  0x00007ffff6ea8eb3 in __pthread_kill_internal (threadid=<optimized out>,
  signo=6) at pthread_kill.c:78
  #2  0x00007ffff6e50a30 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/
  raise.c:26
  #3  0x00007ffff6e384c3 in __GI_abort () at abort.c:79
  #4  0x00007ffff6e39354 in __libc_message_impl (fmt=fmt@entry=0x7ffff6fc22ea
  "%s\n") at ../sysdeps/posix/libc_fatal.c:132
  #5  0x00007ffff6eb3085 in malloc_printerr (str=str@entry=0x7ffff6fc5850
  "malloc(): invalid next size (unsorted)") at malloc.c:5772
  #6  0x00007ffff6eb657c in _int_malloc (av=av@entry=0x7ffff6ff6ac0
  <main_arena>, bytes=bytes@entry=368) at malloc.c:4081
  #7  0x00007ffff6eb877e in __libc_calloc (n=<optimized out>,
  elem_size=<optimized out>) at malloc.c:3754
  #8  0x000055555569bdb6 in perf_session.do_write_header ()
  #9  0x00005555555a373a in __cmd_record.constprop.0 ()
  #10 0x00005555555a6846 in cmd_record ()
  #11 0x000055555564db7f in run_builtin ()
  #12 0x000055555558ed77 in main ()
  ```

  Valgrind memcheck:
  ```
  ==45136== Invalid write of size 8
  ==45136==    at 0x2B38A5: perf_event__synthesize_id_sample (in /usr/bin/perf)
  ==45136==    by 0x157069: __cmd_record.constprop.0 (in /usr/bin/perf)
  ==45136==    by 0x15A845: cmd_record (in /usr/bin/perf)
  ==45136==    by 0x201B7E: run_builtin (in /usr/bin/perf)
  ==45136==    by 0x142D76: main (in /usr/bin/perf)
  ==45136==  Address 0x6a866a8 is 0 bytes after a block of size 40 alloc'd
  ==45136==    at 0x4849BF3: calloc (vg_replace_malloc.c:1675)
  ==45136==    by 0x3574AB: zalloc (in /usr/bin/perf)
  ==45136==    by 0x1570E0: __cmd_record.constprop.0 (in /usr/bin/perf)
  ==45136==    by 0x15A845: cmd_record (in /usr/bin/perf)
  ==45136==    by 0x201B7E: run_builtin (in /usr/bin/perf)
  ==45136==    by 0x142D76: main (in /usr/bin/perf)
  ==45136==
  ==45136== Syscall param write(buf) points to unaddressable byte(s)
  ==45136==    at 0x575953D: __libc_write (write.c:26)
  ==45136==    by 0x575953D: write (write.c:24)
  ==45136==    by 0x35761F: ion (in /usr/bin/perf)
  ==45136==    by 0x357778: writen (in /usr/bin/perf)
  ==45136==    by 0x1548F7: record__write (in /usr/bin/perf)
  ==45136==    by 0x15708A: __cmd_record.constprop.0 (in /usr/bin/perf)
  ==45136==    by 0x15A845: cmd_record (in /usr/bin/perf)
  ==45136==    by 0x201B7E: run_builtin (in /usr/bin/perf)
  ==45136==    by 0x142D76: main (in /usr/bin/perf)
  ==45136==  Address 0x6a866a8 is 0 bytes after a block of size 40 alloc'd
  ==45136==    at 0x4849BF3: calloc (vg_replace_malloc.c:1675)
  ==45136==    by 0x3574AB: zalloc (in /usr/bin/perf)
  ==45136==    by 0x1570E0: __cmd_record.constprop.0 (in /usr/bin/perf)
  ==45136==    by 0x15A845: cmd_record (in /usr/bin/perf)
  ==45136==    by 0x201B7E: run_builtin (in /usr/bin/perf)
  ==45136==    by 0x142D76: main (in /usr/bin/perf)
  ==45136==
 -----

Closes: https://lore.kernel.org/linux-perf-users/23879991.0LEYPuXRzz@milian-workstation/
Reported-by: Milian Wolff <[email protected]>
Tested-by: Milian Wolff <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: [email protected] # 6.8+
Link: https://lore.kernel.org/lkml/Zl9ksOlHJHnKM70p@x1
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
smfrench pushed a commit that referenced this pull request Jun 27, 2024
[ Upstream commit 88ce010 ]

The session has a header in it which contains a perf env with
bpf_progs. The bpf_progs are accessed by the sideband thread and so
the sideband thread must be stopped before the session is deleted, to
avoid a use after free.  This error was detected by AddressSanitizer
in the following:

  ==2054673==ERROR: AddressSanitizer: heap-use-after-free on address 0x61d000161e00 at pc 0x55769289de54 bp 0x7f9df36d4ab0 sp 0x7f9df36d4aa8
  READ of size 8 at 0x61d000161e00 thread T1
      #0 0x55769289de53 in __perf_env__insert_bpf_prog_info util/env.c:42
      #1 0x55769289dbb1 in perf_env__insert_bpf_prog_info util/env.c:29
      #2 0x557692bbae29 in perf_env__add_bpf_info util/bpf-event.c:483
      #3 0x557692bbb01a in bpf_event__sb_cb util/bpf-event.c:512
      #4 0x5576928b75f4 in perf_evlist__poll_thread util/sideband_evlist.c:68
      #5 0x7f9df96a63eb in start_thread nptl/pthread_create.c:444
      #6 0x7f9df9726a4b in clone3 ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

  0x61d000161e00 is located 384 bytes inside of 2136-byte region [0x61d000161c80,0x61d0001624d8)
  freed by thread T0 here:
      #0 0x7f9dfa6d7288 in __interceptor_free libsanitizer/asan/asan_malloc_linux.cpp:52
      #1 0x557692978d50 in perf_session__delete util/session.c:319
      #2 0x557692673959 in __cmd_record tools/perf/builtin-record.c:2884
      #3 0x55769267a9f0 in cmd_record tools/perf/builtin-record.c:4259
      #4 0x55769286710c in run_builtin tools/perf/perf.c:349
      #5 0x557692867678 in handle_internal_command tools/perf/perf.c:402
      #6 0x557692867a40 in run_argv tools/perf/perf.c:446
      #7 0x557692867fae in main tools/perf/perf.c:562
      #8 0x7f9df96456c9 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

Fixes: 657ee55 ("perf evlist: Introduce side band thread")
Signed-off-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Disha Goel <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kajol Jain <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Yicong Yang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
smfrench pushed a commit that referenced this pull request Jun 27, 2024
[ Upstream commit 769e6a1 ]

ui_browser__show() is capturing the input title that is stack allocated
memory in hist_browser__run().

Avoid a use after return by strdup-ing the string.

Committer notes:

Further explanation from Ian Rogers:

My command line using tui is:
$ sudo bash -c 'rm /tmp/asan.log*; export
ASAN_OPTIONS="log_path=/tmp/asan.log"; /tmp/perf/perf mem record -a
sleep 1; /tmp/perf/perf mem report'
I then go to the perf annotate view and quit. This triggers the asan
error (from the log file):
```
==1254591==ERROR: AddressSanitizer: stack-use-after-return on address
0x7f2813331920 at pc 0x7f28180
65991 bp 0x7fff0a21c750 sp 0x7fff0a21bf10
READ of size 80 at 0x7f2813331920 thread T0
    #0 0x7f2818065990 in __interceptor_strlen
../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:461
    #1 0x7f2817698251 in SLsmg_write_wrapped_string
(/lib/x86_64-linux-gnu/libslang.so.2+0x98251)
    #2 0x7f28176984b9 in SLsmg_write_nstring
(/lib/x86_64-linux-gnu/libslang.so.2+0x984b9)
    #3 0x55c94045b365 in ui_browser__write_nstring ui/browser.c:60
    #4 0x55c94045c558 in __ui_browser__show_title ui/browser.c:266
    #5 0x55c94045c776 in ui_browser__show ui/browser.c:288
    #6 0x55c94045c06d in ui_browser__handle_resize ui/browser.c:206
    #7 0x55c94047979b in do_annotate ui/browsers/hists.c:2458
    #8 0x55c94047fb17 in evsel__hists_browse ui/browsers/hists.c:3412
    #9 0x55c940480a0c in perf_evsel_menu__run ui/browsers/hists.c:3527
    #10 0x55c940481108 in __evlist__tui_browse_hists ui/browsers/hists.c:3613
    #11 0x55c9404813f7 in evlist__tui_browse_hists ui/browsers/hists.c:3661
    #12 0x55c93ffa253f in report__browse_hists tools/perf/builtin-report.c:671
    #13 0x55c93ffa58ca in __cmd_report tools/perf/builtin-report.c:1141
    #14 0x55c93ffaf159 in cmd_report tools/perf/builtin-report.c:1805
    #15 0x55c94000c05c in report_events tools/perf/builtin-mem.c:374
    #16 0x55c94000d96d in cmd_mem tools/perf/builtin-mem.c:516
    #17 0x55c9400e44ee in run_builtin tools/perf/perf.c:350
    #18 0x55c9400e4a5a in handle_internal_command tools/perf/perf.c:403
    #19 0x55c9400e4e22 in run_argv tools/perf/perf.c:447
    #20 0x55c9400e53ad in main tools/perf/perf.c:561
    #21 0x7f28170456c9 in __libc_start_call_main
../sysdeps/nptl/libc_start_call_main.h:58
    #22 0x7f2817045784 in __libc_start_main_impl ../csu/libc-start.c:360
    #23 0x55c93ff544c0 in _start (/tmp/perf/perf+0x19a4c0) (BuildId:
84899b0e8c7d3a3eaa67b2eb35e3d8b2f8cd4c93)

Address 0x7f2813331920 is located in stack of thread T0 at offset 32 in frame
    #0 0x55c94046e85e in hist_browser__run ui/browsers/hists.c:746

  This frame has 1 object(s):
    [32, 192) 'title' (line 747) <== Memory access at offset 32 is
inside this variable
HINT: this may be a false positive if your program uses some custom
stack unwind mechanism, swapcontext or vfork
```
hist_browser__run isn't on the stack so the asan error looks legit.
There's no clean init/exit on struct ui_browser so I may be trading a
use-after-return for a memory leak, but that seems look a good trade
anyway.

Fixes: 05e8b08 ("perf ui browser: Stop using 'self'")
Signed-off-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Ben Gainey <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kajol Jain <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Li Dong <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Oliver Upton <[email protected]>
Cc: Paran Lee <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Cc: Sun Haiyong <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Yanteng Si <[email protected]>
Cc: Yicong Yang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
smfrench pushed a commit that referenced this pull request Jun 27, 2024
[ Upstream commit ffbf4fb ]

When CONFIG_DEBUG_BUGVERBOSE=n, we fail to add necessary padding bytes
to bug_table entries, and as a result the last entry in a bug table will
be ignored, potentially leading to an unexpected panic(). All prior
entries in the table will be handled correctly.

The arm64 ABI requires that struct fields of up to 8 bytes are
naturally-aligned, with padding added within a struct such that struct
are suitably aligned within arrays.

When CONFIG_DEBUG_BUGVERPOSE=y, the layout of a bug_entry is:

	struct bug_entry {
		signed int      bug_addr_disp;	// 4 bytes
		signed int      file_disp;	// 4 bytes
		unsigned short  line;		// 2 bytes
		unsigned short  flags;		// 2 bytes
	}

... with 12 bytes total, requiring 4-byte alignment.

When CONFIG_DEBUG_BUGVERBOSE=n, the layout of a bug_entry is:

	struct bug_entry {
		signed int      bug_addr_disp;	// 4 bytes
		unsigned short  flags;		// 2 bytes
		< implicit padding >		// 2 bytes
	}

... with 8 bytes total, with 6 bytes of data and 2 bytes of trailing
padding, requiring 4-byte alginment.

When we create a bug_entry in assembly, we align the start of the entry
to 4 bytes, which implicitly handles padding for any prior entries.
However, we do not align the end of the entry, and so when
CONFIG_DEBUG_BUGVERBOSE=n, the final entry lacks the trailing padding
bytes.

For the main kernel image this is not a problem as find_bug() doesn't
depend on the trailing padding bytes when searching for entries:

	for (bug = __start___bug_table; bug < __stop___bug_table; ++bug)
		if (bugaddr == bug_addr(bug))
			return bug;

However for modules, module_bug_finalize() depends on the trailing
bytes when calculating the number of entries:

	mod->num_bugs = sechdrs[i].sh_size / sizeof(struct bug_entry);

... and as the last bug_entry lacks the necessary padding bytes, this entry
will not be counted, e.g. in the case of a single entry:

	sechdrs[i].sh_size == 6
	sizeof(struct bug_entry) == 8;

	sechdrs[i].sh_size / sizeof(struct bug_entry) == 0;

Consequently module_find_bug() will miss the last bug_entry when it does:

	for (i = 0; i < mod->num_bugs; ++i, ++bug)
		if (bugaddr == bug_addr(bug))
			goto out;

... which can lead to a kenrel panic due to an unhandled bug.

This can be demonstrated with the following module:

	static int __init buginit(void)
	{
		WARN(1, "hello\n");
		return 0;
	}

	static void __exit bugexit(void)
	{
	}

	module_init(buginit);
	module_exit(bugexit);
	MODULE_LICENSE("GPL");

... which will trigger a kernel panic when loaded:

	------------[ cut here ]------------
	hello
	Unexpected kernel BRK exception at EL1
	Internal error: BRK handler: 00000000f2000800 [#1] PREEMPT SMP
	Modules linked in: hello(O+)
	CPU: 0 PID: 50 Comm: insmod Tainted: G           O       6.9.1 #8
	Hardware name: linux,dummy-virt (DT)
	pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
	pc : buginit+0x18/0x1000 [hello]
	lr : buginit+0x18/0x1000 [hello]
	sp : ffff800080533ae0
	x29: ffff800080533ae0 x28: 0000000000000000 x27: 0000000000000000
	x26: ffffaba8c4e70510 x25: ffff800080533c30 x24: ffffaba8c4a28a58
	x23: 0000000000000000 x22: 0000000000000000 x21: ffff3947c0eab3c0
	x20: ffffaba8c4e3f000 x19: ffffaba846464000 x18: 0000000000000006
	x17: 0000000000000000 x16: ffffaba8c2492834 x15: 0720072007200720
	x14: 0720072007200720 x13: ffffaba8c49b27c8 x12: 0000000000000312
	x11: 0000000000000106 x10: ffffaba8c4a0a7c8 x9 : ffffaba8c49b27c8
	x8 : 00000000ffffefff x7 : ffffaba8c4a0a7c8 x6 : 80000000fffff000
	x5 : 0000000000000107 x4 : 0000000000000000 x3 : 0000000000000000
	x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff3947c0eab3c0
	Call trace:
	 buginit+0x18/0x1000 [hello]
	 do_one_initcall+0x80/0x1c8
	 do_init_module+0x60/0x218
	 load_module+0x1ba4/0x1d70
	 __do_sys_init_module+0x198/0x1d0
	 __arm64_sys_init_module+0x1c/0x28
	 invoke_syscall+0x48/0x114
	 el0_svc_common.constprop.0+0x40/0xe0
	 do_el0_svc+0x1c/0x28
	 el0_svc+0x34/0xd8
	 el0t_64_sync_handler+0x120/0x12c
	 el0t_64_sync+0x190/0x194
	Code: d0ffffe0 910003fd 91000000 9400000b (d4210000)
	---[ end trace 0000000000000000 ]---
	Kernel panic - not syncing: BRK handler: Fatal exception

Fix this by always aligning the end of a bug_entry to 4 bytes, which is
correct regardless of CONFIG_DEBUG_BUGVERBOSE.

Fixes: 9fb7410 ("arm64/BUG: Use BRK instruction for generic BUG traps")

Signed-off-by: Yuanbin Xie <[email protected]>
Signed-off-by: Jiangfeng Xiao <[email protected]>
Reviewed-by: Mark Rutland <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
smfrench pushed a commit that referenced this pull request Jun 27, 2024
commit d4f9d5a upstream.

A system crash like this

  Failing address: 200000cb7df6f000 TEID: 200000cb7df6f403
  Fault in home space mode while using kernel ASCE.
  AS:00000002d71bc007 R3:00000003fe5b8007 S:000000011a446000 P:000000015660c13d
  Oops: 0038 ilc:3 [#1] PREEMPT SMP
  Modules linked in: mlx5_ib ...
  CPU: 8 PID: 7556 Comm: bash Not tainted 6.9.0-rc7 #8
  Hardware name: IBM 3931 A01 704 (LPAR)
  Krnl PSW : 0704e00180000000 0000014b75e7b606 (ap_parse_bitmap_str+0x10e/0x1f8)
  R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
  Krnl GPRS: 0000000000000001 ffffffffffffffc0 0000000000000001 00000048f96b75d3
  000000cb00000100 ffffffffffffffff ffffffffffffffff 000000cb7df6fce0
  000000cb7df6fce0 00000000ffffffff 000000000000002b 00000048ffffffff
  000003ff9b2dbc80 200000cb7df6fcd8 0000014bffffffc0 000000cb7df6fbc8
  Krnl Code: 0000014b75e7b5fc: a7840047            brc     8,0000014b75e7b68a
  0000014b75e7b600: 18b2                lr      %r11,%r2
  #0000014b75e7b602: a7f4000a            brc     15,0000014b75e7b616
  >0000014b75e7b606: eb22d00000e6        laog    %r2,%r2,0(%r13)
  0000014b75e7b60c: a7680001            lhi     %r6,1
  0000014b75e7b610: 187b                lr      %r7,%r11
  0000014b75e7b612: 84960021            brxh    %r9,%r6,0000014b75e7b654
  0000014b75e7b616: 18e9                lr      %r14,%r9
  Call Trace:
  [<0000014b75e7b606>] ap_parse_bitmap_str+0x10e/0x1f8
  ([<0000014b75e7b5dc>] ap_parse_bitmap_str+0xe4/0x1f8)
  [<0000014b75e7b758>] apmask_store+0x68/0x140
  [<0000014b75679196>] kernfs_fop_write_iter+0x14e/0x1e8
  [<0000014b75598524>] vfs_write+0x1b4/0x448
  [<0000014b7559894c>] ksys_write+0x74/0x100
  [<0000014b7618a440>] __do_syscall+0x268/0x328
  [<0000014b761a3558>] system_call+0x70/0x98
  INFO: lockdep is turned off.
  Last Breaking-Event-Address:
  [<0000014b75e7b636>] ap_parse_bitmap_str+0x13e/0x1f8
  Kernel panic - not syncing: Fatal exception: panic_on_oops

occured when /sys/bus/ap/a[pq]mask was updated with a relative mask value
(like +0x10-0x12,+60,-90) with one of the numeric values exceeding INT_MAX.

The fix is simple: use unsigned long values for the internal variables. The
correct checks are already in place in the function but a simple int for
the internal variables was used with the possibility to overflow.

Reported-by: Marc Hartmayer <[email protected]>
Signed-off-by: Harald Freudenberger <[email protected]>
Tested-by: Marc Hartmayer <[email protected]>
Reviewed-by: Holger Dengler <[email protected]>
Cc: <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
smfrench pushed a commit that referenced this pull request Jun 27, 2024
commit 9d274c1 upstream.

We have been seeing crashes on duplicate keys in
btrfs_set_item_key_safe():

  BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192)
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.c:2620!
  invalid opcode: 0000 [#1] PREEMPT SMP PTI
  CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
  RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs]

With the following stack trace:

  #0  btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4)
  #1  btrfs_drop_extents (fs/btrfs/file.c:411:4)
  #2  log_one_extent (fs/btrfs/tree-log.c:4732:9)
  #3  btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9)
  #4  btrfs_log_inode (fs/btrfs/tree-log.c:6626:9)
  #5  btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8)
  #6  btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8)
  #7  btrfs_sync_file (fs/btrfs/file.c:1933:8)
  #8  vfs_fsync_range (fs/sync.c:188:9)
  #9  vfs_fsync (fs/sync.c:202:9)
  #10 do_fsync (fs/sync.c:212:9)
  #11 __do_sys_fdatasync (fs/sync.c:225:9)
  #12 __se_sys_fdatasync (fs/sync.c:223:1)
  #13 __x64_sys_fdatasync (fs/sync.c:223:1)
  #14 do_syscall_x64 (arch/x86/entry/common.c:52:14)
  #15 do_syscall_64 (arch/x86/entry/common.c:83:7)
  #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121)

So we're logging a changed extent from fsync, which is splitting an
extent in the log tree. But this split part already exists in the tree,
triggering the BUG().

This is the state of the log tree at the time of the crash, dumped with
drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py)
to get more details than btrfs_print_leaf() gives us:

  >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"])
  leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610
  leaf 33439744 flags 0x100000000000000
  fs uuid e5bd3946-400c-4223-8923-190ef1f18677
  chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
          item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160
                  generation 7 transid 9 size 8192 nbytes 8473563889606862198
                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                  sequence 204 flags 0x10(PREALLOC)
                  atime 1716417703.220000000 (2024-05-22 15:41:43)
                  ctime 1716417704.983333333 (2024-05-22 15:41:44)
                  mtime 1716417704.983333333 (2024-05-22 15:41:44)
                  otime 17592186044416.000000000 (559444-03-08 01:40:16)
          item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13
                  index 195 namelen 3 name: 193
          item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37
                  location key (0 UNKNOWN.0 0) type XATTR
                  transid 7 data_len 1 name_len 6
                  name: user.a
                  data a
          item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53
                  generation 9 type 1 (regular)
                  extent data disk byte 303144960 nr 12288
                  extent data offset 0 nr 4096 ram 12288
                  extent compression 0 (none)
          item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 4096 nr 8192
          item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 8192 nr 4096
  ...

So the real problem happened earlier: notice that items 4 (4k-12k) and 5
(8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and
item 5 starts at i_size.

Here is the state of the filesystem tree at the time of the crash:

  >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root
  >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0))
  >>> print_extent_buffer(nodes[0])
  leaf 30425088 level 0 items 184 generation 9 owner 5
  leaf 30425088 flags 0x100000000000000
  fs uuid e5bd3946-400c-4223-8923-190ef1f18677
  chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
  	...
          item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160
                  generation 7 transid 7 size 4096 nbytes 12288
                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                  sequence 6 flags 0x10(PREALLOC)
                  atime 1716417703.220000000 (2024-05-22 15:41:43)
                  ctime 1716417703.220000000 (2024-05-22 15:41:43)
                  mtime 1716417703.220000000 (2024-05-22 15:41:43)
                  otime 1716417703.220000000 (2024-05-22 15:41:43)
          item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13
                  index 195 namelen 3 name: 193
          item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37
                  location key (0 UNKNOWN.0 0) type XATTR
                  transid 7 data_len 1 name_len 6
                  name: user.a
                  data a
          item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53
                  generation 9 type 1 (regular)
                  extent data disk byte 303144960 nr 12288
                  extent data offset 0 nr 8192 ram 12288
                  extent compression 0 (none)
          item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 8192 nr 4096

Item 5 in the log tree corresponds to item 183 in the filesystem tree,
but nothing matches item 4. Furthermore, item 183 is the last item in
the leaf.

btrfs_log_prealloc_extents() is responsible for logging prealloc extents
beyond i_size. It first truncates any previously logged prealloc extents
that start beyond i_size. Then, it walks the filesystem tree and copies
the prealloc extent items to the log tree.

If it hits the end of a leaf, then it calls btrfs_next_leaf(), which
unlocks the tree and does another search. However, while the filesystem
tree is unlocked, an ordered extent completion may modify the tree. In
particular, it may insert an extent item that overlaps with an extent
item that was already copied to the log tree.

This may manifest in several ways depending on the exact scenario,
including an EEXIST error that is silently translated to a full sync,
overlapping items in the log tree, or this crash. This particular crash
is triggered by the following sequence of events:

- Initially, the file has i_size=4k, a regular extent from 0-4k, and a
  prealloc extent beyond i_size from 4k-12k. The prealloc extent item is
  the last item in its B-tree leaf.
- The file is fsync'd, which copies its inode item and both extent items
  to the log tree.
- An xattr is set on the file, which sets the
  BTRFS_INODE_COPY_EVERYTHING flag.
- The range 4k-8k in the file is written using direct I/O. i_size is
  extended to 8k, but the ordered extent is still in flight.
- The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this
  calls copy_inode_items_to_log(), which calls
  btrfs_log_prealloc_extents().
- btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the
  filesystem tree. Since it starts before i_size, it skips it. Since it
  is the last item in its B-tree leaf, it calls btrfs_next_leaf().
- btrfs_next_leaf() unlocks the path.
- The ordered extent completion runs, which converts the 4k-8k part of
  the prealloc extent to written and inserts the remaining prealloc part
  from 8k-12k.
- btrfs_next_leaf() does a search and finds the new prealloc extent
  8k-12k.
- btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into
  the log tree. Note that it overlaps with the 4k-12k prealloc extent
  that was copied to the log tree by the first fsync.
- fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k
  extent that was written.
- This tries to drop the range 4k-8k in the log tree, which requires
  adjusting the start of the 4k-12k prealloc extent in the log tree to
  8k.
- btrfs_set_item_key_safe() sees that there is already an extent
  starting at 8k in the log tree and calls BUG().

Fix this by detecting when we're about to insert an overlapping file
extent item in the log tree and truncating the part that would overlap.

CC: [email protected] # 6.1+
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
smfrench pushed a commit that referenced this pull request Jul 1, 2024
The code in ocfs2_dio_end_io_write() estimates number of necessary
transaction credits using ocfs2_calc_extend_credits().  This however does
not take into account that the IO could be arbitrarily large and can
contain arbitrary number of extents.

Extent tree manipulations do often extend the current transaction but not
in all of the cases.  For example if we have only single block extents in
the tree, ocfs2_mark_extent_written() will end up calling
ocfs2_replace_extent_rec() all the time and we will never extend the
current transaction and eventually exhaust all the transaction credits if
the IO contains many single block extents.  Once that happens a
WARN_ON(jbd2_handle_buffer_credits(handle) <= 0) is triggered in
jbd2_journal_dirty_metadata() and subsequently OCFS2 aborts in response to
this error.  This was actually triggered by one of our customers on a
heavily fragmented OCFS2 filesystem.

To fix the issue make sure the transaction always has enough credits for
one extent insert before each call of ocfs2_mark_extent_written().

Heming Zhao said:

------
PANIC: "Kernel panic - not syncing: OCFS2: (device dm-1): panic forced after error"

PID: xxx  TASK: xxxx  CPU: 5  COMMAND: "SubmitThread-CA"
  #0 machine_kexec at ffffffff8c069932
  #1 __crash_kexec at ffffffff8c1338fa
  #2 panic at ffffffff8c1d69b9
  #3 ocfs2_handle_error at ffffffffc0c86c0c [ocfs2]
  #4 __ocfs2_abort at ffffffffc0c88387 [ocfs2]
  #5 ocfs2_journal_dirty at ffffffffc0c51e98 [ocfs2]
  #6 ocfs2_split_extent at ffffffffc0c27ea3 [ocfs2]
  #7 ocfs2_change_extent_flag at ffffffffc0c28053 [ocfs2]
  #8 ocfs2_mark_extent_written at ffffffffc0c28347 [ocfs2]
  #9 ocfs2_dio_end_io_write at ffffffffc0c2bef9 [ocfs2]
#10 ocfs2_dio_end_io at ffffffffc0c2c0f5 [ocfs2]
#11 dio_complete at ffffffff8c2b9fa7
#12 do_blockdev_direct_IO at ffffffff8c2bc09f
#13 ocfs2_direct_IO at ffffffffc0c2b653 [ocfs2]
#14 generic_file_direct_write at ffffffff8c1dcf14
#15 __generic_file_write_iter at ffffffff8c1dd07b
#16 ocfs2_file_write_iter at ffffffffc0c49f1f [ocfs2]
#17 aio_write at ffffffff8c2cc72e
#18 kmem_cache_alloc at ffffffff8c248dde
#19 do_io_submit at ffffffff8c2ccada
#20 do_syscall_64 at ffffffff8c004984
#21 entry_SYSCALL_64_after_hwframe at ffffffff8c8000ba

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: c15471f ("ocfs2: fix sparse file & data ordering issue in direct io")
Signed-off-by: Jan Kara <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>
Reviewed-by: Heming Zhao <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
smfrench pushed a commit that referenced this pull request Jul 1, 2024
If we're not in a NAPI softirq context, we need to be careful
about how we call napi_consume_skb(), specifically we need to
call it with budget==0 to signal to it that we're not in a
safe context.

This was found while running some configuration stress testing
of traffic and a change queue config loop running, and this
curious note popped out:

[ 4371.402645] BUG: using smp_processor_id() in preemptible [00000000] code: ethtool/20545
[ 4371.402897] caller is napi_skb_cache_put+0x16/0x80
[ 4371.403120] CPU: 25 PID: 20545 Comm: ethtool Kdump: loaded Tainted: G           OE      6.10.0-rc3-netnext+ #8
[ 4371.403302] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 01/23/2021
[ 4371.403460] Call Trace:
[ 4371.403613]  <TASK>
[ 4371.403758]  dump_stack_lvl+0x4f/0x70
[ 4371.403904]  check_preemption_disabled+0xc1/0xe0
[ 4371.404051]  napi_skb_cache_put+0x16/0x80
[ 4371.404199]  ionic_tx_clean+0x18a/0x240 [ionic]
[ 4371.404354]  ionic_tx_cq_service+0xc4/0x200 [ionic]
[ 4371.404505]  ionic_tx_flush+0x15/0x70 [ionic]
[ 4371.404653]  ? ionic_lif_qcq_deinit.isra.23+0x5b/0x70 [ionic]
[ 4371.404805]  ionic_txrx_deinit+0x71/0x190 [ionic]
[ 4371.404956]  ionic_reconfigure_queues+0x5f5/0xff0 [ionic]
[ 4371.405111]  ionic_set_ringparam+0x2e8/0x3e0 [ionic]
[ 4371.405265]  ethnl_set_rings+0x1f1/0x300
[ 4371.405418]  ethnl_default_set_doit+0xbb/0x160
[ 4371.405571]  genl_family_rcv_msg_doit+0xff/0x130
	[...]

I found that ionic_tx_clean() calls napi_consume_skb() which calls
napi_skb_cache_put(), but before that last call is the note
    /* Zero budget indicate non-NAPI context called us, like netpoll */
and
    DEBUG_NET_WARN_ON_ONCE(!in_softirq());

Those are pretty big hints that we're doing it wrong.  We can pass a
context hint down through the calls to let ionic_tx_clean() know what
we're doing so it can call napi_consume_skb() correctly.

Fixes: 386e698 ("ionic: Make use napi_consume_skb")
Signed-off-by: Shannon Nelson <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
smfrench pushed a commit that referenced this pull request Jul 12, 2024
[ Upstream commit 88ce010 ]

The session has a header in it which contains a perf env with
bpf_progs. The bpf_progs are accessed by the sideband thread and so
the sideband thread must be stopped before the session is deleted, to
avoid a use after free.  This error was detected by AddressSanitizer
in the following:

  ==2054673==ERROR: AddressSanitizer: heap-use-after-free on address 0x61d000161e00 at pc 0x55769289de54 bp 0x7f9df36d4ab0 sp 0x7f9df36d4aa8
  READ of size 8 at 0x61d000161e00 thread T1
      #0 0x55769289de53 in __perf_env__insert_bpf_prog_info util/env.c:42
      #1 0x55769289dbb1 in perf_env__insert_bpf_prog_info util/env.c:29
      #2 0x557692bbae29 in perf_env__add_bpf_info util/bpf-event.c:483
      #3 0x557692bbb01a in bpf_event__sb_cb util/bpf-event.c:512
      #4 0x5576928b75f4 in perf_evlist__poll_thread util/sideband_evlist.c:68
      #5 0x7f9df96a63eb in start_thread nptl/pthread_create.c:444
      #6 0x7f9df9726a4b in clone3 ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

  0x61d000161e00 is located 384 bytes inside of 2136-byte region [0x61d000161c80,0x61d0001624d8)
  freed by thread T0 here:
      #0 0x7f9dfa6d7288 in __interceptor_free libsanitizer/asan/asan_malloc_linux.cpp:52
      #1 0x557692978d50 in perf_session__delete util/session.c:319
      #2 0x557692673959 in __cmd_record tools/perf/builtin-record.c:2884
      #3 0x55769267a9f0 in cmd_record tools/perf/builtin-record.c:4259
      #4 0x55769286710c in run_builtin tools/perf/perf.c:349
      #5 0x557692867678 in handle_internal_command tools/perf/perf.c:402
      #6 0x557692867a40 in run_argv tools/perf/perf.c:446
      #7 0x557692867fae in main tools/perf/perf.c:562
      #8 0x7f9df96456c9 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

Fixes: 657ee55 ("perf evlist: Introduce side band thread")
Signed-off-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Disha Goel <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kajol Jain <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Yicong Yang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
smfrench pushed a commit that referenced this pull request Jul 12, 2024
[ Upstream commit 769e6a1 ]

ui_browser__show() is capturing the input title that is stack allocated
memory in hist_browser__run().

Avoid a use after return by strdup-ing the string.

Committer notes:

Further explanation from Ian Rogers:

My command line using tui is:
$ sudo bash -c 'rm /tmp/asan.log*; export
ASAN_OPTIONS="log_path=/tmp/asan.log"; /tmp/perf/perf mem record -a
sleep 1; /tmp/perf/perf mem report'
I then go to the perf annotate view and quit. This triggers the asan
error (from the log file):
```
==1254591==ERROR: AddressSanitizer: stack-use-after-return on address
0x7f2813331920 at pc 0x7f28180
65991 bp 0x7fff0a21c750 sp 0x7fff0a21bf10
READ of size 80 at 0x7f2813331920 thread T0
    #0 0x7f2818065990 in __interceptor_strlen
../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:461
    #1 0x7f2817698251 in SLsmg_write_wrapped_string
(/lib/x86_64-linux-gnu/libslang.so.2+0x98251)
    #2 0x7f28176984b9 in SLsmg_write_nstring
(/lib/x86_64-linux-gnu/libslang.so.2+0x984b9)
    #3 0x55c94045b365 in ui_browser__write_nstring ui/browser.c:60
    #4 0x55c94045c558 in __ui_browser__show_title ui/browser.c:266
    #5 0x55c94045c776 in ui_browser__show ui/browser.c:288
    #6 0x55c94045c06d in ui_browser__handle_resize ui/browser.c:206
    #7 0x55c94047979b in do_annotate ui/browsers/hists.c:2458
    #8 0x55c94047fb17 in evsel__hists_browse ui/browsers/hists.c:3412
    #9 0x55c940480a0c in perf_evsel_menu__run ui/browsers/hists.c:3527
    #10 0x55c940481108 in __evlist__tui_browse_hists ui/browsers/hists.c:3613
    #11 0x55c9404813f7 in evlist__tui_browse_hists ui/browsers/hists.c:3661
    #12 0x55c93ffa253f in report__browse_hists tools/perf/builtin-report.c:671
    #13 0x55c93ffa58ca in __cmd_report tools/perf/builtin-report.c:1141
    #14 0x55c93ffaf159 in cmd_report tools/perf/builtin-report.c:1805
    #15 0x55c94000c05c in report_events tools/perf/builtin-mem.c:374
    #16 0x55c94000d96d in cmd_mem tools/perf/builtin-mem.c:516
    #17 0x55c9400e44ee in run_builtin tools/perf/perf.c:350
    #18 0x55c9400e4a5a in handle_internal_command tools/perf/perf.c:403
    #19 0x55c9400e4e22 in run_argv tools/perf/perf.c:447
    #20 0x55c9400e53ad in main tools/perf/perf.c:561
    #21 0x7f28170456c9 in __libc_start_call_main
../sysdeps/nptl/libc_start_call_main.h:58
    #22 0x7f2817045784 in __libc_start_main_impl ../csu/libc-start.c:360
    #23 0x55c93ff544c0 in _start (/tmp/perf/perf+0x19a4c0) (BuildId:
84899b0e8c7d3a3eaa67b2eb35e3d8b2f8cd4c93)

Address 0x7f2813331920 is located in stack of thread T0 at offset 32 in frame
    #0 0x55c94046e85e in hist_browser__run ui/browsers/hists.c:746

  This frame has 1 object(s):
    [32, 192) 'title' (line 747) <== Memory access at offset 32 is
inside this variable
HINT: this may be a false positive if your program uses some custom
stack unwind mechanism, swapcontext or vfork
```
hist_browser__run isn't on the stack so the asan error looks legit.
There's no clean init/exit on struct ui_browser so I may be trading a
use-after-return for a memory leak, but that seems look a good trade
anyway.

Fixes: 05e8b08 ("perf ui browser: Stop using 'self'")
Signed-off-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Ben Gainey <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kajol Jain <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Li Dong <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Oliver Upton <[email protected]>
Cc: Paran Lee <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Cc: Sun Haiyong <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Yanteng Si <[email protected]>
Cc: Yicong Yang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
smfrench pushed a commit that referenced this pull request Jul 12, 2024
[ Upstream commit ffbf4fb ]

When CONFIG_DEBUG_BUGVERBOSE=n, we fail to add necessary padding bytes
to bug_table entries, and as a result the last entry in a bug table will
be ignored, potentially leading to an unexpected panic(). All prior
entries in the table will be handled correctly.

The arm64 ABI requires that struct fields of up to 8 bytes are
naturally-aligned, with padding added within a struct such that struct
are suitably aligned within arrays.

When CONFIG_DEBUG_BUGVERPOSE=y, the layout of a bug_entry is:

	struct bug_entry {
		signed int      bug_addr_disp;	// 4 bytes
		signed int      file_disp;	// 4 bytes
		unsigned short  line;		// 2 bytes
		unsigned short  flags;		// 2 bytes
	}

... with 12 bytes total, requiring 4-byte alignment.

When CONFIG_DEBUG_BUGVERBOSE=n, the layout of a bug_entry is:

	struct bug_entry {
		signed int      bug_addr_disp;	// 4 bytes
		unsigned short  flags;		// 2 bytes
		< implicit padding >		// 2 bytes
	}

... with 8 bytes total, with 6 bytes of data and 2 bytes of trailing
padding, requiring 4-byte alginment.

When we create a bug_entry in assembly, we align the start of the entry
to 4 bytes, which implicitly handles padding for any prior entries.
However, we do not align the end of the entry, and so when
CONFIG_DEBUG_BUGVERBOSE=n, the final entry lacks the trailing padding
bytes.

For the main kernel image this is not a problem as find_bug() doesn't
depend on the trailing padding bytes when searching for entries:

	for (bug = __start___bug_table; bug < __stop___bug_table; ++bug)
		if (bugaddr == bug_addr(bug))
			return bug;

However for modules, module_bug_finalize() depends on the trailing
bytes when calculating the number of entries:

	mod->num_bugs = sechdrs[i].sh_size / sizeof(struct bug_entry);

... and as the last bug_entry lacks the necessary padding bytes, this entry
will not be counted, e.g. in the case of a single entry:

	sechdrs[i].sh_size == 6
	sizeof(struct bug_entry) == 8;

	sechdrs[i].sh_size / sizeof(struct bug_entry) == 0;

Consequently module_find_bug() will miss the last bug_entry when it does:

	for (i = 0; i < mod->num_bugs; ++i, ++bug)
		if (bugaddr == bug_addr(bug))
			goto out;

... which can lead to a kenrel panic due to an unhandled bug.

This can be demonstrated with the following module:

	static int __init buginit(void)
	{
		WARN(1, "hello\n");
		return 0;
	}

	static void __exit bugexit(void)
	{
	}

	module_init(buginit);
	module_exit(bugexit);
	MODULE_LICENSE("GPL");

... which will trigger a kernel panic when loaded:

	------------[ cut here ]------------
	hello
	Unexpected kernel BRK exception at EL1
	Internal error: BRK handler: 00000000f2000800 [#1] PREEMPT SMP
	Modules linked in: hello(O+)
	CPU: 0 PID: 50 Comm: insmod Tainted: G           O       6.9.1 #8
	Hardware name: linux,dummy-virt (DT)
	pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
	pc : buginit+0x18/0x1000 [hello]
	lr : buginit+0x18/0x1000 [hello]
	sp : ffff800080533ae0
	x29: ffff800080533ae0 x28: 0000000000000000 x27: 0000000000000000
	x26: ffffaba8c4e70510 x25: ffff800080533c30 x24: ffffaba8c4a28a58
	x23: 0000000000000000 x22: 0000000000000000 x21: ffff3947c0eab3c0
	x20: ffffaba8c4e3f000 x19: ffffaba846464000 x18: 0000000000000006
	x17: 0000000000000000 x16: ffffaba8c2492834 x15: 0720072007200720
	x14: 0720072007200720 x13: ffffaba8c49b27c8 x12: 0000000000000312
	x11: 0000000000000106 x10: ffffaba8c4a0a7c8 x9 : ffffaba8c49b27c8
	x8 : 00000000ffffefff x7 : ffffaba8c4a0a7c8 x6 : 80000000fffff000
	x5 : 0000000000000107 x4 : 0000000000000000 x3 : 0000000000000000
	x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff3947c0eab3c0
	Call trace:
	 buginit+0x18/0x1000 [hello]
	 do_one_initcall+0x80/0x1c8
	 do_init_module+0x60/0x218
	 load_module+0x1ba4/0x1d70
	 __do_sys_init_module+0x198/0x1d0
	 __arm64_sys_init_module+0x1c/0x28
	 invoke_syscall+0x48/0x114
	 el0_svc_common.constprop.0+0x40/0xe0
	 do_el0_svc+0x1c/0x28
	 el0_svc+0x34/0xd8
	 el0t_64_sync_handler+0x120/0x12c
	 el0t_64_sync+0x190/0x194
	Code: d0ffffe0 910003fd 91000000 9400000b (d4210000)
	---[ end trace 0000000000000000 ]---
	Kernel panic - not syncing: BRK handler: Fatal exception

Fix this by always aligning the end of a bug_entry to 4 bytes, which is
correct regardless of CONFIG_DEBUG_BUGVERBOSE.

Fixes: 9fb7410 ("arm64/BUG: Use BRK instruction for generic BUG traps")

Signed-off-by: Yuanbin Xie <[email protected]>
Signed-off-by: Jiangfeng Xiao <[email protected]>
Reviewed-by: Mark Rutland <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
smfrench pushed a commit that referenced this pull request Jul 12, 2024
…PLES event"

commit 5b3cde1 upstream.

This reverts commit 7d1405c.

This causes segfaults in some cases, as reported by Milian:

  ```
  sudo /usr/bin/perf record -z --call-graph dwarf -e cycles -e
  raw_syscalls:sys_enter ls
  ...
  [ perf record: Woken up 3 times to write data ]
  malloc(): invalid next size (unsorted)
  Aborted
  ```

  Backtrace with GDB + debuginfod:

  ```
  malloc(): invalid next size (unsorted)

  Thread 1 "perf" received signal SIGABRT, Aborted.
  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6,
  no_tid=no_tid@entry=0) at pthread_kill.c:44
  Downloading source file /usr/src/debug/glibc/glibc/nptl/pthread_kill.c
  44            return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO
  (ret) : 0;
  (gdb) bt
  #0  __pthread_kill_implementation (threadid=<optimized out>,
  signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
  #1  0x00007ffff6ea8eb3 in __pthread_kill_internal (threadid=<optimized out>,
  signo=6) at pthread_kill.c:78
  #2  0x00007ffff6e50a30 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/
  raise.c:26
  #3  0x00007ffff6e384c3 in __GI_abort () at abort.c:79
  #4  0x00007ffff6e39354 in __libc_message_impl (fmt=fmt@entry=0x7ffff6fc22ea
  "%s\n") at ../sysdeps/posix/libc_fatal.c:132
  #5  0x00007ffff6eb3085 in malloc_printerr (str=str@entry=0x7ffff6fc5850
  "malloc(): invalid next size (unsorted)") at malloc.c:5772
  #6  0x00007ffff6eb657c in _int_malloc (av=av@entry=0x7ffff6ff6ac0
  <main_arena>, bytes=bytes@entry=368) at malloc.c:4081
  #7  0x00007ffff6eb877e in __libc_calloc (n=<optimized out>,
  elem_size=<optimized out>) at malloc.c:3754
  #8  0x000055555569bdb6 in perf_session.do_write_header ()
  #9  0x00005555555a373a in __cmd_record.constprop.0 ()
  #10 0x00005555555a6846 in cmd_record ()
  #11 0x000055555564db7f in run_builtin ()
  #12 0x000055555558ed77 in main ()
  ```

  Valgrind memcheck:
  ```
  ==45136== Invalid write of size 8
  ==45136==    at 0x2B38A5: perf_event__synthesize_id_sample (in /usr/bin/perf)
  ==45136==    by 0x157069: __cmd_record.constprop.0 (in /usr/bin/perf)
  ==45136==    by 0x15A845: cmd_record (in /usr/bin/perf)
  ==45136==    by 0x201B7E: run_builtin (in /usr/bin/perf)
  ==45136==    by 0x142D76: main (in /usr/bin/perf)
  ==45136==  Address 0x6a866a8 is 0 bytes after a block of size 40 alloc'd
  ==45136==    at 0x4849BF3: calloc (vg_replace_malloc.c:1675)
  ==45136==    by 0x3574AB: zalloc (in /usr/bin/perf)
  ==45136==    by 0x1570E0: __cmd_record.constprop.0 (in /usr/bin/perf)
  ==45136==    by 0x15A845: cmd_record (in /usr/bin/perf)
  ==45136==    by 0x201B7E: run_builtin (in /usr/bin/perf)
  ==45136==    by 0x142D76: main (in /usr/bin/perf)
  ==45136==
  ==45136== Syscall param write(buf) points to unaddressable byte(s)
  ==45136==    at 0x575953D: __libc_write (write.c:26)
  ==45136==    by 0x575953D: write (write.c:24)
  ==45136==    by 0x35761F: ion (in /usr/bin/perf)
  ==45136==    by 0x357778: writen (in /usr/bin/perf)
  ==45136==    by 0x1548F7: record__write (in /usr/bin/perf)
  ==45136==    by 0x15708A: __cmd_record.constprop.0 (in /usr/bin/perf)
  ==45136==    by 0x15A845: cmd_record (in /usr/bin/perf)
  ==45136==    by 0x201B7E: run_builtin (in /usr/bin/perf)
  ==45136==    by 0x142D76: main (in /usr/bin/perf)
  ==45136==  Address 0x6a866a8 is 0 bytes after a block of size 40 alloc'd
  ==45136==    at 0x4849BF3: calloc (vg_replace_malloc.c:1675)
  ==45136==    by 0x3574AB: zalloc (in /usr/bin/perf)
  ==45136==    by 0x1570E0: __cmd_record.constprop.0 (in /usr/bin/perf)
  ==45136==    by 0x15A845: cmd_record (in /usr/bin/perf)
  ==45136==    by 0x201B7E: run_builtin (in /usr/bin/perf)
  ==45136==    by 0x142D76: main (in /usr/bin/perf)
  ==45136==
 -----

Closes: https://lore.kernel.org/linux-perf-users/23879991.0LEYPuXRzz@milian-workstation/
Reported-by: Milian Wolff <[email protected]>
Tested-by: Milian Wolff <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: [email protected] # 6.8+
Link: https://lore.kernel.org/lkml/Zl9ksOlHJHnKM70p@x1
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
smfrench pushed a commit that referenced this pull request Jul 12, 2024
commit d4f9d5a upstream.

A system crash like this

  Failing address: 200000cb7df6f000 TEID: 200000cb7df6f403
  Fault in home space mode while using kernel ASCE.
  AS:00000002d71bc007 R3:00000003fe5b8007 S:000000011a446000 P:000000015660c13d
  Oops: 0038 ilc:3 [#1] PREEMPT SMP
  Modules linked in: mlx5_ib ...
  CPU: 8 PID: 7556 Comm: bash Not tainted 6.9.0-rc7 #8
  Hardware name: IBM 3931 A01 704 (LPAR)
  Krnl PSW : 0704e00180000000 0000014b75e7b606 (ap_parse_bitmap_str+0x10e/0x1f8)
  R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
  Krnl GPRS: 0000000000000001 ffffffffffffffc0 0000000000000001 00000048f96b75d3
  000000cb00000100 ffffffffffffffff ffffffffffffffff 000000cb7df6fce0
  000000cb7df6fce0 00000000ffffffff 000000000000002b 00000048ffffffff
  000003ff9b2dbc80 200000cb7df6fcd8 0000014bffffffc0 000000cb7df6fbc8
  Krnl Code: 0000014b75e7b5fc: a7840047            brc     8,0000014b75e7b68a
  0000014b75e7b600: 18b2                lr      %r11,%r2
  #0000014b75e7b602: a7f4000a            brc     15,0000014b75e7b616
  >0000014b75e7b606: eb22d00000e6        laog    %r2,%r2,0(%r13)
  0000014b75e7b60c: a7680001            lhi     %r6,1
  0000014b75e7b610: 187b                lr      %r7,%r11
  0000014b75e7b612: 84960021            brxh    %r9,%r6,0000014b75e7b654
  0000014b75e7b616: 18e9                lr      %r14,%r9
  Call Trace:
  [<0000014b75e7b606>] ap_parse_bitmap_str+0x10e/0x1f8
  ([<0000014b75e7b5dc>] ap_parse_bitmap_str+0xe4/0x1f8)
  [<0000014b75e7b758>] apmask_store+0x68/0x140
  [<0000014b75679196>] kernfs_fop_write_iter+0x14e/0x1e8
  [<0000014b75598524>] vfs_write+0x1b4/0x448
  [<0000014b7559894c>] ksys_write+0x74/0x100
  [<0000014b7618a440>] __do_syscall+0x268/0x328
  [<0000014b761a3558>] system_call+0x70/0x98
  INFO: lockdep is turned off.
  Last Breaking-Event-Address:
  [<0000014b75e7b636>] ap_parse_bitmap_str+0x13e/0x1f8
  Kernel panic - not syncing: Fatal exception: panic_on_oops

occured when /sys/bus/ap/a[pq]mask was updated with a relative mask value
(like +0x10-0x12,+60,-90) with one of the numeric values exceeding INT_MAX.

The fix is simple: use unsigned long values for the internal variables. The
correct checks are already in place in the function but a simple int for
the internal variables was used with the possibility to overflow.

Reported-by: Marc Hartmayer <[email protected]>
Signed-off-by: Harald Freudenberger <[email protected]>
Tested-by: Marc Hartmayer <[email protected]>
Reviewed-by: Holger Dengler <[email protected]>
Cc: <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
smfrench pushed a commit that referenced this pull request Jul 12, 2024
commit 9d274c1 upstream.

We have been seeing crashes on duplicate keys in
btrfs_set_item_key_safe():

  BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192)
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.c:2620!
  invalid opcode: 0000 [#1] PREEMPT SMP PTI
  CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
  RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs]

With the following stack trace:

  #0  btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4)
  #1  btrfs_drop_extents (fs/btrfs/file.c:411:4)
  #2  log_one_extent (fs/btrfs/tree-log.c:4732:9)
  #3  btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9)
  #4  btrfs_log_inode (fs/btrfs/tree-log.c:6626:9)
  #5  btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8)
  #6  btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8)
  #7  btrfs_sync_file (fs/btrfs/file.c:1933:8)
  #8  vfs_fsync_range (fs/sync.c:188:9)
  #9  vfs_fsync (fs/sync.c:202:9)
  #10 do_fsync (fs/sync.c:212:9)
  #11 __do_sys_fdatasync (fs/sync.c:225:9)
  #12 __se_sys_fdatasync (fs/sync.c:223:1)
  #13 __x64_sys_fdatasync (fs/sync.c:223:1)
  #14 do_syscall_x64 (arch/x86/entry/common.c:52:14)
  #15 do_syscall_64 (arch/x86/entry/common.c:83:7)
  #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121)

So we're logging a changed extent from fsync, which is splitting an
extent in the log tree. But this split part already exists in the tree,
triggering the BUG().

This is the state of the log tree at the time of the crash, dumped with
drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py)
to get more details than btrfs_print_leaf() gives us:

  >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"])
  leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610
  leaf 33439744 flags 0x100000000000000
  fs uuid e5bd3946-400c-4223-8923-190ef1f18677
  chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
          item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160
                  generation 7 transid 9 size 8192 nbytes 8473563889606862198
                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                  sequence 204 flags 0x10(PREALLOC)
                  atime 1716417703.220000000 (2024-05-22 15:41:43)
                  ctime 1716417704.983333333 (2024-05-22 15:41:44)
                  mtime 1716417704.983333333 (2024-05-22 15:41:44)
                  otime 17592186044416.000000000 (559444-03-08 01:40:16)
          item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13
                  index 195 namelen 3 name: 193
          item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37
                  location key (0 UNKNOWN.0 0) type XATTR
                  transid 7 data_len 1 name_len 6
                  name: user.a
                  data a
          item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53
                  generation 9 type 1 (regular)
                  extent data disk byte 303144960 nr 12288
                  extent data offset 0 nr 4096 ram 12288
                  extent compression 0 (none)
          item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 4096 nr 8192
          item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 8192 nr 4096
  ...

So the real problem happened earlier: notice that items 4 (4k-12k) and 5
(8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and
item 5 starts at i_size.

Here is the state of the filesystem tree at the time of the crash:

  >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root
  >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0))
  >>> print_extent_buffer(nodes[0])
  leaf 30425088 level 0 items 184 generation 9 owner 5
  leaf 30425088 flags 0x100000000000000
  fs uuid e5bd3946-400c-4223-8923-190ef1f18677
  chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
  	...
          item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160
                  generation 7 transid 7 size 4096 nbytes 12288
                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                  sequence 6 flags 0x10(PREALLOC)
                  atime 1716417703.220000000 (2024-05-22 15:41:43)
                  ctime 1716417703.220000000 (2024-05-22 15:41:43)
                  mtime 1716417703.220000000 (2024-05-22 15:41:43)
                  otime 1716417703.220000000 (2024-05-22 15:41:43)
          item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13
                  index 195 namelen 3 name: 193
          item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37
                  location key (0 UNKNOWN.0 0) type XATTR
                  transid 7 data_len 1 name_len 6
                  name: user.a
                  data a
          item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53
                  generation 9 type 1 (regular)
                  extent data disk byte 303144960 nr 12288
                  extent data offset 0 nr 8192 ram 12288
                  extent compression 0 (none)
          item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 8192 nr 4096

Item 5 in the log tree corresponds to item 183 in the filesystem tree,
but nothing matches item 4. Furthermore, item 183 is the last item in
the leaf.

btrfs_log_prealloc_extents() is responsible for logging prealloc extents
beyond i_size. It first truncates any previously logged prealloc extents
that start beyond i_size. Then, it walks the filesystem tree and copies
the prealloc extent items to the log tree.

If it hits the end of a leaf, then it calls btrfs_next_leaf(), which
unlocks the tree and does another search. However, while the filesystem
tree is unlocked, an ordered extent completion may modify the tree. In
particular, it may insert an extent item that overlaps with an extent
item that was already copied to the log tree.

This may manifest in several ways depending on the exact scenario,
including an EEXIST error that is silently translated to a full sync,
overlapping items in the log tree, or this crash. This particular crash
is triggered by the following sequence of events:

- Initially, the file has i_size=4k, a regular extent from 0-4k, and a
  prealloc extent beyond i_size from 4k-12k. The prealloc extent item is
  the last item in its B-tree leaf.
- The file is fsync'd, which copies its inode item and both extent items
  to the log tree.
- An xattr is set on the file, which sets the
  BTRFS_INODE_COPY_EVERYTHING flag.
- The range 4k-8k in the file is written using direct I/O. i_size is
  extended to 8k, but the ordered extent is still in flight.
- The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this
  calls copy_inode_items_to_log(), which calls
  btrfs_log_prealloc_extents().
- btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the
  filesystem tree. Since it starts before i_size, it skips it. Since it
  is the last item in its B-tree leaf, it calls btrfs_next_leaf().
- btrfs_next_leaf() unlocks the path.
- The ordered extent completion runs, which converts the 4k-8k part of
  the prealloc extent to written and inserts the remaining prealloc part
  from 8k-12k.
- btrfs_next_leaf() does a search and finds the new prealloc extent
  8k-12k.
- btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into
  the log tree. Note that it overlaps with the 4k-12k prealloc extent
  that was copied to the log tree by the first fsync.
- fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k
  extent that was written.
- This tries to drop the range 4k-8k in the log tree, which requires
  adjusting the start of the 4k-12k prealloc extent in the log tree to
  8k.
- btrfs_set_item_key_safe() sees that there is already an extent
  starting at 8k in the log tree and calls BUG().

Fix this by detecting when we're about to insert an overlapping file
extent item in the log tree and truncating the part that would overlap.

CC: [email protected] # 6.1+
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
smfrench pushed a commit that referenced this pull request Jul 13, 2024
Currently if we request a feature that is not set in the Kernel config we
fail silently and return all the available features.  However, the man
page indicates we should return an EINVAL.

We need to fix this issue since we can end up with a Kernel warning should
a program request the feature UFFD_FEATURE_WP_UNPOPULATED on a kernel with
the config not set with this feature.

 [  200.812896] WARNING: CPU: 91 PID: 13634 at mm/memory.c:1660 zap_pte_range+0x43d/0x660
 [  200.820738] Modules linked in:
 [  200.869387] CPU: 91 PID: 13634 Comm: userfaultfd Kdump: loaded Not tainted 6.9.0-rc5+ #8
 [  200.877477] Hardware name: Dell Inc. PowerEdge R6525/0N7YGH, BIOS 2.7.3 03/30/2022
 [  200.885052] RIP: 0010:zap_pte_range+0x43d/0x660

Link: https://lkml.kernel.org/r/[email protected]
Fixes: e06f1e1 ("userfaultfd: wp: enabled write protection in userfaultfd API")
Signed-off-by: Audra Mitchell <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Rafael Aquini <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
smfrench pushed a commit that referenced this pull request Jul 16, 2024
commit be346c1 upstream.

The code in ocfs2_dio_end_io_write() estimates number of necessary
transaction credits using ocfs2_calc_extend_credits().  This however does
not take into account that the IO could be arbitrarily large and can
contain arbitrary number of extents.

Extent tree manipulations do often extend the current transaction but not
in all of the cases.  For example if we have only single block extents in
the tree, ocfs2_mark_extent_written() will end up calling
ocfs2_replace_extent_rec() all the time and we will never extend the
current transaction and eventually exhaust all the transaction credits if
the IO contains many single block extents.  Once that happens a
WARN_ON(jbd2_handle_buffer_credits(handle) <= 0) is triggered in
jbd2_journal_dirty_metadata() and subsequently OCFS2 aborts in response to
this error.  This was actually triggered by one of our customers on a
heavily fragmented OCFS2 filesystem.

To fix the issue make sure the transaction always has enough credits for
one extent insert before each call of ocfs2_mark_extent_written().

Heming Zhao said:

------
PANIC: "Kernel panic - not syncing: OCFS2: (device dm-1): panic forced after error"

PID: xxx  TASK: xxxx  CPU: 5  COMMAND: "SubmitThread-CA"
  #0 machine_kexec at ffffffff8c069932
  #1 __crash_kexec at ffffffff8c1338fa
  #2 panic at ffffffff8c1d69b9
  #3 ocfs2_handle_error at ffffffffc0c86c0c [ocfs2]
  #4 __ocfs2_abort at ffffffffc0c88387 [ocfs2]
  #5 ocfs2_journal_dirty at ffffffffc0c51e98 [ocfs2]
  #6 ocfs2_split_extent at ffffffffc0c27ea3 [ocfs2]
  #7 ocfs2_change_extent_flag at ffffffffc0c28053 [ocfs2]
  #8 ocfs2_mark_extent_written at ffffffffc0c28347 [ocfs2]
  #9 ocfs2_dio_end_io_write at ffffffffc0c2bef9 [ocfs2]
#10 ocfs2_dio_end_io at ffffffffc0c2c0f5 [ocfs2]
#11 dio_complete at ffffffff8c2b9fa7
#12 do_blockdev_direct_IO at ffffffff8c2bc09f
#13 ocfs2_direct_IO at ffffffffc0c2b653 [ocfs2]
#14 generic_file_direct_write at ffffffff8c1dcf14
#15 __generic_file_write_iter at ffffffff8c1dd07b
#16 ocfs2_file_write_iter at ffffffffc0c49f1f [ocfs2]
#17 aio_write at ffffffff8c2cc72e
#18 kmem_cache_alloc at ffffffff8c248dde
#19 do_io_submit at ffffffff8c2ccada
#20 do_syscall_64 at ffffffff8c004984
#21 entry_SYSCALL_64_after_hwframe at ffffffff8c8000ba

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: c15471f ("ocfs2: fix sparse file & data ordering issue in direct io")
Signed-off-by: Jan Kara <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>
Reviewed-by: Heming Zhao <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
smfrench pushed a commit that referenced this pull request Aug 6, 2024
commit 1723f04 upstream.

Currently if we request a feature that is not set in the Kernel config we
fail silently and return all the available features.  However, the man
page indicates we should return an EINVAL.

We need to fix this issue since we can end up with a Kernel warning should
a program request the feature UFFD_FEATURE_WP_UNPOPULATED on a kernel with
the config not set with this feature.

 [  200.812896] WARNING: CPU: 91 PID: 13634 at mm/memory.c:1660 zap_pte_range+0x43d/0x660
 [  200.820738] Modules linked in:
 [  200.869387] CPU: 91 PID: 13634 Comm: userfaultfd Kdump: loaded Not tainted 6.9.0-rc5+ #8
 [  200.877477] Hardware name: Dell Inc. PowerEdge R6525/0N7YGH, BIOS 2.7.3 03/30/2022
 [  200.885052] RIP: 0010:zap_pte_range+0x43d/0x660

Link: https://lkml.kernel.org/r/[email protected]
Fixes: e06f1e1 ("userfaultfd: wp: enabled write protection in userfaultfd API")
Signed-off-by: Audra Mitchell <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Rafael Aquini <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
smfrench pushed a commit that referenced this pull request Aug 11, 2024
When l2tp tunnels use a socket provided by userspace, we can hit
lockdep splats like the below when data is transmitted through another
(unrelated) userspace socket which then gets routed over l2tp.

This issue was previously discussed here:
https://lore.kernel.org/netdev/[email protected]/

The solution is to have lockdep treat socket locks of l2tp tunnel
sockets separately than those of standard INET sockets. To do so, use
a different lockdep subclass where lock nesting is possible.

  ============================================
  WARNING: possible recursive locking detected
  6.10.0+ #34 Not tainted
  --------------------------------------------
  iperf3/771 is trying to acquire lock:
  ffff8881027601d8 (slock-AF_INET/1){+.-.}-{2:2}, at: l2tp_xmit_skb+0x243/0x9d0

  but task is already holding lock:
  ffff888102650d98 (slock-AF_INET/1){+.-.}-{2:2}, at: tcp_v4_rcv+0x1848/0x1e10

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(slock-AF_INET/1);
    lock(slock-AF_INET/1);

   *** DEADLOCK ***

   May be due to missing lock nesting notation

  10 locks held by iperf3/771:
   #0: ffff888102650258 (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_sendmsg+0x1a/0x40
   #1: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x4b/0xbc0
   #2: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x17a/0x1130
   #3: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: process_backlog+0x28b/0x9f0
   #4: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0xf9/0x260
   #5: ffff888102650d98 (slock-AF_INET/1){+.-.}-{2:2}, at: tcp_v4_rcv+0x1848/0x1e10
   #6: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x4b/0xbc0
   #7: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x17a/0x1130
   #8: ffffffff822ac1e0 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0xcc/0x1450
   #9: ffff888101f33258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock#2){+...}-{2:2}, at: __dev_queue_xmit+0x513/0x1450

  stack backtrace:
  CPU: 2 UID: 0 PID: 771 Comm: iperf3 Not tainted 6.10.0+ #34
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  Call Trace:
   <IRQ>
   dump_stack_lvl+0x69/0xa0
   dump_stack+0xc/0x20
   __lock_acquire+0x135d/0x2600
   ? srso_alias_return_thunk+0x5/0xfbef5
   lock_acquire+0xc4/0x2a0
   ? l2tp_xmit_skb+0x243/0x9d0
   ? __skb_checksum+0xa3/0x540
   _raw_spin_lock_nested+0x35/0x50
   ? l2tp_xmit_skb+0x243/0x9d0
   l2tp_xmit_skb+0x243/0x9d0
   l2tp_eth_dev_xmit+0x3c/0xc0
   dev_hard_start_xmit+0x11e/0x420
   sch_direct_xmit+0xc3/0x640
   __dev_queue_xmit+0x61c/0x1450
   ? ip_finish_output2+0xf4c/0x1130
   ip_finish_output2+0x6b6/0x1130
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __ip_finish_output+0x217/0x380
   ? srso_alias_return_thunk+0x5/0xfbef5
   __ip_finish_output+0x217/0x380
   ip_output+0x99/0x120
   __ip_queue_xmit+0xae4/0xbc0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? tcp_options_write.constprop.0+0xcb/0x3e0
   ip_queue_xmit+0x34/0x40
   __tcp_transmit_skb+0x1625/0x1890
   __tcp_send_ack+0x1b8/0x340
   tcp_send_ack+0x23/0x30
   __tcp_ack_snd_check+0xa8/0x530
   ? srso_alias_return_thunk+0x5/0xfbef5
   tcp_rcv_established+0x412/0xd70
   tcp_v4_do_rcv+0x299/0x420
   tcp_v4_rcv+0x1991/0x1e10
   ip_protocol_deliver_rcu+0x50/0x220
   ip_local_deliver_finish+0x158/0x260
   ip_local_deliver+0xc8/0xe0
   ip_rcv+0xe5/0x1d0
   ? __pfx_ip_rcv+0x10/0x10
   __netif_receive_skb_one_core+0xce/0xe0
   ? process_backlog+0x28b/0x9f0
   __netif_receive_skb+0x34/0xd0
   ? process_backlog+0x28b/0x9f0
   process_backlog+0x2cb/0x9f0
   __napi_poll.constprop.0+0x61/0x280
   net_rx_action+0x332/0x670
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? find_held_lock+0x2b/0x80
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   handle_softirqs+0xda/0x480
   ? __dev_queue_xmit+0xa2c/0x1450
   do_softirq+0xa1/0xd0
   </IRQ>
   <TASK>
   __local_bh_enable_ip+0xc8/0xe0
   ? __dev_queue_xmit+0xa2c/0x1450
   __dev_queue_xmit+0xa48/0x1450
   ? ip_finish_output2+0xf4c/0x1130
   ip_finish_output2+0x6b6/0x1130
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __ip_finish_output+0x217/0x380
   ? srso_alias_return_thunk+0x5/0xfbef5
   __ip_finish_output+0x217/0x380
   ip_output+0x99/0x120
   __ip_queue_xmit+0xae4/0xbc0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? tcp_options_write.constprop.0+0xcb/0x3e0
   ip_queue_xmit+0x34/0x40
   __tcp_transmit_skb+0x1625/0x1890
   tcp_write_xmit+0x766/0x2fb0
   ? __entry_text_end+0x102ba9/0x102bad
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __might_fault+0x74/0xc0
   ? srso_alias_return_thunk+0x5/0xfbef5
   __tcp_push_pending_frames+0x56/0x190
   tcp_push+0x117/0x310
   tcp_sendmsg_locked+0x14c1/0x1740
   tcp_sendmsg+0x28/0x40
   inet_sendmsg+0x5d/0x90
   sock_write_iter+0x242/0x2b0
   vfs_write+0x68d/0x800
   ? __pfx_sock_write_iter+0x10/0x10
   ksys_write+0xc8/0xf0
   __x64_sys_write+0x3d/0x50
   x64_sys_call+0xfaf/0x1f50
   do_syscall_64+0x6d/0x140
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f4d143af992
  Code: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 01 cc ff ff 41 54 b8 02 00 00 0
  RSP: 002b:00007ffd65032058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
  RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f4d143af992
  RDX: 0000000000000025 RSI: 00007f4d143f3bcc RDI: 0000000000000005
  RBP: 00007f4d143f2b28 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4d143f3bcc
  R13: 0000000000000005 R14: 0000000000000000 R15: 00007ffd650323f0
   </TASK>

Fixes: 0b2c597 ("l2tp: close all race conditions in l2tp_tunnel_register()")
Suggested-by: Eric Dumazet <[email protected]>
Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=6acef9e0a4d1f46c83d4
CC: [email protected]
CC: [email protected]
Signed-off-by: James Chapman <[email protected]>
Signed-off-by: Tom Parkin <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
smfrench pushed a commit that referenced this pull request Sep 1, 2024
A sysfs reader can race with a device reset or removal, attempting to
read device state when the device is not actually present. eg:

     [exception RIP: qed_get_current_link+17]
  #8 [ffffb9e4f2907c48] qede_get_link_ksettings at ffffffffc07a994a [qede]
  #9 [ffffb9e4f2907cd8] __rh_call_get_link_ksettings at ffffffff992b01a3
 #10 [ffffb9e4f2907d38] __ethtool_get_link_ksettings at ffffffff992b04e4
 #11 [ffffb9e4f2907d90] duplex_show at ffffffff99260300
 #12 [ffffb9e4f2907e38] dev_attr_show at ffffffff9905a01c
 #13 [ffffb9e4f2907e50] sysfs_kf_seq_show at ffffffff98e0145b
 #14 [ffffb9e4f2907e68] seq_read at ffffffff98d902e3
 #15 [ffffb9e4f2907ec8] vfs_read at ffffffff98d657d1
 #16 [ffffb9e4f2907f00] ksys_read at ffffffff98d65c3f
 #17 [ffffb9e4f2907f38] do_syscall_64 at ffffffff98a052fb

 crash> struct net_device.state ffff9a9d21336000
    state = 5,

state 5 is __LINK_STATE_START (0b1) and __LINK_STATE_NOCARRIER (0b100).
The device is not present, note lack of __LINK_STATE_PRESENT (0b10).

This is the same sort of panic as observed in commit 4224cfd
("net-sysfs: add check for netdevice being present to speed_show").

There are many other callers of __ethtool_get_link_ksettings() which
don't have a device presence check.

Move this check into ethtool to protect all callers.

Fixes: d519e17 ("net: export device speed and duplex via sysfs")
Fixes: 4224cfd ("net-sysfs: add check for netdevice being present to speed_show")
Signed-off-by: Jamie Bainbridge <[email protected]>
Link: https://patch.msgid.link/8bae218864beaa44ed01628140475b9bf641c5b0.1724393671.git.jamie.bainbridge@gmail.com
Signed-off-by: Jakub Kicinski <[email protected]>
smfrench pushed a commit that referenced this pull request Sep 19, 2024
Daniel Machon says:

====================
net: lan966x: use the newly introduced FDMA library

This patch series is the second of a 2-part series [1], that adds a new
common FDMA library for Microchip switch chips Sparx5 and lan966x. These
chips share the same FDMA engine, and as such will benefit from a common
library with a common implementation.  This also has the benefit of
removing a lot of open-coded bookkeeping and duplicate code for the two
drivers.

In this second series, the FDMA library will be taken into use by the
lan966x switch driver.

 ###################
 # Example of use: #
 ###################

- Initialize the rx and tx fdma structs with values for: number of
  DCB's, number of DB's, channel ID, DB size (data buffer size), and
  total size of the requested memory. Also provide two callbacks:
  nextptr_cb() and dataptr_cb() for getting the nextptr and dataptr.

- Allocate memory using fdma_alloc_phys() or fdma_alloc_coherent().

- Initialize the DCB's with fdma_dcb_init().

- Add new DCB's with fdma_dcb_add().

- Free memory with fdma_free_phys() or fdma_free_coherent().

 #####################
 # Patch  breakdown: #
 #####################

Patch #1:  select FDMA library for lan966x.

Patch #2:  includes the fdma_api.h header and removes old symbols.

Patch #3:  replaces old rx and tx variables with equivalent ones from the
           fdma struct. Only the variables that can be changed without
           breaking traffic is changed in this patch.

Patch #4:  uses the library for allocation of rx buffers. This requires
           quite a bit of refactoring in this single patch.

Patch #5:  uses the library for adding DCB's in the rx path.

Patch #6:  uses the library for freeing rx buffers.

Patch #7:  uses the library for allocation of tx buffers. This requires
           quite a bit of refactoring in this single patch.

Patch #8:  uses the library for adding DCB's in the tx path.

Patch #9:  uses the library helpers in the tx path.

Patch #10: ditch last_in_use variable and use library instead.

Patch #11: uses library helpers throughout.

Patch #12: refactor lan966x_fdma_reload() function.

[1] https://lore.kernel.org/netdev/[email protected]/

Signed-off-by: Daniel Machon <[email protected]>
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
smfrench pushed a commit that referenced this pull request Sep 22, 2024
Eduard Zingerman says:

====================
__jited test tag to check disassembly after jit

Some of the logic in the BPF jits might be non-trivial.
It might be useful to allow testing this logic by comparing
generated native code with expected code template.
This patch set adds a macro __jited() that could be used for
test_loader based tests in a following manner:

    SEC("tp")
    __arch_x86_64
    __jited("   endbr64")
    __jited("   nopl    (%rax,%rax)")
    __jited("   xorq    %rax, %rax")
    ...
    __naked void some_test(void) { ... }

Also add a test for jit code generated for tail calls handling to
demonstrate the feature.

The feature uses LLVM libraries to do the disassembly.
At selftests compilation time Makefile detects if these libraries are
available. When libraries are not available tests using __jit_x86()
are skipped.
Current CI environment does not include llvm development libraries,
but changes to add these are trivial.

This was previously discussed here:
https://lore.kernel.org/bpf/[email protected]/

Patch-set includes a few auxiliary steps:
- patches #2 and #3 fix a few bugs in test_loader behaviour;
- patch #4 replaces __regex macro with ability to specify regular
  expressions in __msg and __xlated using "{{" "}}" escapes;
- patch #8 updates __xlated to match disassembly lines consequently,
  same way as __jited does.

Changes v2->v3:
- changed macro name from __jit_x86 to __jited with __arch_* to
  specify disassembly arch (Yonghong);
- __jited matches disassembly lines consequently with "..."
  allowing to skip some number of lines (Andrii);
- __xlated matches disassembly lines consequently, same as __jited;
- "{{...}}" regex brackets instead of __regex macro;
- bug fixes for old commits.

Changes v1->v2:
- stylistic changes suggested by Yonghong;
- fix for -Wformat-truncation related warning when compiled with
  llvm15 (Yonghong).

v1: https://lore.kernel.org/bpf/[email protected]/
v2: https://lore.kernel.org/bpf/[email protected]/
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
smfrench pushed a commit that referenced this pull request Sep 29, 2024
Use a dedicated mutex to guard kvm_usage_count to fix a potential deadlock
on x86 due to a chain of locks and SRCU synchronizations.  Translating the
below lockdep splat, CPU1 #6 will wait on CPU0 #1, CPU0 #8 will wait on
CPU2 #3, and CPU2 #7 will wait on CPU1 #4 (if there's a writer, due to the
fairness of r/w semaphores).

    CPU0                     CPU1                     CPU2
1   lock(&kvm->slots_lock);
2                                                     lock(&vcpu->mutex);
3                                                     lock(&kvm->srcu);
4                            lock(cpu_hotplug_lock);
5                            lock(kvm_lock);
6                            lock(&kvm->slots_lock);
7                                                     lock(cpu_hotplug_lock);
8   sync(&kvm->srcu);

Note, there are likely more potential deadlocks in KVM x86, e.g. the same
pattern of taking cpu_hotplug_lock outside of kvm_lock likely exists with
__kvmclock_cpufreq_notifier():

  cpuhp_cpufreq_online()
  |
  -> cpufreq_online()
     |
     -> cpufreq_gov_performance_limits()
        |
        -> __cpufreq_driver_target()
           |
           -> __target_index()
              |
              -> cpufreq_freq_transition_begin()
                 |
                 -> cpufreq_notify_transition()
                    |
                    -> ... __kvmclock_cpufreq_notifier()

But, actually triggering such deadlocks is beyond rare due to the
combination of dependencies and timings involved.  E.g. the cpufreq
notifier is only used on older CPUs without a constant TSC, mucking with
the NX hugepage mitigation while VMs are running is very uncommon, and
doing so while also onlining/offlining a CPU (necessary to generate
contention on cpu_hotplug_lock) would be even more unusual.

The most robust solution to the general cpu_hotplug_lock issue is likely
to switch vm_list to be an RCU-protected list, e.g. so that x86's cpufreq
notifier doesn't to take kvm_lock.  For now, settle for fixing the most
blatant deadlock, as switching to an RCU-protected list is a much more
involved change, but add a comment in locking.rst to call out that care
needs to be taken when walking holding kvm_lock and walking vm_list.

  ======================================================
  WARNING: possible circular locking dependency detected
  6.10.0-smp--c257535a0c9d-pip #330 Tainted: G S         O
  ------------------------------------------------------
  tee/35048 is trying to acquire lock:
  ff6a80eced71e0a8 (&kvm->slots_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x179/0x1e0 [kvm]

  but task is already holding lock:
  ffffffffc07abb08 (kvm_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x14a/0x1e0 [kvm]

  which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:

  -> #3 (kvm_lock){+.+.}-{3:3}:
         __mutex_lock+0x6a/0xb40
         mutex_lock_nested+0x1f/0x30
         kvm_dev_ioctl+0x4fb/0xe50 [kvm]
         __se_sys_ioctl+0x7b/0xd0
         __x64_sys_ioctl+0x21/0x30
         x64_sys_call+0x15d0/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -> #2 (cpu_hotplug_lock){++++}-{0:0}:
         cpus_read_lock+0x2e/0xb0
         static_key_slow_inc+0x16/0x30
         kvm_lapic_set_base+0x6a/0x1c0 [kvm]
         kvm_set_apic_base+0x8f/0xe0 [kvm]
         kvm_set_msr_common+0x9ae/0xf80 [kvm]
         vmx_set_msr+0xa54/0xbe0 [kvm_intel]
         __kvm_set_msr+0xb6/0x1a0 [kvm]
         kvm_arch_vcpu_ioctl+0xeca/0x10c0 [kvm]
         kvm_vcpu_ioctl+0x485/0x5b0 [kvm]
         __se_sys_ioctl+0x7b/0xd0
         __x64_sys_ioctl+0x21/0x30
         x64_sys_call+0x15d0/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -> #1 (&kvm->srcu){.+.+}-{0:0}:
         __synchronize_srcu+0x44/0x1a0
         synchronize_srcu_expedited+0x21/0x30
         kvm_swap_active_memslots+0x110/0x1c0 [kvm]
         kvm_set_memslot+0x360/0x620 [kvm]
         __kvm_set_memory_region+0x27b/0x300 [kvm]
         kvm_vm_ioctl_set_memory_region+0x43/0x60 [kvm]
         kvm_vm_ioctl+0x295/0x650 [kvm]
         __se_sys_ioctl+0x7b/0xd0
         __x64_sys_ioctl+0x21/0x30
         x64_sys_call+0x15d0/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -> #0 (&kvm->slots_lock){+.+.}-{3:3}:
         __lock_acquire+0x15ef/0x2e30
         lock_acquire+0xe0/0x260
         __mutex_lock+0x6a/0xb40
         mutex_lock_nested+0x1f/0x30
         set_nx_huge_pages+0x179/0x1e0 [kvm]
         param_attr_store+0x93/0x100
         module_attr_store+0x22/0x40
         sysfs_kf_write+0x81/0xb0
         kernfs_fop_write_iter+0x133/0x1d0
         vfs_write+0x28d/0x380
         ksys_write+0x70/0xe0
         __x64_sys_write+0x1f/0x30
         x64_sys_call+0x281b/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

Cc: Chao Gao <[email protected]>
Fixes: 0bf5049 ("KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock")
Cc: [email protected]
Reviewed-by: Kai Huang <[email protected]>
Acked-by: Kai Huang <[email protected]>
Tested-by: Farrah Chen <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants