Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should this kernel boot on a Jetson TX2? #1

Closed
dhess opened this issue Oct 3, 2017 · 2 comments
Closed

Should this kernel boot on a Jetson TX2? #1

dhess opened this issue Oct 3, 2017 · 2 comments

Comments

@dhess
Copy link

dhess commented Oct 3, 2017

Hi, hopefully you don't mind if I post an issue/question here as I'm not sure where the project "lives," per se.

Is this kernel expected to work on a Jetson TX2 development board? I'm trying to get NixOS working on mine and I've had no luck with any kernel so far, including this one.

I can build the grate-driver kernel just fine on NixOS running on my Jetson TX1, and it works great there. It boots from the TX1's built-in eMMC, and then the NixOS root filesystem is mounted from an NVMe drive that's plugged into a PCIe riser board on the TX1. That configuration is very stable and has been working for awhile now under heavy loads.

However, the exact same filesystem won't boot on the TX2. I've copied the /boot fs from the TX1 eMMC to an SDHC card, and installed the NVMe drive/PCIe riser card on the TX2. Here is the boot log:

https://gist.github.com/dhess/d41a7c806a61e0b886a2cd0defed2c6a

I have also tried the "official" Tegra kernel from https://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux.git on the TX2, and the results are effectively the same -- the TX2 can't find the root fs device.

Any ideas?

@kusma
Copy link
Member

kusma commented Oct 3, 2017

Grate is about GPU-enablement on Tegra 2/3/4, not Tegra K1 and later. So the grate-kernel is probably not the right one for your system.

Tegra TX1 development happens at [email protected] (and #tegra on freenode). Perhaps you should direct these questions there instead? You'll be more likely to reach someone with relevant experience there...

@dhess
Copy link
Author

dhess commented Oct 3, 2017

Ack, sorry, I could have sworn I saw Tegra 210 patches in this kernel. Sorry for the noise!

@dhess dhess closed this as completed Oct 3, 2017
digetx pushed a commit that referenced this issue Oct 13, 2017
While line6_probe() may kick off URB for a control MIDI endpoint, the
function doesn't clean up it properly at its error path.  This results
in a leftover URB action that is eventually triggered later and causes
an Oops like:
  general protection fault: 0000 [#1] PREEMPT SMP KASAN
  CPU: 1 PID: 0 Comm: swapper/1 Not tainted
  RIP: 0010:usb_fill_bulk_urb ./include/linux/usb.h:1619
  RIP: 0010:line6_start_listen+0x3fe/0x9e0 sound/usb/line6/driver.c:76
  Call Trace:
   <IRQ>
   line6_data_received+0x1f7/0x470 sound/usb/line6/driver.c:326
   __usb_hcd_giveback_urb+0x2e0/0x650 drivers/usb/core/hcd.c:1779
   usb_hcd_giveback_urb+0x337/0x420 drivers/usb/core/hcd.c:1845
   dummy_timer+0xba9/0x39f0 drivers/usb/gadget/udc/dummy_hcd.c:1965
   call_timer_fn+0x2a2/0x940 kernel/time/timer.c:1281
   ....

Since the whole clean-up procedure is done in line6_disconnect()
callback, we can simply call it in the error path instead of
open-coding the whole again.  It'll fix such an issue automagically.

The bug was spotted by syzkaller.

Fixes: eedd0e9 ("ALSA: line6: Don't forget to call driver's destructor at error path")
Reported-by: Andrey Konovalov <[email protected]>
Tested-by: Andrey Konovalov <[email protected]>
Cc: <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>
digetx pushed a commit that referenced this issue Oct 13, 2017
The 0day kbuild robot reports following crash:

  BUG: unable to handle kernel NULL pointer dereference at 00000004
  IP: tb_property_find+0xe/0x41
  *pde = 00000000
  Oops: 0000 [#1]
  CPU: 0 PID: 1 Comm: swapper Not tainted 4.14.0-rc1-00741-ge69b6c0 #412
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
  task: 89c80000 task.stack: 89c7c000
  EIP: tb_property_find+0xe/0x41
  EFLAGS: 00210246 CPU: 0
  EAX: 00000000 EBX: 7a368f47 ECX: 00000044 EDX: 7a368f47
  ESI: 8851d340 EDI: 7a368f47 EBP: 89c7df0c ESP: 89c7defc
   DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
  CR0: 80050033 CR2: 00000004 CR3: 027a2000 CR4: 00000690
  Call Trace:
   tb_register_property_dir+0x49/0xb9
   ? cdc_mbim_driver_init+0x1b/0x1b
   tbnet_init+0x77/0x9f
   ? cdc_mbim_driver_init+0x1b/0x1b
   do_one_initcall+0x7e/0x145
   ? parse_args+0x10c/0x1b3
   ? kernel_init_freeable+0xbe/0x159
   kernel_init_freeable+0xd1/0x159
   ? rest_init+0x110/0x110
   kernel_init+0xd/0xd0
   ret_from_fork+0x19/0x30

The reason is that both Thunderbolt bus and thunderbolt-net are build
into the kernel image, and the latter is linked first because
drivers/net comes before drivers/thunderbolt. Since both use
module_init() thunderbolt-net ends up calling Thunderbolt bus functions
too early triggering the above crash.

Fix this by moving Thunderbolt bus initialization to happen earlier to
make sure all the data structures are ready when Thunderbolt service
drivers are initialized. To be on the safe side also add a check for
properly initialized xdomain_property_dir to tb_register_property_dir().

Reported-by: kernel test robot <[email protected]>
Signed-off-by: Mika Westerberg <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
digetx pushed a commit that referenced this issue Oct 13, 2017
This patch replaces rcu_deference() with rcu_dereference_bh() in
ipv6_route_seq_next() to avoid the following warning:

[   19.431685] WARNING: suspicious RCU usage
[   19.433451] 4.14.0-rc3-00914-g66f5d6c #118 Not tainted
[   19.435509] -----------------------------
[   19.437267] net/ipv6/ip6_fib.c:2259 suspicious
rcu_dereference_check() usage!
[   19.440790]
[   19.440790] other info that might help us debug this:
[   19.440790]
[   19.444734]
[   19.444734] rcu_scheduler_active = 2, debug_locks = 1
[   19.447757] 2 locks held by odhcpd/3720:
[   19.449480]  #0:  (&p->lock){+.+.}, at: [<ffffffffb1231f7d>]
seq_read+0x3c/0x333
[   19.452720]  #1:  (rcu_read_lock_bh){....}, at: [<ffffffffb1d2b984>]
ipv6_route_seq_start+0x5/0xfd
[   19.456323]
[   19.456323] stack backtrace:
[   19.458812] CPU: 0 PID: 3720 Comm: odhcpd Not tainted
4.14.0-rc3-00914-g66f5d6c #118
[   19.462042] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1 04/01/2014
[   19.465414] Call Trace:
[   19.466788]  dump_stack+0x86/0xc0
[   19.468358]  lockdep_rcu_suspicious+0xea/0xf3
[   19.470183]  ipv6_route_seq_next+0x71/0x164
[   19.471963]  seq_read+0x244/0x333
[   19.473522]  proc_reg_read+0x48/0x67
[   19.475152]  ? proc_reg_write+0x67/0x67
[   19.476862]  __vfs_read+0x26/0x10b
[   19.478463]  ? __might_fault+0x37/0x84
[   19.480148]  vfs_read+0xba/0x146
[   19.481690]  SyS_read+0x51/0x8e
[   19.483197]  do_int80_syscall_32+0x66/0x15a
[   19.484969]  entry_INT80_compat+0x32/0x50
[   19.486707] RIP: 0023:0xf7f0be8e
[   19.488244] RSP: 002b:00000000ffa75d04 EFLAGS: 00000246 ORIG_RAX:
0000000000000003
[   19.491431] RAX: ffffffffffffffda RBX: 0000000000000009 RCX:
0000000008056068
[   19.493886] RDX: 0000000000001000 RSI: 0000000008056008 RDI:
0000000000001000
[   19.496331] RBP: 00000000000001ff R08: 0000000000000000 R09:
0000000000000000
[   19.498768] R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000000
[   19.501217] R13: 0000000000000000 R14: 0000000000000000 R15:
0000000000000000

Fixes: 66f5d6c ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Reported-by: Xiaolong Ye <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Acked-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
digetx pushed a commit that referenced this issue Oct 13, 2017
Since commit:

  1fd7e41 ("perf/core: Remove perf_cpu_context::unique_pmu")

... when a PMU is unregistered then its associated ->pmu_cpu_context is
unconditionally freed. Whilst this is fine for dynamically allocated
context types (i.e. those registered using perf_invalid_context), this
causes a problem for sharing of static contexts such as
perf_{sw,hw}_context, which are used by multiple built-in PMUs and
effectively have a global lifetime.

Whilst testing the ARM SPE driver, which must use perf_sw_context to
support per-task AUX tracing, unregistering the driver as a result of a
module unload resulted in:

 Unable to handle kernel NULL pointer dereference at virtual address 00000038
 Internal error: Oops: 96000004 [#1] PREEMPT SMP
 Modules linked in: [last unloaded: arm_spe_pmu]
 PC is at ctx_resched+0x38/0xe8
 LR is at perf_event_exec+0x20c/0x278
 [...]
 ctx_resched+0x38/0xe8
 perf_event_exec+0x20c/0x278
 setup_new_exec+0x88/0x118
 load_elf_binary+0x26c/0x109c
 search_binary_handler+0x90/0x298
 do_execveat_common.isra.14+0x540/0x618
 SyS_execve+0x38/0x48

since the software context has been freed and the ctx.pmu->pmu_disable_count
field has been set to NULL.

This patch fixes the problem by avoiding the freeing of static PMU contexts
altogether. Whilst the sharing of dynamic contexts is questionable, this
actually requires the caller to share their context pointer explicitly
and so the burden is on them to manage the object lifetime.

Reported-by: Kim Phillips <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: 1fd7e41 ("perf/core: Remove perf_cpu_context::unique_pmu")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
digetx pushed a commit that referenced this issue Oct 13, 2017
On AMD Family17h-based (EPYC) system, a logical NUMA node can contain
upto 8 cores (16 threads) with the following topology.

             ----------------------------
         C0  | T0 T1 |    ||    | T0 T1 | C4
             --------|    ||    |--------
         C1  | T0 T1 | L3 || L3 | T0 T1 | C5
             --------|    ||    |--------
         C2  | T0 T1 | #0 || #1 | T0 T1 | C6
             --------|    ||    |--------
         C3  | T0 T1 |    ||    | T0 T1 | C7
             ----------------------------

Here, there are 2 last-level (L3) caches per logical NUMA node.
A socket can contain upto 4 NUMA nodes, and a system can support
upto 2 sockets. With full system configuration, current scheduler
creates 4 sched domains:

  domain0 SMT       (span a core)
  domain1 MC        (span a last-level-cache)
  domain2 NUMA      (span a socket: 4 nodes)
  domain3 NUMA      (span a system: 8 nodes)

Note that there is no domain to represent cpus spaning a logical
NUMA node.  With this hierarchy of sched domains, the scheduler does
not balance properly in the following cases:

Case1:

 When running 8 tasks, a properly balanced system should
 schedule a task per logical NUMA node. This is not the case for
 the current scheduler.

Case2:

 In some cases, threads are scheduled on the same cpu, while other
 cpus are idle. This results in run-to-run inconsistency. For example:

  taskset -c 0-7 sysbench --num-threads=8 --test=cpu \
                          --cpu-max-prime=100000 run

Total execution time ranges from 25.1s to 33.5s depending on threads
placement, where 25.1s is when all 8 threads are balanced properly
on 8 cpus.

Introducing NUMA identity node sched domain, which is based on how
SRAT/SLIT table define a logical NUMA node. This results in the following
hierarchy of sched domains on the same system described above.

  domain0 SMT       (span a core)
  domain1 MC        (span a last-level-cache)
  domain2 NODE      (span a logical NUMA node)
  domain3 NUMA      (span a socket: 4 nodes)
  domain4 NUMA      (span a system: 8 nodes)

This fixes the improper load balancing cases mentioned above.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
digetx pushed a commit that referenced this issue Oct 13, 2017
Tetsuo Handa and Fengguang Wu reported a panic in the unwinder:

  BUG: unable to handle kernel NULL pointer dereference at 000001f2
  IP: update_stack_state+0xd4/0x340
  *pde = 00000000

  Oops: 0000 [#1] PREEMPT SMP
  CPU: 0 PID: 18728 Comm: 01-cpu-hotplug Not tainted 4.13.0-rc4-00170-gb09be67 #592
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
  task: bb0b53c0 task.stack: bb3ac000
  EIP: update_stack_state+0xd4/0x340
  EFLAGS: 00010002 CPU: 0
  EAX: 0000a570 EBX: bb3adccb ECX: 0000f401 EDX: 0000a570
  ESI: 00000001 EDI: 000001ba EBP: bb3adc6b ESP: bb3adc3f
   DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
  CR0: 80050033 CR2: 000001f2 CR3: 0b3a7000 CR4: 00140690
  DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
  DR6: fffe0ff0 DR7: 00000400
  Call Trace:
   ? unwind_next_frame+0xea/0x400
   ? __unwind_start+0xf5/0x180
   ? __save_stack_trace+0x81/0x160
   ? save_stack_trace+0x20/0x30
   ? __lock_acquire+0xfa5/0x12f0
   ? lock_acquire+0x1c2/0x230
   ? tick_periodic+0x3a/0xf0
   ? _raw_spin_lock+0x42/0x50
   ? tick_periodic+0x3a/0xf0
   ? tick_periodic+0x3a/0xf0
   ? debug_smp_processor_id+0x12/0x20
   ? tick_handle_periodic+0x23/0xc0
   ? local_apic_timer_interrupt+0x63/0x70
   ? smp_trace_apic_timer_interrupt+0x235/0x6a0
   ? trace_apic_timer_interrupt+0x37/0x3c
   ? strrchr+0x23/0x50
  Code: 0f 95 c1 89 c7 89 45 e4 0f b6 c1 89 c6 89 45 dc 8b 04 85 98 cb 74 bc 88 4d e3 89 45 f0 83 c0 01 84 c9 89 04 b5 98 cb 74 bc 74 3b <8b> 47 38 8b 57 34 c6 43 1d 01 25 00 00 02 00 83 e2 03 09 d0 83
  EIP: update_stack_state+0xd4/0x340 SS:ESP: 0068:bb3adc3f
  CR2: 00000000000001f2
  ---[ end trace 0d147fd4aba8ff50 ]---
  Kernel panic - not syncing: Fatal exception in interrupt

On x86-32, after decoding a frame pointer to get a regs address,
regs_size() dereferences the regs pointer when it checks regs->cs to see
if the regs are user mode.  This is dangerous because it's possible that
what looks like a decoded frame pointer is actually a corrupt value, and
we don't want the unwinder to make things worse.

Instead of calling regs_size() on an unsafe pointer, just assume they're
kernel regs to start with.  Later, once it's safe to access the regs, we
can do the user mode check and corresponding safety check for the
remaining two regs.

Reported-and-tested-by: Tetsuo Handa <[email protected]>
Reported-and-tested-by: Fengguang Wu <[email protected]>
Signed-off-by: Josh Poimboeuf <[email protected]>
Cc: Byungchul Park <[email protected]>
Cc: LKP <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: 5ed8d8b ("x86/unwind: Move common code into update_stack_state()")
Link: http://lkml.kernel.org/r/7f95b9a6993dec7674b3f3ab3dcd3294f7b9644d.1507597785.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <[email protected]>
digetx pushed a commit that referenced this issue Oct 13, 2017
…ug deadlock

4.14-rc1 gained the fancy new cross-release support in lockdep, which
seems to have uncovered a few more rules about what is allowed and
isn't.

This one here seems to indicate that allocating a work-queue while
holding mmap_sem is a no-go, so let's try to preallocate it.

Of course another way to break this chain would be somewhere in the
cpu hotplug code, since this isn't the only trace we're finding now
which goes through msr_create_device.

Full lockdep splat:

======================================================
WARNING: possible circular locking dependency detected
4.14.0-rc1-CI-CI_DRM_3118+ #1 Tainted: G     U
------------------------------------------------------
prime_mmap/1551 is trying to acquire lock:
 (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff8109dbb7>] apply_workqueue_attrs+0x17/0x50

but task is already holding lock:
 (&dev_priv->mm_lock){+.+.}, at: [<ffffffffa01a7b2a>] i915_gem_userptr_init__mmu_notifier+0x14a/0x270 [i915]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #6 (&dev_priv->mm_lock){+.+.}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       __mutex_lock+0x86/0x9b0
       mutex_lock_nested+0x1b/0x20
       i915_gem_userptr_init__mmu_notifier+0x14a/0x270 [i915]
       i915_gem_userptr_ioctl+0x222/0x2c0 [i915]
       drm_ioctl_kernel+0x69/0xb0
       drm_ioctl+0x2f9/0x3d0
       do_vfs_ioctl+0x94/0x670
       SyS_ioctl+0x41/0x70
       entry_SYSCALL_64_fastpath+0x1c/0xb1

-> #5 (&mm->mmap_sem){++++}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       __might_fault+0x68/0x90
       _copy_to_user+0x23/0x70
       filldir+0xa5/0x120
       dcache_readdir+0xf9/0x170
       iterate_dir+0x69/0x1a0
       SyS_getdents+0xa5/0x140
       entry_SYSCALL_64_fastpath+0x1c/0xb1

-> #4 (&sb->s_type->i_mutex_key#5){++++}:
       down_write+0x3b/0x70
       handle_create+0xcb/0x1e0
       devtmpfsd+0x139/0x180
       kthread+0x152/0x190
       ret_from_fork+0x27/0x40

-> #3 ((complete)&req.done){+.+.}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       wait_for_common+0x58/0x210
       wait_for_completion+0x1d/0x20
       devtmpfs_create_node+0x13d/0x160
       device_add+0x5eb/0x620
       device_create_groups_vargs+0xe0/0xf0
       device_create+0x3a/0x40
       msr_device_create+0x2b/0x40
       cpuhp_invoke_callback+0xa3/0x840
       cpuhp_thread_fun+0x7a/0x150
       smpboot_thread_fn+0x18a/0x280
       kthread+0x152/0x190
       ret_from_fork+0x27/0x40

-> #2 (cpuhp_state){+.+.}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       cpuhp_issue_call+0x10b/0x170
       __cpuhp_setup_state_cpuslocked+0x134/0x2a0
       __cpuhp_setup_state+0x46/0x60
       page_writeback_init+0x43/0x67
       pagecache_init+0x3d/0x42
       start_kernel+0x3a8/0x3fc
       x86_64_start_reservations+0x2a/0x2c
       x86_64_start_kernel+0x6d/0x70
       verify_cpu+0x0/0xfb

-> #1 (cpuhp_state_mutex){+.+.}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       __mutex_lock+0x86/0x9b0
       mutex_lock_nested+0x1b/0x20
       __cpuhp_setup_state_cpuslocked+0x52/0x2a0
       __cpuhp_setup_state+0x46/0x60
       page_alloc_init+0x28/0x30
       start_kernel+0x145/0x3fc
       x86_64_start_reservations+0x2a/0x2c
       x86_64_start_kernel+0x6d/0x70
       verify_cpu+0x0/0xfb

-> #0 (cpu_hotplug_lock.rw_sem){++++}:
       check_prev_add+0x430/0x840
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       cpus_read_lock+0x3d/0xb0
       apply_workqueue_attrs+0x17/0x50
       __alloc_workqueue_key+0x1d8/0x4d9
       i915_gem_userptr_init__mmu_notifier+0x1fb/0x270 [i915]
       i915_gem_userptr_ioctl+0x222/0x2c0 [i915]
       drm_ioctl_kernel+0x69/0xb0
       drm_ioctl+0x2f9/0x3d0
       do_vfs_ioctl+0x94/0x670
       SyS_ioctl+0x41/0x70
       entry_SYSCALL_64_fastpath+0x1c/0xb1

other info that might help us debug this:

Chain exists of:
  cpu_hotplug_lock.rw_sem --> &mm->mmap_sem --> &dev_priv->mm_lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&dev_priv->mm_lock);
                               lock(&mm->mmap_sem);
                               lock(&dev_priv->mm_lock);
  lock(cpu_hotplug_lock.rw_sem);

 *** DEADLOCK ***

2 locks held by prime_mmap/1551:
 #0:  (&mm->mmap_sem){++++}, at: [<ffffffffa01a7b18>] i915_gem_userptr_init__mmu_notifier+0x138/0x270 [i915]
 #1:  (&dev_priv->mm_lock){+.+.}, at: [<ffffffffa01a7b2a>] i915_gem_userptr_init__mmu_notifier+0x14a/0x270 [i915]

stack backtrace:
CPU: 4 PID: 1551 Comm: prime_mmap Tainted: G     U          4.14.0-rc1-CI-CI_DRM_3118+ #1
Hardware name: Dell Inc. XPS 8300  /0Y2MRG, BIOS A06 10/17/2011
Call Trace:
 dump_stack+0x68/0x9f
 print_circular_bug+0x235/0x3c0
 ? lockdep_init_map_crosslock+0x20/0x20
 check_prev_add+0x430/0x840
 __lock_acquire+0x1420/0x15e0
 ? __lock_acquire+0x1420/0x15e0
 ? lockdep_init_map_crosslock+0x20/0x20
 lock_acquire+0xb0/0x200
 ? apply_workqueue_attrs+0x17/0x50
 cpus_read_lock+0x3d/0xb0
 ? apply_workqueue_attrs+0x17/0x50
 apply_workqueue_attrs+0x17/0x50
 __alloc_workqueue_key+0x1d8/0x4d9
 ? __lockdep_init_map+0x57/0x1c0
 i915_gem_userptr_init__mmu_notifier+0x1fb/0x270 [i915]
 i915_gem_userptr_ioctl+0x222/0x2c0 [i915]
 ? i915_gem_userptr_release+0x140/0x140 [i915]
 drm_ioctl_kernel+0x69/0xb0
 drm_ioctl+0x2f9/0x3d0
 ? i915_gem_userptr_release+0x140/0x140 [i915]
 ? __do_page_fault+0x2a4/0x570
 do_vfs_ioctl+0x94/0x670
 ? entry_SYSCALL_64_fastpath+0x5/0xb1
 ? __this_cpu_preempt_check+0x13/0x20
 ? trace_hardirqs_on_caller+0xe3/0x1b0
 SyS_ioctl+0x41/0x70
 entry_SYSCALL_64_fastpath+0x1c/0xb1
RIP: 0033:0x7fbb83c39587
RSP: 002b:00007fff188dc228 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: ffffffff81492963 RCX: 00007fbb83c39587
RDX: 00007fff188dc260 RSI: 00000000c0186473 RDI: 0000000000000003
RBP: ffffc90001487f88 R08: 0000000000000000 R09: 00007fff188dc2ac
R10: 00007fbb83efcb58 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000003 R14: 00000000c0186473 R15: 00007fff188dc2ac
 ? __this_cpu_preempt_check+0x13/0x20

Note that this also has the minor benefit of slightly reducing the
critical section where we hold mmap_sem.

v2: Set ret correctly when we raced with another thread.

v3: Use Chris' diff. Attach the right lockdep splat.

v4: Repaint in Tvrtko's colors (aka don't report ENOMEM if we race and
some other thread managed to not also get an ENOMEM and successfully
install the mmu notifier. Note that the kernel guarantees that small
allocations succeed, so this never actually happens).

Cc: Chris Wilson <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Joonas Lahtinen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Marta Lofstedt <[email protected]>
Cc: Tejun Heo <[email protected]>
References: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3180/shard-hsw3/igt@prime_mmap@test_userptr.html
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102939
Reviewed-by: Tvrtko Ursulin <[email protected]>
Signed-off-by: Daniel Vetter <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
digetx pushed a commit that referenced this issue Oct 13, 2017
This patch separates enable/disable of RC6 and RPS for gen6+
platforms prior to VLV.

v2: Fixed checkpatch issue. (Sagar)

Signed-off-by: Sagar Arun Kamble <[email protected]>
Cc: Imre Deak <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: Joonas Lahtinen <[email protected]>
Reviewed-by: Radoslaw Szwichtenberg <[email protected]> #1
Reviewed-by: Chris Wilson <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Acked-by: Imre Deak <[email protected]>
Signed-off-by: Chris Wilson <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
digetx pushed a commit that referenced this issue Oct 13, 2017
We were using dev_priv->pm for runtime power management related state.
This patch renames it to "runtime_pm" which looks more apt.

v2: s/rpm/runtime_pm (Chris)

Signed-off-by: Sagar Arun Kamble <[email protected]>
Cc: Imre Deak <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: Joonas Lahtinen <[email protected]>
Reviewed-by: Radoslaw Szwichtenberg <[email protected]> #1
Reviewed-by: Chris Wilson <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Acked-by: Imre Deak <[email protected]>
Signed-off-by: Chris Wilson <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
digetx pushed a commit that referenced this issue Oct 13, 2017
…gt_pm"

Prepared substructure rps for RPS related state. autoenable_work is
used for RC6 too hence it is defined outside rps structure. As we do
this lot many functions are refactored to use intel_rps *rps to access
rps related members. Hence renamed intel_rps_client pointer variables
to rps_client in various functions.

v2: Rebase.

v3: s/pm/gt_pm (Chris)
Refactored access to rps structure by declaring struct intel_rps * in
many functions.

Signed-off-by: Sagar Arun Kamble <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: Imre Deak <[email protected]>
Cc: Joonas Lahtinen <[email protected]>
Reviewed-by: Radoslaw Szwichtenberg <[email protected]> #1
Reviewed-by: Chris Wilson <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Acked-by: Imre Deak <[email protected]>
Signed-off-by: Chris Wilson <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
digetx pushed a commit that referenced this issue Oct 13, 2017
This function gives the status of RC6, whether disabled or if
enabled then which state. intel_enable_rc6 will be used for
enabling RC6 in the next patch.

v2: Rebase.

v3: Rebase.

Signed-off-by: Sagar Arun Kamble <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: Imre Deak <[email protected]>
Cc: Joonas Lahtinen <[email protected]>
Reviewed-by: Radoslaw Szwichtenberg <[email protected]> #1
Reviewed-by: Ewelina Musial <[email protected]> #1
Reviewed-by: Chris Wilson <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Acked-by: Imre Deak <[email protected]>
Signed-off-by: Chris Wilson <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
digetx pushed a commit that referenced this issue Oct 13, 2017
The dummy-hcd driver calls the gadget driver's disconnect callback
under the wrong conditions.  It should invoke the callback when Vbus
power is turned off, but instead it does so when the D+ pullup is
turned off.

This can cause a deadlock in the composite core when a gadget driver
is unregistered:

[   88.361471] ============================================
[   88.362014] WARNING: possible recursive locking detected
[   88.362580] 4.14.0-rc2+ #9 Not tainted
[   88.363010] --------------------------------------------
[   88.363561] v4l_id/526 is trying to acquire lock:
[   88.364062]  (&(&cdev->lock)->rlock){....}, at: [<ffffffffa0547e03>] composite_disconnect+0x43/0x100 [libcomposite]
[   88.365051]
[   88.365051] but task is already holding lock:
[   88.365826]  (&(&cdev->lock)->rlock){....}, at: [<ffffffffa0547b09>] usb_function_deactivate+0x29/0x80 [libcomposite]
[   88.366858]
[   88.366858] other info that might help us debug this:
[   88.368301]  Possible unsafe locking scenario:
[   88.368301]
[   88.369304]        CPU0
[   88.369701]        ----
[   88.370101]   lock(&(&cdev->lock)->rlock);
[   88.370623]   lock(&(&cdev->lock)->rlock);
[   88.371145]
[   88.371145]  *** DEADLOCK ***
[   88.371145]
[   88.372211]  May be due to missing lock nesting notation
[   88.372211]
[   88.373191] 2 locks held by v4l_id/526:
[   88.373715]  #0:  (&(&cdev->lock)->rlock){....}, at: [<ffffffffa0547b09>] usb_function_deactivate+0x29/0x80 [libcomposite]
[   88.374814]  #1:  (&(&dum_hcd->dum->lock)->rlock){....}, at: [<ffffffffa05bd48d>] dummy_pullup+0x7d/0xf0 [dummy_hcd]
[   88.376289]
[   88.376289] stack backtrace:
[   88.377726] CPU: 0 PID: 526 Comm: v4l_id Not tainted 4.14.0-rc2+ #9
[   88.378557] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[   88.379504] Call Trace:
[   88.380019]  dump_stack+0x86/0xc7
[   88.380605]  __lock_acquire+0x841/0x1120
[   88.381252]  lock_acquire+0xd5/0x1c0
[   88.381865]  ? composite_disconnect+0x43/0x100 [libcomposite]
[   88.382668]  _raw_spin_lock_irqsave+0x40/0x54
[   88.383357]  ? composite_disconnect+0x43/0x100 [libcomposite]
[   88.384290]  composite_disconnect+0x43/0x100 [libcomposite]
[   88.385490]  set_link_state+0x2d4/0x3c0 [dummy_hcd]
[   88.386436]  dummy_pullup+0xa7/0xf0 [dummy_hcd]
[   88.387195]  usb_gadget_disconnect+0xd8/0x160 [udc_core]
[   88.387990]  usb_gadget_deactivate+0xd3/0x160 [udc_core]
[   88.388793]  usb_function_deactivate+0x64/0x80 [libcomposite]
[   88.389628]  uvc_function_disconnect+0x1e/0x40 [usb_f_uvc]

This patch changes the code to test the port-power status bit rather
than the port-connect status bit when deciding whether to isue the
callback.

Signed-off-by: Alan Stern <[email protected]>
Reported-by: David Tulloh <[email protected]>
CC: <[email protected]>
Signed-off-by: Felipe Balbi <[email protected]>
digetx pushed a commit that referenced this issue Oct 13, 2017
I forgot to account for the fact that the device core holds a
reference to a device added with device_initialize() that need
to be released with a corresponding put_device() to reach a 0
refcount at the end of the lifecycle.

This led to a NULL pointer reference when freeing the device
when e.g. unbidning the host device in sysfs.

Fix this and use the device .release() callback to free the
IDA and free:ing the memory used by the RPMB device.

Before this patch:

/sys/bus/amba/drivers/mmci-pl18x$ echo 80114000.sdi4_per2 > unbind
[   29.797332] mmc3: card 0001 removed
[   29.810791] Unable to handle kernel NULL pointer dereference at
               virtual address 00000050
[   29.818878] pgd = de70c000
[   29.821624] [00000050] *pgd=1e70a831, *pte=00000000, *ppte=00000000
[   29.827911] Internal error: Oops: 17 [#1] PREEMPT SMP ARM
[   29.833282] Modules linked in:
[   29.836334] CPU: 1 PID: 154 Comm: sh Not tainted
               4.14.0-rc3-00039-g83318e309566-dirty #736
[   29.844604] Hardware name: ST-Ericsson Ux5x0 platform (Device Tree Support)
[   29.851562] task: de572700 task.stack: de742000
[   29.856079] PC is at kernfs_find_ns+0x8/0x100
[   29.860443] LR is at kernfs_find_and_get_ns+0x30/0x48

After this patch:

/sys/bus/amba/drivers/mmci-pl18x$ echo 80005000.sdi4_per2 > unbind
[   20.623382] mmc3: card 0001 removed

Fixes: 0ab0c7c ("mmc: block: Convert RPMB to a character device")
Reported-by: Adrian Hunter <[email protected]>
Signed-off-by: Linus Walleij <[email protected]>
Acked-by: Adrian Hunter <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
digetx pushed a commit that referenced this issue Oct 13, 2017
stop_machine is not really a locking primitive we should use, except
when the hw folks tell us the hw is broken and that's the only way to
work around it.

This patch tries to address the locking abuse of stop_machine() from

commit 20e4933
Author: Chris Wilson <[email protected]>
Date:   Tue Nov 22 14:41:21 2016 +0000

    drm/i915: Stop the machine as we install the wedged submit_request handler

Chris said parts of the reasons for going with stop_machine() was that
it's no overhead for the fast-path. But these callbacks use irqsave
spinlocks and do a bunch of MMIO, and rcu_read_lock is _real_ fast.

To stay as close as possible to the stop_machine semantics we first
update all the submit function pointers to the nop handler, then call
synchronize_rcu() to make sure no new requests can be submitted. This
should give us exactly the huge barrier we want.

I pondered whether we should annotate engine->submit_request as __rcu
and use rcu_assign_pointer and rcu_dereference on it. But the reason
behind those is to make sure the compiler/cpu barriers are there for
when you have an actual data structure you point at, to make sure all
the writes are seen correctly on the read side. But we just have a
function pointer, and .text isn't changed, so no need for these
barriers and hence no need for annotations.

Unfortunately there's a complication with the call to
intel_engine_init_global_seqno:

- Without stop_machine we must hold the corresponding spinlock.

- Without stop_machine we must ensure that all requests are marked as
  having failed with dma_fence_set_error() before we call it. That
  means we need to split the nop request submission into two phases,
  both synchronized with rcu:

  1. Only stop submitting the requests to hw and mark them as failed.

  2. After all pending requests in the scheduler/ring are suitably
  marked up as failed and we can force complete them all, also force
  complete by calling intel_engine_init_global_seqno().

This should fix the followwing lockdep splat:

======================================================
WARNING: possible circular locking dependency detected
4.14.0-rc3-CI-CI_DRM_3179+ #1 Tainted: G     U
------------------------------------------------------
kworker/3:4/562 is trying to acquire lock:
 (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff8113d4bc>] stop_machine+0x1c/0x40

but task is already holding lock:
 (&dev->struct_mutex){+.+.}, at: [<ffffffffa0136588>] i915_reset_device+0x1e8/0x260 [i915]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #6 (&dev->struct_mutex){+.+.}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       __mutex_lock+0x86/0x9b0
       mutex_lock_interruptible_nested+0x1b/0x20
       i915_mutex_lock_interruptible+0x51/0x130 [i915]
       i915_gem_fault+0x209/0x650 [i915]
       __do_fault+0x1e/0x80
       __handle_mm_fault+0xa08/0xed0
       handle_mm_fault+0x156/0x300
       __do_page_fault+0x2c5/0x570
       do_page_fault+0x28/0x250
       page_fault+0x22/0x30

-> #5 (&mm->mmap_sem){++++}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       __might_fault+0x68/0x90
       _copy_to_user+0x23/0x70
       filldir+0xa5/0x120
       dcache_readdir+0xf9/0x170
       iterate_dir+0x69/0x1a0
       SyS_getdents+0xa5/0x140
       entry_SYSCALL_64_fastpath+0x1c/0xb1

-> #4 (&sb->s_type->i_mutex_key#5){++++}:
       down_write+0x3b/0x70
       handle_create+0xcb/0x1e0
       devtmpfsd+0x139/0x180
       kthread+0x152/0x190
       ret_from_fork+0x27/0x40

-> #3 ((complete)&req.done){+.+.}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       wait_for_common+0x58/0x210
       wait_for_completion+0x1d/0x20
       devtmpfs_create_node+0x13d/0x160
       device_add+0x5eb/0x620
       device_create_groups_vargs+0xe0/0xf0
       device_create+0x3a/0x40
       msr_device_create+0x2b/0x40
       cpuhp_invoke_callback+0xc9/0xbf0
       cpuhp_thread_fun+0x17b/0x240
       smpboot_thread_fn+0x18a/0x280
       kthread+0x152/0x190
       ret_from_fork+0x27/0x40

-> #2 (cpuhp_state-up){+.+.}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       cpuhp_issue_call+0x133/0x1c0
       __cpuhp_setup_state_cpuslocked+0x139/0x2a0
       __cpuhp_setup_state+0x46/0x60
       page_writeback_init+0x43/0x67
       pagecache_init+0x3d/0x42
       start_kernel+0x3a8/0x3fc
       x86_64_start_reservations+0x2a/0x2c
       x86_64_start_kernel+0x6d/0x70
       verify_cpu+0x0/0xfb

-> #1 (cpuhp_state_mutex){+.+.}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       __mutex_lock+0x86/0x9b0
       mutex_lock_nested+0x1b/0x20
       __cpuhp_setup_state_cpuslocked+0x53/0x2a0
       __cpuhp_setup_state+0x46/0x60
       page_alloc_init+0x28/0x30
       start_kernel+0x145/0x3fc
       x86_64_start_reservations+0x2a/0x2c
       x86_64_start_kernel+0x6d/0x70
       verify_cpu+0x0/0xfb

-> #0 (cpu_hotplug_lock.rw_sem){++++}:
       check_prev_add+0x430/0x840
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       cpus_read_lock+0x3d/0xb0
       stop_machine+0x1c/0x40
       i915_gem_set_wedged+0x1a/0x20 [i915]
       i915_reset+0xb9/0x230 [i915]
       i915_reset_device+0x1f6/0x260 [i915]
       i915_handle_error+0x2d8/0x430 [i915]
       hangcheck_declare_hang+0xd3/0xf0 [i915]
       i915_hangcheck_elapsed+0x262/0x2d0 [i915]
       process_one_work+0x233/0x660
       worker_thread+0x4e/0x3b0
       kthread+0x152/0x190
       ret_from_fork+0x27/0x40

other info that might help us debug this:

Chain exists of:
  cpu_hotplug_lock.rw_sem --> &mm->mmap_sem --> &dev->struct_mutex

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&dev->struct_mutex);
                               lock(&mm->mmap_sem);
                               lock(&dev->struct_mutex);
  lock(cpu_hotplug_lock.rw_sem);

 *** DEADLOCK ***

3 locks held by kworker/3:4/562:
 #0:  ("events_long"){+.+.}, at: [<ffffffff8109c64a>] process_one_work+0x1aa/0x660
 #1:  ((&(&i915->gpu_error.hangcheck_work)->work)){+.+.}, at: [<ffffffff8109c64a>] process_one_work+0x1aa/0x660
 #2:  (&dev->struct_mutex){+.+.}, at: [<ffffffffa0136588>] i915_reset_device+0x1e8/0x260 [i915]

stack backtrace:
CPU: 3 PID: 562 Comm: kworker/3:4 Tainted: G     U          4.14.0-rc3-CI-CI_DRM_3179+ #1
Hardware name:                  /NUC7i5BNB, BIOS BNKBL357.86A.0048.2017.0704.1415 07/04/2017
Workqueue: events_long i915_hangcheck_elapsed [i915]
Call Trace:
 dump_stack+0x68/0x9f
 print_circular_bug+0x235/0x3c0
 ? lockdep_init_map_crosslock+0x20/0x20
 check_prev_add+0x430/0x840
 ? irq_work_queue+0x86/0xe0
 ? wake_up_klogd+0x53/0x70
 __lock_acquire+0x1420/0x15e0
 ? __lock_acquire+0x1420/0x15e0
 ? lockdep_init_map_crosslock+0x20/0x20
 lock_acquire+0xb0/0x200
 ? stop_machine+0x1c/0x40
 ? i915_gem_object_truncate+0x50/0x50 [i915]
 cpus_read_lock+0x3d/0xb0
 ? stop_machine+0x1c/0x40
 stop_machine+0x1c/0x40
 i915_gem_set_wedged+0x1a/0x20 [i915]
 i915_reset+0xb9/0x230 [i915]
 i915_reset_device+0x1f6/0x260 [i915]
 ? gen8_gt_irq_ack+0x170/0x170 [i915]
 ? work_on_cpu_safe+0x60/0x60
 i915_handle_error+0x2d8/0x430 [i915]
 ? vsnprintf+0xd1/0x4b0
 ? scnprintf+0x3a/0x70
 hangcheck_declare_hang+0xd3/0xf0 [i915]
 ? intel_runtime_pm_put+0x56/0xa0 [i915]
 i915_hangcheck_elapsed+0x262/0x2d0 [i915]
 process_one_work+0x233/0x660
 worker_thread+0x4e/0x3b0
 kthread+0x152/0x190
 ? process_one_work+0x660/0x660
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x27/0x40
Setting dangerous option reset - tainting kernel
i915 0000:00:02.0: Resetting chip after gpu hang
Setting dangerous option reset - tainting kernel
i915 0000:00:02.0: Resetting chip after gpu hang

v2: Have 1 global synchronize_rcu() barrier across all engines, and
improve commit message.

v3: We need to protect the seqno update with the timeline spinlock (in
set_wedged) to avoid racing with other updates of the seqno, like we
already do in nop_submit_request (Chris).

v4: Use two-phase sequence to plug the race Chris spotted where we can
complete requests before they're marked up with -EIO.

v5: Review from Chris:
- simplify nop_submit_request.
- Add comment to rcu_read_lock section.
- Align comments with the new style.

v6: Remove unused variable to appease CI.

Reviewed-by: Chris Wilson <[email protected]>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102886
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103096
Cc: Chris Wilson <[email protected]>
Cc: Mika Kuoppala <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Marta Lofstedt <[email protected]>
Signed-off-by: Daniel Vetter <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
digetx pushed a commit that referenced this issue Oct 13, 2017
Purely to silence lockdep, as we know that no bo can exist at this time
and so the inversion is impossible. Nevertheless, lockdep currently
warns on unload:

[  137.522565] WARNING: possible circular locking dependency detected
[  137.522568] 4.14.0-rc4-CI-CI_DRM_3209+ #1 Tainted: G     U
[  137.522570] ------------------------------------------------------
[  137.522572] drv_module_relo/1532 is trying to acquire lock:
[  137.522574]  ("i915-userptr-acquire"){+.+.}, at: [<ffffffff8109a831>] flush_workqueue+0x91/0x540
[  137.522581]
               but task is already holding lock:
[  137.522583]  (&dev->struct_mutex){+.+.}, at: [<ffffffffa014fb3f>] i915_gem_fini+0x3f/0xc0 [i915]
[  137.522605]
               which lock already depends on the new lock.

[  137.522608]
               the existing dependency chain (in reverse order) is:
[  137.522611]
               -> #3 (&dev->struct_mutex){+.+.}:
[  137.522615]        __lock_acquire+0x1420/0x15e0
[  137.522618]        lock_acquire+0xb0/0x200
[  137.522621]        __mutex_lock+0x86/0x9b0
[  137.522623]        mutex_lock_interruptible_nested+0x1b/0x20
[  137.522640]        i915_mutex_lock_interruptible+0x51/0x130 [i915]
[  137.522657]        i915_gem_fault+0x20b/0x720 [i915]
[  137.522660]        __do_fault+0x1e/0x80
[  137.522662]        __handle_mm_fault+0xa08/0xed0
[  137.522664]        handle_mm_fault+0x156/0x300
[  137.522666]        __do_page_fault+0x2c5/0x570
[  137.522668]        do_page_fault+0x28/0x250
[  137.522671]        page_fault+0x22/0x30
[  137.522672]
               -> #2 (&mm->mmap_sem){++++}:
[  137.522677]        __lock_acquire+0x1420/0x15e0
[  137.522679]        lock_acquire+0xb0/0x200
[  137.522682]        down_read+0x3e/0x70
[  137.522699]        __i915_gem_userptr_get_pages_worker+0x141/0x240 [i915]
[  137.522701]        process_one_work+0x233/0x660
[  137.522704]        worker_thread+0x4e/0x3b0
[  137.522706]        kthread+0x152/0x190
[  137.522708]        ret_from_fork+0x27/0x40
[  137.522710]
               -> #1 ((&work->work)){+.+.}:
[  137.522714]        __lock_acquire+0x1420/0x15e0
[  137.522717]        lock_acquire+0xb0/0x200
[  137.522719]        process_one_work+0x206/0x660
[  137.522721]        worker_thread+0x4e/0x3b0
[  137.522723]        kthread+0x152/0x190
[  137.522725]        ret_from_fork+0x27/0x40
[  137.522727]
               -> #0 ("i915-userptr-acquire"){+.+.}:
[  137.522731]        check_prev_add+0x430/0x840
[  137.522733]        __lock_acquire+0x1420/0x15e0
[  137.522735]        lock_acquire+0xb0/0x200
[  137.522738]        flush_workqueue+0xb4/0x540
[  137.522740]        drain_workqueue+0xd4/0x1b0
[  137.522742]        destroy_workqueue+0x1c/0x200
[  137.522758]        i915_gem_cleanup_userptr+0x15/0x20 [i915]
[  137.522770]        i915_gem_fini+0x5f/0xc0 [i915]
[  137.522782]        i915_driver_unload+0x122/0x180 [i915]
[  137.522794]        i915_pci_remove+0x19/0x30 [i915]
[  137.522797]        pci_device_remove+0x39/0xb0
[  137.522800]        device_release_driver_internal+0x15d/0x220
[  137.522803]        driver_detach+0x40/0x80
[  137.522805]        bus_remove_driver+0x58/0xd0
[  137.522807]        driver_unregister+0x2c/0x40
[  137.522809]        pci_unregister_driver+0x36/0xb0
[  137.522828]        i915_exit+0x1a/0x8b [i915]
[  137.522831]        SyS_delete_module+0x18c/0x1e0
[  137.522834]        entry_SYSCALL_64_fastpath+0x1c/0xb1
[  137.522835]
               other info that might help us debug this:

[  137.522838] Chain exists of:
                 "i915-userptr-acquire" --> &mm->mmap_sem --> &dev->struct_mutex

[  137.522844]  Possible unsafe locking scenario:

[  137.522846]        CPU0                    CPU1
[  137.522848]        ----                    ----
[  137.522850]   lock(&dev->struct_mutex);
[  137.522852]                                lock(&mm->mmap_sem);
[  137.522854]                                lock(&dev->struct_mutex);
[  137.522857]   lock("i915-userptr-acquire");
[  137.522859]
                *** DEADLOCK ***

[  137.522862] 3 locks held by drv_module_relo/1532:
[  137.522864]  #0:  (&dev->mutex){....}, at: [<ffffffff8161d47b>] device_release_driver_internal+0x2b/0x220
[  137.522869]  #1:  (&dev->mutex){....}, at: [<ffffffff8161d489>] device_release_driver_internal+0x39/0x220
[  137.522873]  #2:  (&dev->struct_mutex){+.+.}, at: [<ffffffffa014fb3f>] i915_gem_fini+0x3f/0xc0 [i915]
[  137.522888]
               stack backtrace:
[  137.522891] CPU: 0 PID: 1532 Comm: drv_module_relo Tainted: G     U          4.14.0-rc4-CI-CI_DRM_3209+ #1
[  137.522894] Hardware name:                  /NUC7i5BNB, BIOS BNKBL357.86A.0048.2017.0704.1415 07/04/2017
[  137.522897] Call Trace:
[  137.522900]  dump_stack+0x68/0x9f
[  137.522902]  print_circular_bug+0x235/0x3c0
[  137.522905]  ? lockdep_init_map_crosslock+0x20/0x20
[  137.522908]  check_prev_add+0x430/0x840
[  137.522919]  ? i915_gem_fini+0x5f/0xc0 [i915]
[  137.522922]  ? __kernel_text_address+0x12/0x40
[  137.522925]  ? __save_stack_trace+0x66/0xd0
[  137.522928]  __lock_acquire+0x1420/0x15e0
[  137.522930]  ? __lock_acquire+0x1420/0x15e0
[  137.522933]  ? lockdep_init_map_crosslock+0x20/0x20
[  137.522936]  ? __this_cpu_preempt_check+0x13/0x20
[  137.522938]  lock_acquire+0xb0/0x200
[  137.522940]  ? flush_workqueue+0x91/0x540
[  137.522943]  flush_workqueue+0xb4/0x540
[  137.522945]  ? flush_workqueue+0x91/0x540
[  137.522948]  ? __mutex_unlock_slowpath+0x43/0x2c0
[  137.522951]  ? trace_hardirqs_on_caller+0xe3/0x1b0
[  137.522954]  drain_workqueue+0xd4/0x1b0
[  137.522956]  ? drain_workqueue+0xd4/0x1b0
[  137.522958]  destroy_workqueue+0x1c/0x200
[  137.522975]  i915_gem_cleanup_userptr+0x15/0x20 [i915]
[  137.522987]  i915_gem_fini+0x5f/0xc0 [i915]
[  137.523000]  i915_driver_unload+0x122/0x180 [i915]
[  137.523015]  i915_pci_remove+0x19/0x30 [i915]
[  137.523018]  pci_device_remove+0x39/0xb0
[  137.523021]  device_release_driver_internal+0x15d/0x220
[  137.523023]  driver_detach+0x40/0x80
[  137.523026]  bus_remove_driver+0x58/0xd0
[  137.523028]  driver_unregister+0x2c/0x40
[  137.523030]  pci_unregister_driver+0x36/0xb0
[  137.523049]  i915_exit+0x1a/0x8b [i915]
[  137.523052]  SyS_delete_module+0x18c/0x1e0
[  137.523055]  entry_SYSCALL_64_fastpath+0x1c/0xb1
[  137.523057] RIP: 0033:0x7f7bd0609287
[  137.523059] RSP: 002b:00007ffef694bc18 EFLAGS: 00000246 ORIG_RAX: 00000000000000b0
[  137.523062] RAX: ffffffffffffffda RBX: ffffffff81493f33 RCX: 00007f7bd0609287
[  137.523065] RDX: 0000000000000001 RSI: 0000000000000800 RDI: 0000564f999f9fc8
[  137.523067] RBP: ffffc90005c4ff88 R08: 0000000000000000 R09: 0000000000000080
[  137.523069] R10: 00007f7bd20ef8c0 R11: 0000000000000246 R12: 0000000000000000
[  137.523072] R13: 00007ffef694be00 R14: 0000000000000000 R15: 0000000000000000
[  137.523075]  ? __this_cpu_preempt_check+0x13/0x20

Signed-off-by: Chris Wilson <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Daniel Vetter <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Reviewed-by: Daniel Vetter <[email protected]>
digetx pushed a commit that referenced this issue Oct 19, 2017
Stack trace output during a stress test:
 [    4.310049] Freeing initrd memory: 22592K
[    4.310646] rtas_flash: no firmware flash support
[    4.313341] cpuhp/64: page allocation failure: order:0, mode:0x14480c0(GFP_KERNEL|__GFP_ZERO|__GFP_THISNODE), nodemask=(null)
[    4.313465] cpuhp/64 cpuset=/ mems_allowed=0
[    4.313521] CPU: 64 PID: 392 Comm: cpuhp/64 Not tainted 4.11.0-39.el7a.ppc64le #1
[    4.313588] Call Trace:
[    4.313622] [c000000f1fb1b8e0] [c000000000c09388] dump_stack+0xb0/0xf0 (unreliable)
[    4.313694] [c000000f1fb1b920] [c00000000030ef6c] warn_alloc+0x12c/0x1c0
[    4.313753] [c000000f1fb1b9c0] [c00000000030ff68] __alloc_pages_nodemask+0xea8/0x1000
[    4.313823] [c000000f1fb1bbb0] [c000000000113a8c] core_imc_mem_init+0xbc/0x1c0
[    4.313892] [c000000f1fb1bc00] [c000000000113cdc] ppc_core_imc_cpu_online+0x14c/0x170
[    4.313962] [c000000f1fb1bc90] [c000000000125758] cpuhp_invoke_callback+0x198/0x5d0
[    4.314031] [c000000f1fb1bd00] [c00000000012782c] cpuhp_thread_fun+0x8c/0x3d0
[    4.314101] [c000000f1fb1bd60] [c0000000001678d0] smpboot_thread_fn+0x290/0x2a0
[    4.314169] [c000000f1fb1bdc0] [c00000000015ee78] kthread+0x168/0x1b0
[    4.314229] [c000000f1fb1be30] [c00000000000b368] ret_from_kernel_thread+0x5c/0x74
[    4.314313] Mem-Info:
[    4.314356] active_anon:0 inactive_anon:0 isolated_anon:0

core_imc_mem_init() at system boot use alloc_pages_node() to get memory
and alloc_pages_node() throws this stack dump when tried to allocate
memory from a node which has no memory behind it. Add a ___GFP_NOWARN
flag in allocation request as a fix.

Signed-off-by: Anju T Sudhakar <[email protected]>
Reported-by: Michael Ellerman <[email protected]>
Reported-by: Venkat R.B <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
digetx pushed a commit that referenced this issue Oct 19, 2017
We want to keep ucode xfer functions separate from other
initialization. Once separated, add explicit forcewake.

v2: use BLITTER domain only and add comment (Daniele)

Suggested-by: Joonas Lahtinen <[email protected]>
Signed-off-by: Michal Wajdeczko <[email protected]>
Cc: Joonas Lahtinen <[email protected]>
Cc: Sagar Arun Kamble <[email protected]>
Cc: Daniele Ceraolo Spurio <[email protected]>
Reviewed-by: Chris Wilson <[email protected]> #1
Reviewed-by: Daniele Ceraolo Spurio <[email protected]>
Signed-off-by: Joonas Lahtinen <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
digetx pushed a commit that referenced this issue Oct 19, 2017
v4.10 commit 6f2ce1c ("scsi: zfcp: fix rport unblock race with LUN
recovery") extended accessing parent pointer fields of struct
zfcp_erp_action for tracing.  If an erp_action has never been enqueued
before, these parent pointer fields are uninitialized and NULL. Examples
are zfcp objects freshly added to the parent object's children list,
before enqueueing their first recovery subsequently. In
zfcp_erp_try_rport_unblock(), we iterate such list. Accessing erp_action
fields can cause a NULL pointer dereference.  Since the kernel can read
from lowcore on s390, it does not immediately cause a kernel page
fault. Instead it can cause hangs on trying to acquire the wrong
erp_action->adapter->dbf->rec_lock in zfcp_dbf_rec_action_lvl()
                      ^bogus^
while holding already other locks with IRQs disabled.

Real life example from attaching lots of LUNs in parallel on many CPUs:

crash> bt 17723
PID: 17723  TASK: ...               CPU: 25  COMMAND: "zfcperp0.0.1800"
 LOWCORE INFO:
  -psw      : 0x0404300180000000 0x000000000038e424
  -function : _raw_spin_lock_wait_flags at 38e424
...
 #0 [fdde8fc90] zfcp_dbf_rec_action_lvl at 3e0004e9862 [zfcp]
 #1 [fdde8fce8] zfcp_erp_try_rport_unblock at 3e0004dfddc [zfcp]
 #2 [fdde8fd38] zfcp_erp_strategy at 3e0004e0234 [zfcp]
 #3 [fdde8fda8] zfcp_erp_thread at 3e0004e0a12 [zfcp]
 #4 [fdde8fe60] kthread at 173550
 #5 [fdde8feb8] kernel_thread_starter at 10add2

zfcp_adapter
 zfcp_port
  zfcp_unit <address>, 0x404040d600000000
  scsi_device NULL, returning early!
zfcp_scsi_dev.status = 0x40000000
0x40000000 ZFCP_STATUS_COMMON_RUNNING

crash> zfcp_unit <address>
struct zfcp_unit {
  erp_action = {
    adapter = 0x0,
    port = 0x0,
    unit = 0x0,
  },
}

zfcp_erp_action is always fully embedded into its container object. Such
container object is never moved in its object tree (only add or delete).
Hence, erp_action parent pointers can never change.

To fix the issue, initialize the erp_action parent pointers before
adding the erp_action container to any list and thus before it becomes
accessible from outside of its initializing function.

In order to also close the time window between zfcp_erp_setup_act()
memsetting the entire erp_action to zero and setting the parent pointers
again, drop the memset and instead explicitly initialize individually
all erp_action fields except for parent pointers. To be extra careful
not to introduce any other unintended side effect, even keep zeroing the
erp_action fields for list and timer. Also double-check with
WARN_ON_ONCE that erp_action parent pointers never change, so we get to
know when we would deviate from previous behavior.

Signed-off-by: Steffen Maier <[email protected]>
Fixes: 6f2ce1c ("scsi: zfcp: fix rport unblock race with LUN recovery")
Cc: <[email protected]> #2.6.32+
Reviewed-by: Benjamin Block <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
digetx pushed a commit that referenced this issue Oct 19, 2017
The rcutorture test suite occasionally provokes a splat due to invoking
resched_cpu() on an offline CPU:

WARNING: CPU: 2 PID: 8 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x37/0x40
Modules linked in:
CPU: 2 PID: 8 Comm: rcu_preempt Not tainted 4.14.0-rc4+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: ffff902ede9daf00 task.stack: ffff96c50010c000
RIP: 0010:native_smp_send_reschedule+0x37/0x40
RSP: 0018:ffff96c50010fdb8 EFLAGS: 00010096
RAX: 000000000000002e RBX: ffff902edaab4680 RCX: 0000000000000003
RDX: 0000000080000003 RSI: 0000000000000000 RDI: 00000000ffffffff
RBP: ffff96c50010fdb8 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 00000000299f36ae R12: 0000000000000001
R13: ffffffff9de64240 R14: 0000000000000001 R15: ffffffff9de64240
FS:  0000000000000000(0000) GS:ffff902edfc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000f7d4c642 CR3: 000000001e0e2000 CR4: 00000000000006e0
Call Trace:
 resched_curr+0x8f/0x1c0
 resched_cpu+0x2c/0x40
 rcu_implicit_dynticks_qs+0x152/0x220
 force_qs_rnp+0x147/0x1d0
 ? sync_rcu_exp_select_cpus+0x450/0x450
 rcu_gp_kthread+0x5a9/0x950
 kthread+0x142/0x180
 ? force_qs_rnp+0x1d0/0x1d0
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x27/0x40
Code: 14 01 0f 92 c0 84 c0 74 14 48 8b 05 14 4f f4 00 be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 38 89 ca 9d e8 e5 56 08 00 <0f> ff 5d c3 0f 1f 44 00 00 8b 05 52 9e 37 02 85 c0 75 38 55 48
---[ end trace 26df9e5df4bba4ac ]---

This splat cannot be generated by expedited grace periods because they
always invoke resched_cpu() on the current CPU, which is good because
expedited grace periods require that resched_cpu() unconditionally
succeed.  However, other parts of RCU can tolerate resched_cpu() acting
as a no-op, at least as long as it doesn't happen too often.

This commit therefore makes resched_cpu() invoke resched_curr() only if
the CPU is either online or is the current CPU.

Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
digetx pushed a commit that referenced this issue Oct 19, 2017
The rcutorture test suite occasionally provokes a splat due to invoking
rt_mutex_lock() which needs to boost the priority of a task currently
sitting on a runqueue that belongs to an offline CPU:

WARNING: CPU: 0 PID: 12 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x37/0x40
Modules linked in:
CPU: 0 PID: 12 Comm: rcub/7 Not tainted 4.14.0-rc4+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: ffff9ed3de5f8cc0 task.stack: ffffbbf80012c000
RIP: 0010:native_smp_send_reschedule+0x37/0x40
RSP: 0018:ffffbbf80012fd10 EFLAGS: 00010082
RAX: 000000000000002f RBX: ffff9ed3dd9cb300 RCX: 0000000000000004
RDX: 0000000080000004 RSI: 0000000000000086 RDI: 00000000ffffffff
RBP: ffffbbf80012fd10 R08: 000000000009da7a R09: 0000000000007b9d
R10: 0000000000000001 R11: ffffffffbb57c2cd R12: 000000000000000d
R13: ffff9ed3de5f8cc0 R14: 0000000000000061 R15: ffff9ed3ded59200
FS:  0000000000000000(0000) GS:ffff9ed3dea00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000080686f0 CR3: 000000001b9e0000 CR4: 00000000000006f0
Call Trace:
 resched_curr+0x61/0xd0
 switched_to_rt+0x8f/0xa0
 rt_mutex_setprio+0x25c/0x410
 task_blocks_on_rt_mutex+0x1b3/0x1f0
 rt_mutex_slowlock+0xa9/0x1e0
 rt_mutex_lock+0x29/0x30
 rcu_boost_kthread+0x127/0x3c0
 kthread+0x104/0x140
 ? rcu_report_unblock_qs_rnp+0x90/0x90
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x22/0x30
Code: f0 00 0f 92 c0 84 c0 74 14 48 8b 05 34 74 c5 00 be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 a0 c6 fc b9 e8 d5 b5 06 00 <0f> ff 5d c3 0f 1f 44 00 00 8b 05 a2 d1 13 02 85 c0 75 38 55 48

But the target task's priority has already been adjusted, so the only
purpose of switched_to_rt() invoking resched_curr() is to wake up the
CPU running some task that needs to be preempted by the boosted task.
But the CPU is offline, which presumably means that the task must be
migrated to some other CPU, and that this other CPU will undertake any
needed preemption at the time of migration.  Because the runqueue lock
is held when resched_curr() is invoked, we know that the boosted task
cannot go anywhere, so it is not necessary to invoke resched_curr()
in this particular case.

This commit therefore makes switched_to_rt() refrain from invoking
resched_curr() when the target CPU is offline.

Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
digetx pushed a commit that referenced this issue Oct 19, 2017
dma-debug reports the following warning:

[name:panic&]WARNING: CPU: 3 PID: 298 at kernel-4.4/lib/dma-debug.c:604
debug _dma_assert_idle+0x1a8/0x230()
DMA-API: cpu touching an active dma mapped cacheline [cln=0x00000882300]
CPU: 3 PID: 298 Comm: vold Tainted: G        W  O    4.4.22+ #1
Hardware name: MT6739 (DT)
Call trace:
[<ffffff800808acd0>] dump_backtrace+0x0/0x1d4
[<ffffff800808affc>] show_stack+0x14/0x1c
[<ffffff800838019c>] dump_stack+0xa8/0xe0
[<ffffff80080a0594>] warn_slowpath_common+0xf4/0x11c
[<ffffff80080a061c>] warn_slowpath_fmt+0x60/0x80
[<ffffff80083afe24>] debug_dma_assert_idle+0x1a8/0x230
[<ffffff80081dca9c>] wp_page_copy.isra.96+0x118/0x520
[<ffffff80081de114>] do_wp_page+0x4fc/0x534
[<ffffff80081e0a14>] handle_mm_fault+0xd4c/0x1310
[<ffffff8008098798>] do_page_fault+0x1c8/0x394
[<ffffff800808231c>] do_mem_abort+0x50/0xec

I found that debug_dma_alloc_coherent() and debug_dma_free_coherent()
assume that dma_alloc_coherent() always returns a linear address.  However
it's possible that dma_alloc_coherent() returns a non-linear address.  In
this case, page_to_pfn(virt_to_page(virt)) will return an incorrect pfn.
If the pfn is valid and mapped as a COW page, we will hit the warning when
doing wp_page_copy().

Fix this by calculating pfn for linear and non-linear addresses.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Miles Chen <[email protected]>
Reviewed-by: Robin Murphy <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Mark Brown <[email protected]>
digetx pushed a commit that referenced this issue Oct 19, 2017
Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by
going through __init_single_page().

In some cases these struct pages are accessed even if they do not contain
any data.  One example is page_to_pfn() might access page->flags if this
is where section information is stored (CONFIG_SPARSEMEM,
SECTION_IN_PAGE_FLAGS).

One example of such memory: trim_low_memory_range() unconditionally
reserves from pfn 0, but e820__memblock_setup() might provide the exiting
memory from pfn 1 (i.e.  KVM).

Since struct pages are zeroed in __init_single_page(), and not during
allocation time, we must zero such struct pages explicitly.

The patch involves adding a new memblock iterator:
	for_each_resv_unavail_range(i, p_start, p_end)

Which iterates through reserved && !memory lists, and we zero struct pages
explicitly by calling mm_zero_struct_page().

===

Here is more detailed example of problem that this patch is addressing:

Run tested on qemu with the following arguments:

	-enable-kvm -cpu kvm64 -m 512 -smp 2

This patch reports that there are 98 unavailable pages.

They are: pfn 0 and pfns in range [159, 255].

Note, trim_low_memory_range() reserves only pfns in range [0, 15], it does
not reserve [159, 255] ones.

e820__memblock_setup() reports linux that the following physical ranges are
available:
    [1 , 158]
[256, 130783]

Notice, that exactly unavailable pfns are missing!

Now, lets check what we have in zone 0: [1, 131039]

pfn 0, is not part of the zone, but pfns [1, 158], are.

However, the bigger problem we have if we do not initialize these struct
pages is with memory hotplug.  Because, that path operates at 2M
boundaries (section_nr).  And checks if 2M range of pages is hot
removable.  It starts with first pfn from zone, rounds it down to 2M
boundary (sturct pages are allocated at 2M boundaries when vmemmap is
created), and checks if that section is hot removable.  In this case start
with pfn 1 and convert it down to pfn 0.  Later pfn is converted to struct
page, and some fields are checked.  Now, if we do not zero struct pages,
we get unpredictable results.

In fact when CONFIG_VM_DEBUG is enabled, and we explicitly set all vmemmap
memory to ones, the following panic is observed with kernel test without
this patch applied:

BUG: unable to handle kernel NULL pointer dereference at          (null)
IP: is_pageblock_removable_nolock+0x35/0x90
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT
...
task: ffff88001f4e2900 task.stack: ffffc90000314000
RIP: 0010:is_pageblock_removable_nolock+0x35/0x90
RSP: 0018:ffffc90000317d60 EFLAGS: 00010202
RAX: ffffffffffffffff RBX: ffff88001d92b000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000200000 RDI: ffff88001d92b000
RBP: ffffc90000317d80 R08: 00000000000010c8 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88001db2b000
R13: ffffffff81af6d00 R14: ffff88001f7d5000 R15: ffffffff82a1b6c0
FS:  00007f4eb857f7c0(0000) GS:ffffffff81c27000(0000) knlGS:0
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000001f4e6000 CR4: 00000000000006b0
Call Trace:
 ? is_mem_section_removable+0x5a/0xd0
 show_mem_removable+0x6b/0xa0
 dev_attr_show+0x1b/0x50
 sysfs_kf_seq_show+0xa1/0x100
 kernfs_seq_show+0x22/0x30
 seq_read+0x1ac/0x3a0
 kernfs_fop_read+0x36/0x190
 ? security_file_permission+0x90/0xb0
 __vfs_read+0x16/0x30
 vfs_read+0x81/0x130
 SyS_read+0x44/0xa0
 entry_SYSCALL_64_fastpath+0x1f/0xbd

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Steven Sistare <[email protected]>
Reviewed-by: Daniel Jordan <[email protected]>
Reviewed-by: Bob Picco <[email protected]>
Tested-by: Bob Picco <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Mark Brown <[email protected]>
digetx pushed a commit that referenced this issue Nov 30, 2017
When a thread mlocks an address space backed either by file pages which
are currently not present in memory or swapped out anon pages (not in
swapcache), a new page is allocated and added to the local pagevec
(lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
On I/O completion, the thread can wake on a different CPU, the mlock
syscall will then sets the PageMlocked() bit of the page but will not be
able to put that page in unevictable LRU as the page is on the pagevec of
a different CPU.  Even on drain, that page will go to evictable LRU
because the PageMlocked() bit is not checked on pagevec drain.

The page will eventually go to right LRU on reclaim but the LRU stats will
remain skewed for a long time.

This patch puts all the pages, even unevictable, to the pagevecs and on
the drain, the pages will be added on their LRUs correctly by checking
their evictability.  This resolves the mlocked pages on pagevec of other
CPUs issue because when those pagevecs will be drained, the mlocked file
pages will go to unevictable LRU.  Also this makes the race with munlock
easier to resolve because the pagevec drains happen in LRU lock.

However there is still one place which makes a page evictable and does
PageLRU check on that page without LRU lock and needs special attention.
TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().

	#0: __pagevec_lru_add_fn	#1: clear_page_mlock

	SetPageLRU()			if (!TestClearPageMlocked())
					  return
	smp_mb() // <--required
					// inside does PageLRU
	if (!PageMlocked())		if (isolate_lru_page())
	  move to evictable LRU		  putback_lru_page()
	else
	  move to unevictable LRU

In '#1', TestClearPageMlocked() provides full memory barrier semantics and
thus the PageLRU check (inside isolate_lru_page) can not be reordered
before it.

In '#0', without explicit memory barrier, the PageMlocked() check can be
reordered before SetPageLRU().  If that happens, '#0' can put a page in
unevictable LRU and '#1' might have just cleared the Mlocked bit of that
page but fails to isolate as PageLRU fails as '#0' still hasn't set
PageLRU bit of that page.  That page will be stranded on the unevictable
LRU.

There is one (good) side effect though.  Without this patch, the pages
allocated for System V shared memory segment are added to evictable LRUs
even after shmctl(SHM_LOCK) on that segment.  This patch will correctly
put such pages to unevictable LRU.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this issue Dec 1, 2017
When a thread mlocks an address space backed either by file pages which
are currently not present in memory or swapped out anon pages (not in
swapcache), a new page is allocated and added to the local pagevec
(lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
On I/O completion, the thread can wake on a different CPU, the mlock
syscall will then sets the PageMlocked() bit of the page but will not be
able to put that page in unevictable LRU as the page is on the pagevec of
a different CPU.  Even on drain, that page will go to evictable LRU
because the PageMlocked() bit is not checked on pagevec drain.

The page will eventually go to right LRU on reclaim but the LRU stats will
remain skewed for a long time.

This patch puts all the pages, even unevictable, to the pagevecs and on
the drain, the pages will be added on their LRUs correctly by checking
their evictability.  This resolves the mlocked pages on pagevec of other
CPUs issue because when those pagevecs will be drained, the mlocked file
pages will go to unevictable LRU.  Also this makes the race with munlock
easier to resolve because the pagevec drains happen in LRU lock.

However there is still one place which makes a page evictable and does
PageLRU check on that page without LRU lock and needs special attention.
TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().

	#0: __pagevec_lru_add_fn	#1: clear_page_mlock

	SetPageLRU()			if (!TestClearPageMlocked())
					  return
	smp_mb() // <--required
					// inside does PageLRU
	if (!PageMlocked())		if (isolate_lru_page())
	  move to evictable LRU		  putback_lru_page()
	else
	  move to unevictable LRU

In '#1', TestClearPageMlocked() provides full memory barrier semantics and
thus the PageLRU check (inside isolate_lru_page) can not be reordered
before it.

In '#0', without explicit memory barrier, the PageMlocked() check can be
reordered before SetPageLRU().  If that happens, '#0' can put a page in
unevictable LRU and '#1' might have just cleared the Mlocked bit of that
page but fails to isolate as PageLRU fails as '#0' still hasn't set
PageLRU bit of that page.  That page will be stranded on the unevictable
LRU.

There is one (good) side effect though.  Without this patch, the pages
allocated for System V shared memory segment are added to evictable LRUs
even after shmctl(SHM_LOCK) on that segment.  This patch will correctly
put such pages to unevictable LRU.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this issue Dec 11, 2017
When IOMMU support was enabled, dma-buf import in Exynos DRM was broken
since commit f43c359 ("drm/exynos: use real device for DMA-mapping
operations") due to using wrong struct device in drm_gem_prime_import()
function. This patch fixes following kernel BUG caused by incorrect buffer
mapping to DMA address space:

exynos-sysmmu 14650000.sysmmu: 14450000.mixer: PAGE FAULT occurred at 0xb2e00000
------------[ cut here ]------------
kernel BUG at drivers/iommu/exynos-iommu.c:449!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-rc4-next-20171016-00033-g990d723669fd #3165
Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
task: c0e0b7c0 task.stack: c0e00000
PC is at exynos_sysmmu_irq+0x1d0/0x24c
LR is at exynos_sysmmu_irq+0x154/0x24c
------------[ cut here ]------------

Reported-by: Marian Mihailescu <[email protected]>
Fixes: f43c359 ("drm/exynos: use real device for DMA-mapping operations")
Signed-off-by: Marek Szyprowski <[email protected]>
Reviewed-by: Tobias Jakobi <[email protected]>
Signed-off-by: Inki Dae <[email protected]>
digetx pushed a commit that referenced this issue Dec 11, 2017
There is nothing that says that number of TX queues == number of RX
queues. E.g. the ARTPEC-6 SoC has 2 TX queues and 1 RX queue.

This code is obviously wrong:
for (chan = 0; chan < tx_channel_count; chan++) {
    struct stmmac_rx_queue *rx_q = &priv->rx_queue[chan];

priv->rx_queue has size MTL_MAX_RX_QUEUES, so this will send an
uninitialized napi_struct to __napi_schedule(), causing us to
crash in net_rx_action(), because napi_struct->poll is zero.

[12846.759880] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[12846.768014] pgd = (ptrval)
[12846.770742] [00000000] *pgd=39ec7831, *pte=00000000, *ppte=00000000
[12846.777023] Internal error: Oops: 80000007 [#1] PREEMPT SMP ARM
[12846.782942] Modules linked in:
[12846.785998] CPU: 0 PID: 161 Comm: dropbear Not tainted 4.15.0-rc2-00285-gf5fb5f2f39a7 #36
[12846.794177] Hardware name: Axis ARTPEC-6 Platform
[12846.798879] task: (ptrval) task.stack: (ptrval)
[12846.803407] PC is at 0x0
[12846.805942] LR is at net_rx_action+0x274/0x43c
[12846.810383] pc : [<00000000>]    lr : [<80bff064>]    psr: 200e0113
[12846.816648] sp : b90d9ae8  ip : b90d9ae8  fp : b90d9b44
[12846.821871] r10: 00000008  r9 : 0013250e  r8 : 00000100
[12846.827094] r7 : 0000012c  r6 : 00000000  r5 : 00000001  r4 : bac84900
[12846.833619] r3 : 00000000  r2 : b90d9b08  r1 : 00000000  r0 : bac84900

Since each DMA channel can be used for rx and tx simultaneously,
the current code should probably be rewritten so that napi_struct is
embedded in a new struct stmmac_channel.
That way, stmmac_poll() can call stmmac_tx_clean() on just the tx queue
where we got the IRQ, instead of looping through all tx queues.
This is also how the xgbe driver does it (another driver for this IP).

Fixes: c22a3f4 ("net: stmmac: adding multiple napi mechanism")
Signed-off-by: Niklas Cassel <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
digetx pushed a commit that referenced this issue Dec 11, 2017
…ligned access

ubsan reports a warning like:

UBSAN: Undefined behaviour in ../include/linux/etherdevice.h:386:9
load of misaligned address ffffffc069ba0482 for type 'long unsigned int'
which requires 8 byte alignment
CPU: 0 PID: 901 Comm: sshd Not tainted 4.xx+ #1
Hardware name: linux,dummy-virt (DT)
Call trace:
[<ffffffc000093600>] dump_backtrace+0x0/0x348
[<ffffffc000093968>] show_stack+0x20/0x30
[<ffffffc001651664>] dump_stack+0x144/0x1b4
[<ffffffc0016519b0>] ubsan_epilogue+0x18/0x74
[<ffffffc001651bac>] __ubsan_handle_type_mismatch+0x1a0/0x25c
[<ffffffc00125d8a0>] dev_gro_receive+0x17d8/0x1830
[<ffffffc00125d928>] napi_gro_receive+0x30/0x158
[<ffffffc000f4f93c>] virtnet_receive+0xad4/0x1fa8

The reason is that when enabling the CONFIG_UBSAN_ALIGNMENT, ubsan will
report the unaligned access even if the system supports it
(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y).  This produces a lot of noise
in the log and causes confusion.

Prevent the detection of unaligned access when the system support
unaligned access.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ding Tianhong <[email protected]>
Cc: David Laight <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this issue Nov 30, 2021
In commit 510410b ("drm/msm: Implement mmap as GEM object
function") we switched to a new/cleaner method of doing things. That's
good, but we missed a little bit.

Before that commit, we used to _first_ run through the
drm_gem_mmap_obj() case where `obj->funcs->mmap()` was NULL. That meant
that we ran:

  vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
  vma->vm_page_prot = pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
  vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);

...and _then_ we modified those mappings with our own. Now that
`obj->funcs->mmap()` is no longer NULL we don't run the default
code. It looks like the fact that the vm_flags got VM_IO / VM_DONTDUMP
was important because we're now getting crashes on Chromebooks that
use ARC++ while logging out. Specifically a crash that looks like this
(this is on a 5.10 kernel w/ relevant backports but also seen on a
5.15 kernel):

  Unable to handle kernel paging request at virtual address ffffffc008000000
  Mem abort info:
    ESR = 0x96000006
    EC = 0x25: DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
  Data abort info:
    ISV = 0, ISS = 0x00000006
    CM = 0, WnR = 0
  swapper pgtable: 4k pages, 39-bit VAs, pgdp=000000008293d000
  [ffffffc008000000] pgd=00000001002b3003, p4d=00000001002b3003,
                     pud=00000001002b3003, pmd=0000000000000000
  Internal error: Oops: 96000006 [#1] PREEMPT SMP
  [...]
  CPU: 7 PID: 15734 Comm: crash_dump64 Tainted: G W 5.10.67 #1 [...]
  Hardware name: Qualcomm Technologies, Inc. sc7280 IDP SKU2 platform (DT)
  pstate: 80400009 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
  pc : __arch_copy_to_user+0xc0/0x30c
  lr : copyout+0xac/0x14c
  [...]
  Call trace:
   __arch_copy_to_user+0xc0/0x30c
   copy_page_to_iter+0x1a0/0x294
   process_vm_rw_core+0x240/0x408
   process_vm_rw+0x110/0x16c
   __arm64_sys_process_vm_readv+0x30/0x3c
   el0_svc_common+0xf8/0x250
   do_el0_svc+0x30/0x80
   el0_svc+0x10/0x1c
   el0_sync_handler+0x78/0x108
   el0_sync+0x184/0x1c0
  Code: f8408423 f80008c3 910020c6 36100082 (b8404423)

Let's add the two flags back in.

While we're at it, the fact that we aren't running the default means
that we _don't_ need to clear out VM_PFNMAP, so remove that and save
an instruction.

NOTE: it was confirmed that VM_IO was the important flag to fix the
problem I was seeing, but adding back VM_DONTDUMP seems like a sane
thing to do so I'm doing that too.

Fixes: 510410b ("drm/msm: Implement mmap as GEM object function")
Reported-by: Stephen Boyd <[email protected]>
Signed-off-by: Douglas Anderson <[email protected]>
Reviewed-by: Stephen Boyd <[email protected]>
Tested-by: Stephen Boyd <[email protected]>
Link: https://lore.kernel.org/r/20211110113334.1.I1687e716adb2df746da58b508db3f25423c40b27@changeid
Signed-off-by: Rob Clark <[email protected]>
digetx pushed a commit that referenced this issue Nov 30, 2021
If you happened to try to access `/dev/drm_dp_aux` devices provided by
the MSM DP AUX driver too early at bootup you could go boom. Let's
avoid that by only allowing AUX transfers when the controller is
powered up.

Specifically the crash that was seen (on Chrome OS 5.4 tree with
relevant backports):
  Kernel panic - not syncing: Asynchronous SError Interrupt
  CPU: 0 PID: 3131 Comm: fwupd Not tainted 5.4.144-16620-g28af11b73efb #1
  Hardware name: Google Lazor (rev3+) with KB Backlight (DT)
  Call trace:
   dump_backtrace+0x0/0x14c
   show_stack+0x20/0x2c
   dump_stack+0xac/0x124
   panic+0x150/0x390
   nmi_panic+0x80/0x94
   arm64_serror_panic+0x78/0x84
   do_serror+0x0/0x118
   do_serror+0xa4/0x118
   el1_error+0xbc/0x160
   dp_catalog_aux_write_data+0x1c/0x3c
   dp_aux_cmd_fifo_tx+0xf0/0x1b0
   dp_aux_transfer+0x1b0/0x2bc
   drm_dp_dpcd_access+0x8c/0x11c
   drm_dp_dpcd_read+0x64/0x10c
   auxdev_read_iter+0xd4/0x1c4

I did a little bit of tracing and found that:
* We register the AUX device very early at bootup.
* Power isn't actually turned on for my system until
  hpd_event_thread() -> dp_display_host_init() -> dp_power_init()
* You can see that dp_power_init() calls dp_aux_init() which is where
  we start allowing AUX channel requests to go through.

In general this patch is a bit of a bandaid but at least it gets us
out of the current state where userspace acting at the wrong time can
fully crash the system.
* I think the more proper fix (which requires quite a bit more
  changes) is to power stuff on while an AUX transfer is
  happening. This is like the solution we did for ti-sn65dsi86. This
  might be required for us to move to populating the panel via the
  DP-AUX bus.
* Another fix considered was to dynamically register / unregister. I
  tried that at <https://crrev.com/c/3169431/3> but it got
  ugly. Currently there's a bug where the pm_runtime() state isn't
  tracked properly and that causes us to just keep registering more
  and more.

Signed-off-by: Douglas Anderson <[email protected]>
Reviewed-by: Kuogee Hsieh <[email protected]>
Reviewed-by: Abhinav Kumar <[email protected]>
Link: https://lore.kernel.org/r/20211109100403.1.I4e23470d681f7efe37e2e7f1a6466e15e9bb1d72@changeid
Signed-off-by: Rob Clark <[email protected]>
digetx pushed a commit that referenced this issue Nov 30, 2021
The capture code is typically run entirely in the fence signalling
critical path. We're about to add lockdep annotation in an upcoming patch
which reveals a lockdep splat similar to the below one.

Fix the associated potential deadlocks using __GFP_KSWAPD_RECLAIM
(which is the same as GFP_WAIT, but open-coded for clarity) rather than
GFP_KERNEL for memory allocation in the capture path. This has the
potential drawback that capture might fail in situations with memory
pressure.

[  234.842048] WARNING: possible circular locking dependency detected
[  234.842050] 5.15.0-rc7+ #20 Tainted: G     U  W
[  234.842052] ------------------------------------------------------
[  234.842054] gem_exec_captur/1180 is trying to acquire lock:
[  234.842056] ffffffffa3e51c00 (fs_reclaim){+.+.}-{0:0}, at: __kmalloc+0x4d/0x330
[  234.842063]
               but task is already holding lock:
[  234.842064] ffffffffa3f57620 (dma_fence_map){++++}-{0:0}, at: i915_vma_snapshot_resource_pin+0x27/0x30 [i915]
[  234.842138]
               which lock already depends on the new lock.

[  234.842140]
               the existing dependency chain (in reverse order) is:
[  234.842142]
               -> #2 (dma_fence_map){++++}-{0:0}:
[  234.842145]        __dma_fence_might_wait+0x41/0xa0
[  234.842149]        dma_resv_lockdep+0x1dc/0x28f
[  234.842151]        do_one_initcall+0x58/0x2d0
[  234.842154]        kernel_init_freeable+0x273/0x2bf
[  234.842157]        kernel_init+0x16/0x120
[  234.842160]        ret_from_fork+0x1f/0x30
[  234.842163]
               -> #1 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}:
[  234.842166]        fs_reclaim_acquire+0x6d/0xd0
[  234.842168]        __kmalloc_node+0x51/0x3a0
[  234.842171]        alloc_cpumask_var_node+0x1b/0x30
[  234.842174]        native_smp_prepare_cpus+0xc7/0x292
[  234.842177]        kernel_init_freeable+0x160/0x2bf
[  234.842179]        kernel_init+0x16/0x120
[  234.842181]        ret_from_fork+0x1f/0x30
[  234.842184]
               -> #0 (fs_reclaim){+.+.}-{0:0}:
[  234.842186]        __lock_acquire+0x1161/0x1dc0
[  234.842189]        lock_acquire+0xb5/0x2b0
[  234.842192]        fs_reclaim_acquire+0xa1/0xd0
[  234.842193]        __kmalloc+0x4d/0x330
[  234.842196]        i915_vma_coredump_create+0x78/0x5b0 [i915]
[  234.842253]        intel_engine_coredump_add_vma+0x36/0xe0 [i915]
[  234.842307]        __i915_gpu_coredump+0x290/0x5e0 [i915]
[  234.842365]        i915_capture_error_state+0x57/0xa0 [i915]
[  234.842415]        intel_gt_handle_error+0x348/0x3e0 [i915]
[  234.842462]        intel_gt_debugfs_reset_store+0x3c/0x90 [i915]
[  234.842504]        simple_attr_write+0xc1/0xe0
[  234.842507]        full_proxy_write+0x53/0x80
[  234.842509]        vfs_write+0xbc/0x350
[  234.842513]        ksys_write+0x58/0xd0
[  234.842514]        do_syscall_64+0x38/0x90
[  234.842516]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  234.842519]
               other info that might help us debug this:

[  234.842521] Chain exists of:
                 fs_reclaim --> mmu_notifier_invalidate_range_start --> dma_fence_map

[  234.842526]  Possible unsafe locking scenario:

[  234.842528]        CPU0                    CPU1
[  234.842529]        ----                    ----
[  234.842531]   lock(dma_fence_map);
[  234.842532]                                lock(mmu_notifier_invalidate_range_start);
[  234.842535]                                lock(dma_fence_map);
[  234.842537]   lock(fs_reclaim);
[  234.842539]
                *** DEADLOCK ***

[  234.842540] 4 locks held by gem_exec_captur/1180:
[  234.842543]  #0: ffff9007812d9460 (sb_writers#17){.+.+}-{0:0}, at: ksys_write+0x58/0xd0
[  234.842547]  #1: ffff900781d9ecb8 (&attr->mutex){+.+.}-{3:3}, at: simple_attr_write+0x3a/0xe0
[  234.842552]  #2: ffffffffc11913a8 (capture_mutex){+.+.}-{3:3}, at: i915_capture_error_state+0x1a/0xa0 [i915]
[  234.842602]  #3: ffffffffa3f57620 (dma_fence_map){++++}-{0:0}, at: i915_vma_snapshot_resource_pin+0x27/0x30 [i915]
[  234.842656]
               stack backtrace:
[  234.842658] CPU: 0 PID: 1180 Comm: gem_exec_captur Tainted: G     U  W         5.15.0-rc7+ #20
[  234.842661] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 0403 01/26/2021
[  234.842664] Call Trace:
[  234.842666]  dump_stack_lvl+0x57/0x72
[  234.842669]  check_noncircular+0xde/0x100
[  234.842672]  ? __lock_acquire+0x3bf/0x1dc0
[  234.842675]  __lock_acquire+0x1161/0x1dc0
[  234.842678]  lock_acquire+0xb5/0x2b0
[  234.842680]  ? __kmalloc+0x4d/0x330
[  234.842683]  ? finish_task_switch.isra.0+0xf2/0x360
[  234.842686]  ? i915_vma_coredump_create+0x78/0x5b0 [i915]
[  234.842734]  fs_reclaim_acquire+0xa1/0xd0
[  234.842737]  ? __kmalloc+0x4d/0x330
[  234.842739]  __kmalloc+0x4d/0x330
[  234.842742]  i915_vma_coredump_create+0x78/0x5b0 [i915]
[  234.842793]  ? capture_vma+0xbe/0x110 [i915]
[  234.842844]  intel_engine_coredump_add_vma+0x36/0xe0 [i915]
[  234.842892]  __i915_gpu_coredump+0x290/0x5e0 [i915]
[  234.842939]  i915_capture_error_state+0x57/0xa0 [i915]
[  234.842985]  intel_gt_handle_error+0x348/0x3e0 [i915]
[  234.843032]  ? __mutex_lock+0x81/0x830
[  234.843035]  ? simple_attr_write+0x3a/0xe0
[  234.843038]  ? __lock_acquire+0x3bf/0x1dc0
[  234.843041]  intel_gt_debugfs_reset_store+0x3c/0x90 [i915]
[  234.843083]  ? _copy_from_user+0x45/0x80
[  234.843086]  simple_attr_write+0xc1/0xe0
[  234.843089]  full_proxy_write+0x53/0x80
[  234.843091]  vfs_write+0xbc/0x350
[  234.843094]  ksys_write+0x58/0xd0
[  234.843096]  do_syscall_64+0x38/0x90
[  234.843098]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  234.843101] RIP: 0033:0x7fa467480877
[  234.843103] Code: 75 05 48 83 c4 58 c3 e8 37 4e ff ff 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  234.843108] RSP: 002b:00007ffd14d79b08 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  234.843112] RAX: ffffffffffffffda RBX: 00007ffd14d79b60 RCX: 00007fa467480877
[  234.843114] RDX: 0000000000000014 RSI: 00007ffd14d79b60 RDI: 0000000000000007
[  234.843116] RBP: 0000000000000007 R08: 0000000000000000 R09: 00007ffd14d79ab0
[  234.843119] R10: ffffffffffffffff R11: 0000000000000246 R12: 0000000000000014
[  234.843121] R13: 0000000000000000 R14: 00007ffd14d79b60 R15: 0000000000000005

v5:
- Use __GFP_KSWAPD_RECLAIM rather than __GFP_NOWAIT for clarity.
  (Daniel Vetter)
v6:
- Include an instance in execlists_capture_work().
- Rework the commit message due to patch reordering.

Signed-off-by: Thomas Hellström <[email protected]>
Reviewed-by: Ramalingam C <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
digetx pushed a commit that referenced this issue Nov 30, 2021
 into HEAD

KVM/riscv fixes for 5.16, take #1

- Fix incorrect KVM_MAX_VCPUS value

- Unmap stage2 mapping when deleting/moving a memslot
  (This was due to empty kvm_arch_flush_shadow_memslot())
digetx pushed a commit that referenced this issue Nov 30, 2021
…han_radar_detection

Fix the following NULL pointer dereference in
cfg80211_stop_offchan_radar_detection routine that occurs when hostapd
is stopped during the CAC on offchannel chain:

Sat Jan  1 0[  779.567851]   ESR = 0x96000005
0:12:50 2000 dae[  779.572346]   EC = 0x25: DABT (current EL), IL = 32 bits
mon.debug hostap[  779.578984]   SET = 0, FnV = 0
d: hostapd_inter[  779.583445]   EA = 0, S1PTW = 0
face_deinit_free[  779.587936] Data abort info:
: num_bss=1 conf[  779.592224]   ISV = 0, ISS = 0x00000005
->num_bss=1
Sat[  779.597403]   CM = 0, WnR = 0
 Jan  1 00:12:50[  779.601749] user pgtable: 4k pages, 39-bit VAs, pgdp=00000000418b2000
 2000 daemon.deb[  779.609601] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
ug hostapd: host[  779.619657] Internal error: Oops: 96000005 [#1] SMP
[  779.770810] CPU: 0 PID: 2202 Comm: hostapd Not tainted 5.10.75 #0
[  779.776892] Hardware name: MediaTek MT7622 RFB1 board (DT)
[  779.782370] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
[  779.788384] pc : cfg80211_chandef_valid+0x10/0x490 [cfg80211]
[  779.794128] lr : cfg80211_check_station_change+0x3190/0x3950 [cfg80211]
[  779.800731] sp : ffffffc01204b7e0
[  779.804036] x29: ffffffc01204b7e0 x28: ffffff80039bdc00
[  779.809340] x27: 0000000000000000 x26: ffffffc008cb3050
[  779.814644] x25: 0000000000000000 x24: 0000000000000002
[  779.819948] x23: ffffff8002630000 x22: ffffff8003e748d0
[  779.825252] x21: 0000000000000cc0 x20: ffffff8003da4a00
[  779.830556] x19: 0000000000000000 x18: ffffff8001bf7ce0
[  779.835860] x17: 00000000ffffffff x16: 0000000000000000
[  779.841164] x15: 0000000040d59200 x14: 00000000000019c0
[  779.846467] x13: 00000000000001c8 x12: 000636b9e9dab1c6
[  779.851771] x11: 0000000000000141 x10: 0000000000000820
[  779.857076] x9 : 0000000000000000 x8 : ffffff8003d7d038
[  779.862380] x7 : 0000000000000000 x6 : ffffff8003d7d038
[  779.867683] x5 : 0000000000000e90 x4 : 0000000000000038
[  779.872987] x3 : 0000000000000002 x2 : 0000000000000004
[  779.878291] x1 : 0000000000000000 x0 : 0000000000000000
[  779.883594] Call trace:
[  779.886039]  cfg80211_chandef_valid+0x10/0x490 [cfg80211]
[  779.891434]  cfg80211_check_station_change+0x3190/0x3950 [cfg80211]
[  779.897697]  nl80211_radar_notify+0x138/0x19c [cfg80211]
[  779.903005]  cfg80211_stop_offchan_radar_detection+0x7c/0x8c [cfg80211]
[  779.909616]  __cfg80211_leave+0x2c/0x190 [cfg80211]
[  779.914490]  cfg80211_register_netdevice+0x1c0/0x6d0 [cfg80211]
[  779.920404]  raw_notifier_call_chain+0x50/0x70
[  779.924841]  call_netdevice_notifiers_info+0x54/0xa0
[  779.929796]  __dev_close_many+0x40/0x100
[  779.933712]  __dev_change_flags+0x98/0x190
[  779.937800]  dev_change_flags+0x20/0x60
[  779.941628]  devinet_ioctl+0x534/0x6d0
[  779.945370]  inet_ioctl+0x1bc/0x230
[  779.948849]  sock_do_ioctl+0x44/0x200
[  779.952502]  sock_ioctl+0x268/0x4c0
[  779.955985]  __arm64_sys_ioctl+0xac/0xd0
[  779.959900]  el0_svc_common.constprop.0+0x60/0x110
[  779.964682]  do_el0_svc+0x1c/0x24
[  779.967990]  el0_svc+0x10/0x1c
[  779.971036]  el0_sync_handler+0x9c/0x120
[  779.974950]  el0_sync+0x148/0x180
[  779.978259] Code: a9bc7bfd 910003fd a90153f3 aa0003f3 (f9400000)
[  779.984344] ---[ end trace 0e67b4f5d6cdeec7 ]---
[  779.996400] Kernel panic - not syncing: Oops: Fatal exception
[  780.002139] SMP: stopping secondary CPUs
[  780.006057] Kernel Offset: disabled
[  780.009537] CPU features: 0x0000002,04002004
[  780.013796] Memory Limit: none

Fixes: b8f5fac ("cfg80211: implement APIs for dedicated radar detection HW")
Reported-by: Evelyn Tsai <[email protected]>
Tested-by: Evelyn Tsai <[email protected]>
Signed-off-by: Lorenzo Bianconi <[email protected]>
Link: https://lore.kernel.org/r/c2e34c065bf8839c5ffa45498ae154021a72a520.1635958796.git.lorenzo@kernel.org
Signed-off-by: Johannes Berg <[email protected]>
digetx pushed a commit that referenced this issue Nov 30, 2021
Currently, with an unknown recv_type, mwifiex_usb_recv
just return -1 without restoring the skb. Next time
mwifiex_usb_rx_complete is invoked with the same skb,
calling skb_put causes skb_over_panic.

The bug is triggerable with a compromised/malfunctioning
usb device. After applying the patch, skb_over_panic
no longer shows up with the same input.

Attached is the panic report from fuzzing.
skbuff: skb_over_panic: text:000000003bf1b5fa
 len:2048 put:4 head:00000000dd6a115b data:000000000a9445d8
 tail:0x844 end:0x840 dev:<NULL>
kernel BUG at net/core/skbuff.c:109!
invalid opcode: 0000 [#1] SMP KASAN NOPTI
CPU: 0 PID: 198 Comm: in:imklog Not tainted 5.6.0 #60
RIP: 0010:skb_panic+0x15f/0x161
Call Trace:
 <IRQ>
 ? mwifiex_usb_rx_complete+0x26b/0xfcd [mwifiex_usb]
 skb_put.cold+0x24/0x24
 mwifiex_usb_rx_complete+0x26b/0xfcd [mwifiex_usb]
 __usb_hcd_giveback_urb+0x1e4/0x380
 usb_giveback_urb_bh+0x241/0x4f0
 ? __hrtimer_run_queues+0x316/0x740
 ? __usb_hcd_giveback_urb+0x380/0x380
 tasklet_action_common.isra.0+0x135/0x330
 __do_softirq+0x18c/0x634
 irq_exit+0x114/0x140
 smp_apic_timer_interrupt+0xde/0x380
 apic_timer_interrupt+0xf/0x20
 </IRQ>

Reported-by: Brendan Dolan-Gavitt <[email protected]>
Signed-off-by: Zekun Shen <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
digetx pushed a commit that referenced this issue Nov 30, 2021
when the number of DRR classes decreases, the round-robin active list can
contain elements that have already been freed in ets_qdisc_change(). As a
consequence, it's possible to see a NULL dereference crash, caused by the
attempt to call cl->qdisc->ops->peek(cl->qdisc) when cl->qdisc is NULL:

 BUG: kernel NULL pointer dereference, address: 0000000000000018
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] PREEMPT SMP NOPTI
 CPU: 1 PID: 910 Comm: mausezahn Not tainted 5.16.0-rc1+ #475
 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
 RIP: 0010:ets_qdisc_dequeue+0x129/0x2c0 [sch_ets]
 Code: c5 01 41 39 ad e4 02 00 00 0f 87 18 ff ff ff 49 8b 85 c0 02 00 00 49 39 c4 0f 84 ba 00 00 00 49 8b ad c0 02 00 00 48 8b 7d 10 <48> 8b 47 18 48 8b 40 38 0f ae e8 ff d0 48 89 c3 48 85 c0 0f 84 9d
 RSP: 0000:ffffbb36c0b5fdd8 EFLAGS: 00010287
 RAX: ffff956678efed30 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: 0000000000000002 RSI: ffffffff9b938dc9 RDI: 0000000000000000
 RBP: ffff956678efed30 R08: e2f3207fe360129c R09: 0000000000000000
 R10: 0000000000000001 R11: 0000000000000001 R12: ffff956678efeac0
 R13: ffff956678efe800 R14: ffff956611545000 R15: ffff95667ac8f100
 FS:  00007f2aa9120740(0000) GS:ffff95667b800000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000018 CR3: 000000011070c000 CR4: 0000000000350ee0
 Call Trace:
  <TASK>
  qdisc_peek_dequeued+0x29/0x70 [sch_ets]
  tbf_dequeue+0x22/0x260 [sch_tbf]
  __qdisc_run+0x7f/0x630
  net_tx_action+0x290/0x4c0
  __do_softirq+0xee/0x4f8
  irq_exit_rcu+0xf4/0x130
  sysvec_apic_timer_interrupt+0x52/0xc0
  asm_sysvec_apic_timer_interrupt+0x12/0x20
 RIP: 0033:0x7f2aa7fc9ad4
 Code: b9 ff ff 48 8b 54 24 18 48 83 c4 08 48 89 ee 48 89 df 5b 5d e9 ed fc ff ff 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa <53> 48 83 ec 10 48 8b 05 10 64 33 00 48 8b 00 48 85 c0 0f 85 84 00
 RSP: 002b:00007ffe5d33fab8 EFLAGS: 00000202
 RAX: 0000000000000002 RBX: 0000561f72c31460 RCX: 0000561f72c31720
 RDX: 0000000000000002 RSI: 0000561f72c31722 RDI: 0000561f72c31720
 RBP: 000000000000002a R08: 00007ffe5d33fa40 R09: 0000000000000014
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000561f7187e380
 R13: 0000000000000000 R14: 0000000000000000 R15: 0000561f72c31460
  </TASK>
 Modules linked in: sch_ets sch_tbf dummy rfkill iTCO_wdt intel_rapl_msr iTCO_vendor_support intel_rapl_common joydev virtio_balloon lpc_ich i2c_i801 i2c_smbus pcspkr ip_tables xfs libcrc32c crct10dif_pclmul crc32_pclmul crc32c_intel ahci libahci ghash_clmulni_intel serio_raw libata virtio_blk virtio_console virtio_net net_failover failover sunrpc dm_mirror dm_region_hash dm_log dm_mod
 CR2: 0000000000000018

Ensuring that 'alist' was never zeroed [1] was not sufficient, we need to
remove from the active list those elements that are no more SP nor DRR.

[1] https://lore.kernel.org/netdev/60d274838bf09777f0371253416e8af71360bc08.1633609148.git.dcaratti@redhat.com/

v3: fix race between ets_qdisc_change() and ets_qdisc_dequeue() delisting
    DRR classes beyond 'nbands' in ets_qdisc_change() with the qdisc lock
    acquired, thanks to Cong Wang.

v2: when a NULL qdisc is found in the DRR active list, try to dequeue skb
    from the next list item.

Reported-by: Hangbin Liu <[email protected]>
Fixes: dcc68b4 ("net: sch_ets: Add a new Qdisc")
Signed-off-by: Davide Caratti <[email protected]>
Link: https://lore.kernel.org/r/7a5c496eed2d62241620bdbb83eb03fb9d571c99.1637762721.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <[email protected]>
digetx pushed a commit that referenced this issue Nov 30, 2021
When the `rmmod sata_fsl.ko` command is executed in the PPC64 GNU/Linux,
a bug is reported:
 ==================================================================
 BUG: Unable to handle kernel data access on read at 0x80000800805b502c
 Oops: Kernel access of bad area, sig: 11 [#1]
 NIP [c0000000000388a4] .ioread32+0x4/0x20
 LR [80000000000c6034] .sata_fsl_port_stop+0x44/0xe0 [sata_fsl]
 Call Trace:
  .free_irq+0x1c/0x4e0 (unreliable)
  .ata_host_stop+0x74/0xd0 [libata]
  .release_nodes+0x330/0x3f0
  .device_release_driver_internal+0x178/0x2c0
  .driver_detach+0x64/0xd0
  .bus_remove_driver+0x70/0xf0
  .driver_unregister+0x38/0x80
  .platform_driver_unregister+0x14/0x30
  .fsl_sata_driver_exit+0x18/0xa20 [sata_fsl]
  .__se_sys_delete_module+0x1ec/0x2d0
  .system_call_exception+0xfc/0x1f0
  system_call_common+0xf8/0x200
 ==================================================================

The triggering of the BUG is shown in the following stack:

driver_detach
  device_release_driver_internal
    __device_release_driver
      drv->remove(dev) --> platform_drv_remove/platform_remove
        drv->remove(dev) --> sata_fsl_remove
          iounmap(host_priv->hcr_base);			<---- unmap
          kfree(host_priv);                             <---- free
      devres_release_all
        release_nodes
          dr->node.release(dev, dr->data) --> ata_host_stop
            ap->ops->port_stop(ap) --> sata_fsl_port_stop
                ioread32(hcr_base + HCONTROL)           <---- UAF
            host->ops->host_stop(host)

The iounmap(host_priv->hcr_base) and kfree(host_priv) functions should
not be executed in drv->remove. These functions should be executed in
host_stop after port_stop. Therefore, we move these functions to the
new function sata_fsl_host_stop and bind the new function to host_stop.

Fixes: faf0b2e ("drivers/ata: add support to Freescale 3.0Gbps SATA Controller")
Cc: [email protected]
Reported-by: Hulk Robot <[email protected]>
Signed-off-by: Baokun Li <[email protected]>
Reviewed-by: Sergei Shtylyov <[email protected]>
Signed-off-by: Damien Le Moal <[email protected]>
digetx pushed a commit that referenced this issue Nov 30, 2021
There is a kernel panic caused by pcpu_alloc_pages() passing offlined and
uninitialized node to alloc_pages_node() leading to panic by NULL
dereferencing uninitialized NODE_DATA(nid).

 CPU2 has been hot-added
 BUG: unable to handle page fault for address: 0000000000001608
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] SMP PTI
 CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
 Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW

 RIP: 0010:__alloc_pages+0x127/0x290
 Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
 RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246
 RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
 RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600
 R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2
 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2
 FS:  00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0
 Call Trace:
  pcpu_alloc_pages.constprop.0+0xe4/0x1c0
  pcpu_populate_chunk+0x33/0xb0
  pcpu_alloc+0x4d3/0x6f0
  __alloc_percpu_gfp+0xd/0x10
  alloc_mem_cgroup_per_node_info+0x54/0xb0
  mem_cgroup_alloc+0xed/0x2f0
  mem_cgroup_css_alloc+0x33/0x2f0
  css_create+0x3a/0x1f0
  cgroup_apply_control_enable+0x12b/0x150
  cgroup_mkdir+0xdd/0x110
  kernfs_iop_mkdir+0x4f/0x80
  vfs_mkdir+0x178/0x230
  do_mkdirat+0xfd/0x120
  __x64_sys_mkdir+0x47/0x70
  ? syscall_exit_to_user_mode+0x21/0x50
  do_syscall_64+0x43/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae

Panic can be easily reproduced by disabling udev rule for automatic
onlining hot added CPU followed by CPU with memoryless node (NUMA node
with CPU only) hot add.

Hot adding CPU and memoryless node does not bring the node to online
state.  Memoryless node will be onlined only during the onlining its CPU.

Node can be in one of the following states:
1. not present.(nid == NUMA_NO_NODE)
2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
				NODE_DATA(nid) == NULL)
3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
				NODE_DATA(nid) != NULL)

Percpu code is doing allocations for all possible CPUs.  The issue happens
when it serves hot added but not yet onlined CPU when its node is in 2nd
state.  This node is not ready to use, fallback to numa_mem_id().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Makhalov <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Dennis Zhou <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this issue Nov 30, 2021
A kernel panic was observed during reading /proc/kpageflags for first few
pfns allocated by pmem namespace:

BUG: unable to handle page fault for address: fffffffffffffffe
[  114.495280] #PF: supervisor read access in kernel mode
[  114.495738] #PF: error_code(0x0000) - not-present page
[  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
[  114.496713] Oops: 0000 [#1] SMP PTI
[  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1 #1
[  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
[  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
[  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
[  114.500788] RSP: 0018:ffffa5e601a0fe60 EFLAGS: 00010202
[  114.501373] RAX: fffffffffffffffe RBX: ffffffffffffffff RCX: 0000000000000000
[  114.502009] RDX: 0000000000000001 RSI: 00007ffca13a7310 RDI: ffffd07489000000
[  114.502637] RBP: ffffd07489000000 R08: 0000000000000001 R09: 0000000000000000
[  114.503270] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000240000
[  114.503896] R13: 0000000000080000 R14: 00007ffca13a7310 R15: ffffa5e601a0ff08
[  114.504530] FS:  00007f0266c7f540(0000) GS:ffff962dbbac0000(0000) knlGS:0000000000000000
[  114.505245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  114.505754] CR2: fffffffffffffffe CR3: 000000023a204000 CR4: 00000000000006e0
[  114.506401] Call Trace:
[  114.506660]  kpageflags_read+0xb1/0x130
[  114.507051]  proc_reg_read+0x39/0x60
[  114.507387]  vfs_read+0x8a/0x140
[  114.507686]  ksys_pread64+0x61/0xa0
[  114.508021]  do_syscall_64+0x5f/0x1a0
[  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  114.508844] RIP: 0033:0x7f0266ba426b

The reason for the panic is that stable_page_flags() which parses the page
flags uses uninitialized struct pages reserved by the ZONE_DEVICE driver.

Earlier approach to fix this was discussed here:
https://marc.info/?l=linux-mm&m=152964770000672&w=2

This is another approach.  To avoid using the uninitialized struct page,
immediately return with KPF_RESERVED at the beginning of
stable_page_flags() if the page is reserved by ZONE_DEVICE driver.

Dan said:

: The nvdimm implementation uses vmem_altmap to arrange for the 'struct
: page' array to be allocated from a reservation of a pmem namespace.  A
: namespace in this mode contains an info-block that consumes the first
: 8K of the namespace capacity, capacity designated for page mapping,
: capacity for padding the start of data to optionally 4K, 2MB, or 1GB
: (on x86), and then the namespace data itself.  The implementation
: specifies a section aligned (now sub-section aligned) address to
: arch_add_memory() to establish the linear mapping to map the metadata,
: and then vmem_altmap indicates to memmap_init_zone() which pfns
: represent data.  The implementation only specifies enough 'struct page'
: capacity for pfn_to_page() to operate on the data space, not the
: namespace metadata space.
:
: The proposal to validate ZONE_DEVICE pfns against the altmap seems the
: right approach to me.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Toshiki Fukasawa <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Junichi Nomura <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this issue Dec 2, 2021
When CONFIG_FSL_PMC is set to n, no value is assigned to cpu_up_prepare
in the mpc85xx_pm_ops structure. As a result, oops is triggered in
smp_85xx_start_cpu().

  smp: Bringing up secondary CPUs ...
  kernel tried to execute user page (0) - exploit attempt? (uid: 0)
  BUG: Unable to handle kernel instruction fetch (NULL pointer?)
  Faulting instruction address: 0x00000000
  Oops: Kernel access of bad area, sig: 11 [#1]
  ...
  NIP [00000000] 0x0
  LR [c0021d2c] smp_85xx_kick_cpu+0xe8/0x568
  Call Trace:
  [c1051da8] [c0021cb8] smp_85xx_kick_cpu+0x74/0x568 (unreliable)
  [c1051de8] [c0011460] __cpu_up+0xc0/0x228
  [c1051e18] [c0031bbc] bringup_cpu+0x30/0x224
  [c1051e48] [c0031f3c] cpu_up.constprop.0+0x180/0x33c
  [c1051e88] [c00322e8] bringup_nonboot_cpus+0x88/0xc8
  [c1051eb8] [c07e67bc] smp_init+0x30/0x78
  [c1051ed8] [c07d9e28] kernel_init_freeable+0x118/0x2a8
  [c1051f18] [c00032d8] kernel_init+0x14/0x124
  [c1051f38] [c0010278] ret_from_kernel_thread+0x14/0x1c

Fixes: c45361a ("powerpc/85xx: fix timebase sync issue when CONFIG_HOTPLUG_CPU=n")
Reported-by: Martin Kennedy <[email protected]>
Signed-off-by: Xiaoming Ni <[email protected]>
Tested-by: Martin Kennedy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
digetx pushed a commit that referenced this issue Dec 2, 2021
The func v4l2_m2m_ctx_release waits for currently running jobs
to finish and then stop streaming both queues and frees the buffers.
All this should be done before the call to mtk_vcodec_enc_release
which frees the encoder handler. This fixes null-pointer dereference bug:

[  638.028076] Mem abort info:
[  638.030932]   ESR = 0x96000004
[  638.033978]   EC = 0x25: DABT (current EL), IL = 32 bits
[  638.039293]   SET = 0, FnV = 0
[  638.042338]   EA = 0, S1PTW = 0
[  638.045474]   FSC = 0x04: level 0 translation fault
[  638.050349] Data abort info:
[  638.053224]   ISV = 0, ISS = 0x00000004
[  638.057055]   CM = 0, WnR = 0
[  638.060018] user pgtable: 4k pages, 48-bit VAs, pgdp=000000012b6db000
[  638.066485] [00000000000001a0] pgd=0000000000000000, p4d=0000000000000000
[  638.073277] Internal error: Oops: 96000004 [#1] SMP
[  638.078145] Modules linked in: rfkill mtk_vcodec_dec mtk_vcodec_enc uvcvideo mtk_mdp mtk_vcodec_common videobuf2_dma_contig v4l2_h264 cdc_ether v4l2_mem2mem videobuf2_vmalloc usbnet videobuf2_memops videobuf2_v4l2 r8152 videobuf2_common videodev cros_ec_sensors cros_ec_sensors_core industrialio_triggered_buffer kfifo_buf elan_i2c elants_i2c sbs_battery mc cros_usbpd_charger cros_ec_chardev cros_usbpd_logger crct10dif_ce mtk_vpu fuse ip_tables x_tables ipv6
[  638.118583] CPU: 0 PID: 212 Comm: kworker/u8:5 Not tainted 5.15.0-06427-g58a1d4dcfc74-dirty #109
[  638.127357] Hardware name: Google Elm (DT)
[  638.131444] Workqueue: mtk-vcodec-enc mtk_venc_worker [mtk_vcodec_enc]
[  638.137974] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  638.144925] pc : vp8_enc_encode+0x34/0x2b0 [mtk_vcodec_enc]
[  638.150493] lr : venc_if_encode+0xac/0x1b0 [mtk_vcodec_enc]
[  638.156060] sp : ffff8000124d3c40
[  638.159364] x29: ffff8000124d3c40 x28: 0000000000000000 x27: 0000000000000000
[  638.166493] x26: 0000000000000000 x25: ffff0000e7f252d0 x24: ffff8000124d3d58
[  638.173621] x23: ffff8000124d3d58 x22: ffff8000124d3d60 x21: 0000000000000001
[  638.180750] x20: ffff80001137e000 x19: 0000000000000000 x18: 0000000000000001
[  638.187878] x17: 000000040044ffff x16: 00400032b5503510 x15: 0000000000000000
[  638.195006] x14: ffff8000118536c0 x13: ffff8000ee1da000 x12: 0000000030d4d91d
[  638.202134] x11: 0000000000000000 x10: 0000000000000980 x9 : ffff8000124d3b20
[  638.209262] x8 : ffff0000c18d4ea0 x7 : ffff0000c18d44c0 x6 : ffff0000c18d44c0
[  638.216391] x5 : ffff80000904a3b0 x4 : ffff8000124d3d58 x3 : ffff8000124d3d60
[  638.223519] x2 : ffff8000124d3d78 x1 : 0000000000000001 x0 : ffff80001137efb8
[  638.230648] Call trace:
[  638.233084]  vp8_enc_encode+0x34/0x2b0 [mtk_vcodec_enc]
[  638.238304]  venc_if_encode+0xac/0x1b0 [mtk_vcodec_enc]
[  638.243525]  mtk_venc_worker+0x110/0x250 [mtk_vcodec_enc]
[  638.248918]  process_one_work+0x1f8/0x498
[  638.252923]  worker_thread+0x140/0x538
[  638.256664]  kthread+0x148/0x158
[  638.259884]  ret_from_fork+0x10/0x20
[  638.263455] Code: f90023f9 2a0103f5 aa0303f6 aa0403f8 (f940d277)
[  638.269538] ---[ end trace e374fc10f8e181f5 ]---

[gst-master] root@debian:~/gst-build# [  638.019193] Unable to handle kernel NULL pointer dereference at virtual address 00000000000001a0
Fixes: 4e855a6 ("[media] vcodec: mediatek: Add Mediatek V4L2 Video Encoder Driver")
Signed-off-by: Dafna Hirschfeld <[email protected]>
Signed-off-by: Hans Verkuil <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
digetx pushed a commit that referenced this issue Dec 2, 2021
The default kasan_record_aux_stack() calls stack_depot_save() with GFP_NOWAIT,
which in turn can then call alloc_pages(GFP_NOWAIT, ...).  In general, however,
it is not even possible to use either GFP_ATOMIC nor GFP_NOWAIT in certain
non-preemptive contexts/RT kernel including raw_spin_locks (see gfp.h and ab00db2).
Fix it by instructing stackdepot to not expand stack storage via alloc_pages()
in case it runs out by using kasan_record_aux_stack_noalloc().

Jianwei Hu reported:
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:969
in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 15319, name: python3
INFO: lockdep is turned off.
irq event stamp: 0
  hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  hardirqs last disabled at (0): [<ffffffff856c8b13>] copy_process+0xaf3/0x2590
  softirqs last  enabled at (0): [<ffffffff856c8b13>] copy_process+0xaf3/0x2590
  softirqs last disabled at (0): [<0000000000000000>] 0x0
  CPU: 6 PID: 15319 Comm: python3 Tainted: G        W  O 5.15-rc7-preempt-rt #1
  Hardware name: Supermicro SYS-E300-9A-8C/A2SDi-8C-HLN4F, BIOS 1.1b 12/17/2018
  Call Trace:
    show_stack+0x52/0x58
    dump_stack+0xa1/0xd6
    ___might_sleep.cold+0x11c/0x12d
    rt_spin_lock+0x3f/0xc0
    rmqueue+0x100/0x1460
    rmqueue+0x100/0x1460
    mark_usage+0x1a0/0x1a0
    ftrace_graph_ret_addr+0x2a/0xb0
    rmqueue_pcplist.constprop.0+0x6a0/0x6a0
     __kasan_check_read+0x11/0x20
     __zone_watermark_ok+0x114/0x270
     get_page_from_freelist+0x148/0x630
     is_module_text_address+0x32/0xa0
     __alloc_pages_nodemask+0x2f6/0x790
     __alloc_pages_slowpath.constprop.0+0x12d0/0x12d0
     create_prof_cpu_mask+0x30/0x30
     alloc_pages_current+0xb1/0x150
     stack_depot_save+0x39f/0x490
     kasan_save_stack+0x42/0x50
     kasan_save_stack+0x23/0x50
     kasan_record_aux_stack+0xa9/0xc0
     __call_rcu+0xff/0x9c0
     call_rcu+0xe/0x10
     put_object+0x53/0x70
     __delete_object+0x7b/0x90
     kmemleak_free+0x46/0x70
     slab_free_freelist_hook+0xb4/0x160
     kfree+0xe5/0x420
     kfree_const+0x17/0x30
     kobject_cleanup+0xaa/0x230
     kobject_put+0x76/0x90
     netdev_queue_update_kobjects+0x17d/0x1f0
     ... ...
     ksys_write+0xd9/0x180
     __x64_sys_write+0x42/0x50
     do_syscall_64+0x38/0x50
     entry_SYSCALL_64_after_hwframe+0x44/0xa9

Links: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/include/linux/kasan.h?id=7cb3007ce2da27ec02a1a3211941e7fe6875b642
Fixes: 84109ab ("rcu: Record kvfree_call_rcu() call stack for KASAN")
Fixes: 26e760c ("rcu: kasan: record and print call_rcu() call stack")
Reported-by: Jianwei Hu <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Acked-by: Marco Elver <[email protected]>
Tested-by: Juri Lelli <[email protected]>
Signed-off-by: Jun Miao <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
digetx pushed a commit that referenced this issue Dec 2, 2021
Tiny SRCU readers can appear at task level, but also in interrupt and
softirq handlers.  Because Tiny SRCU is selected only in kernels built
with CONFIG_SMP=n and CONFIG_PREEMPTION=n, it is not possible for a grace
period to start while there is a non-task-level SRCU reader executing.
This means that it does not make sense for __srcu_read_unlock() to awaken
the Tiny SRCU grace period, because that can only happen when the grace
period is waiting for one value of ->srcu_idx and __srcu_read_unlock()
is ending the last reader for some other value of ->srcu_idx.  After all,
any such wakeup will be redundant.

Worse yet, in some cases, such wakeups generate lockdep splats:

	======================================================
	WARNING: possible circular locking dependency detected
	5.15.0-rc1+ #3758 Not tainted
	------------------------------------------------------
	rcu_torture_rea/53 is trying to acquire lock:
	ffffffff9514e6a8 (srcu_ctl.srcu_wq.lock){..-.}-{2:2}, at:
	xa/0x30

	but task is already holding lock:
	ffff95c642479d80 (&p->pi_lock){-.-.}-{2:2}, at:
	_extend+0x370/0x400

	which lock already depends on the new lock.

	the existing dependency chain (in reverse order) is:

	-> #1 (&p->pi_lock){-.-.}-{2:2}:
	       _raw_spin_lock_irqsave+0x2f/0x50
	       try_to_wake_up+0x50/0x580
	       swake_up_locked.part.7+0xe/0x30
	       swake_up_one+0x22/0x30
	       rcutorture_one_extend+0x1b6/0x400
	       rcu_torture_one_read+0x290/0x5d0
	       rcu_torture_timer+0x1a/0x70
	       call_timer_fn+0xa6/0x230
	       run_timer_softirq+0x493/0x4c0
	       __do_softirq+0xc0/0x371
	       irq_exit+0x73/0x90
	       sysvec_apic_timer_interrupt+0x63/0x80
	       asm_sysvec_apic_timer_interrupt+0x12/0x20
	       default_idle+0xb/0x10
	       default_idle_call+0x5e/0x170
	       do_idle+0x18a/0x1f0
	       cpu_startup_entry+0xa/0x10
	       start_kernel+0x678/0x69f
	       secondary_startup_64_no_verify+0xc2/0xcb

	-> #0 (srcu_ctl.srcu_wq.lock){..-.}-{2:2}:
	       __lock_acquire+0x130c/0x2440
	       lock_acquire+0xc2/0x270
	       _raw_spin_lock_irqsave+0x2f/0x50
	       swake_up_one+0xa/0x30
	       rcutorture_one_extend+0x387/0x400
	       rcu_torture_one_read+0x290/0x5d0
	       rcu_torture_reader+0xac/0x200
	       kthread+0x12d/0x150
	       ret_from_fork+0x22/0x30

	other info that might help us debug this:

	 Possible unsafe locking scenario:

	       CPU0                    CPU1
	       ----                    ----
	  lock(&p->pi_lock);
				       lock(srcu_ctl.srcu_wq.lock);
				       lock(&p->pi_lock);
	  lock(srcu_ctl.srcu_wq.lock);

	 *** DEADLOCK ***

	1 lock held by rcu_torture_rea/53:
	 #0: ffff95c642479d80 (&p->pi_lock){-.-.}-{2:2}, at:
	_extend+0x370/0x400

	stack backtrace:
	CPU: 0 PID: 53 Comm: rcu_torture_rea Not tainted 5.15.0-rc1+

	Hardware name: Red Hat KVM/RHEL-AV, BIOS
	e_el8.5.0+746+bbd5d70c 04/01/2014
	Call Trace:
	 check_noncircular+0xfe/0x110
	 ? find_held_lock+0x2d/0x90
	 __lock_acquire+0x130c/0x2440
	 lock_acquire+0xc2/0x270
	 ? swake_up_one+0xa/0x30
	 ? find_held_lock+0x72/0x90
	 _raw_spin_lock_irqsave+0x2f/0x50
	 ? swake_up_one+0xa/0x30
	 swake_up_one+0xa/0x30
	 rcutorture_one_extend+0x387/0x400
	 rcu_torture_one_read+0x290/0x5d0
	 rcu_torture_reader+0xac/0x200
	 ? rcutorture_oom_notify+0xf0/0xf0
	 ? __kthread_parkme+0x61/0x90
	 ? rcu_torture_one_read+0x5d0/0x5d0
	 kthread+0x12d/0x150
	 ? set_kthread_struct+0x40/0x40
	 ret_from_fork+0x22/0x30

This is a false positive because there is only one CPU, and both locks
are raw (non-preemptible) spinlocks.  However, it is worthwhile getting
rid of the redundant wakeup, which has the side effect of breaking
the theoretical deadlock cycle.  This commit therefore eliminates the
redundant wakeups.

Signed-off-by: Paul E. McKenney <[email protected]>
digetx pushed a commit that referenced this issue Dec 2, 2021
Adding a check on len parameter to avoid empty skb. This prevents a
division error in netem_enqueue function which is caused when skb->len=0
and skb->data_len=0 in the randomized corruption step as shown below.

skb->data[prandom_u32() % skb_headlen(skb)] ^= 1<<(prandom_u32() % 8);

Crash Report:
[  343.170349] netdevsim netdevsim0 netdevsim3: set [1, 0] type 2 family
0 port 6081 - 0
[  343.216110] netem: version 1.3
[  343.235841] divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI
[  343.236680] CPU: 3 PID: 4288 Comm: reproducer Not tainted 5.16.0-rc1+
[  343.237569] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.11.0-2.el7 04/01/2014
[  343.238707] RIP: 0010:netem_enqueue+0x1590/0x33c0 [sch_netem]
[  343.239499] Code: 89 85 58 ff ff ff e8 5f 5d e9 d3 48 8b b5 48 ff ff
ff 8b 8d 50 ff ff ff 8b 85 58 ff ff ff 48 8b bd 70 ff ff ff 31 d2 2b 4f
74 <f7> f1 48 b8 00 00 00 00 00 fc ff df 49 01 d5 4c 89 e9 48 c1 e9 03
[  343.241883] RSP: 0018:ffff88800bcd7368 EFLAGS: 00010246
[  343.242589] RAX: 00000000ba7c0a9c RBX: 0000000000000001 RCX:
0000000000000000
[  343.243542] RDX: 0000000000000000 RSI: ffff88800f8edb10 RDI:
ffff88800f8eda40
[  343.244474] RBP: ffff88800bcd7458 R08: 0000000000000000 R09:
ffffffff94fb8445
[  343.245403] R10: ffffffff94fb8336 R11: ffffffff94fb8445 R12:
0000000000000000
[  343.246355] R13: ffff88800a5a7000 R14: ffff88800a5b5800 R15:
0000000000000020
[  343.247291] FS:  00007fdde2bd7700(0000) GS:ffff888109780000(0000)
knlGS:0000000000000000
[  343.248350] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  343.249120] CR2: 00000000200000c0 CR3: 000000000ef4c000 CR4:
00000000000006e0
[  343.250076] Call Trace:
[  343.250423]  <TASK>
[  343.250713]  ? memcpy+0x4d/0x60
[  343.251162]  ? netem_init+0xa0/0xa0 [sch_netem]
[  343.251795]  ? __sanitizer_cov_trace_pc+0x21/0x60
[  343.252443]  netem_enqueue+0xe28/0x33c0 [sch_netem]
[  343.253102]  ? stack_trace_save+0x87/0xb0
[  343.253655]  ? filter_irq_stacks+0xb0/0xb0
[  343.254220]  ? netem_init+0xa0/0xa0 [sch_netem]
[  343.254837]  ? __kasan_check_write+0x14/0x20
[  343.255418]  ? _raw_spin_lock+0x88/0xd6
[  343.255953]  dev_qdisc_enqueue+0x50/0x180
[  343.256508]  __dev_queue_xmit+0x1a7e/0x3090
[  343.257083]  ? netdev_core_pick_tx+0x300/0x300
[  343.257690]  ? check_kcov_mode+0x10/0x40
[  343.258219]  ? _raw_spin_unlock_irqrestore+0x29/0x40
[  343.258899]  ? __kasan_init_slab_obj+0x24/0x30
[  343.259529]  ? setup_object.isra.71+0x23/0x90
[  343.260121]  ? new_slab+0x26e/0x4b0
[  343.260609]  ? kasan_poison+0x3a/0x50
[  343.261118]  ? kasan_unpoison+0x28/0x50
[  343.261637]  ? __kasan_slab_alloc+0x71/0x90
[  343.262214]  ? memcpy+0x4d/0x60
[  343.262674]  ? write_comp_data+0x2f/0x90
[  343.263209]  ? __kasan_check_write+0x14/0x20
[  343.263802]  ? __skb_clone+0x5d6/0x840
[  343.264329]  ? __sanitizer_cov_trace_pc+0x21/0x60
[  343.264958]  dev_queue_xmit+0x1c/0x20
[  343.265470]  netlink_deliver_tap+0x652/0x9c0
[  343.266067]  netlink_unicast+0x5a0/0x7f0
[  343.266608]  ? netlink_attachskb+0x860/0x860
[  343.267183]  ? __sanitizer_cov_trace_pc+0x21/0x60
[  343.267820]  ? write_comp_data+0x2f/0x90
[  343.268367]  netlink_sendmsg+0x922/0xe80
[  343.268899]  ? netlink_unicast+0x7f0/0x7f0
[  343.269472]  ? __sanitizer_cov_trace_pc+0x21/0x60
[  343.270099]  ? write_comp_data+0x2f/0x90
[  343.270644]  ? netlink_unicast+0x7f0/0x7f0
[  343.271210]  sock_sendmsg+0x155/0x190
[  343.271721]  ____sys_sendmsg+0x75f/0x8f0
[  343.272262]  ? kernel_sendmsg+0x60/0x60
[  343.272788]  ? write_comp_data+0x2f/0x90
[  343.273332]  ? write_comp_data+0x2f/0x90
[  343.273869]  ___sys_sendmsg+0x10f/0x190
[  343.274405]  ? sendmsg_copy_msghdr+0x80/0x80
[  343.274984]  ? slab_post_alloc_hook+0x70/0x230
[  343.275597]  ? futex_wait_setup+0x240/0x240
[  343.276175]  ? security_file_alloc+0x3e/0x170
[  343.276779]  ? write_comp_data+0x2f/0x90
[  343.277313]  ? __sanitizer_cov_trace_pc+0x21/0x60
[  343.277969]  ? write_comp_data+0x2f/0x90
[  343.278515]  ? __fget_files+0x1ad/0x260
[  343.279048]  ? __sanitizer_cov_trace_pc+0x21/0x60
[  343.279685]  ? write_comp_data+0x2f/0x90
[  343.280234]  ? __sanitizer_cov_trace_pc+0x21/0x60
[  343.280874]  ? sockfd_lookup_light+0xd1/0x190
[  343.281481]  __sys_sendmsg+0x118/0x200
[  343.281998]  ? __sys_sendmsg_sock+0x40/0x40
[  343.282578]  ? alloc_fd+0x229/0x5e0
[  343.283070]  ? write_comp_data+0x2f/0x90
[  343.283610]  ? write_comp_data+0x2f/0x90
[  343.284135]  ? __sanitizer_cov_trace_pc+0x21/0x60
[  343.284776]  ? ktime_get_coarse_real_ts64+0xb8/0xf0
[  343.285450]  __x64_sys_sendmsg+0x7d/0xc0
[  343.285981]  ? syscall_enter_from_user_mode+0x4d/0x70
[  343.286664]  do_syscall_64+0x3a/0x80
[  343.287158]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  343.287850] RIP: 0033:0x7fdde24cf289
[  343.288344] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00
48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 db 2c 00 f7 d8 64 89 01 48
[  343.290729] RSP: 002b:00007fdde2bd6d98 EFLAGS: 00000246 ORIG_RAX:
000000000000002e
[  343.291730] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
00007fdde24cf289
[  343.292673] RDX: 0000000000000000 RSI: 00000000200000c0 RDI:
0000000000000004
[  343.293618] RBP: 00007fdde2bd6e20 R08: 0000000100000001 R09:
0000000000000000
[  343.294557] R10: 0000000100000001 R11: 0000000000000246 R12:
0000000000000000
[  343.295493] R13: 0000000000021000 R14: 0000000000000000 R15:
00007fdde2bd7700
[  343.296432]  </TASK>
[  343.296735] Modules linked in: sch_netem ip6_vti ip_vti ip_gre ipip
sit ip_tunnel geneve macsec macvtap tap ipvlan macvlan 8021q garp mrp
hsr wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64
ip6_udp_tunnel udp_tunnel libblake2s blake2s_x86_64 libblake2s_generic
curve25519_x86_64 libcurve25519_generic libchacha xfrm_interface
xfrm6_tunnel tunnel4 veth netdevsim psample batman_adv nlmon dummy team
bonding tls vcan ip6_gre ip6_tunnel tunnel6 gre tun ip6t_rpfilter
ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set
ebtable_nat ebtable_broute ip6table_nat ip6table_mangle
ip6table_security ip6table_raw iptable_nat nf_nat nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_security
iptable_raw ebtable_filter ebtables rfkill ip6table_filter ip6_tables
iptable_filter ppdev bochs drm_vram_helper drm_ttm_helper ttm
drm_kms_helper cec parport_pc drm joydev floppy parport sg syscopyarea
sysfillrect sysimgblt i2c_piix4 qemu_fw_cfg fb_sys_fops pcspkr
[  343.297459]  ip_tables xfs virtio_net net_failover failover sd_mod
sr_mod cdrom t10_pi ata_generic pata_acpi ata_piix libata virtio_pci
virtio_pci_legacy_dev serio_raw virtio_pci_modern_dev dm_mirror
dm_region_hash dm_log dm_mod
[  343.311074] Dumping ftrace buffer:
[  343.311532]    (ftrace buffer empty)
[  343.312040] ---[ end trace a2e3db5a6ae05099 ]---
[  343.312691] RIP: 0010:netem_enqueue+0x1590/0x33c0 [sch_netem]
[  343.313481] Code: 89 85 58 ff ff ff e8 5f 5d e9 d3 48 8b b5 48 ff ff
ff 8b 8d 50 ff ff ff 8b 85 58 ff ff ff 48 8b bd 70 ff ff ff 31 d2 2b 4f
74 <f7> f1 48 b8 00 00 00 00 00 fc ff df 49 01 d5 4c 89 e9 48 c1 e9 03
[  343.315893] RSP: 0018:ffff88800bcd7368 EFLAGS: 00010246
[  343.316622] RAX: 00000000ba7c0a9c RBX: 0000000000000001 RCX:
0000000000000000
[  343.317585] RDX: 0000000000000000 RSI: ffff88800f8edb10 RDI:
ffff88800f8eda40
[  343.318549] RBP: ffff88800bcd7458 R08: 0000000000000000 R09:
ffffffff94fb8445
[  343.319503] R10: ffffffff94fb8336 R11: ffffffff94fb8445 R12:
0000000000000000
[  343.320455] R13: ffff88800a5a7000 R14: ffff88800a5b5800 R15:
0000000000000020
[  343.321414] FS:  00007fdde2bd7700(0000) GS:ffff888109780000(0000)
knlGS:0000000000000000
[  343.322489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  343.323283] CR2: 00000000200000c0 CR3: 000000000ef4c000 CR4:
00000000000006e0
[  343.324264] Kernel panic - not syncing: Fatal exception in interrupt
[  343.333717] Dumping ftrace buffer:
[  343.334175]    (ftrace buffer empty)
[  343.334653] Kernel Offset: 0x13600000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  343.336027] Rebooting in 86400 seconds..

Reported-by: syzkaller <[email protected]>
Signed-off-by: Harshit Mogalapalli <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
digetx pushed a commit that referenced this issue Dec 2, 2021
Driver needs to nullify the port select attributes of the LAG when
port selection is destroyed, otherwise it breaks recreation of the
LAG.
It fixes the below kernel oops:

 [  587.906377] BUG: kernel NULL pointer dereference, address: 0000000000000008
 [  587.908843] #PF: supervisor read access in kernel mode
 [  587.910730] #PF: error_code(0x0000) - not-present page
 [  587.912580] PGD 0 P4D 0
 [  587.913632] Oops: 0000 [#1] SMP PTI
 [  587.914644] CPU: 5 PID: 165 Comm: kworker/u20:5 Tainted: G           OE     5.9.0_mlnx #1
 [  587.916152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 [  587.918332] Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core]
 [  587.919479] RIP: 0010:mlx5_del_flow_rules+0x10/0x270 [mlx5_core]
 [  587.920568] mlx5_core 0000:08:00.1 enp8s0f1: Link up
 [  587.920680] Code: c0 09 80 a0 e8 cf 42 a4 e0 48 c7 c3 f4 ff ff ff e8 8a 88 dd e0 e9 ab fe ff ff 0f 1f 44 00 00 41 56 41 55 49 89 fd 41 54 55 53 <48> 8b 47 08 48 8b 68 28 48 85 ed 74 2e 48 8d 7d 38 e8 6a 64 34 e1
 [  587.925116] bond0: (slave enp8s0f1): Enslaving as an active interface with an up link
 [  587.930415] RSP: 0018:ffffc9000048fd88 EFLAGS: 00010282
 [  587.930417] RAX: ffff88846c14fac0 RBX: ffff88846cddcb80 RCX: 0000000080400007
 [  587.930417] RDX: 0000000080400008 RSI: ffff88846cddcb80 RDI: 0000000000000000
 [  587.930419] RBP: ffff88845fd80140 R08: 0000000000000001 R09: ffffffffa074ba00
 [  587.938132] R10: ffff88846c14fec0 R11: 0000000000000001 R12: ffff88846c122f10
 [  587.939473] R13: 0000000000000000 R14: 0000000000000001 R15: ffff88846d7a0000
 [  587.940800] FS:  0000000000000000(0000) GS:ffff88846fa80000(0000) knlGS:0000000000000000
 [  587.942416] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [  587.943536] CR2: 0000000000000008 CR3: 000000000240a002 CR4: 0000000000770ee0
 [  587.944904] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [  587.946308] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [  587.947639] PKRU: 55555554
 [  587.948236] Call Trace:
 [  587.948834]  mlx5_lag_destroy_definer.isra.3+0x16/0x90 [mlx5_core]
 [  587.950033]  mlx5_lag_destroy_definers+0x5b/0x80 [mlx5_core]
 [  587.951128]  mlx5_deactivate_lag+0x6e/0x80 [mlx5_core]
 [  587.952146]  mlx5_do_bond+0x150/0x450 [mlx5_core]
 [  587.953086]  mlx5_do_bond_work+0x3e/0x50 [mlx5_core]
 [  587.954086]  process_one_work+0x1eb/0x3e0
 [  587.954899]  worker_thread+0x2d/0x3c0
 [  587.955656]  ? process_one_work+0x3e0/0x3e0
 [  587.956493]  kthread+0x115/0x130
 [  587.957174]  ? kthread_park+0x90/0x90
 [  587.957929]  ret_from_fork+0x1f/0x30
 [  587.973055] ---[ end trace 71ccd6eca89f5513 ]---

Fixes: b726786 ("net/mlx5: Lag, add support to create/destroy/modify port selection")
Signed-off-by: Maor Gottlieb <[email protected]>
Reviewed-by: Mark Bloch <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
digetx pushed a commit that referenced this issue Dec 2, 2021
[Why]
IGT bypass test will set crc source as DPRX,and display DM didn`t check
connection type, it run the test on the HDMI connector ,then the kernel
will be crashed because aux->transfer is set null for HDMI connection.
This patch will skip the invalid connection test and fix kernel crash issue.

[How]
Check the connector type while setting the pipe crc source as DPRX or
auto,if the type is not DP or eDP, the crtc crc source will not be set
and report error code to IGT test,IGT will show the this subtest as no
valid crtc/connector combinations found.

116.779714] [IGT] amd_bypass: starting subtest 8bpc-bypass-mode
[ 117.730996] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 117.731001] #PF: supervisor instruction fetch in kernel mode
[ 117.731003] #PF: error_code(0x0010) - not-present page
[ 117.731004] PGD 0 P4D 0
[ 117.731006] Oops: 0010 [#1] SMP NOPTI
[ 117.731009] CPU: 11 PID: 2428 Comm: amd_bypass Tainted: G OE 5.11.0-34-generic #36~20.04.1-Ubuntu
[ 117.731011] Hardware name: AMD CZN/, BIOS AB.FD 09/07/2021
[ 117.731012] RIP: 0010:0x0
[ 117.731015] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[ 117.731016] RSP: 0018:ffffa8d64225bab8 EFLAGS: 00010246
[ 117.731017] RAX: 0000000000000000 RBX: 0000000000000020 RCX: ffffa8d64225bb5e
[ 117.731018] RDX: ffff93151d921880 RSI: ffffa8d64225bac8 RDI: ffff931511a1a9d8
[ 117.731022] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 117.731023] CR2: ffffffffffffffd6 CR3: 000000010d5a4000 CR4: 0000000000750ee0
[ 117.731023] PKRU: 55555554
[ 117.731024] Call Trace:
[ 117.731027] drm_dp_dpcd_access+0x72/0x110 [drm_kms_helper]
[ 117.731036] drm_dp_dpcd_read+0xb7/0xf0 [drm_kms_helper]
[ 117.731040] drm_dp_start_crc+0x38/0xb0 [drm_kms_helper]
[ 117.731047] amdgpu_dm_crtc_set_crc_source+0x1ae/0x3e0 [amdgpu]
[ 117.731149] crtc_crc_open+0x174/0x220 [drm]
[ 117.731162] full_proxy_open+0x168/0x1f0
[ 117.731165] ? open_proxy_open+0x100/0x100

BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1546
Reviewed-by: Harry Wentland <[email protected]>
Reviewed-by: Rodrigo Siqueira <[email protected]>
Signed-off-by: Perry Yuan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
digetx pushed a commit that referenced this issue Dec 4, 2021
[Why]
IGT bypass test will set crc source as DPRX,and display DM didn`t check
connection type, it run the test on the HDMI connector ,then the kernel
will be crashed because aux->transfer is set null for HDMI connection.
This patch will skip the invalid connection test and fix kernel crash issue.

[How]
Check the connector type while setting the pipe crc source as DPRX or
auto,if the type is not DP or eDP, the crtc crc source will not be set
and report error code to IGT test,IGT will show the this subtest as no
valid crtc/connector combinations found.

116.779714] [IGT] amd_bypass: starting subtest 8bpc-bypass-mode
[ 117.730996] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 117.731001] #PF: supervisor instruction fetch in kernel mode
[ 117.731003] #PF: error_code(0x0010) - not-present page
[ 117.731004] PGD 0 P4D 0
[ 117.731006] Oops: 0010 [#1] SMP NOPTI
[ 117.731009] CPU: 11 PID: 2428 Comm: amd_bypass Tainted: G OE 5.11.0-34-generic #36~20.04.1-Ubuntu
[ 117.731011] Hardware name: AMD CZN/, BIOS AB.FD 09/07/2021
[ 117.731012] RIP: 0010:0x0
[ 117.731015] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[ 117.731016] RSP: 0018:ffffa8d64225bab8 EFLAGS: 00010246
[ 117.731017] RAX: 0000000000000000 RBX: 0000000000000020 RCX: ffffa8d64225bb5e
[ 117.731018] RDX: ffff93151d921880 RSI: ffffa8d64225bac8 RDI: ffff931511a1a9d8
[ 117.731022] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 117.731023] CR2: ffffffffffffffd6 CR3: 000000010d5a4000 CR4: 0000000000750ee0
[ 117.731023] PKRU: 55555554
[ 117.731024] Call Trace:
[ 117.731027] drm_dp_dpcd_access+0x72/0x110 [drm_kms_helper]
[ 117.731036] drm_dp_dpcd_read+0xb7/0xf0 [drm_kms_helper]
[ 117.731040] drm_dp_start_crc+0x38/0xb0 [drm_kms_helper]
[ 117.731047] amdgpu_dm_crtc_set_crc_source+0x1ae/0x3e0 [amdgpu]
[ 117.731149] crtc_crc_open+0x174/0x220 [drm]
[ 117.731162] full_proxy_open+0x168/0x1f0
[ 117.731165] ? open_proxy_open+0x100/0x100

BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1546
Reviewed-by: Harry Wentland <[email protected]>
Reviewed-by: Rodrigo Siqueira <[email protected]>
Signed-off-by: Perry Yuan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
digetx pushed a commit that referenced this issue Dec 4, 2021
In machine_kexec_post_load() we use __pa() on `empty_zero_page`, so that
we can use the physical address during arm64_relocate_new_kernel() to
switch TTBR1 to a new set of tables. While `empty_zero_page` is part of
the old kernel, we won't clobber it until after this switch, so using it
is benign.

However, `empty_zero_page` is part of the kernel image rather than a
linear map address, so it is not correct to use __pa(x), and we should
instead use __pa_symbol(x) or __pa(lm_alias(x)). Otherwise, when the
kernel is built with DEBUG_VIRTUAL, we'll encounter splats as below, as
I've seen when fuzzing v5.16-rc3 with Syzkaller:

| ------------[ cut here ]------------
| virt_to_phys used for non-linear address: 000000008492561a (empty_zero_page+0x0/0x1000)
| WARNING: CPU: 3 PID: 11492 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x120/0x1c0 arch/arm64/mm/physaddr.c:12
| CPU: 3 PID: 11492 Comm: syz-executor.0 Not tainted 5.16.0-rc3-00001-g48bd452a045c #1
| Hardware name: linux,dummy-virt (DT)
| pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : __virt_to_phys+0x120/0x1c0 arch/arm64/mm/physaddr.c:12
| lr : __virt_to_phys+0x120/0x1c0 arch/arm64/mm/physaddr.c:12
| sp : ffff80001af17bb0
| x29: ffff80001af17bb0 x28: ffff1cc65207b400 x27: ffffb7828730b120
| x26: 0000000000000e11 x25: 0000000000000000 x24: 0000000000000001
| x23: ffffb7828963e000 x22: ffffb78289644000 x21: 0000600000000000
| x20: 000000000000002d x19: 0000b78289644000 x18: 0000000000000000
| x17: 74706d6528206131 x16: 3635323934383030 x15: 303030303030203a
| x14: 1ffff000035e2eb8 x13: ffff6398d53f4f0f x12: 1fffe398d53f4f0e
| x11: 1fffe398d53f4f0e x10: ffff6398d53f4f0e x9 : ffffb7827c6f76dc
| x8 : ffff1cc6a9fa7877 x7 : 0000000000000001 x6 : ffff6398d53f4f0f
| x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff1cc66f2a99c0
| x2 : 0000000000040000 x1 : d7ce7775b09b5d00 x0 : 0000000000000000
| Call trace:
|  __virt_to_phys+0x120/0x1c0 arch/arm64/mm/physaddr.c:12
|  machine_kexec_post_load+0x284/0x670 arch/arm64/kernel/machine_kexec.c:150
|  do_kexec_load+0x570/0x670 kernel/kexec.c:155
|  __do_sys_kexec_load kernel/kexec.c:250 [inline]
|  __se_sys_kexec_load kernel/kexec.c:231 [inline]
|  __arm64_sys_kexec_load+0x1d8/0x268 kernel/kexec.c:231
|  __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline]
|  invoke_syscall+0x90/0x2e0 arch/arm64/kernel/syscall.c:52
|  el0_svc_common.constprop.2+0x1e4/0x2f8 arch/arm64/kernel/syscall.c:142
|  do_el0_svc+0xf8/0x150 arch/arm64/kernel/syscall.c:181
|  el0_svc+0x60/0x248 arch/arm64/kernel/entry-common.c:603
|  el0t_64_sync_handler+0x90/0xb8 arch/arm64/kernel/entry-common.c:621
|  el0t_64_sync+0x180/0x184 arch/arm64/kernel/entry.S:572
| irq event stamp: 2428
| hardirqs last  enabled at (2427): [<ffffb7827c6f2308>] __up_console_sem+0xf0/0x118 kernel/printk/printk.c:255
| hardirqs last disabled at (2428): [<ffffb7828223df98>] el1_dbg+0x28/0x80 arch/arm64/kernel/entry-common.c:375
| softirqs last  enabled at (2424): [<ffffb7827c411c00>] softirq_handle_end kernel/softirq.c:401 [inline]
| softirqs last  enabled at (2424): [<ffffb7827c411c00>] __do_softirq+0xa28/0x11e4 kernel/softirq.c:587
| softirqs last disabled at (2417): [<ffffb7827c59015c>] do_softirq_own_stack include/asm-generic/softirq_stack.h:10 [inline]
| softirqs last disabled at (2417): [<ffffb7827c59015c>] invoke_softirq kernel/softirq.c:439 [inline]
| softirqs last disabled at (2417): [<ffffb7827c59015c>] __irq_exit_rcu kernel/softirq.c:636 [inline]
| softirqs last disabled at (2417): [<ffffb7827c59015c>] irq_exit_rcu+0x53c/0x688 kernel/softirq.c:648
| ---[ end trace 0ca578534e7ca938 ]---

With or without DEBUG_VIRTUAL __pa() will fall back to __kimg_to_phys()
for non-linear addresses, and will happen to do the right thing in this
case, even with the warning. But we should not depend upon this, and to
keep the warning useful we should fix this case.

Fix this issue by using __pa_symbol(), which handles kernel image
addresses (and checks its input is a kernel image address). This matches
what we do elsewhere, e.g. in arch/arm64/include/asm/pgtable.h:

| #define ZERO_PAGE(vaddr)       phys_to_page(__pa_symbol(empty_zero_page))

Fixes: 3744b52 ("arm64: kexec: install a copy of the linear-map")
Signed-off-by: Mark Rutland <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: James Morse <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Will Deacon <[email protected]>
Reviewed-by: Pasha Tatashin <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
digetx pushed a commit that referenced this issue Dec 4, 2021
smc_lgr_cleanup_early() meant to delete the link
group from the link group list, but it deleted
the list head by mistake.

This may cause memory corruption since we didn't
remove the real link group from the list and later
memseted the link group structure.
We got a list corruption panic when testing:

[  231.277259] list_del corruption. prev->next should be ffff8881398a8000, but was 0000000000000000
[  231.278222] ------------[ cut here ]------------
[  231.278726] kernel BUG at lib/list_debug.c:53!
[  231.279326] invalid opcode: 0000 [#1] SMP NOPTI
[  231.279803] CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.10.46+ #435
[  231.280466] Hardware name: Alibaba Cloud ECS, BIOS 8c24b4c 04/01/2014
[  231.281248] Workqueue: events smc_link_down_work
[  231.281732] RIP: 0010:__list_del_entry_valid+0x70/0x90
[  231.282258] Code: 4c 60 82 e8 7d cc 6a 00 0f 0b 48 89 fe 48 c7 c7 88 4c
60 82 e8 6c cc 6a 00 0f 0b 48 89 fe 48 c7 c7 c0 4c 60 82 e8 5b cc 6a 00 <0f>
0b 48 89 fe 48 c7 c7 00 4d 60 82 e8 4a cc 6a 00 0f 0b cc cc cc
[  231.284146] RSP: 0018:ffffc90000033d58 EFLAGS: 00010292
[  231.284685] RAX: 0000000000000054 RBX: ffff8881398a8000 RCX: 0000000000000000
[  231.285415] RDX: 0000000000000001 RSI: ffff88813bc18040 RDI: ffff88813bc18040
[  231.286141] RBP: ffffffff8305ad40 R08: 0000000000000003 R09: 0000000000000001
[  231.286873] R10: ffffffff82803da0 R11: ffffc90000033b90 R12: 0000000000000001
[  231.287606] R13: 0000000000000000 R14: ffff8881398a8000 R15: 0000000000000003
[  231.288337] FS:  0000000000000000(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
[  231.289160] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  231.289754] CR2: 0000000000e72058 CR3: 000000010fa96006 CR4: 00000000003706f0
[  231.290485] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  231.291211] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  231.291940] Call Trace:
[  231.292211]  smc_lgr_terminate_sched+0x53/0xa0
[  231.292677]  smc_switch_conns+0x75/0x6b0
[  231.293085]  ? update_load_avg+0x1a6/0x590
[  231.293517]  ? ttwu_do_wakeup+0x17/0x150
[  231.293907]  ? update_load_avg+0x1a6/0x590
[  231.294317]  ? newidle_balance+0xca/0x3d0
[  231.294716]  smcr_link_down+0x50/0x1a0
[  231.295090]  ? __wake_up_common_lock+0x77/0x90
[  231.295534]  smc_link_down_work+0x46/0x60
[  231.295933]  process_one_work+0x18b/0x350

Fixes: a0a62ee ("net/smc: separate locks for SMCD and SMCR link group lists")
Signed-off-by: Dust Li <[email protected]>
Acked-by: Karsten Graul <[email protected]>
Reviewed-by: Tony Lu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
digetx pushed a commit that referenced this issue Dec 4, 2021
There is a kernel panic caused by pcpu_alloc_pages() passing offlined and
uninitialized node to alloc_pages_node() leading to panic by NULL
dereferencing uninitialized NODE_DATA(nid).

 CPU2 has been hot-added
 BUG: unable to handle page fault for address: 0000000000001608
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] SMP PTI
 CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
 Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW

 RIP: 0010:__alloc_pages+0x127/0x290
 Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
 RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246
 RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
 RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600
 R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2
 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2
 FS:  00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0
 Call Trace:
  pcpu_alloc_pages.constprop.0+0xe4/0x1c0
  pcpu_populate_chunk+0x33/0xb0
  pcpu_alloc+0x4d3/0x6f0
  __alloc_percpu_gfp+0xd/0x10
  alloc_mem_cgroup_per_node_info+0x54/0xb0
  mem_cgroup_alloc+0xed/0x2f0
  mem_cgroup_css_alloc+0x33/0x2f0
  css_create+0x3a/0x1f0
  cgroup_apply_control_enable+0x12b/0x150
  cgroup_mkdir+0xdd/0x110
  kernfs_iop_mkdir+0x4f/0x80
  vfs_mkdir+0x178/0x230
  do_mkdirat+0xfd/0x120
  __x64_sys_mkdir+0x47/0x70
  ? syscall_exit_to_user_mode+0x21/0x50
  do_syscall_64+0x43/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae

Panic can be easily reproduced by disabling udev rule for automatic
onlining hot added CPU followed by CPU with memoryless node (NUMA node
with CPU only) hot add.

Hot adding CPU and memoryless node does not bring the node to online
state.  Memoryless node will be onlined only during the onlining its CPU.

Node can be in one of the following states:
1. not present.(nid == NUMA_NO_NODE)
2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
				NODE_DATA(nid) == NULL)
3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
				NODE_DATA(nid) != NULL)

Percpu code is doing allocations for all possible CPUs.  The issue happens
when it serves hot added but not yet onlined CPU when its node is in 2nd
state.  This node is not ready to use, fallback to numa_mem_id().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Makhalov <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Dennis Zhou <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this issue Dec 4, 2021
A kernel panic was observed during reading /proc/kpageflags for first few
pfns allocated by pmem namespace:

BUG: unable to handle page fault for address: fffffffffffffffe
[  114.495280] #PF: supervisor read access in kernel mode
[  114.495738] #PF: error_code(0x0000) - not-present page
[  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
[  114.496713] Oops: 0000 [#1] SMP PTI
[  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1 #1
[  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
[  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
[  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
[  114.500788] RSP: 0018:ffffa5e601a0fe60 EFLAGS: 00010202
[  114.501373] RAX: fffffffffffffffe RBX: ffffffffffffffff RCX: 0000000000000000
[  114.502009] RDX: 0000000000000001 RSI: 00007ffca13a7310 RDI: ffffd07489000000
[  114.502637] RBP: ffffd07489000000 R08: 0000000000000001 R09: 0000000000000000
[  114.503270] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000240000
[  114.503896] R13: 0000000000080000 R14: 00007ffca13a7310 R15: ffffa5e601a0ff08
[  114.504530] FS:  00007f0266c7f540(0000) GS:ffff962dbbac0000(0000) knlGS:0000000000000000
[  114.505245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  114.505754] CR2: fffffffffffffffe CR3: 000000023a204000 CR4: 00000000000006e0
[  114.506401] Call Trace:
[  114.506660]  kpageflags_read+0xb1/0x130
[  114.507051]  proc_reg_read+0x39/0x60
[  114.507387]  vfs_read+0x8a/0x140
[  114.507686]  ksys_pread64+0x61/0xa0
[  114.508021]  do_syscall_64+0x5f/0x1a0
[  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  114.508844] RIP: 0033:0x7f0266ba426b

The reason for the panic is that stable_page_flags() which parses the page
flags uses uninitialized struct pages reserved by the ZONE_DEVICE driver.

Earlier approach to fix this was discussed here:
https://marc.info/?l=linux-mm&m=152964770000672&w=2

This is another approach.  To avoid using the uninitialized struct page,
immediately return with KPF_RESERVED at the beginning of
stable_page_flags() if the page is reserved by ZONE_DEVICE driver.

Dan said:

: The nvdimm implementation uses vmem_altmap to arrange for the 'struct
: page' array to be allocated from a reservation of a pmem namespace.  A
: namespace in this mode contains an info-block that consumes the first
: 8K of the namespace capacity, capacity designated for page mapping,
: capacity for padding the start of data to optionally 4K, 2MB, or 1GB
: (on x86), and then the namespace data itself.  The implementation
: specifies a section aligned (now sub-section aligned) address to
: arch_add_memory() to establish the linear mapping to map the metadata,
: and then vmem_altmap indicates to memmap_init_zone() which pfns
: represent data.  The implementation only specifies enough 'struct page'
: capacity for pfn_to_page() to operate on the data space, not the
: namespace metadata space.
:
: The proposal to validate ZONE_DEVICE pfns against the altmap seems the
: right approach to me.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Toshiki Fukasawa <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Junichi Nomura <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this issue Jan 12, 2022
MM defined the rule [1] very clearly that once page was set with PG_private
flag, we should increment the refcount in that page, also main flows like
pageout(), migrate_page() will assume there is one additional page
reference count if page_has_private() returns true. Otherwise, we may
get a BUG in page migration:

  page:0000000080d05b9d refcount:-1 mapcount:0 mapping:000000005f4d82a8
  index:0xe2 pfn:0x14c12
  aops:ubifs_file_address_operations [ubifs] ino:8f1 dentry name:"f30e"
  flags: 0x1fffff80002405(locked|uptodate|owner_priv_1|private|node=0|
  zone=1|lastcpupid=0x1fffff)
  page dumped because: VM_BUG_ON_PAGE(page_count(page) != 0)
  ------------[ cut here ]------------
  kernel BUG at include/linux/page_ref.h:184!
  invalid opcode: 0000 [#1] SMP
  CPU: 3 PID: 38 Comm: kcompactd0 Not tainted 5.15.0-rc5
  RIP: 0010:migrate_page_move_mapping+0xac3/0xe70
  Call Trace:
    ubifs_migrate_page+0x22/0xc0 [ubifs]
    move_to_new_page+0xb4/0x600
    migrate_pages+0x1523/0x1cc0
    compact_zone+0x8c5/0x14b0
    kcompactd+0x2bc/0x560
    kthread+0x18c/0x1e0
    ret_from_fork+0x1f/0x30

Before the time, we should make clean a concept, what does refcount means
in page gotten from grab_cache_page_write_begin(). There are 2 situations:
Situation 1: refcount is 3, page is created by __page_cache_alloc.
  TYPE_A - the write process is using this page
  TYPE_B - page is assigned to one certain mapping by calling
	   __add_to_page_cache_locked()
  TYPE_C - page is added into pagevec list corresponding current cpu by
	   calling lru_cache_add()
Situation 2: refcount is 2, page is gotten from the mapping's tree
  TYPE_B - page has been assigned to one certain mapping
  TYPE_A - the write process is using this page (by calling
	   page_cache_get_speculative())
Filesystem releases one refcount by calling put_page() in xxx_write_end(),
the released refcount corresponds to TYPE_A (write task is using it). If
there are any processes using a page, page migration process will skip the
page by judging whether expected_page_refs() equals to page refcount.

The BUG is caused by following process:
    PA(cpu 0)                           kcompactd(cpu 1)
				compact_zone
ubifs_write_begin
  page_a = grab_cache_page_write_begin
    add_to_page_cache_lru
      lru_cache_add
        pagevec_add // put page into cpu 0's pagevec
  (refcnf = 3, for page creation process)
ubifs_write_end
  SetPagePrivate(page_a) // doesn't increase page count !
  unlock_page(page_a)
  put_page(page_a)  // refcnt = 2
				[...]

    PB(cpu 0)
filemap_read
  filemap_get_pages
    add_to_page_cache_lru
      lru_cache_add
        __pagevec_lru_add // traverse all pages in cpu 0's pagevec
	  __pagevec_lru_add_fn
	    SetPageLRU(page_a)
				isolate_migratepages
                                  isolate_migratepages_block
				    get_page_unless_zero(page_a)
				    // refcnt = 3
                                      list_add(page_a, from_list)
				migrate_pages(from_list)
				  __unmap_and_move
				    move_to_new_page
				      ubifs_migrate_page(page_a)
				        migrate_page_move_mapping
					  expected_page_refs get 3
                                  (migration[1] + mapping[1] + private[1])
	 release_pages
	   put_page_testzero(page_a) // refcnt = 3
                                          page_ref_freeze  // refcnt = 0
	     page_ref_dec_and_test(0 - 1 = -1)
                                          page_ref_unfreeze
                                            VM_BUG_ON_PAGE(-1 != 0, page)

UBIFS doesn't increase the page refcount after setting private flag, which
leads to page migration task believes the page is not used by any other
processes, so the page is migrated. This causes concurrent accessing on
page refcount between put_page() called by other process(eg. read process
calls lru_cache_add) and page_ref_unfreeze() called by migration task.

Actually zhangjun has tried to fix this problem [2] by recalculating page
refcnt in ubifs_migrate_page(). It's better to follow MM rules [1], because
just like Kirill suggested in [2], we need to check all users of
page_has_private() helper. Like f2fs does in [3], fix it by adding/deleting
refcount when setting/clearing private for a page. BTW, according to [4],
we set 'page->private' as 1 because ubifs just simply SetPagePrivate().
And, [5] provided a common helper to set/clear page private, ubifs can
use this helper following the example of iomap, afs, btrfs, etc.

Jump [6] to find a reproducer.

[1] https://lore.kernel.org/lkml/[email protected]
[2] https://www.spinics.net/lists/linux-mtd/msg04018.html
[3] http://lkml.iu.edu/hypermail/linux/kernel/1903.0/03313.html
[4] https://lore.kernel.org/linux-f2fs-devel/[email protected]
[5] https://lore.kernel.org/all/[email protected]
[6] https://bugzilla.kernel.org/show_bug.cgi?id=214961

Fixes: 1e51764 ("UBIFS: add new flash file system")
Signed-off-by: Zhihao Cheng <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
digetx pushed a commit that referenced this issue Jan 16, 2022
…el/git/paulmck/linux-rcu into timers/core

Pull clocksource watchdog updates from Paul McKenney:

 - Avoid accidental unstable marking of clocksources by rejecting
   clocksource measurements where the source of the skew is the delay
   reading reference clocksource itself.  This change avoids many of the
   current false positives caused by epic cache-thrashing workloads.

 - Reduce the default clocksource_watchdog() retries to 2, thus offsetting
   the increased overhead due to #1 above rereading the reference
   clocksource.

Link: https://lore.kernel.org/lkml/20220105001723.GA536708@paulmck-ThinkPad-P17-Gen-1
digetx pushed a commit that referenced this issue Jan 16, 2022
…rors

Use down_read_nested() and down_write_nested() when taking the
ctrl->reset_lock rw-sem, passing the number of PCIe hotplug controllers in
the path to the PCI root bus as lock subclass parameter.

This fixes the following false-positive lockdep report when unplugging a
Lenovo X1C8 from a Lenovo 2nd gen TB3 dock:

  pcieport 0000:06:01.0: pciehp: Slot(1): Link Down
  pcieport 0000:06:01.0: pciehp: Slot(1): Card not present
  ============================================
  WARNING: possible recursive locking detected
  5.16.0-rc2+ #621 Not tainted
  --------------------------------------------
  irq/124-pciehp/86 is trying to acquire lock:
  ffff8e5ac4299ef8 (&ctrl->reset_lock){.+.+}-{3:3}, at: pciehp_check_presence+0x23/0x80

  but task is already holding lock:
  ffff8e5ac4298af8 (&ctrl->reset_lock){.+.+}-{3:3}, at: pciehp_ist+0xf3/0x180

   other info that might help us debug this:
   Possible unsafe locking scenario:

	 CPU0
	 ----
    lock(&ctrl->reset_lock);
    lock(&ctrl->reset_lock);

   *** DEADLOCK ***

   May be due to missing lock nesting notation

  3 locks held by irq/124-pciehp/86:
   #0: ffff8e5ac4298af8 (&ctrl->reset_lock){.+.+}-{3:3}, at: pciehp_ist+0xf3/0x180
   #1: ffffffffa3b024e8 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pciehp_unconfigure_device+0x31/0x110
   #2: ffff8e5ac1ee2248 (&dev->mutex){....}-{3:3}, at: device_release_driver+0x1c/0x40

  stack backtrace:
  CPU: 4 PID: 86 Comm: irq/124-pciehp Not tainted 5.16.0-rc2+ #621
  Hardware name: LENOVO 20U90SIT19/20U90SIT19, BIOS N2WET30W (1.20 ) 08/26/2021
  Call Trace:
   <TASK>
   dump_stack_lvl+0x59/0x73
   __lock_acquire.cold+0xc5/0x2c6
   lock_acquire+0xb5/0x2b0
   down_read+0x3e/0x50
   pciehp_check_presence+0x23/0x80
   pciehp_runtime_resume+0x5c/0xa0
   device_for_each_child+0x45/0x70
   pcie_port_device_runtime_resume+0x20/0x30
   pci_pm_runtime_resume+0xa7/0xc0
   __rpm_callback+0x41/0x110
   rpm_callback+0x59/0x70
   rpm_resume+0x512/0x7b0
   __pm_runtime_resume+0x4a/0x90
   __device_release_driver+0x28/0x240
   device_release_driver+0x26/0x40
   pci_stop_bus_device+0x68/0x90
   pci_stop_bus_device+0x2c/0x90
   pci_stop_and_remove_bus_device+0xe/0x20
   pciehp_unconfigure_device+0x6c/0x110
   pciehp_disable_slot+0x5b/0xe0
   pciehp_handle_presence_or_link_change+0xc3/0x2f0
   pciehp_ist+0x179/0x180

This lockdep warning is triggered because with Thunderbolt, hotplug ports
are nested. When removing multiple devices in a daisy-chain, each hotplug
port's reset_lock may be acquired recursively. It's never the same lock, so
the lockdep splat is a false positive.

Because locks at the same hierarchy level are never acquired recursively, a
per-level lockdep class is sufficient to fix the lockdep warning.

The choice to use one lockdep subclass per pcie-hotplug controller in the
path to the root-bus was made to conserve class keys because their number
is limited and the complexity grows quadratically with number of keys
according to Documentation/locking/lockdep-design.rst.

Link: https://lore.kernel.org/linux-pci/[email protected]/
Link: https://lore.kernel.org/linux-pci/[email protected]/
Link: https://lore.kernel.org/r/[email protected]
Link: https://bugzilla.kernel.org/show_bug.cgi?id=208855
Reported-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Hans de Goede <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Lukas Wunner <[email protected]>
Cc: [email protected]
digetx pushed a commit that referenced this issue Jan 16, 2022
I missed that @ndev value can be NULL.

I prefer not factorizing this NULL check, and instead
clearly document where a NULL might be expected.

general protection fault, probably for non-canonical address 0xdffffc00000000ba: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x00000000000005d0-0x00000000000005d7]
CPU: 0 PID: 19875 Comm: syz-executor.2 Not tainted 5.16.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__lock_acquire+0xd7a/0x5470 kernel/locking/lockdep.c:4897
Code: 14 0e 41 bf 01 00 00 00 0f 86 c8 00 00 00 89 05 5c 20 14 0e e9 bd 00 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 9f 2e 00 00 49 81 3e 20 c5 1a 8f 0f 84 52 f3 ff
RSP: 0018:ffffc900057071d0 EFLAGS: 00010002
RAX: dffffc0000000000 RBX: 1ffff92000ae0e65 RCX: 1ffff92000ae0e4c
RDX: 00000000000000ba RSI: 0000000000000000 RDI: 0000000000000001
RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001
R10: fffffbfff1b24ae2 R11: 000000000008808a R12: 0000000000000000
R13: ffff888040ca4000 R14: 00000000000005d0 R15: 0000000000000000
FS:  00007fbd683e0700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2be22000 CR3: 0000000013fea000 CR4: 00000000003526f0
Call Trace:
 <TASK>
 lock_acquire kernel/locking/lockdep.c:5637 [inline]
 lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5602
 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
 _raw_spin_lock_irqsave+0x39/0x50 kernel/locking/spinlock.c:162
 ref_tracker_alloc+0x182/0x440 lib/ref_tracker.c:84
 netdev_tracker_alloc include/linux/netdevice.h:3859 [inline]
 smc_pnet_add_eth net/smc/smc_pnet.c:372 [inline]
 smc_pnet_enter net/smc/smc_pnet.c:492 [inline]
 smc_pnet_add+0x49a/0x14d0 net/smc/smc_pnet.c:555
 genl_family_rcv_msg_doit+0x228/0x320 net/netlink/genetlink.c:731
 genl_family_rcv_msg net/netlink/genetlink.c:775 [inline]
 genl_rcv_msg+0x328/0x580 net/netlink/genetlink.c:792
 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
 genl_rcv+0x24/0x40 net/netlink/genetlink.c:803
 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
 netlink_unicast+0x539/0x7e0 net/netlink/af_netlink.c:1343
 netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:705 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:725
 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2413
 ___sys_sendmsg+0xf3/0x170 net/socket.c:2467
 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2496
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: b606452 ("net/smc: add net device tracker to struct smc_pnetentry")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: syzbot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
digetx pushed a commit that referenced this issue Jan 16, 2022
If the key is already present then free the key used for lookup.

Found with:
$ perf stat -M IO_Read_BW /bin/true

==1749112==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 32 byte(s) in 4 object(s) allocated from:
    #0 0x7f6f6fa7d7cf in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
    #1 0x55acecd9d7a6 in check_per_pkg util/stat.c:343
    #2 0x55acecd9d9c5 in process_counter_values util/stat.c:365
    #3 0x55acecd9e0ab in process_counter_maps util/stat.c:421
    #4 0x55acecd9e292 in perf_stat_process_counter util/stat.c:443
    #5 0x55aceca8553e in read_counters ./tools/perf/builtin-stat.c:470
    #6 0x55aceca88fe3 in __run_perf_stat ./tools/perf/builtin-stat.c:1023
    #7 0x55aceca89146 in run_perf_stat ./tools/perf/builtin-stat.c:1048
    #8 0x55aceca90858 in cmd_stat ./tools/perf/builtin-stat.c:2555
    #9 0x55acecc05fa5 in run_builtin ./tools/perf/perf.c:313
    #10 0x55acecc064fe in handle_internal_command ./tools/perf/perf.c:365
    #11 0x55acecc068bb in run_argv ./tools/perf/perf.c:409
    #12 0x55acecc070aa in main ./tools/perf/perf.c:539

Reviewed-by: James Clark <[email protected]>
Signed-off-by: Ian Rogers <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: John Garry <[email protected]>
Cc: Kajol Jain <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Mathieu Poirier <[email protected]>
Cc: Mike Leach <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Paul Clarke <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Riccardo Mancini <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Suzuki Poulouse <[email protected]>
Cc: Vineet Singh <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
digetx pushed a commit that referenced this issue Jan 16, 2022
…nters

Pingfan reported that the following causes a fault:

  echo "filename ~ \"cpu\"" > events/syscalls/sys_enter_openat/filter
  echo 1 > events/syscalls/sys_enter_at/enable

The reason is that trace event filter treats the user space pointer
defined by "filename" as a normal pointer to compare against the "cpu"
string. The following bug happened:

 kvm-03-guest16 login: [72198.026181] BUG: unable to handle page fault for address: 00007fffaae8ef60
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0001) - permissions violation
 PGD 80000001008b7067 P4D 80000001008b7067 PUD 2393f1067 PMD 2393ec067 PTE 8000000108f47867
 Oops: 0001 [#1] PREEMPT SMP PTI
 CPU: 1 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.14.0-32.el9.x86_64 #1
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:strlen+0x0/0x20
 Code: 48 89 f9 74 09 48 83 c1 01 80 39 00 75 f7 31 d2 44 0f b6 04 16 44 88 04 11
       48 83 c2 01 45 84 c0 75 ee c3 0f 1f 80 00 00 00 00 <80> 3f 00 74 10 48 89 f8
       48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31
 RSP: 0018:ffffb5b900013e48 EFLAGS: 00010246
 RAX: 0000000000000018 RBX: ffff8fc1c49ede00 RCX: 0000000000000000
 RDX: 0000000000000020 RSI: ffff8fc1c02d601c RDI: 00007fffaae8ef60
 RBP: 00007fffaae8ef60 R08: 0005034f4ddb8ea4 R09: 0000000000000000
 R10: ffff8fc1c02d601c R11: 0000000000000000 R12: ffff8fc1c8a6e380
 R13: 0000000000000000 R14: ffff8fc1c02d6010 R15: ffff8fc1c00453c0
 FS:  00007fa86123db40(0000) GS:ffff8fc2ffd00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007fffaae8ef60 CR3: 0000000102880001 CR4: 00000000007706e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  filter_pred_pchar+0x18/0x40
  filter_match_preds+0x31/0x70
  ftrace_syscall_enter+0x27a/0x2c0
  syscall_trace_enter.constprop.0+0x1aa/0x1d0
  do_syscall_64+0x16/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae
 RIP: 0033:0x7fa861d88664

The above happened because the kernel tried to access user space directly
and triggered a "supervisor read access in kernel mode" fault. Worse yet,
the memory could not even be loaded yet, and a SEGFAULT could happen as
well. This could be true for kernel space accessing as well.

To be even more robust, test both kernel and user space strings. If the
string fails to read, then simply have the filter fail.

Note, TASK_SIZE is used to determine if the pointer is user or kernel space
and the appropriate strncpy_from_kernel/user_nofault() function is used to
copy the memory. For some architectures, the compare to TASK_SIZE may always
pick user space or kernel space. If it gets it wrong, the only thing is that
the filter will fail to match. In the future, this needs to be fixed to have
the event denote which should be used. But failing a filter is much better
than panicing the machine, and that can be solved later.

Link: https://lore.kernel.org/all/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]

Cc: [email protected]
Cc: Ingo Molnar <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Tom Zanussi <[email protected]>
Reported-by: Pingfan Liu <[email protected]>
Tested-by: Pingfan Liu <[email protected]>
Fixes: 87a342f ("tracing/filters: Support filtering for char * strings")
Signed-off-by: Steven Rostedt <[email protected]>
digetx pushed a commit that referenced this issue Jan 16, 2022
Currently three dma atomic pools are initialized as long as the relevant
kernel codes are built in.  While in kdump kernel of x86_64, this is not
right when trying to create atomic_pool_dma, because there's no managed
pages in DMA zone.  In the case, DMA zone only has low 1M memory presented
and locked down by memblock allocator.  So no pages are added into buddy
of DMA zone.  Please check commit f1d4d47 ("x86/setup: Always reserve
the first 1M of RAM").

Then in kdump kernel of x86_64, it always prints below failure message:

 DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
 swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.13.0-0.rc5.20210611git929d931f2b40.42.fc35.x86_64 #1
 Hardware name: Dell Inc. PowerEdge R910/0P658H, BIOS 2.12.0 06/04/2018
 Call Trace:
  dump_stack+0x7f/0xa1
  warn_alloc.cold+0x72/0xd6
  ? _raw_spin_unlock_irq+0x24/0x40
  ? __alloc_pages_direct_compact+0x90/0x1b0
  __alloc_pages_slowpath.constprop.0+0xf29/0xf50
  ? __cond_resched+0x16/0x50
  ? prepare_alloc_pages.constprop.0+0x19d/0x1b0
  __alloc_pages+0x24d/0x2c0
  ? __dma_atomic_pool_init+0x93/0x93
  alloc_page_interleave+0x13/0xb0
  atomic_pool_expand+0x118/0x210
  ? __dma_atomic_pool_init+0x93/0x93
  __dma_atomic_pool_init+0x45/0x93
  dma_atomic_pool_init+0xdb/0x176
  do_one_initcall+0x67/0x320
  ? rcu_read_lock_sched_held+0x3f/0x80
  kernel_init_freeable+0x290/0x2dc
  ? rest_init+0x24f/0x24f
  kernel_init+0xa/0x111
  ret_from_fork+0x22/0x30
 Mem-Info:
 ......
 DMA: failed to allocate 128 KiB GFP_KERNEL|GFP_DMA pool for atomic allocation
 DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA32 pool for atomic allocations

Here, let's check if DMA zone has managed pages, then create
atomic_pool_dma if yes.  Otherwise just skip it.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 6f599d8 ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
Signed-off-by: Baoquan He <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: John Donnelly  <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Laight <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this issue Jan 16, 2022
After the recent changes made by commit c2e3930 ("btrfs: clear
extent buffer uptodate when we fail to write it") and its followup fix,
commit 651740a ("btrfs: check WRITE_ERR when trying to read an
extent buffer"), we can now end up not cleaning up space reservations of
log tree extent buffers after a transaction abort happens, as well as not
cleaning up still dirty extent buffers.

This happens because if writeback for a log tree extent buffer failed,
than we have cleared the EXTENT_BUFFER_UPTODATE from the extent buffer
and we have also set the bit EXTENT_BUFFER_WRITE_ERR on it. Later on,
when trying to free the log tree with free_log_tree(), which iterates
over the tree, we can end up getting an -EIO error when trying to read
a node or a leaf, since read_extent_buffer_pages() returns -EIO if an
extent buffer does not have EXTENT_BUFFER_UPTODATE set and has the
EXTENT_BUFFER_WRITE_ERR bit set. Getting that -EIO means that we return
immediately as we can not iterate over the entire tree.

In that case we never update the reserved space for an extent buffer in
the respective block group and space_info object, as well as for any other
extent buffers that we not yet iterated over.

When this happens we get the following traces when unmounting the fs:

[174957.284509] BTRFS: error (device dm-0) in cleanup_transaction:1913: errno=-5 IO failure
[174957.286497] BTRFS: error (device dm-0) in free_log_tree:3420: errno=-5 IO failure
[174957.399379] ------------[ cut here ]------------
[174957.402497] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:127 btrfs_put_block_group+0x77/0xb0 [btrfs]
[174957.407523] Modules linked in: btrfs overlay dm_zero (...)
[174957.424917] CPU: 2 PID: 3206883 Comm: umount Tainted: G        W         5.16.0-rc5-btrfs-next-109 #1
[174957.426689] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[174957.428716] RIP: 0010:btrfs_put_block_group+0x77/0xb0 [btrfs]
[174957.429717] Code: 21 48 8b bd (...)
[174957.432867] RSP: 0018:ffffb70d41cffdd0 EFLAGS: 00010206
[174957.433632] RAX: 0000000000000001 RBX: ffff8b09c3848000 RCX: ffff8b0758edd1c8
[174957.434689] RDX: 0000000000000001 RSI: ffffffffc0b467e7 RDI: ffff8b0758edd000
[174957.436068] RBP: ffff8b0758edd000 R08: 0000000000000000 R09: 0000000000000000
[174957.437114] R10: 0000000000000246 R11: 0000000000000000 R12: ffff8b09c3848148
[174957.438140] R13: ffff8b09c3848198 R14: ffff8b0758edd188 R15: dead000000000100
[174957.439317] FS:  00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
[174957.440402] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[174957.441164] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
[174957.442117] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[174957.443076] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[174957.443948] Call Trace:
[174957.444264]  <TASK>
[174957.444538]  btrfs_free_block_groups+0x255/0x3c0 [btrfs]
[174957.445238]  close_ctree+0x301/0x357 [btrfs]
[174957.445803]  ? call_rcu+0x16c/0x290
[174957.446250]  generic_shutdown_super+0x74/0x120
[174957.446832]  kill_anon_super+0x14/0x30
[174957.447305]  btrfs_kill_super+0x12/0x20 [btrfs]
[174957.447890]  deactivate_locked_super+0x31/0xa0
[174957.448440]  cleanup_mnt+0x147/0x1c0
[174957.448888]  task_work_run+0x5c/0xa0
[174957.449336]  exit_to_user_mode_prepare+0x1e5/0x1f0
[174957.449934]  syscall_exit_to_user_mode+0x16/0x40
[174957.450512]  do_syscall_64+0x48/0xc0
[174957.450980]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[174957.451605] RIP: 0033:0x7f328fdc4a97
[174957.452059] Code: 03 0c 00 f7 (...)
[174957.454320] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[174957.455262] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
[174957.456131] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
[174957.457118] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
[174957.458005] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
[174957.459113] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
[174957.460193]  </TASK>
[174957.460534] irq event stamp: 0
[174957.461003] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[174957.461947] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.463147] softirqs last  enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.465116] softirqs last disabled at (0): [<0000000000000000>] 0x0
[174957.466323] ---[ end trace bc7ee0c490bce3af ]---
[174957.467282] ------------[ cut here ]------------
[174957.468184] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:3976 btrfs_free_block_groups+0x330/0x3c0 [btrfs]
[174957.470066] Modules linked in: btrfs overlay dm_zero (...)
[174957.483137] CPU: 2 PID: 3206883 Comm: umount Tainted: G        W         5.16.0-rc5-btrfs-next-109 #1
[174957.484691] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[174957.486853] RIP: 0010:btrfs_free_block_groups+0x330/0x3c0 [btrfs]
[174957.488050] Code: 00 00 00 ad de (...)
[174957.491479] RSP: 0018:ffffb70d41cffde0 EFLAGS: 00010206
[174957.492520] RAX: ffff8b08d79310b0 RBX: ffff8b09c3848000 RCX: 0000000000000000
[174957.493868] RDX: 0000000000000001 RSI: fffff443055ee600 RDI: ffffffffb1131846
[174957.495183] RBP: ffff8b08d79310b0 R08: 0000000000000000 R09: 0000000000000000
[174957.496580] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8b08d7931000
[174957.498027] R13: ffff8b09c38492b0 R14: dead000000000122 R15: dead000000000100
[174957.499438] FS:  00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
[174957.500990] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[174957.502117] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
[174957.503513] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[174957.504864] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[174957.506167] Call Trace:
[174957.506654]  <TASK>
[174957.507047]  close_ctree+0x301/0x357 [btrfs]
[174957.507867]  ? call_rcu+0x16c/0x290
[174957.508567]  generic_shutdown_super+0x74/0x120
[174957.509447]  kill_anon_super+0x14/0x30
[174957.510194]  btrfs_kill_super+0x12/0x20 [btrfs]
[174957.511123]  deactivate_locked_super+0x31/0xa0
[174957.511976]  cleanup_mnt+0x147/0x1c0
[174957.512610]  task_work_run+0x5c/0xa0
[174957.513309]  exit_to_user_mode_prepare+0x1e5/0x1f0
[174957.514231]  syscall_exit_to_user_mode+0x16/0x40
[174957.515069]  do_syscall_64+0x48/0xc0
[174957.515718]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[174957.516688] RIP: 0033:0x7f328fdc4a97
[174957.517413] Code: 03 0c 00 f7 d8 (...)
[174957.521052] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[174957.522514] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
[174957.523950] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
[174957.525375] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
[174957.526763] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
[174957.528058] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
[174957.529404]  </TASK>
[174957.529843] irq event stamp: 0
[174957.530256] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[174957.531061] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.532075] softirqs last  enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.533083] softirqs last disabled at (0): [<0000000000000000>] 0x0
[174957.533865] ---[ end trace bc7ee0c490bce3b0 ]---
[174957.534452] BTRFS info (device dm-0): space_info 4 has 1070841856 free, is not full
[174957.535404] BTRFS info (device dm-0): space_info total=1073741824, used=2785280, pinned=0, reserved=49152, may_use=0, readonly=65536 zone_unusable=0
[174957.537029] BTRFS info (device dm-0): global_block_rsv: size 0 reserved 0
[174957.537859] BTRFS info (device dm-0): trans_block_rsv: size 0 reserved 0
[174957.538697] BTRFS info (device dm-0): chunk_block_rsv: size 0 reserved 0
[174957.539552] BTRFS info (device dm-0): delayed_block_rsv: size 0 reserved 0
[174957.540403] BTRFS info (device dm-0): delayed_refs_rsv: size 0 reserved 0

This also means that in case we have log tree extent buffers that are
still dirty, we can end up not cleaning them up in case we find an
extent buffer with EXTENT_BUFFER_WRITE_ERR set on it, as in that case
we have no way for iterating over the rest of the tree.

This issue is very often triggered with test cases generic/475 and
generic/648 from fstests.

So instead of trying to iterate a log tree when freeing it after a
transaction abort, iterate over the io tree that tracks the ranges of the
log tree's extent buffers, and unaccount each extent buffer's range as
well as clear any extent buffers that are still dirty. Also because when
we trigger writeback of log tree extent buffers we clear their range from
the ->dirty_log_pages io tree, we need to make sure the range is kept but
with a different bit, so that we can find it if we ever run into a
transaction abort - so when triggering the writeback, instead of clearing
the range, convert it to the EXTENT_UPTODATE bit.

As a final note, while the commits mentioned above make this issue easy to
trigger with tests that use the device mapper to simulate device errors,
the issue could happen before, it was just much less likely. For example
after writing an extent buffer of a log tree, it could be evicted from
memory due to memory pressure, and then later during transaction commit
while cleaning up the log tree through walk_log_tree() we get an -EIO when
trying to read the extent buffer from disk - resulting in cleanup of the
remaining log tree extent buffers not being made.

Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants