Issues just used as internal communication and image storage #1

martinberlin · 2022-10-21T08:17:37Z

The issues in this repository won't be worked on. Please refer to the original linux kernel repositories.
The modules included in this Linux kernel:

…_transaction() We are seeing crashes similar to the following trace: [38.969182] WARNING: CPU: 20 PID: 2105 at fs/btrfs/relocation.c:4070 btrfs_relocate_block_group+0x2dc/0x340 [btrfs] [38.973556] CPU: 20 PID: 2105 Comm: btrfs Not tainted 5.17.0-rc4 torvalds#54 [38.974580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [38.976539] RIP: 0010:btrfs_relocate_block_group+0x2dc/0x340 [btrfs] [38.980336] RSP: 0000:ffffb0dd42e03c20 EFLAGS: 00010206 [38.981218] RAX: ffff96cfc4ede800 RBX: ffff96cfc3ce0000 RCX: 000000000002ca14 [38.982560] RDX: 0000000000000000 RSI: 4cfd109a0bcb5d7f RDI: ffff96cfc3ce0360 [38.983619] RBP: ffff96cfc309c000 R08: 0000000000000000 R09: 0000000000000000 [38.984678] R10: ffff96cec0000001 R11: ffffe84c80000000 R12: ffff96cfc4ede800 [38.985735] R13: 0000000000000000 R14: 0000000000000000 R15: ffff96cfc3ce0360 [38.987146] FS: 00007f11c15218c0(0000) GS:ffff96d6dfb00000(0000) knlGS:0000000000000000 [38.988662] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [38.989398] CR2: 00007ffc922c8e60 CR3: 00000001147a6001 CR4: 0000000000370ee0 [38.990279] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [38.991219] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [38.992528] Call Trace: [38.992854] <TASK> [38.993148] btrfs_relocate_chunk+0x27/0xe0 [btrfs] [38.993941] btrfs_balance+0x78e/0xea0 [btrfs] [38.994801] ? vsnprintf+0x33c/0x520 [38.995368] ? __kmalloc_track_caller+0x351/0x440 [38.996198] btrfs_ioctl_balance+0x2b9/0x3a0 [btrfs] [38.997084] btrfs_ioctl+0x11b0/0x2da0 [btrfs] [38.997867] ? mod_objcg_state+0xee/0x340 [38.998552] ? seq_release+0x24/0x30 [38.999184] ? proc_nr_files+0x30/0x30 [38.999654] ? call_rcu+0xc8/0x2f0 [39.000228] ? __x64_sys_ioctl+0x84/0xc0 [39.000872] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [39.001973] __x64_sys_ioctl+0x84/0xc0 [39.002566] do_syscall_64+0x3a/0x80 [39.003011] entry_SYSCALL_64_after_hwframe+0x44/0xae [39.003735] RIP: 0033:0x7f11c166959b [39.007324] RSP: 002b:00007fff2543e998 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [39.008521] RAX: ffffffffffffffda RBX: 00007f11c1521698 RCX: 00007f11c166959b [39.009833] RDX: 00007fff2543ea40 RSI: 00000000c4009420 RDI: 0000000000000003 [39.011270] RBP: 0000000000000003 R08: 0000000000000013 R09: 00007f11c16f94e0 [39.012581] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff25440df3 [39.014046] R13: 0000000000000000 R14: 00007fff2543ea40 R15: 0000000000000001 [39.015040] </TASK> [39.015418] ---[ end trace 0000000000000000 ]--- [43.131559] ------------[ cut here ]------------ [43.132234] kernel BUG at fs/btrfs/extent-tree.c:2717! [43.133031] invalid opcode: 0000 [#1] PREEMPT SMP PTI [43.133702] CPU: 1 PID: 1839 Comm: btrfs Tainted: G W 5.17.0-rc4 torvalds#54 [43.134863] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [43.136426] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs] [43.139913] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246 [43.140629] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001 [43.141604] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff [43.142645] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50 [43.143669] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000 [43.144657] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000 [43.145686] FS: 00007f7657dd68c0(0000) GS:ffff96d6df640000(0000) knlGS:0000000000000000 [43.146808] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [43.147584] CR2: 00007f7fe81bf5b0 CR3: 00000001093ee004 CR4: 0000000000370ee0 [43.148589] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [43.149581] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [43.150559] Call Trace: [43.150904] <TASK> [43.151253] btrfs_finish_extent_commit+0x88/0x290 [btrfs] [43.152127] btrfs_commit_transaction+0x74f/0xaa0 [btrfs] [43.152932] ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs] [43.153786] btrfs_ioctl+0x1edc/0x2da0 [btrfs] [43.154475] ? __check_object_size+0x150/0x170 [43.155170] ? preempt_count_add+0x49/0xa0 [43.155753] ? __x64_sys_ioctl+0x84/0xc0 [43.156437] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [43.157456] __x64_sys_ioctl+0x84/0xc0 [43.157980] do_syscall_64+0x3a/0x80 [43.158543] entry_SYSCALL_64_after_hwframe+0x44/0xae [43.159231] RIP: 0033:0x7f7657f1e59b [43.161819] RSP: 002b:00007ffda5cd1658 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [43.162702] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f7657f1e59b [43.163526] RDX: 0000000000000000 RSI: 0000000000009408 RDI: 0000000000000003 [43.164358] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 [43.165208] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [43.166029] R13: 00005621b91c3232 R14: 00005621b91ba580 R15: 00007ffda5cd1800 [43.166863] </TASK> [43.167125] Modules linked in: btrfs blake2b_generic xor pata_acpi ata_piix libata raid6_pq scsi_mod libcrc32c virtio_net virtio_rng net_failover rng_core failover scsi_common [43.169552] ---[ end trace 0000000000000000 ]--- [43.171226] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs] [43.174767] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246 [43.175600] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001 [43.176468] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff [43.177357] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50 [43.178271] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000 [43.179178] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000 [43.180071] FS: 00007f7657dd68c0(0000) GS:ffff96d6df800000(0000) knlGS:0000000000000000 [43.181073] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [43.181808] CR2: 00007fe09905f010 CR3: 00000001093ee004 CR4: 0000000000370ee0 [43.182706] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [43.183591] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 We first hit the WARN_ON(rc->block_group->pinned > 0) in btrfs_relocate_block_group() and then the BUG_ON(!cache) in unpin_extent_range(). This tells us that we are exiting relocation and removing the block group with bytes still pinned for that block group. This is supposed to be impossible: the last thing relocate_block_group() does is commit the transaction to get rid of pinned extents. Commit d0c2f4f ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") introduced an optimization so that commits from fsync don't have to wait for the previous commit to unpin extents. This was only intended to affect fsync, but it inadvertently made it possible for any commit to skip waiting for the previous commit to unpin. This is because if a call to btrfs_commit_transaction() finds that another thread is already committing the transaction, it waits for the other thread to complete the commit and then returns. If that other thread was in fsync, then it completes the commit without completing the previous commit. This makes the following sequence of events possible: Thread 1____________________|Thread 2 (fsync)_____________________|Thread 3 (balance)___________________ btrfs_commit_transaction(N) | | btrfs_run_delayed_refs | | pin extents | | ... | | state = UNBLOCKED |btrfs_sync_file | | btrfs_start_transaction(N + 1) |relocate_block_group | | btrfs_join_transaction(N + 1) | btrfs_commit_transaction(N + 1) | ... | trans->state = COMMIT_START | | | btrfs_commit_transaction(N + 1) | | wait_for_commit(N + 1, COMPLETED) | wait_for_commit(N, SUPER_COMMITTED)| state = SUPER_COMMITTED | ... | btrfs_finish_extent_commit| | unpin_extent_range() | trans->state = COMPLETED | | | return | | ... | |Thread 1 isn't done, so pinned > 0 | |and we WARN | | | |btrfs_remove_block_group unpin_extent_range() | | Thread 3 removed the | | block group, so we BUG| | There are other sequences involving SUPER_COMMITTED transactions that can cause a similar outcome. We could fix this by making relocation explicitly wait for unpinning, but there may be other cases that need it. Josef mentioned ENOSPC flushing and the free space cache inode as other potential victims. Rather than playing whack-a-mole, this fix is conservative and makes all commits not in fsync wait for all previous transactions, which is what the optimization intended. Fixes: d0c2f4f ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") CC: [email protected] # 5.15+ Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Omar Sandoval <[email protected]> Signed-off-by: David Sterba <[email protected]>

There is a kernel panic caused by pcpu_alloc_pages() passing offlined and uninitialized node to alloc_pages_node() leading to panic by NULL dereferencing uninitialized NODE_DATA(nid). CPU2 has been hot-added BUG: unable to handle page fault for address: 0000000000001608 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 0 PID: 1 Comm: systemd Tainted: G E 5.15.0-rc7+ torvalds#11 Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW RIP: 0010:__alloc_pages+0x127/0x290 Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2 RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246 RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2 RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600 R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2 FS: 00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0 Call Trace: pcpu_alloc_pages.constprop.0+0xe4/0x1c0 pcpu_populate_chunk+0x33/0xb0 pcpu_alloc+0x4d3/0x6f0 __alloc_percpu_gfp+0xd/0x10 alloc_mem_cgroup_per_node_info+0x54/0xb0 mem_cgroup_alloc+0xed/0x2f0 mem_cgroup_css_alloc+0x33/0x2f0 css_create+0x3a/0x1f0 cgroup_apply_control_enable+0x12b/0x150 cgroup_mkdir+0xdd/0x110 kernfs_iop_mkdir+0x4f/0x80 vfs_mkdir+0x178/0x230 do_mkdirat+0xfd/0x120 __x64_sys_mkdir+0x47/0x70 ? syscall_exit_to_user_mode+0x21/0x50 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Panic can be easily reproduced by disabling udev rule for automatic onlining hot added CPU followed by CPU with memoryless node (NUMA node with CPU only) hot add. Hot adding CPU and memoryless node does not bring the node to online state. Memoryless node will be onlined only during the onlining its CPU. Node can be in one of the following states: 1. not present.(nid == NUMA_NO_NODE) 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0, NODE_DATA(nid) == NULL) 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0, NODE_DATA(nid) != NULL) Percpu code is doing allocations for all possible CPUs. The issue happens when it serves hot added but not yet onlined CPU when its node is in 2nd state. This node is not ready to use, fallback to numa_mem_id(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alexey Makhalov <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Dennis Zhou <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]>

A kernel panic was observed during reading /proc/kpageflags for first few pfns allocated by pmem namespace: BUG: unable to handle page fault for address: fffffffffffffffe [ 114.495280] #PF: supervisor read access in kernel mode [ 114.495738] #PF: error_code(0x0000) - not-present page [ 114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0 [ 114.496713] Oops: 0000 [#1] SMP PTI [ 114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1 #1 [ 114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014 [ 114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0 [ 114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef [ 114.500788] RSP: 0018:ffffa5e601a0fe60 EFLAGS: 00010202 [ 114.501373] RAX: fffffffffffffffe RBX: ffffffffffffffff RCX: 0000000000000000 [ 114.502009] RDX: 0000000000000001 RSI: 00007ffca13a7310 RDI: ffffd07489000000 [ 114.502637] RBP: ffffd07489000000 R08: 0000000000000001 R09: 0000000000000000 [ 114.503270] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000240000 [ 114.503896] R13: 0000000000080000 R14: 00007ffca13a7310 R15: ffffa5e601a0ff08 [ 114.504530] FS: 00007f0266c7f540(0000) GS:ffff962dbbac0000(0000) knlGS:0000000000000000 [ 114.505245] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 114.505754] CR2: fffffffffffffffe CR3: 000000023a204000 CR4: 00000000000006e0 [ 114.506401] Call Trace: [ 114.506660] kpageflags_read+0xb1/0x130 [ 114.507051] proc_reg_read+0x39/0x60 [ 114.507387] vfs_read+0x8a/0x140 [ 114.507686] ksys_pread64+0x61/0xa0 [ 114.508021] do_syscall_64+0x5f/0x1a0 [ 114.508372] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 114.508844] RIP: 0033:0x7f0266ba426b The reason for the panic is that stable_page_flags() which parses the page flags uses uninitialized struct pages reserved by the ZONE_DEVICE driver. Earlier approach to fix this was discussed here: https://marc.info/?l=linux-mm&m=152964770000672&w=2 This is another approach. To avoid using the uninitialized struct page, immediately return with KPF_RESERVED at the beginning of stable_page_flags() if the page is reserved by ZONE_DEVICE driver. Dan said: : The nvdimm implementation uses vmem_altmap to arrange for the 'struct : page' array to be allocated from a reservation of a pmem namespace. A : namespace in this mode contains an info-block that consumes the first : 8K of the namespace capacity, capacity designated for page mapping, : capacity for padding the start of data to optionally 4K, 2MB, or 1GB : (on x86), and then the namespace data itself. The implementation : specifies a section aligned (now sub-section aligned) address to : arch_add_memory() to establish the linear mapping to map the metadata, : and then vmem_altmap indicates to memmap_init_zone() which pfns : represent data. The implementation only specifies enough 'struct page' : capacity for pfn_to_page() to operate on the data space, not the : namespace metadata space. : : The proposal to validate ZONE_DEVICE pfns against the altmap seems the : right approach to me. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Toshiki Fukasawa <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dan Williams <[email protected]> Cc: Junichi Nomura <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]>

Patch series "mm: enforce pageblock_order < MAX_ORDER". Having pageblock_order >= MAX_ORDER seems to be able to happen in corner cases and some parts of the kernel are not prepared for it. For example, Aneesh has shown [1] that such kernels can be compiled on ppc64 with 64k base pages by setting FORCE_MAX_ZONEORDER=8, which will run into a WARN_ON_ONCE(order >= MAX_ORDER) in comapction code right during boot. We can get pageblock_order >= MAX_ORDER when the default hugetlb size is bigger than the maximum allocation granularity of the buddy, in which case we are no longer talking about huge pages but instead gigantic pages. Having pageblock_order >= MAX_ORDER can only make alloc_contig_range() of such gigantic pages more likely to succeed. Reliable use of gigantic pages either requires boot time allcoation or CMA, no need to overcomplicate some places in the kernel to optimize for corner cases that are broken in other areas of the kernel. This patch (of 2): Let's enforce pageblock_order < MAX_ORDER and simplify. Especially patch #1 can be regarded a cleanup before: [PATCH v5 0/6] Use pageblock_order for cma and alloc_contig_range alignment. [2] [1] https://lkml.kernel.org/r/[email protected] [2] https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Zi Yan <[email protected]> Acked-by: Rob Herring <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Frank Rowand <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Robin Murphy <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: John Garry via iommu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]>

ZONE_MOVABLE uses the remaining memory in each node. Its starting pfn is also aligned to MAX_ORDER_NR_PAGES. It is possible for the remaining memory in a node to be less than MAX_ORDER_NR_PAGES, meaning there is not enough room for ZONE_MOVABLE on that node. Unfortunately this condition is not checked for. This leads to zone_movable_pfn[] getting set to a pfn greater than the last pfn in a node. calculate_node_totalpages() then sets zone->present_pages to be greater than zone->spanned_pages which is invalid, as spanned_pages represents the maximum number of pages in a zone assuming no holes. Subsequently it is possible free_area_init_core() will observe a zone of size zero with present pages. In this case it will skip setting up the zone, including the initialisation of free_lists[]. However populated_zone() checks zone->present_pages to see if a zone has memory available. This is used by iterators such as walk_zones_in_node(). pagetypeinfo_showfree() uses this to walk the free_list of each zone in each node, which are assumed to be initialised due to the zone not being empty. As free_area_init_core() never initialised the free_lists[] this results in the following kernel crash when trying to read /proc/pagetypeinfo: [ 67.534914] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 67.535429] #PF: supervisor read access in kernel mode [ 67.535789] #PF: error_code(0x0000) - not-present page [ 67.536128] PGD 0 P4D 0 [ 67.536305] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI [ 67.536696] CPU: 0 PID: 456 Comm: cat Not tainted 5.16.0 torvalds#461 [ 67.537096] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 [ 67.537638] RIP: 0010:pagetypeinfo_show+0x163/0x460 [ 67.537992] Code: 9e 82 e8 80 57 0e 00 49 8b 06 b9 01 00 00 00 4c 39 f0 75 16 e9 65 02 00 00 48 83 c1 01 48 81 f9 a0 86 01 00 0f 84 48 02 00 00 <48> 8b 00 4c 39 f0 75 e7 48 c7 c2 80 a2 e2 82 48 c7 c6 79 ef e3 82 [ 67.538259] RSP: 0018:ffffc90001c4bd10 EFLAGS: 00010003 [ 67.538259] RAX: 0000000000000000 RBX: ffff88801105f638 RCX: 0000000000000001 [ 67.538259] RDX: 0000000000000001 RSI: 000000000000068b RDI: ffff8880163dc68b [ 67.538259] RBP: ffffc90001c4bd90 R08: 0000000000000001 R09: ffff8880163dc67e [ 67.538259] R10: 656c6261766f6d6e R11: 6c6261766f6d6e55 R12: ffff88807ffb4a00 [ 67.538259] R13: ffff88807ffb49f8 R14: ffff88807ffb4580 R15: ffff88807ffb3000 [ 67.538259] FS: 00007f9c83eff5c0(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 67.538259] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 67.538259] CR2: 0000000000000000 CR3: 0000000013c8e000 CR4: 0000000000350ef0 [ 67.538259] Call Trace: [ 67.538259] <TASK> [ 67.538259] seq_read_iter+0x128/0x460 [ 67.538259] ? aa_file_perm+0x1af/0x5f0 [ 67.538259] proc_reg_read_iter+0x51/0x80 [ 67.538259] ? lock_is_held_type+0xea/0x140 [ 67.538259] new_sync_read+0x113/0x1a0 [ 67.538259] vfs_read+0x136/0x1d0 [ 67.538259] ksys_read+0x70/0xf0 [ 67.538259] __x64_sys_read+0x1a/0x20 [ 67.538259] do_syscall_64+0x3b/0xc0 [ 67.538259] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 67.538259] RIP: 0033:0x7f9c83e23cce [ 67.538259] Code: c0 e9 b6 fe ff ff 50 48 8d 3d 6e 13 0a 00 e8 c9 e3 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28 [ 67.538259] RSP: 002b:00007fff116e1a08 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 67.538259] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f9c83e23cce [ 67.538259] RDX: 0000000000020000 RSI: 00007f9c83a2c000 RDI: 0000000000000003 [ 67.538259] RBP: 00007f9c83a2c000 R08: 00007f9c83a2b010 R09: 0000000000000000 [ 67.538259] R10: 00007f9c83f2d7d0 R11: 0000000000000246 R12: 0000000000000000 [ 67.538259] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000 [ 67.538259] </TASK> Fix this by checking that the aligned zone_movable_pfn[] does not exceed the end of the node, and if it does skip creating a movable zone on this node. Link: https://lkml.kernel.org/r/[email protected] Fixes: 2a1e274 ("Create the ZONE_MOVABLE zone") Signed-off-by: Alistair Popple <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: John Hubbard <[email protected]> Cc: Zi Yan <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]>

On x86, prior to ("mm: handle uninitialized numa nodes gracecully"), NUMA nodes could be allocated at three different places. - numa_register_memblks - init_cpu_to_node - init_gi_nodes All these calls happen at setup_arch, and have the following order: setup_arch ... x86_numa_init numa_init numa_register_memblks ... init_cpu_to_node init_memory_less_node alloc_node_data free_area_init_memoryless_node init_gi_nodes init_memory_less_node alloc_node_data free_area_init_memoryless_node numa_register_memblks() is only interested in those nodes which have memory, so it skips over any memoryless node it founds. Later on, when we have read ACPI's SRAT table, we call init_cpu_to_node() and init_gi_nodes(), which initialize any memoryless node we might have that have either CPU or Initiator affinity, meaning we allocate pg_data_t struct for them and we mark them as ONLINE. So far so good, but the thing is that after ("mm: handle uninitialized numa nodes gracefully"), we allocate all possible NUMA nodes in free_area_init(), meaning we have a picture like the following: setup_arch x86_numa_init numa_init numa_register_memblks <-- allocate non-memoryless node x86_init.paging.pagetable_init ... free_area_init free_area_init_memoryless <-- allocate memoryless node init_cpu_to_node alloc_node_data <-- allocate memoryless node with CPU free_area_init_memoryless_node init_gi_nodes alloc_node_data <-- allocate memoryless node with Initiator free_area_init_memoryless_node free_area_init() already allocates all possible NUMA nodes, but init_cpu_to_node() and init_gi_nodes() are clueless about that, so they go ahead and allocate a new pg_data_t struct without checking anything, meaning we end up allocating twice. It should be mad clear that this only happens in the case where memoryless NUMA node happens to have a CPU/Initiator affinity. So get rid of init_memory_less_node() and just set the node online. Note that setting the node online is needed, otherwise we choke down the chain when bringup_nonboot_cpus() ends up calling __try_online_node()->register_one_node()->... and we blow up in bus_add_device(). As can be seen here: ========== [ 0.585060] BUG: kernel NULL pointer dereference, address: 0000000000000060 [ 0.586091] #PF: supervisor read access in kernel mode [ 0.586831] #PF: error_code(0x0000) - not-present page [ 0.586930] PGD 0 P4D 0 [ 0.586930] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [ 0.586930] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc4-1-default+ torvalds#45 [ 0.586930] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/4 [ 0.586930] RIP: 0010:bus_add_device+0x5a/0x140 [ 0.586930] Code: 8b 74 24 20 48 89 df e8 84 96 ff ff 85 c0 89 c5 75 38 48 8b 53 50 48 85 d2 0f 84 bb 00 004 [ 0.586930] RSP: 0000:ffffc9000022bd10 EFLAGS: 00010246 [ 0.586930] RAX: 0000000000000000 RBX: ffff888100987400 RCX: ffff8881003e4e19 [ 0.586930] RDX: ffff8881009a5e00 RSI: ffff888100987400 RDI: ffff888100987400 [ 0.586930] RBP: 0000000000000000 R08: ffff8881003e4e18 R09: ffff8881003e4c98 [ 0.586930] R10: 0000000000000000 R11: ffff888100402bc0 R12: ffffffff822ceba0 [ 0.586930] R13: 0000000000000000 R14: ffff888100987400 R15: 0000000000000000 [ 0.586930] FS: 0000000000000000(0000) GS:ffff88853fc00000(0000) knlGS:0000000000000000 [ 0.586930] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.586930] CR2: 0000000000000060 CR3: 000000000200a001 CR4: 00000000001706b0 [ 0.586930] Call Trace: [ 0.586930] <TASK> [ 0.586930] device_add+0x4c0/0x910 [ 0.586930] __register_one_node+0x97/0x2d0 [ 0.586930] __try_online_node+0x85/0xc0 [ 0.586930] try_online_node+0x25/0x40 [ 0.586930] cpu_up+0x4f/0x100 [ 0.586930] bringup_nonboot_cpus+0x4f/0x60 [ 0.586930] smp_init+0x26/0x79 [ 0.586930] kernel_init_freeable+0x130/0x2f1 [ 0.586930] ? rest_init+0x100/0x100 [ 0.586930] kernel_init+0x17/0x150 [ 0.586930] ? rest_init+0x100/0x100 [ 0.586930] ret_from_fork+0x22/0x30 [ 0.586930] </TASK> [ 0.586930] Modules linked in: [ 0.586930] CR2: 0000000000000060 [ 0.586930] ---[ end trace 0000000000000000 ]--- ========== The reason is simple, by the time bringup_nonboot_cpus() gets called, we did not register the node_subsys bus yet, so we crash when bus_add_device() tries to dereference bus()->p. The following shows the order of the calls: kernel_init_freeable smp_init bringup_nonboot_cpus ... bus_add_device() <- we did not register node_subsys yet do_basic_setup do_initcalls postcore_initcall(register_node_type); register_node_type subsys_system_register subsys_register bus_register <- register node_subsys bus Why setting the node online saves us then? Well, simply because __try_online_node() backs off when the node is online, meaning we do not end up calling register_one_node() in the first place. This is subtle, broken and deserves a deep analysis and thought about how to put this into shape, but for now let us have this easy fix for the leaking memory issue. Link: https://lkml.kernel.org/r/[email protected] Fixes: da4490c ("mm: handle uninitialized numa nodes gracefully") Signed-off-by: Oscar Salvador <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Rafael Aquini <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Wei Yang <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Alexey Makhalov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]>

…onize_rcu On systems that run FIFO:1 applications that busy loop on isolated CPUs, executing tasks on such CPUs under lower priority is undesired (since that will either hang the system, or cause longer interruption to the FIFO task due to execution of lower priority task with very small sched slices). Commit d479960 ("mm: disable LRU pagevec during the migration temporarily") relies on queueing work items on all online CPUs to ensure visibility of lru_disable_count. However, its possible to use synchronize_rcu which will provide the same guarantees (see comment this patch modifies on lru_cache_disable). Fixes: [ 1873.243925] INFO: task kworker/u160:0:9 blocked for more than 622 seconds. [ 1873.243927] Tainted: G I --------- --- 5.14.0-31.rt21.31.el9.x86_64 #1 [ 1873.243929] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1873.243929] task:kworker/u160:0 state:D stack: 0 pid: 9 ppid: 2 flags:0x00004000 [ 1873.243932] Workqueue: cpuset_migrate_mm cpuset_migrate_mm_workfn [ 1873.243936] Call Trace: [ 1873.243938] __schedule+0x21b/0x5b0 [ 1873.243941] schedule+0x43/0xe0 [ 1873.243943] schedule_timeout+0x14d/0x190 [ 1873.243946] ? resched_curr+0x20/0xe0 [ 1873.243953] ? __prepare_to_swait+0x4b/0x70 [ 1873.243958] wait_for_completion+0x84/0xe0 [ 1873.243962] __flush_work.isra.0+0x146/0x200 [ 1873.243966] ? flush_workqueue_prep_pwqs+0x130/0x130 [ 1873.243971] __lru_add_drain_all+0x158/0x1f0 [ 1873.243978] do_migrate_pages+0x3d/0x2d0 [ 1873.243985] ? pick_next_task_fair+0x39/0x3b0 [ 1873.243989] ? put_prev_task_fair+0x1e/0x30 [ 1873.243992] ? pick_next_task+0xb30/0xbd0 [ 1873.243995] ? __tick_nohz_task_switch+0x1e/0x70 [ 1873.244000] ? raw_spin_rq_unlock+0x18/0x60 [ 1873.244002] ? finish_task_switch.isra.0+0xc1/0x2d0 [ 1873.244005] ? __switch_to+0x12f/0x510 [ 1873.244013] cpuset_migrate_mm_workfn+0x22/0x40 [ 1873.244016] process_one_work+0x1e0/0x410 [ 1873.244019] worker_thread+0x50/0x3b0 [ 1873.244022] ? process_one_work+0x410/0x410 [ 1873.244024] kthread+0x173/0x190 [ 1873.244027] ? set_kthread_struct+0x40/0x40 [ 1873.244031] ret_from_fork+0x1f/0x30 Signed-off-by: Marcelo Tosatti <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]>

…k_under_node() Patch series "drivers/base/memory: determine and store zone for single-zone memory blocks", v2. I remember talking to Michal in the past about removing test_pages_in_a_zone(), which we use for: * verifying that a memory block we intend to offline is really only managed by a single zone. We don't support offlining of memory blocks that are managed by multiple zones (e.g., multiple nodes, DMA and DMA32) * exposing that zone to user space via /sys/devices/system/memory/memory*/valid_zones Now that I identified some more cases where test_pages_in_a_zone() might go wrong, and we received an UBSAN report (see patch smaeul#3), let's get rid of this PFN walker. So instead of detecting the zone at runtime with test_pages_in_a_zone() by scanning the memmap, let's determine and remember for each memory block if it's managed by a single zone. The stored zone can then be used for the above two cases, avoiding a manual lookup using test_pages_in_a_zone(). This avoids eventually stumbling over uninitialized memmaps in corner cases, especially when ZONE_DEVICE ranges partly fall into memory block (that are responsible for managing System RAM). Handling memory onlining is easy, because we online to exactly one zone. Handling boot memory is more tricky, because we want to avoid scanning all zones of all nodes to detect possible zones that overlap with the physical memory region of interest. Fortunately, we already have code that determines the applicable nodes for a memory block, to create sysfs links -- we'll hook into that. Patch #1 is a simple cleanup I had laying around for a longer time. Patch smaeul#2 contains the main logic to remove test_pages_in_a_zone() and further details. [1] https://lkml.kernel.org/r/[email protected] [2] https://lkml.kernel.org/r/[email protected] This patch (of 2): Let's adjust the stale terminology, making it match unregister_memory_block_under_nodes() and do_register_memory_block_under_node(). We're dealing with memory block devices, which span 1..X memory sections. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Oscar Salvador <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Rafael Parra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]>

Lockdep noticed that there is chance for a deadlock if we have concurrent mmap, concurrent read, and the addition/removal of a callback. As nicely explained by Boqun: " Lockdep warned about the above sequences because rw_semaphore is a fair read-write lock, and the following can cause a deadlock: TASK 1 TASK 2 TASK 3 ====== ====== ====== down_write(mmap_lock); down_read(vmcore_cb_rwsem) down_write(vmcore_cb_rwsem); // blocked down_read(vmcore_cb_rwsem); // cannot get the lock because of the fairness down_read(mmap_lock); // blocked IOW, a reader can block another read if there is a writer queued by the second reader and the lock is fair. " To fix, convert to srcu to make this deadlock impossible. We need srcu as our callbacks can sleep. With this change, I cannot trigger any lockdep warnings. [ 6.386519] ====================================================== [ 6.387203] WARNING: possible circular locking dependency detected [ 6.387965] 5.17.0-0.rc0.20220117git0c947b893d69.68.test.fc36.x86_64 #1 Not tainted [ 6.388899] ------------------------------------------------------ [ 6.389657] makedumpfile/542 is trying to acquire lock: [ 6.390308] ffffffff832d2eb8 (vmcore_cb_rwsem){.+.+}-{3:3}, at: mmap_vmcore+0x340/0x580 [ 6.391290] [ 6.391290] but task is already holding lock: [ 6.391978] ffff8880af226438 (&mm->mmap_lock#2){++++}-{3:3}, at: vm_mmap_pgoff+0x84/0x150 [ 6.392898] [ 6.392898] which lock already depends on the new lock. [ 6.392898] [ 6.393866] [ 6.393866] the existing dependency chain (in reverse order) is: [ 6.394762] [ 6.394762] -> #1 (&mm->mmap_lock#2){++++}-{3:3}: [ 6.395530] lock_acquire+0xc3/0x1a0 [ 6.396047] __might_fault+0x4e/0x70 [ 6.396562] _copy_to_user+0x1f/0x90 [ 6.397093] __copy_oldmem_page+0x72/0xc0 [ 6.397663] read_from_oldmem+0x77/0x1e0 [ 6.398229] read_vmcore+0x2c2/0x310 [ 6.398742] proc_reg_read+0x47/0xa0 [ 6.399265] vfs_read+0x101/0x340 [ 6.399751] __x64_sys_pread64+0x5d/0xa0 [ 6.400314] do_syscall_64+0x43/0x90 [ 6.400778] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 6.401390] [ 6.401390] -> #0 (vmcore_cb_rwsem){.+.+}-{3:3}: [ 6.402063] validate_chain+0x9f4/0x2670 [ 6.402560] __lock_acquire+0x8f7/0xbc0 [ 6.403054] lock_acquire+0xc3/0x1a0 [ 6.403509] down_read+0x4a/0x140 [ 6.403948] mmap_vmcore+0x340/0x580 [ 6.404403] proc_reg_mmap+0x3e/0x90 [ 6.404866] mmap_region+0x504/0x880 [ 6.405322] do_mmap+0x38a/0x520 [ 6.405744] vm_mmap_pgoff+0xc1/0x150 [ 6.406258] ksys_mmap_pgoff+0x178/0x200 [ 6.406823] do_syscall_64+0x43/0x90 [ 6.407339] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 6.407975] [ 6.407975] other info that might help us debug this: [ 6.407975] [ 6.408945] Possible unsafe locking scenario: [ 6.408945] [ 6.409684] CPU0 CPU1 [ 6.410196] ---- ---- [ 6.410703] lock(&mm->mmap_lock#2); [ 6.411121] lock(vmcore_cb_rwsem); [ 6.411792] lock(&mm->mmap_lock#2); [ 6.412465] lock(vmcore_cb_rwsem); [ 6.412873] [ 6.412873] *** DEADLOCK *** [ 6.412873] [ 6.413522] 1 lock held by makedumpfile/542: [ 6.414006] #0: ffff8880af226438 (&mm->mmap_lock#2){++++}-{3:3}, at: vm_mmap_pgoff+0x84/0x150 [ 6.414944] [ 6.414944] stack backtrace: [ 6.415432] CPU: 0 PID: 542 Comm: makedumpfile Not tainted 5.17.0-0.rc0.20220117git0c947b893d69.68.test.fc36.x86_64 #1 [ 6.416581] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 6.417272] Call Trace: [ 6.417593] <TASK> [ 6.417882] dump_stack_lvl+0x5d/0x78 [ 6.418346] print_circular_bug+0x5d7/0x5f0 [ 6.418821] ? stack_trace_save+0x3a/0x50 [ 6.419273] ? save_trace+0x3d/0x330 [ 6.419681] check_noncircular+0xd1/0xe0 [ 6.420217] validate_chain+0x9f4/0x2670 [ 6.420715] ? __lock_acquire+0x8f7/0xbc0 [ 6.421234] ? __lock_acquire+0x8f7/0xbc0 [ 6.421685] __lock_acquire+0x8f7/0xbc0 [ 6.422127] lock_acquire+0xc3/0x1a0 [ 6.422535] ? mmap_vmcore+0x340/0x580 [ 6.422965] ? lock_is_held_type+0xe2/0x140 [ 6.423432] ? mmap_vmcore+0x340/0x580 [ 6.423893] down_read+0x4a/0x140 [ 6.424321] ? mmap_vmcore+0x340/0x580 [ 6.424800] mmap_vmcore+0x340/0x580 [ 6.425237] ? vm_area_alloc+0x1c/0x60 [ 6.425661] ? trace_kmem_cache_alloc+0x30/0xe0 [ 6.426174] ? kmem_cache_alloc+0x1e0/0x2f0 [ 6.426641] proc_reg_mmap+0x3e/0x90 [ 6.427052] mmap_region+0x504/0x880 [ 6.427462] do_mmap+0x38a/0x520 [ 6.427842] vm_mmap_pgoff+0xc1/0x150 [ 6.428260] ksys_mmap_pgoff+0x178/0x200 [ 6.428701] do_syscall_64+0x43/0x90 [ 6.429126] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 6.429745] RIP: 0033:0x7fc7359b8fc7 [ 6.430157] Code: 00 00 00 89 ef e8 69 b3 ff ff eb e4 e8 c2 64 01 00 66 90 f3 0f 1e fa 41 89 ca 41 f7 c1 ff 0f 00 00 75 10 b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 21 c3 48 8b 05 21 7e 0e 00 64 c7 00 16 00 00 [ 6.432147] RSP: 002b:00007fff35b4c208 EFLAGS: 00000246 ORIG_RAX: 0000000000000009 [ 6.432970] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fc7359b8fc7 [ 6.433746] RDX: 0000000000000001 RSI: 0000000000400000 RDI: 0000000000000000 [ 6.434529] RBP: 000055a1125ecf10 R08: 0000000000000003 R09: 0000000000002000 [ 6.435310] R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000002000 [ 6.436093] R13: 0000000000400000 R14: 000055a1124269e2 R15: 0000000000000000 [ 6.436887] </TASK> Link: https://lkml.kernel.org/r/[email protected] Fixes: cc5f270 ("proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore callbacks") Signed-off-by: David Hildenbrand <[email protected]> Reported-by: Baoquan He <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: Dave Young <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Boqun Feng <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]>

Patch series "mm: COW fixes part 1: fix the COW security issue for THP and swap", v3. This series attempts to optimize and streamline the COW logic for ordinary anon pages and THP anon pages, fixing two remaining instances of CVE-2020-29374 in do_swap_page() and do_huge_pmd_wp_page(): information can leak from a parent process to a child process via anonymous pages shared during fork(). This issue, including other related COW issues, has been summarized in [2]: " 1. Observing Memory Modifications of Private Pages From A Child Process Long story short: process-private memory might not be as private as you think once you fork(): successive modifications of private memory regions in the parent process can still be observed by the child process, for example, by smart use of vmsplice()+munmap(). The core problem is that pinning pages readable in a child process, such as done via the vmsplice system call, can result in a child process observing memory modifications done in the parent process the child is not supposed to observe. [1] contains an excellent summary and [2] contains further details. This issue was assigned CVE-2020-29374 [9]. For this to trigger, it's required to use a fork() without subsequent exec(), for example, as used under Android zygote. Without further details about an application that forks less-privileged child processes, one cannot really say what's actually affected and what's not -- see the details section the end of this mail for a short sshd/openssh analysis. While commit 1783985 ("gup: document and work around "COW can break either way" issue") fixed this issue and resulted in other problems (e.g., ptrace on pmem), commit 09854ba ("mm: do_wp_page() simplification") re-introduced part of the problem unfortunately. The original reproducer can be modified quite easily to use THP [3] and make the issue appear again on upstream kernels. I modified it to use hugetlb [4] and it triggers as well. The problem is certainly less severe with hugetlb than with THP; it merely highlights that we still have plenty of open holes we should be closing/fixing. Regarding vmsplice(), the only known workaround is to disallow the vmsplice() system call ... or disable THP and hugetlb. But who knows what else is affected (RDMA? O_DIRECT?) to achieve the same goal -- in the end, it's a more generic issue. " This security issue was first reported by Jann Horn on 27 May 2020 and it currently affects anonymous pages during swapin, anonymous THP and hugetlb. This series tackles anonymous pages during swapin and anonymous THP: * do_swap_page() for handling COW on PTEs during swapin directly * do_huge_pmd_wp_page() for handling COW on PMD-mapped THP during write faults With this series, we'll apply the same COW logic we have in do_wp_page() to all swappable anon pages: don't reuse (map writable) the page in case there are additional references (page_count() != 1). All users of reuse_swap_page() are remove, and consequently reuse_swap_page() is removed. In general, we're struggling with the following COW-related issues: (1) "missed COW": we miss to copy on write and reuse the page (map it writable) although we must copy because there are pending references from another process to this page. The result is a security issue. (2) "wrong COW": we copy on write although we wouldn't have to and shouldn't: if there are valid GUP references, they will become out of sync with the pages mapped into the page table. We fail to detect that such a page can be reused safely, especially if never more than a single process mapped the page. The result is an intra process memory corruption. (3) "unnecessary COW": we copy on write although we wouldn't have to: performance degradation and temporary increases swap+memory consumption can be the result. While this series fixes (1) for swappable anon pages, it tries to reduce reported cases of (3) first as good and easy as possible to limit the impact when streamlining. The individual patches try to describe in which cases we will run into (3). This series certainly makes (2) worse for THP, because a THP will now get PTE-mapped on write faults if there are additional references, even if there was only ever a single process involved: once PTE-mapped, we'll copy each and every subpage and won't reuse any subpage as long as the underlying compound page wasn't split. I'm working on an approach to fix (2) and improve (3): PageAnonExclusive to mark anon pages that are exclusive to a single process, allow GUP pins only on such exclusive pages, and allow turning exclusive pages shared (clearing PageAnonExclusive) only if there are no GUP pins. Anon pages with PageAnonExclusive set never have to be copied during write faults, but eventually during fork() if they cannot be turned shared. The improved reuse logic in this series will essentially also be the logic to reset PageAnonExclusive. This work will certainly take a while, but I'm planning on sharing details before having code fully ready. #1-smaeul#5 can be applied independently of the rest. smaeul#6-torvalds#9 are mostly only cleanups related to reuse_swap_page(). Notes: * For now, I'll leave hugetlb code untouched: "unnecessary COW" might easily break existing setups because hugetlb pages are a scarce resource and we could just end up having to crash the application when we run out of hugetlb pages. We have to be very careful and the security aspect with hugetlb is most certainly less relevant than for unprivileged anon pages. * Instead of lru_add_drain() we might actually just drain the lru_add list or even just remove the single page of interest from the lru_add list. This would require a new helper function, and could be added if the conditional lru_add_drain() turn out to be a problem. * I extended the test case already included in [1] to also test for the newly found do_swap_page() case. I'll send that out separately once/if this part was merged. [1] https://lkml.kernel.org/r/[email protected] [2] https://lore.kernel.org/r/[email protected] This patch (of 9): Liang Zhang reported [1] that the current COW logic in do_wp_page() is sub-optimal when it comes to swap+read fault+write fault of anonymous pages that have a single user, visible via a performance degradation in the redis benchmark. Something similar was previously reported [2] by Nadav with a simple reproducer. After we put an anon page into the swapcache and unmapped it from a single process, that process might read that page again and refault it read-only. If that process then writes to that page, the process is actually the exclusive user of the page, however, the COW logic in do_co_page() won't be able to reuse it due to the additional reference from the swapcache. Let's optimize for pages that have been added to the swapcache but only have an exclusive user. Try removing the swapcache reference if there is hope that we're the exclusive user. We will fail removing the swapcache reference in two scenarios: (1) There are additional swap entries referencing the page: copying instead of reusing is the right thing to do. (2) The page is under writeback: theoretically we might be able to reuse in some cases, however, we cannot remove the additional reference and will have to copy. Note that we'll only try removing the page from the swapcache when it's highly likely that we'll be the exclusive owner after removing the page from the swapache. As we're about to map that page writable and redirty it, that should not affect reclaim but is rather the right thing to do. Further, we might have additional references from the LRU pagevecs, which will force us to copy instead of being able to reuse. We'll try handling such references for some scenarios next. Concurrent writeback cannot be handled easily and we'll always have to copy. While at it, remove the superfluous page_mapcount() check: it's implicitly covered by the page_count() for ordinary anon pages. [1] https://lkml.kernel.org/r/[email protected] [2] https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reported-by: Liang Zhang <[email protected]> Reported-by: Nadav Amit <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: David Rientjes <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Yang Shi <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Jann Horn <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Peter Xu <[email protected]> Cc: Don Dutile <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]>

failed [ 10.939320] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 10.940116] Mem abort info: [ 10.940373] ESR = 0x86000006 [ 10.940652] Exception class = IABT (current EL), IL = 32 bits [ 10.941181] SET = 0, FnV = 0 [ 10.941458] EA = 0, S1PTW = 0 [ 10.941741] user pgtable: 4k pages, 39-bit VAs, pgdp = 00000000c7870e86 [ 10.942328] [0000000000000000] pgd=0000000059e40003, pud=0000000059e40003, pmd=0000000000000000 [ 10.943150] Internal error: Oops: 86000006 [#1] PREEMPT SMP [ 10.943656] Modules linked in: [ 10.943940] Process servicemanager (pid: 141, stack limit = 0x000000001d4803a1) [ 10.944585] CPU: 0 PID: 141 Comm: servicemanager Not tainted 4.19.172 #1156 [ 10.945192] Hardware name: Rockchip RK3566 RK817 EINK LP4X Board (DT) [ 10.945754] pstate: 60400009 (nZCv daif +PAN -UAO) [ 10.946179] pc : (null) [ 10.946496] lr : call_timer_fn+0x2c/0x1c0 [ 10.946854] sp : ffffff8008003dd0 [ 10.947155] x29: ffffff8008003dd0 x28: ffffff8009ca5000 [ 10.947627] x27: ffffff80099a2000 x26: 0000000000000100 [ 10.948099] x25: 0000000000000010 x24: ffffff8009cadb00 [ 10.948571] x23: ffffff80099a1018 x22: 0000000000000100 [ 10.949043] x21: 0000000000000000 x20: ffffffc05ca13320 [ 10.949515] x19: ffffffc05ca13320 x18: 0000000000000000 [ 10.949986] x17: 0000000000000000 x16: 0000000000000000 [ 10.950458] x15: 0000000000000000 x14: 0000000000000000 Signed-off-by: Zorro Liu <[email protected]> Change-Id: I6bc332ac3f9904e95414f5769dc57dd742c8e7fc

martinberlin self-assigned this Oct 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues just used as internal communication and image storage #1

Issues just used as internal communication and image storage #1

martinberlin commented Oct 21, 2022 •

edited

Loading

Issues just used as internal communication and image storage #1

Issues just used as internal communication and image storage #1

Comments

martinberlin commented Oct 21, 2022 • edited Loading

martinberlin commented Oct 21, 2022 •

edited

Loading