Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop clearing access bits during page migration #6

Closed
wants to merge 90 commits into from

Conversation

apopple-nvidia
Copy link

UVM uses migrate_vma functionality to handle some page faults. Prior to a fix in kernel v6.1 this would clear the PTE access flag leading to increased faults. This pull request is to backport required patches to implement the fix.

ianmay81 and others added 30 commits January 13, 2023 15:27
Ignore: yes
Signed-off-by: Ian May <[email protected]>
Ignore: yes
Signed-off-by: Dimitri John Ledkov <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/1992162
Properties: no-test-build
Signed-off-by: Dimitri John Ledkov <[email protected]>
Ignore: yes
Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/1996300
Properties: no-test-build
Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
Ignore: yes
Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/1998795
Properties: no-test-build
Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/1999750

Currently Ubuntu Linux kernel header packages use stock Makefile as
shipped by upstream kernel. During linux kernel package build however
HOSTCC and CC are typically overridden to use explicit complier
version $(DEB_HOST_GNU_TYPE)-gcc-12. This can lead to dkms module
build failures as despite all efforts to reuse matching gcc version
out of .config, various shell scripts / build systems / makefiles do
not pass the CC variable as a make variable to the end make call that
is used to build dkms modules. To avoid this, hardcode the correct
compiler in the linux headers package shipped Makefile. This is
similar to the Makefile includes that debian ships, albeit with less
indirections.

Tested by applying this patch to the hwe-5.19 kernel, rebuilding it,
and testing that dkms module building is using expected compilers now.

Signed-off-by: Dimitri John Ledkov <[email protected]>
Acked-by: Kleber Sacilotto de Souza <[email protected]>
Acked-by: Stefan Bader <[email protected]>
Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
Ignore: yes
Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/1999745
Properties: no-test-build
Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2002679

This reverts commit e96b739.

In order to use the same compiler as used for kinetic:linux (gcc-12) we have
made some changes to explicitly compile jammy:linux-hwe-5.19 with the same
compiler. However, such changes are breaking the build of a number of dkms
packages. Revert this commit, which would make the kernel compile with the
default GCC in Jammy (gcc-11), while a proper solution is being worked on.

Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
Ignore: yes
Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2002679

Changing the gcc version used for the build from gcc-12 to gcc (11)
removed CONFIG_INIT_STACK_ALL_ZERO and CONFIG_SHADOW_CALL_STACK from
being an option on the kernel config. Update the annotations file
accondingly, keeping the annotation instead of simply removing it so
when the configs are re-enabled in the future it stays as a reminder
to update the annotation.

Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2001755
Properties: no-test-build
Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
nvidia-bfigg and others added 28 commits January 20, 2023 16:17
BugLink: https://bugs.launchpad.net/bugs/2002574

Tegra234 and later chips have hardware feature that detects
wait state raised by slave as mentioned in TCG client doc.

Signed-off-by: Krishna Yarlagadda <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Ian May <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2002574

Trusted Platform Module defines flow control where slave will drive
data line at specified clock cycles. Tegra241 has TPM wait state
detections support when using combined sequence transfers. Enabling
the feature based on dt flag

Signed-off-by: Krishna Yarlagadda <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Ian May <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
Ignore: yes
Signed-off-by: Brad Figg <[email protected]>
Ignore: yes
Signed-off-by: Brad Figg <[email protected]>
CONFIG_UBUNTU_ODM_DRIVERS to 'n'

Signed-off-by: Brad Figg <[email protected]>
Ignore: yes
Signed-off-by: Ian May <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
Ignore: yes
Signed-off-by: Ian May <[email protected]>
Commit e5251fd ("mm/hugetlb: introduce set_huge_swap_pte_at()
helper") add set_huge_swap_pte_at() to handle swap entries on
architectures that support hugepages consisting of contiguous ptes.  And
currently the set_huge_swap_pte_at() is only overridden by arm64.

set_huge_swap_pte_at() provide a sz parameter to help determine the number
of entries to be updated.  But in fact, all hugetlb swap entries contain
pfn information, so we can find the corresponding folio through the pfn
recorded in the swap entry, then the folio_size() is the number of entries
that need to be updated.

And considering that users will easily cause bugs by ignoring the
difference between set_huge_swap_pte_at() and set_huge_pte_at().  Let's
handle swap entries in set_huge_pte_at() and remove the
set_huge_swap_pte_at(), then we can call set_huge_pte_at() anywhere, which
simplifies our coding.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Qi Zheng <[email protected]>
Acked-by: Muchun Song <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
nvbug: https://nvbugswb.nvidia.com/NVBugs5/redir.aspx?url=/4067536
(cherry picked from commit 18f3962)
Signed-off-by: Alistair Popple <[email protected]>
…n_pmd

vma->vm_page_prot is read lockless from the rmap_walk, it may be updated
concurrently.  Using READ_ONCE to prevent the risk of reading intermediate
values.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
nvbug: https://nvbugswb.nvidia.com/NVBugs5/redir.aspx?url=/4067536
(cherry picked from commit 4286f14)
Signed-off-by: Alistair Popple <[email protected]>
Replace all the magic "5" with the macro.

Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Huang Ying <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
nvbug: https://nvbugswb.nvidia.com/NVBugs5/redir.aspx?url=/4067536
(cherry picked from commit 9c61d53)
Signed-off-by: Alistair Popple <[email protected]>
swapops.h contains quite a few layers of ifdef, some of the "else" and
"endif" doesn't get proper comment on the macro so it's hard to follow on
what are they referring to.  Add the comments.

Suggested-by: Nadav Amit <[email protected]>
Reviewed-by: Huang Ying <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
nvbug: https://nvbugswb.nvidia.com/NVBugs5/redir.aspx?url=/4067536
(cherry picked from commit eba4d77)
Signed-off-by: Alistair Popple <[email protected]>
We've got a bunch of special swap entries that stores PFN inside the swap
offset fields.  To fetch the PFN, normally the user just calls swp_offset()
assuming that'll be the PFN.

Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
max possible length of a PFN on the host, meanwhile doing proper check with
MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the PFNs
properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().

One reason to do so is we never tried to sanitize whether swap offset can
really fit for storing PFN.  At the meantime, this patch also prepares us
with the future possibility to store more information inside the swp offset
field, so assuming "swp_offset(entry)" to be the PFN will not stand any
more very soon.

Replace many of the swp_offset() callers to use swp_offset_pfn() where
proper.  Note that many of the existing users are not candidates for the
replacement, e.g.:

  (1) When the swap entry is not a pfn swap entry at all, or,
  (2) when we wanna keep the whole swp_offset but only change the swp type.

For the latter, it can happen when fork() triggered on a write-migration
swap entry pte, we may want to only change the migration type from
write->read but keep the rest, so it's not "fetching PFN" but "changing
swap type only".  They're left aside so that when there're more information
within the swp offset they'll be carried over naturally in those cases.

Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
the new swp_offset_pfn() is about.

Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
nvbug: https://nvbugswb.nvidia.com/NVBugs5/redir.aspx?url=/4067536
(cherry picked from commit 0d206b5)
Signed-off-by: Alistair Popple <[email protected]>
Carry over the dirty bit from pmd to pte when a huge pmd splits.  It
shouldn't be a correctness issue since when pmd_dirty() we'll have the page
marked dirty anyway, however having dirty bit carried over helps the next
initial writes of split ptes on some archs like x86.

Reviewed-by: Huang Ying <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
nvbug: https://nvbugswb.nvidia.com/NVBugs5/redir.aspx?url=/4067536
(cherry picked from commit 0ccf7f1)
Signed-off-by: Alistair Popple <[email protected]>
When page migration happens, we always ignore the young/dirty bit settings
in the old pgtable, and marking the page as old in the new page table using
either pte_mkold() or pmd_mkold(), and keeping the pte clean.

That's fine from functional-wise, but that's not friendly to page reclaim
because the moving page can be actively accessed within the procedure.  Not
to mention hardware setting the young bit can bring quite some overhead on
some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
The same slowdown problem to dirty bits when the memory is first written
after page migration happened.

Actually we can easily remember the A/D bit configuration and recover the
information after the page is migrated.  To achieve it, define a new set of
bits in the migration swap offset field to cache the A/D bits for old pte.
Then when removing/recovering the migration entry, we can recover the A/D
bits even if the page changed.

One thing to mention is that here we used max_swapfile_size() to detect how
many swp offset bits we have, and we'll only enable this feature if we know
the swp offset is big enough to store both the PFN value and the A/D bits.
Otherwise the A/D bits are dropped like before.

Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
nvbug: https://nvbugswb.nvidia.com/NVBugs5/redir.aspx?url=/4067536
(cherry picked from commit 2e34687)
Signed-off-by: Alistair Popple <[email protected]>
We used to have swapfile_maximum_size() fetching a maximum value of
swapfile size per-arch.

As the caller of max_swapfile_size() grows, this patch introduce a variable
"swapfile_maximum_size" and cache the value of old max_swapfile_size(), so
that we don't need to calculate the value every time.

Caching the value in swapfile_init() is safe because when reaching the
phase we should have initialized all the relevant information.  Here the
major arch to take care of is x86, which defines the max swapfile size
based on L1TF mitigation.

Here both X86_BUG_L1TF or l1tf_mitigation should have been setup properly
when reaching swapfile_init(). As a reference, the code path looks like
this for x86:

- start_kernel
  - setup_arch
    - early_cpu_init
      - early_identify_cpu --> setup X86_BUG_L1TF
  - parse_early_param
    - l1tf_cmdline --> set l1tf_mitigation
  - check_bugs
    - l1tf_select_mitigation --> set l1tf_mitigation
  - arch_call_rest_init
    - rest_init
      - kernel_init
        - kernel_init_freeable
          - do_basic_setup
            - do_initcalls --> calls swapfile_init() (initcall level 4)

The swapfile size only depends on swp pte format on non-x86 archs, so
caching it is safe too.

Since at it, rename max_swapfile_size() to arch_max_swapfile_size() because
arch can define its own function, so it's more straightforward to have
"arch_" as its prefix.  At the meantime, export swapfile_maximum_size to
replace the old usages of max_swapfile_size().

Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
nvbug: https://nvbugswb.nvidia.com/NVBugs5/redir.aspx?url=/4067536
(cherry picked from commit be45a49)
Signed-off-by: Alistair Popple <[email protected]>
Introduce a variable swap_migration_ad_supported to cache whether the arch
supports swap migration A/D bits.

Here one thing to mention is that SWP_MIG_TOTAL_BITS will internally
reference the other macro MAX_PHYSMEM_BITS, which is a function call on
x86 (constant on all the rest of archs).

It's safe to reference it in swapfile_init() because when reaching here
we're already during initcalls level 4 so we must have initialized 5-level
pgtable for x86_64 (right after early_identify_cpu() finishes).

- start_kernel
  - setup_arch
    - early_cpu_init
      - get_cpu_cap --> fetch from CPUID (including X86_FEATURE_LA57)
      - early_identify_cpu --> clear X86_FEATURE_LA57 (if early lvl5 not enabled (USE_EARLY_PGTABLE_L5))
  - arch_call_rest_init
    - rest_init
      - kernel_init
        - kernel_init_freeable
          - do_basic_setup
            - do_initcalls --> calls swapfile_init() (initcall level 4)

This should slightly speed up the migration swap entry handlings.

Signed-off-by: Peter Xu <[email protected]>
nvbug: https://nvbugswb.nvidia.com/NVBugs5/redir.aspx?url=/4067536
(cherry picked from commit 5154e60)
Signed-off-by: Alistair Popple <[email protected]>
We use "unsigned long" to store a PFN in the kernel and phys_addr_t to
store a physical address.

On a 64bit system, both are 64bit wide.  However, on a 32bit system, the
latter might be 64bit wide.  This is, for example, the case on x86 with
PAE: phys_addr_t and PTEs are 64bit wide, while "unsigned long" only spans
32bit.

The current definition of SWP_PFN_BITS without MAX_PHYSMEM_BITS misses
that case, and assumes that the maximum PFN is limited by an 32bit
phys_addr_t.  This implies, that SWP_PFN_BITS will currently only be able
to cover 4 GiB - 1 on any 32bit system with 4k page size, which is wrong.

Let's rely on the number of bits in phys_addr_t instead, but make sure to
not exceed the maximum swap offset, to not make the BUILD_BUG_ON() in
is_pfn_swap_entry() unhappy.  Note that swp_entry_t is effectively an
unsigned long and the maximum swap offset shares that value with the swap
type.

For example, on an 8 GiB x86 PAE system with a kernel config based on
Debian 11.5 (-> CONFIG_FLATMEM=y, CONFIG_X86_PAE=y), we will currently
fail removing migration entries (remove_migration_ptes()), because
mm/page_vma_mapped.c:check_pte() will fail to identify a PFN match as
swp_offset_pfn() wrongly masks off PFN bits.  For example,
split_huge_page_to_list()->...->remap_page() will leave migration entries
in place and continue to unlock the page.

Later, when we stumble over these migration entries (e.g., via
/proc/self/pagemap), pfn_swap_entry_to_page() will BUG_ON() because these
migration entries shouldn't exist anymore and the page was unlocked.

[   33.067591] kernel BUG at include/linux/swapops.h:497!
[   33.067597] invalid opcode: 0000 [NVIDIA-BaseOS-6#1] PREEMPT SMP NOPTI
[   33.067602] CPU: 3 PID: 742 Comm: cow Tainted: G            E      6.1.0-rc8+ #16
[   33.067605] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
[   33.067606] EIP: pagemap_pmd_range+0x644/0x650
[   33.067612] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 48 c6 52 00 e9 23 fb ff ff e8 61 83 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31
[   33.067615] EAX: ee394000 EBX: 00000002 ECX: ee394000 EDX: 00000000
[   33.067617] ESI: c1b0ded4 EDI: 00024a00 EBP: c1b0ddb4 ESP: c1b0dd68
[   33.067619] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
[   33.067624] CR0: 80050033 CR2: b7a00000 CR3: 01bbbd20 CR4: 00350ef0
[   33.067625] Call Trace:
[   33.067628]  ? madvise_free_pte_range+0x720/0x720
[   33.067632]  ? smaps_pte_range+0x4b0/0x4b0
[   33.067634]  walk_pgd_range+0x325/0x720
[   33.067637]  ? mt_find+0x1d6/0x3a0
[   33.067641]  ? mt_find+0x1d6/0x3a0
[   33.067643]  __walk_page_range+0x164/0x170
[   33.067646]  walk_page_range+0xf9/0x170
[   33.067648]  ? __kmem_cache_alloc_node+0x2a8/0x340
[   33.067653]  pagemap_read+0x124/0x280
[   33.067658]  ? default_llseek+0x101/0x160
[   33.067662]  ? smaps_account+0x1d0/0x1d0
[   33.067664]  vfs_read+0x90/0x290
[   33.067667]  ? do_madvise.part.0+0x24b/0x390
[   33.067669]  ? debug_smp_processor_id+0x12/0x20
[   33.067673]  ksys_pread64+0x58/0x90
[   33.067675]  __ia32_sys_ia32_pread64+0x1b/0x20
[   33.067680]  __do_fast_syscall_32+0x4c/0xc0
[   33.067683]  do_fast_syscall_32+0x29/0x60
[   33.067686]  do_SYSENTER_32+0x15/0x20
[   33.067689]  entry_SYSENTER_32+0x98/0xf1

Decrease the indentation level of SWP_PFN_BITS and SWP_PFN_MASK to keep it
readable and consistent.

[[email protected]: rely on sizeof(phys_addr_t) and min_t() instead]
  Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: use "int" for comparison, as we're only comparing numbers < 64]
  Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 0d206b5 ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Peter Xu <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
nvbug: https://nvbugswb.nvidia.com/NVBugs5/redir.aspx?url=/4067536
(cherry picked from commit 630dc25)
Signed-off-by: Alistair Popple <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.