Skip to content

Commit

Permalink
transparent huge pages breaks KVM on AMD.
Browse files Browse the repository at this point in the history
On Fri, May 13, 2016 at 07:23:34AM +0200, Marc Haber wrote:
> How do I apply this?

I'm attaching it.

$ patch -p1 --dry-run -i /tmp/01-mm-thp-calculate_the_mapcount_correctly_for_thp_pages_during_wp_faults.patch
checking file include/linux/mm.h
checking file include/linux/swap.h
checking file mm/huge_memory.c
checking file mm/memory.c
checking file mm/swapfile.c
$ patch -p1 -i /tmp/01-mm-thp-calculate_the_mapcount_correctly_for_thp_pages_during_wp_faults.patch
patching file include/linux/mm.h
patching file include/linux/swap.h
patching file mm/huge_memory.c
patching file mm/memory.c
patching file mm/swapfile.c

The --dry-run is to check whether it applies first.

That's on 4.6-rc7+ here.

HTH.

--
Regards/Gruss,
    Boris.

ECO tip torvalds#101: Trim your mails when you reply.

From [email protected] Tue May 10 21:21:32 2016
From: Andrea Arcangeli <[email protected]>
To: Andrew Morton <[email protected]>, [email protected],
 [email protected]
Cc: Alex Williamson <[email protected]>, "Kirill A. Shutemov"
 <[email protected]>
Subject: [PATCH 1/1] mm: thp: calculate the mapcount correctly for THP pages
 during WP faults
Date:	Tue, 10 May 2016 21:21:22 +0200
Message-Id: <[email protected]>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=utf-8
Status: RO

This will provide fully accuracy to the mapcount calculation in the
write protect faults, so page pinning will not get broken by false
positive copy-on-writes.

total_mapcount() isn't the right calculation needed in
reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
that is effectively the full accurate return value for page_mapcount()
if dealing with Transparent Hugepages, however we only use the
page_trans_huge_mapcount() during COW faults where it strictly needed,
due to its higher runtime cost.

This also provide at practical zero cost the total_mapcount
information which is needed to know if we can still relocate the page
anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
can reuse the page no matter if it's a pte or a pmd_trans_huge
triggering the fault, but we can only relocate the page anon_vma to
the local vma->anon_vma if we're sure it's only this "vma" mapping the
whole THP physical range.

Kirill A. Shutemov discovered the problem with moving the page
anon_vma to the local vma->anon_vma in a previous version of this
patch and another problem in the way page_move_anon_rmap() was called.

Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a
previous version, because reuse_swap_page must be a macro to call
page_trans_huge_mapcount from swap.h, so this uses a macro again
instead of an inline function. With this change at least it's a less
dangerous usage than it was before, because "page" is used only once
now, while with the previous code reuse_swap_page(page++) would have
called page_mapcount on page+1 and it would have increased page twice
instead of just once.

Reviewed-by: "Kirill A. Shutemov" <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Ebru Akagunduz <[email protected]>
Cc: Geliang Tang <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jerome Marchand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Xie XiuQi <[email protected]>
Cc: linux-mm <[email protected]>
Cc: lkml <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
  • Loading branch information
bp3tk0v authored and 0day robot committed May 13, 2016
1 parent 44549e8 commit f7d3d07
Show file tree
Hide file tree
Showing 5 changed files with 91 additions and 26 deletions.
9 changes: 9 additions & 0 deletions include/linux/mm.h
Original file line number Diff line number Diff line change
Expand Up @@ -500,11 +500,20 @@ static inline int page_mapcount(struct page *page)

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
int total_mapcount(struct page *page);
int page_trans_huge_mapcount(struct page *page, int *total_mapcount);
#else
static inline int total_mapcount(struct page *page)
{
return page_mapcount(page);
}
static inline int page_trans_huge_mapcount(struct page *page,
int *total_mapcount)
{
int mapcount = page_mapcount(page);
if (total_mapcount)
*total_mapcount = mapcount;
return mapcount;
}
#endif

static inline struct page *virt_to_head_page(const void *x)
Expand Down
6 changes: 3 additions & 3 deletions include/linux/swap.h
Original file line number Diff line number Diff line change
Expand Up @@ -418,7 +418,7 @@ extern sector_t swapdev_block(int, pgoff_t);
extern int page_swapcount(struct page *);
extern int swp_swapcount(swp_entry_t entry);
extern struct swap_info_struct *page_swap_info(struct page *);
extern int reuse_swap_page(struct page *);
extern bool reuse_swap_page(struct page *, int *);
extern int try_to_free_swap(struct page *);
struct backing_dev_info;

Expand Down Expand Up @@ -513,8 +513,8 @@ static inline int swp_swapcount(swp_entry_t entry)
return 0;
}

#define reuse_swap_page(page) \
(!PageTransCompound(page) && page_mapcount(page) == 1)
#define reuse_swap_page(page, total_mapcount) \
(page_trans_huge_mapcount(page, total_mapcount) == 1)

static inline int try_to_free_swap(struct page *page)
{
Expand Down
67 changes: 58 additions & 9 deletions mm/huge_memory.c
Original file line number Diff line number Diff line change
Expand Up @@ -1298,15 +1298,9 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
/*
* We can only reuse the page if nobody else maps the huge page or it's
* part. We can do it by checking page_mapcount() on each sub-page, but
* it's expensive.
* The cheaper way is to check page_count() to be equal 1: every
* mapcount takes page reference reference, so this way we can
* guarantee, that the PMD is the only mapping.
* This can give false negative if somebody pinned the page, but that's
* fine.
* part.
*/
if (page_mapcount(page) == 1 && page_count(page) == 1) {
if (page_trans_huge_mapcount(page, NULL) == 1) {
pmd_t entry;
entry = pmd_mkyoung(orig_pmd);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
Expand Down Expand Up @@ -2079,7 +2073,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
if (pte_write(pteval)) {
writable = true;
} else {
if (PageSwapCache(page) && !reuse_swap_page(page)) {
if (PageSwapCache(page) &&
!reuse_swap_page(page, NULL)) {
unlock_page(page);
result = SCAN_SWAP_CACHE_PAGE;
goto out;
Expand Down Expand Up @@ -3222,6 +3217,60 @@ int total_mapcount(struct page *page)
return ret;
}

/*
* This calculates accurately how many mappings a transparent hugepage
* has (unlike page_mapcount() which isn't fully accurate). This full
* accuracy is primarily needed to know if copy-on-write faults can
* reuse the page and change the mapping to read-write instead of
* copying them. At the same time this returns the total_mapcount too.
*
* The function returns the highest mapcount any one of the subpages
* has. If the return value is one, even if different processes are
* mapping different subpages of the transparent hugepage, they can
* all reuse it, because each process is reusing a different subpage.
*
* The total_mapcount is instead counting all virtual mappings of the
* subpages. If the total_mapcount is equal to "one", it tells the
* caller all mappings belong to the same "mm" and in turn the
* anon_vma of the transparent hugepage can become the vma->anon_vma
* local one as no other process may be mapping any of the subpages.
*
* It would be more accurate to replace page_mapcount() with
* page_trans_huge_mapcount(), however we only use
* page_trans_huge_mapcount() in the copy-on-write faults where we
* need full accuracy to avoid breaking page pinning, because
* page_trans_huge_mapcount() is slower than page_mapcount().
*/
int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
{
int i, ret, _total_mapcount, mapcount;

/* hugetlbfs shouldn't call it */
VM_BUG_ON_PAGE(PageHuge(page), page);

if (likely(!PageTransCompound(page)))
return atomic_read(&page->_mapcount) + 1;

page = compound_head(page);

_total_mapcount = ret = 0;
for (i = 0; i < HPAGE_PMD_NR; i++) {
mapcount = atomic_read(&page[i]._mapcount) + 1;
ret = max(ret, mapcount);
_total_mapcount += mapcount;
}
if (PageDoubleMap(page)) {
ret -= 1;
_total_mapcount -= HPAGE_PMD_NR;
}
mapcount = compound_mapcount(page);
ret += mapcount;
_total_mapcount += mapcount;
if (total_mapcount)
*total_mapcount = _total_mapcount;
return ret;
}

/*
* This function splits huge page into normal pages. @page can point to any
* subpage of huge page to split. Split doesn't change the position of @page.
Expand Down
22 changes: 14 additions & 8 deletions mm/memory.c
Original file line number Diff line number Diff line change
Expand Up @@ -2373,6 +2373,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
* not dirty accountable.
*/
if (PageAnon(old_page) && !PageKsm(old_page)) {
int total_mapcount;
if (!trylock_page(old_page)) {
get_page(old_page);
pte_unmap_unlock(page_table, ptl);
Expand All @@ -2387,13 +2388,18 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
put_page(old_page);
}
if (reuse_swap_page(old_page)) {
/*
* The page is all ours. Move it to our anon_vma so
* the rmap code will not search our parent or siblings.
* Protected against the rmap code by the page lock.
*/
page_move_anon_rmap(old_page, vma, address);
if (reuse_swap_page(old_page, &total_mapcount)) {
if (total_mapcount == 1) {
/*
* The page is all ours. Move it to
* our anon_vma so the rmap code will
* not search our parent or siblings.
* Protected against the rmap code by
* the page lock.
*/
page_move_anon_rmap(compound_head(old_page),
vma, address);
}
unlock_page(old_page);
return wp_page_reuse(mm, vma, address, page_table, ptl,
orig_pte, old_page, 0, 0);
Expand Down Expand Up @@ -2617,7 +2623,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
inc_mm_counter_fast(mm, MM_ANONPAGES);
dec_mm_counter_fast(mm, MM_SWAPENTS);
pte = mk_pte(page, vma->vm_page_prot);
if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
flags &= ~FAULT_FLAG_WRITE;
ret |= VM_FAULT_WRITE;
Expand Down
13 changes: 7 additions & 6 deletions mm/swapfile.c
Original file line number Diff line number Diff line change
Expand Up @@ -922,18 +922,19 @@ int swp_swapcount(swp_entry_t entry)
* to it. And as a side-effect, free up its swap: because the old content
* on disk will never be read, and seeking back there to write new content
* later would only waste time away from clustering.
*
* NOTE: total_mapcount should not be relied upon by the caller if
* reuse_swap_page() returns false, but it may be always overwritten
* (see the other implementation for CONFIG_SWAP=n).
*/
int reuse_swap_page(struct page *page)
bool reuse_swap_page(struct page *page, int *total_mapcount)
{
int count;

VM_BUG_ON_PAGE(!PageLocked(page), page);
if (unlikely(PageKsm(page)))
return 0;
/* The page is part of THP and cannot be reused */
if (PageTransCompound(page))
return 0;
count = page_mapcount(page);
return false;
count = page_trans_huge_mapcount(page, total_mapcount);
if (count <= 1 && PageSwapCache(page)) {
count += page_swapcount(page);
if (count == 1 && !PageWriteback(page)) {
Expand Down

0 comments on commit f7d3d07

Please sign in to comment.