Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PageStorage: Fix peak memory usage when running GC on PageDirectory #6168

Merged
merged 11 commits into from
Oct 26, 2022

Conversation

JaySon-Huang
Copy link
Contributor

@JaySon-Huang JaySon-Huang commented Oct 19, 2022

What problem does this PR solve?

Issue Number: close #6163

Problem Summary:

The snapshot file getting bigger and bigger, causing peak memory usage after TiFlash running for a long time.
Introduced by #5357

#5357 want to resolve a bug that cause by concurrency between deleting a page and Full GC. But that fixes will make the entries in PageDirectory snapshot file getting bigger and bigger. Cause it can not ensure whether there are more edit on the pages with a FileSnapshot

What is changed and how it works?

When running into compact log-files, we freeze the current writing log-file. Because compact log-files and full gc won't run concurrently, once we freeze the current writing log-file, we can get all upsert that may happen after del for the same page id. So the new compacted log files can correctly remove the page id.

Another approach that dose not work Another approach is to avoid persisting a "upsert" after "del". However, because PageStorage support RefPages, it is hard to avoid this behavior.

For example, if we want to avoid persisting the wrong "upsert", we can check whether the page is deleted when the upsert entry edit is being committed to PageDirectory. If the page is deleted, then rollback the entries edit for this page id.
However, PageStorage supports RefPages. If one page itself is marked as deleted, but still being ref by another page id. We still need to commit the upsert entry, or "full gc" can not reduce space amplification as expected.

image

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Run the HeavySkewWriteRead testing pageworkload (based on #6178), we can see that before this PR, the wal snapshot file getting bigger and bigger as it does not remove the deleted page from snapshot file. After this PR, the wal snapshot file only keep the available page.

Before this PR

... ["Dumped directory snapshot ... [num_records=1085673] ... [size=76022852]."] [source=stress_test] [thread_id=23]
... ["Dumped directory snapshot ... [num_records=1125073] ... [size=78780852]."] [source=stress_test] [thread_id=23]
... ["Dumped directory snapshot ... [num_records=1168805] ... [size=81842092]."] [source=stress_test] [thread_id=23]
... ["Dumped directory snapshot ... [num_records=1213833] ... [size=84994052]."] [source=stress_test] [thread_id=23]
... ["Dumped directory snapshot ... [num_records=1258217] ... [size=88100932]."] [source=stress_test] [thread_id=23]
... ["Dumped directory snapshot ... [num_records=1307151] ... [size=91526312]."] [source=stress_test] [thread_id=23]

After this PR

... ["Dumped directory snapshot ... [num_records=754] ... [size=78420]."] [source=stress_test] [thread_id=24]
... ["Dumped directory snapshot ... [num_records=754] ... [size=78420]."] [source=stress_test] [thread_id=24]
... ["Dumped directory snapshot ... [num_records=755] ... [size=78524]."] [source=stress_test] [thread_id=24]
... ["Dumped directory snapshot ... [num_records=754] ... [size=78420]."] [source=stress_test] [thread_id=24]
... ["Dumped directory snapshot ... [num_records=754] ... [size=78420]."] [source=stress_test] [thread_id=24]

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Fix the bug that leads to peak memory usage when running GC

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Oct 19, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • hehechen
  • lidezhu

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. needs-cherry-pick-release-6.3 size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 19, 2022
@JaySon-Huang
Copy link
Contributor Author

/run-all-tests

@JaySon-Huang
Copy link
Contributor Author

Unit-tests are added, but still need some fullstack tests and polish the metrics

@JaySon-Huang
Copy link
Contributor Author

/run-all-tests

Copy link
Contributor

@lidezhu lidezhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Oct 25, 2022
@JaySon-Huang
Copy link
Contributor Author

JaySon-Huang commented Oct 26, 2022

Metrics (grafana panels with this PR: #6175):
image
Before this fix
image
After this fix
image

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Oct 26, 2022
@JaySon-Huang
Copy link
Contributor Author

/merge

@ti-chi-bot
Copy link
Member

@JaySon-Huang: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: a8d677d

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Oct 26, 2022
@ti-chi-bot
Copy link
Member

@JaySon-Huang: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@JaySon-Huang
Copy link
Contributor Author

/run-unit-test

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #6184.

ti-chi-bot pushed a commit to ti-chi-bot/tiflash that referenced this pull request Oct 26, 2022
@JaySon-Huang JaySon-Huang deleted the fix_directory_snapshot branch October 26, 2022 07:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-cherry-pick-release-6.3 release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compaction on PageStorage WAL causes OOM
4 participants