Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

topsql: use new cache policy for top-n SQL #25744

Merged
merged 21 commits into from
Jun 29, 2021

Conversation

crazycs520
Copy link
Contributor

@crazycs520 crazycs520 commented Jun 24, 2021

What problem does this PR solve?

#24875 implement the TOP-SQL feature. But the cache policy was not very good. Here is an example:

suppose the config is:

set @@global.tidb_top_sql_precision_seconds=1;
set @@global.tidb_top_sql_max_statement_count=2;
set @@global.tidb_top_sql_report_interval_seconds=3;

At timestamp_1: the SQL and CPU time is:

SQL1: 100ms
SQL2: 200ms

At timestamp_2: the SQL and CPU time is:

SQL3: 300ms
SQL1: 100ms

At timestamp_3: the SQL and CPU time is:

SQL4: 100ms
SQL1: 100ms

As you can see, between timestamp_1 and timestamp_3, the total SQL and CPU time is:

SQL1: 300ms
SQL2: 200ms
SQL3: 300ms
SQL4: 100ms

So, between timestamp_1 and timestamp_3, the final TOP-2 SQL should be:

SQL1: 300ms
SQL3: 300ms

But before this PR, the final Top-2 SQL is:

SQL2: 200ms
SQL3: 300ms

In this PR, the final TOP-2 SQL is:

SQL1: 300ms
SQL3: 300ms

What is changed and how it works?

Reference: https://github.com/breeswish/topn-stores

This PR uses a new cache policy, which saves each Top-N SQL for each timestamp and calculate the final Top-N SQL between the report interval.

This PR also change the default value of tidb_top_sql_max_statement_count variable from 2000 to 200. Since 200 is big enough for one second.

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test

Side effects

  • Performance regression
    • Consumes more MEM

In the default config:

tidb_top_sql_report_interval_seconds = 60;
tidb_top_sql_precision_seconds = 1;
tidb_top_sql_max_statement_count=200;

MaxSQLTextSize=4 * 1024 
MaxPlanTextSize = 32 * 1024 

In the best case, doesn't have too much kind of SQL, the memory usage enlarge 1X. The memory usage is 200*36K * 2 =
14MB.

In the worse case, each second has many different SQL digest, the memory will become 200 * 36K * 60 = 422MB.

Release note

  • No release note.

Signed-off-by: crazycs <[email protected]>
Signed-off-by: crazycs <[email protected]>
@ti-chi-bot ti-chi-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 24, 2021
@crazycs520
Copy link
Contributor Author

@dragonly @breeswish PTAL

@breezewish breezewish added the sig/diagnosis SIG: Diagnosis label Jun 25, 2021
Copy link
Contributor

@dragonly dragonly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding a holistic description about the 2-round top N eviction algorithm in the code comments.
Rest LGTM.

util/topsql/reporter/reporter.go Outdated Show resolved Hide resolved
util/topsql/reporter/reporter.go Outdated Show resolved Hide resolved
@breezewish
Copy link
Member

We need to adjust back to maximum 200 statements if we switch the evict policy.

@breezewish
Copy link
Member

Also do we need to merge with pingcap/tipb#233?

@crazycs520
Copy link
Contributor Author

We need to adjust back to maximum 200 statements if we switch the evict policy.

Nice catch, already change
to 200.

Also do we need to merge with pingcap/tipb#233?

Ok, already update the latest tipb.

util/topsql/reporter/reporter.go Outdated Show resolved Hide resolved
util/topsql/reporter/reporter.go Outdated Show resolved Hide resolved
util/topsql/reporter/reporter.go Outdated Show resolved Hide resolved
records = append(records, v)
}
var evicted []*dataPoints
records, evicted = tsr.getTopNDataPoints(records)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to not to evict it here, since immediately after that the map will be dropped entirely. Maybe we can simply ignore them at the report time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This evict is use to avoid report useless SQL/Plan. And since the data structure of evicted is a slice, it's a little bit of trouble for ignore them at the report time, since we should traverse the slice for each SQL/Plan record.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about traversing the records and generate a "retain map" first, then iterate all records in the hash map and send records that only in the retain map?

Copy link
Contributor

@dragonly dragonly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot
Copy link
Member

@dragonly: Thanks for your review. The bot only counts approvals from reviewers and higher roles in list, but you're still welcome to leave your comments.

In response to this:

LGTM

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

// When `reportDataChan` receives something, there could be ongoing `RegisterSQL` and `RegisterPlan` running,
// who writes to the data structure that `data` contains. So we wait for a little while to ensure that
// these writes are finished.
time.Sleep(time.Millisecond * 100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we need this any more?

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Jun 29, 2021
Signed-off-by: crazycs <[email protected]>
@ti-chi-bot
Copy link
Member

@tangenta: Thanks for your review. The bot only counts approvals from reviewers and higher roles in list, but you're still welcome to leave your comments.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@crazycs520 crazycs520 added the sig/sql-infra SIG: SQL Infra label Jun 29, 2021
@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • breeswish
  • tangenta

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jun 29, 2021
@crazycs520
Copy link
Contributor Author

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 3809ae7

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Jun 29, 2021
Signed-off-by: crazycs <[email protected]>
@ti-chi-bot ti-chi-bot removed the status/can-merge Indicates a PR has been approved by a committer. label Jun 29, 2021
@crazycs520
Copy link
Contributor Author

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 0d12420

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Jun 29, 2021
@ti-chi-bot
Copy link
Member

@crazycs520: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot ti-chi-bot merged commit fac17a2 into pingcap:master Jun 29, 2021
@crazycs520 crazycs520 deleted the new-top-sql-cache branch June 29, 2021 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/diagnosis SIG: Diagnosis sig/sql-infra SIG: SQL Infra size/L Denotes a PR that changes 100-499 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants