Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor,metrics: add a metric for observing execution phases #35906

Merged
merged 20 commits into from
Jul 21, 2022

Conversation

zyguan
Copy link
Contributor

@zyguan zyguan commented Jul 4, 2022

Signed-off-by: zyguan [email protected]

What problem does this PR solve?

Issue Number: ref #34106

Problem Summary:

The execution duraition is the main part of the database time, however it cannot be broken down (in wall time) by current metrics.

What is changed and how it works?

The execution process can be described as the following.

2022-07-04_095849

The main parts are build, open, next, lock and commit and we may retry build -> open -> next [-> lock] when a pessimistic lock error returned. Thus these phases (except for commit) can be further split into two parts:

  1. trying to lock keys (but failed)
  2. the final iteration of the retry loop

By observing durations of these phases, we can identify some typical issues easily, eg:

  1. too much time spent on building executor (typically caused by waiting tso)
    2022-07-04_125624

  2. lock contention
    2022-07-04_125656

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Jul 4, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • cfzjywxk
  • you06

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 4, 2022
@sre-bot
Copy link
Contributor

sre-bot commented Jul 4, 2022

@zyguan zyguan added the sig/diagnosis SIG: Diagnosis label Jul 5, 2022
executor/adapter.go Outdated Show resolved Hide resolved
@zyguan zyguan marked this pull request as ready for review July 15, 2022 10:46
@ti-chi-bot ti-chi-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 15, 2022
func (a *ExecStmt) observePhaseDurations(internal bool, commitDetails *util.CommitDetails) {
if d := a.phaseBuildDurations[0]; d > 0 {
if internal {
metrics.ExecPhaseDuration.WithLabelValues("build:final", "1").Observe(d.Seconds())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could abstract the label constants and pre-define the related metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

executor/adapter.go Show resolved Hide resolved
execBuildFinal = metrics.ExecPhaseDuration.WithLabelValues("build:final", "0")
execOpenFinal = metrics.ExecPhaseDuration.WithLabelValues("open:final", "0")
execNextFinal = metrics.ExecPhaseDuration.WithLabelValues("next:final", "0")
execLockFinal = metrics.ExecPhaseDuration.WithLabelValues("lock:final", "0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the next operation maybe we need to integrate more with the kv client metrics in the future.

executor/adapter.go Outdated Show resolved Hide resolved
@@ -724,6 +736,7 @@ func (a *ExecStmt) handlePessimisticDML(ctx context.Context, e Executor) error {
ctx = context.WithValue(ctx, util.LockKeysDetailCtxKey, &lockKeyStats)
startLocking := time.Now()
err = txn.LockKeys(ctx, lockCtx, keys...)
a.phaseLockDurations[0] += time.Since(startLocking)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point get and batch point get will lock keys internally, that duration will not be observed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, here the lock phase is only for pessimistic DMLs that have extra keys to lock. The internal lock key duration (of point-get, point-update, etc) is counted in next phase.

Comment on lines +941 to +955
phaseBuildLocking = "build:locking"
phaseOpenLocking = "open:locking"
phaseNextLocking = "next:locking"
phaseLockLocking = "lock:locking"
phaseBuildFinal = "build:final"
phaseOpenFinal = "open:final"
phaseNextFinal = "next:final"
phaseLockFinal = "lock:final"
phaseCommitPrewrite = "commit:prewrite"
phaseCommitCommit = "commit:commit"
phaseCommitWaitCommitTS = "commit:wait:commit-ts"
phaseCommitWaitLatestTS = "commit:wait:latest-ts"
phaseCommitWaitLatch = "commit:wait:local-latch"
phaseCommitWaitBinlog = "commit:wait:prewrite-binlog"
phaseWriteResponse = "write-response"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cfzjywxk @you06 @sticnarf @longfangsong Any suggestion about naming? eg. build:locking and locking:build, which is better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build:locking IMO, it's better to put the related metrics together in the alphabetic order.

@zyguan
Copy link
Contributor Author

zyguan commented Jul 19, 2022

There are still some unknown spans. I found diagnosis (stmt summary & top sql, not included in this PR) may cost more time than writing response and there is 5.24s database time (execute:other) I cannot explain currently. Since the total duation (5.24s+1.65s) accounts for only 3.64% database time, I think we can just ignore them for now.
2022-07-19_163616

@cfzjywxk
Copy link
Contributor

cfzjywxk commented Jul 20, 2022

diagnosis (stmt summary & top sql, not included in this PR) may cost more time than writing response

Seems the workload does not return much data to the client?

@cfzjywxk
Copy link
Contributor

cfzjywxk commented Jul 20, 2022

@zyguan
Usually, the next operation may take much time, maybe we need to find a way to integrate it with the kv client durations later.
Also after merging this pr the performance-map needs to be updated.

@@ -46,4 +46,13 @@ var (
Name: "statement_db_total",
Help: "Counter of StmtNode by Database.",
}, []string{LblDb, LblType})

// ExecPhaseDuration records the duration of each execution phase.
ExecPhaseDuration = prometheus.NewSummaryVec(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why summary is used here? I have never seen summary type metrics used in TiDB before and histogram is always used....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce the size of metrics data (histogram = summary + buckets). IMO, we won't care too much about sth like "what's the p99 latency of next phase". There are too many kinds of executors (as well as their combinations), some may be fast and other may be extremely slow, a higher or lower p99 latency may not provide more info (we do not known about the distribution of each kind of executors). Besides, it's hard to decide buckets here, some phases (like open) take very little time, but phases like lock may cost a few seconds.

@zyguan
Copy link
Contributor Author

zyguan commented Jul 20, 2022

@zyguan Usually, the next operation may take much time, maybe we need to find a way to integrate it with the kv client durations later. Also after merging this pr the performance-map needs to be updated.

Yes, we need to figure out a way to show these durations in wall time, since kv client / executors may run concurrently.

Copy link
Contributor

@cfzjywxk cfzjywxk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Jul 20, 2022
@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jul 21, 2022
@cfzjywxk
Copy link
Contributor

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 9e97b78

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Jul 21, 2022
@sticnarf
Copy link
Contributor

/run-all-tests

@ti-chi-bot ti-chi-bot merged commit 23f25af into pingcap:master Jul 21, 2022
@sre-bot
Copy link
Contributor

sre-bot commented Jul 21, 2022

TiDB MergeCI notify

🔴 Bad News! New failing [1] after this pr merged.
These new failed integration tests seem to be caused by the current PR, please try to fix these new failed integration tests, thanks!

CI Name Result Duration Compare with Parent commit
idc-jenkins-ci-tidb/integration-ddl-test 🟥 failed 1, success 5, total 6 6 min 35 sec New failing
idc-jenkins-ci-tidb/integration-common-test 🔴 failed 3, success 8, total 11 48 min Existing failure
idc-jenkins-ci/integration-cdc-test 🟢 all 36 tests passed 27 min Existing passed
idc-jenkins-ci-tidb/common-test 🟢 all 12 tests passed 17 min Existing passed
idc-jenkins-ci-tidb/sqllogic-test-2 🟢 all 28 tests passed 6 min 25 sec Existing passed
idc-jenkins-ci-tidb/sqllogic-test-1 🟢 all 26 tests passed 5 min 55 sec Existing passed
idc-jenkins-ci-tidb/tics-test 🟢 all 1 tests passed 5 min 46 sec Existing passed
idc-jenkins-ci-tidb/mybatis-test 🟢 all 1 tests passed 3 min 29 sec Existing passed
idc-jenkins-ci-tidb/integration-compatibility-test 🟢 all 1 tests passed 3 min 11 sec Existing passed
idc-jenkins-ci-tidb/plugin-test 🟢 build success, plugin test success 4min Existing passed

morgo added a commit to morgo/tidb that referenced this pull request Jul 22, 2022
…ip-init

* upstream/master: (125 commits)
  infoschema: fix PromQL for `tidb_distsql_copr_cache` (pingcap#36450)
  test: stabilize TestTopSQLCPUProfile (pingcap#36468)
  parser: add support of 'ADMIN SHOW DDL JOB QUERIES LIMIT m OFFSET n' transferring to AST (pingcap#36285)
  *: enable flaky test for all test (pingcap#36385)
  expression: fix return type of agg func `bit_or` when handling varbinary column (pingcap#36415)
  executor: fix aggregating enum zero value gets different results from mysql  (pingcap#36208)
  server: skip check tiflash version (pingcap#36451)
  *: Minor update to SECURITY.md to improved clarity (pingcap#36346)
  table partition: add telemetry for partition table (pingcap#36204)
  ddl: invalid multiple MAXVALUE partitions (pingcap#36329) (pingcap#36345)
  planner: Fixed `Merge` hint in nested CTE (pingcap#36432)
  metric: impove concurrency ddl metrics (pingcap#36405)
  planner: add more test cases for leading outer join (pingcap#36409)
  ddl: only set concurrent variable if no error (pingcap#36437)
  ddl: fix update panic in the middle of multi-schema change (pingcap#36421)
  session: Mising OptimizeWithPlanAndThenWarmUp in prepare-execute path (pingcap#36347)
  executor,metrics: add a metric for observing execution phases (pingcap#35906)
  br: unified docker image align with tidb (pingcap#36016)
  ddl: skip to close nil sessPool  (pingcap#36425)
  log-backup: remove the timezone from log-date (pingcap#36369)
  ...
morgo added a commit to morgo/tidb that referenced this pull request Jul 22, 2022
* upstream/master: (280 commits)
  infoschema: fix PromQL for `tidb_distsql_copr_cache` (pingcap#36450)
  test: stabilize TestTopSQLCPUProfile (pingcap#36468)
  parser: add support of 'ADMIN SHOW DDL JOB QUERIES LIMIT m OFFSET n' transferring to AST (pingcap#36285)
  *: enable flaky test for all test (pingcap#36385)
  expression: fix return type of agg func `bit_or` when handling varbinary column (pingcap#36415)
  executor: fix aggregating enum zero value gets different results from mysql  (pingcap#36208)
  server: skip check tiflash version (pingcap#36451)
  *: Minor update to SECURITY.md to improved clarity (pingcap#36346)
  table partition: add telemetry for partition table (pingcap#36204)
  ddl: invalid multiple MAXVALUE partitions (pingcap#36329) (pingcap#36345)
  planner: Fixed `Merge` hint in nested CTE (pingcap#36432)
  metric: impove concurrency ddl metrics (pingcap#36405)
  planner: add more test cases for leading outer join (pingcap#36409)
  ddl: only set concurrent variable if no error (pingcap#36437)
  ddl: fix update panic in the middle of multi-schema change (pingcap#36421)
  session: Mising OptimizeWithPlanAndThenWarmUp in prepare-execute path (pingcap#36347)
  executor,metrics: add a metric for observing execution phases (pingcap#35906)
  br: unified docker image align with tidb (pingcap#36016)
  ddl: skip to close nil sessPool  (pingcap#36425)
  log-backup: remove the timezone from log-date (pingcap#36369)
  ...
joycse06 added a commit to joycse06/tidb that referenced this pull request Jul 23, 2022
…rimary-key

* upstream/master: (104 commits)
  br: fix compatibility issue with concurrent ddl (pingcap#36474)
  infoschema: fix PromQL for `tidb_distsql_copr_cache` (pingcap#36450)
  test: stabilize TestTopSQLCPUProfile (pingcap#36468)
  parser: add support of 'ADMIN SHOW DDL JOB QUERIES LIMIT m OFFSET n' transferring to AST (pingcap#36285)
  *: enable flaky test for all test (pingcap#36385)
  expression: fix return type of agg func `bit_or` when handling varbinary column (pingcap#36415)
  executor: fix aggregating enum zero value gets different results from mysql  (pingcap#36208)
  server: skip check tiflash version (pingcap#36451)
  *: Minor update to SECURITY.md to improved clarity (pingcap#36346)
  table partition: add telemetry for partition table (pingcap#36204)
  ddl: invalid multiple MAXVALUE partitions (pingcap#36329) (pingcap#36345)
  planner: Fixed `Merge` hint in nested CTE (pingcap#36432)
  metric: impove concurrency ddl metrics (pingcap#36405)
  planner: add more test cases for leading outer join (pingcap#36409)
  ddl: only set concurrent variable if no error (pingcap#36437)
  ddl: fix update panic in the middle of multi-schema change (pingcap#36421)
  session: Mising OptimizeWithPlanAndThenWarmUp in prepare-execute path (pingcap#36347)
  executor,metrics: add a metric for observing execution phases (pingcap#35906)
  br: unified docker image align with tidb (pingcap#36016)
  ddl: skip to close nil sessPool  (pingcap#36425)
  ...
zyguan added a commit to zyguan/tidb that referenced this pull request Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none Denotes a PR that doesn't merit a release note. sig/diagnosis SIG: Diagnosis size/L Denotes a PR that changes 100-499 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants