executor,metrics: add a metric for observing execution phases #35906

zyguan · 2022-07-04T03:10:58Z

What problem does this PR solve?

Issue Number: ref #34106

Problem Summary:

The execution duraition is the main part of the database time, however it cannot be broken down (in wall time) by current metrics.

What is changed and how it works?

The execution process can be described as the following.

The main parts are build, open, next, lock and commit and we may retry build -> open -> next [-> lock] when a pessimistic lock error returned. Thus these phases (except for commit) can be further split into two parts:

trying to lock keys (but failed)
the final iteration of the retry loop

By observing durations of these phases, we can identify some typical issues easily, eg:

too much time spent on building executor (typically caused by waiting tso)
lock contention

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Signed-off-by: zyguan <[email protected]>

ti-chi-bot · 2022-07-04T03:10:59Z

[REVIEW NOTIFICATION]

This pull request has been approved by:

cfzjywxk
you06

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

sre-bot · 2022-07-04T03:22:49Z

Code Coverage Details: https://codecov.io/github/pingcap/tidb/commit/1b25036ca36befb45e6588ee168b4cebefa10f4a

Signed-off-by: zyguan <[email protected]>

executor/adapter.go

Signed-off-by: zyguan <[email protected]>

cfzjywxk · 2022-07-18T01:53:51Z

executor/adapter.go

+func (a *ExecStmt) observePhaseDurations(internal bool, commitDetails *util.CommitDetails) {
+	if d := a.phaseBuildDurations[0]; d > 0 {
+		if internal {
+			metrics.ExecPhaseDuration.WithLabelValues("build:final", "1").Observe(d.Seconds())


Maybe we could abstract the label constants and pre-define the related metrics?

executor/adapter.go

cfzjywxk · 2022-07-18T01:59:51Z

executor/adapter.go

+	execBuildFinal   = metrics.ExecPhaseDuration.WithLabelValues("build:final", "0")
+	execOpenFinal    = metrics.ExecPhaseDuration.WithLabelValues("open:final", "0")
+	execNextFinal    = metrics.ExecPhaseDuration.WithLabelValues("next:final", "0")
+	execLockFinal    = metrics.ExecPhaseDuration.WithLabelValues("lock:final", "0")


For the next operation maybe we need to integrate more with the kv client metrics in the future.

Signed-off-by: zyguan <[email protected]>

executor/adapter.go

you06 · 2022-07-19T09:01:20Z

executor/adapter.go

@@ -724,6 +736,7 @@ func (a *ExecStmt) handlePessimisticDML(ctx context.Context, e Executor) error {
 		ctx = context.WithValue(ctx, util.LockKeysDetailCtxKey, &lockKeyStats)
 		startLocking := time.Now()
 		err = txn.LockKeys(ctx, lockCtx, keys...)
+		a.phaseLockDurations[0] += time.Since(startLocking)


Point get and batch point get will lock keys internally, that duration will not be observed here.

Yes, here the lock phase is only for pessimistic DMLs that have extra keys to lock. The internal lock key duration (of point-get, point-update, etc) is counted in next phase.

Signed-off-by: zyguan <[email protected]>

zyguan · 2022-07-19T10:19:25Z

executor/adapter.go

+	phaseBuildLocking       = "build:locking"
+	phaseOpenLocking        = "open:locking"
+	phaseNextLocking        = "next:locking"
+	phaseLockLocking        = "lock:locking"
+	phaseBuildFinal         = "build:final"
+	phaseOpenFinal          = "open:final"
+	phaseNextFinal          = "next:final"
+	phaseLockFinal          = "lock:final"
+	phaseCommitPrewrite     = "commit:prewrite"
+	phaseCommitCommit       = "commit:commit"
+	phaseCommitWaitCommitTS = "commit:wait:commit-ts"
+	phaseCommitWaitLatestTS = "commit:wait:latest-ts"
+	phaseCommitWaitLatch    = "commit:wait:local-latch"
+	phaseCommitWaitBinlog   = "commit:wait:prewrite-binlog"
+	phaseWriteResponse      = "write-response"


@cfzjywxk @you06 @sticnarf @longfangsong Any suggestion about naming? eg. build:locking and locking:build, which is better?

build:locking IMO, it's better to put the related metrics together in the alphabetic order.

zyguan · 2022-07-19T10:35:24Z

There are still some unknown spans. I found diagnosis (stmt summary & top sql, not included in this PR) may cost more time than writing response and there is 5.24s database time (execute:other) I cannot explain currently. Since the total duation (5.24s+1.65s) accounts for only 3.64% database time, I think we can just ignore them for now.

cfzjywxk · 2022-07-20T06:22:10Z

diagnosis (stmt summary & top sql, not included in this PR) may cost more time than writing response

Seems the workload does not return much data to the client?

cfzjywxk · 2022-07-20T06:24:14Z

@zyguan
Usually, the next operation may take much time, maybe we need to find a way to integrate it with the kv client durations later.
Also after merging this pr the performance-map needs to be updated.

sticnarf · 2022-07-20T06:41:48Z

metrics/executor.go

@@ -46,4 +46,13 @@ var (
 			Name:      "statement_db_total",
 			Help:      "Counter of StmtNode by Database.",
 		}, []string{LblDb, LblType})
+
+	// ExecPhaseDuration records the duration of each execution phase.
+	ExecPhaseDuration = prometheus.NewSummaryVec(


Can you explain why summary is used here? I have never seen summary type metrics used in TiDB before and histogram is always used....

To reduce the size of metrics data (histogram = summary + buckets). IMO, we won't care too much about sth like "what's the p99 latency of next phase". There are too many kinds of executors (as well as their combinations), some may be fast and other may be extremely slow, a higher or lower p99 latency may not provide more info (we do not known about the distribution of each kind of executors). Besides, it's hard to decide buckets here, some phases (like open) take very little time, but phases like lock may cost a few seconds.

zyguan · 2022-07-20T06:42:17Z

@zyguan Usually, the next operation may take much time, maybe we need to find a way to integrate it with the kv client durations later. Also after merging this pr the performance-map needs to be updated.

Yes, we need to figure out a way to show these durations in wall time, since kv client / executors may run concurrently.

cfzjywxk

LGTM

Signed-off-by: zyguan <[email protected]>

cfzjywxk · 2022-07-21T07:56:00Z

/merge

ti-chi-bot · 2022-07-21T07:56:04Z

This pull request has been accepted and is ready to merge.

Commit hash: 9e97b78

sticnarf · 2022-07-21T11:57:05Z

/run-all-tests

sre-bot · 2022-07-21T14:25:29Z

TiDB MergeCI notify

🔴 Bad News! New failing [1] after this pr merged.
These new failed integration tests seem to be caused by the current PR, please try to fix these new failed integration tests, thanks!

CI Name	Result	Duration	Compare with Parent commit
idc-jenkins-ci-tidb/integration-ddl-test	🟥 failed 1, success 5, total 6	6 min 35 sec	New failing
idc-jenkins-ci-tidb/integration-common-test	🔴 failed 3, success 8, total 11	48 min	Existing failure
idc-jenkins-ci/integration-cdc-test	🟢 all 36 tests passed	27 min	Existing passed
idc-jenkins-ci-tidb/common-test	🟢 all 12 tests passed	17 min	Existing passed
idc-jenkins-ci-tidb/sqllogic-test-2	🟢 all 28 tests passed	6 min 25 sec	Existing passed
idc-jenkins-ci-tidb/sqllogic-test-1	🟢 all 26 tests passed	5 min 55 sec	Existing passed
idc-jenkins-ci-tidb/tics-test	🟢 all 1 tests passed	5 min 46 sec	Existing passed
idc-jenkins-ci-tidb/mybatis-test	🟢 all 1 tests passed	3 min 29 sec	Existing passed
idc-jenkins-ci-tidb/integration-compatibility-test	🟢 all 1 tests passed	3 min 11 sec	Existing passed
idc-jenkins-ci-tidb/plugin-test	🟢 build success, plugin test success	4min	Existing passed

…ip-init * upstream/master: (125 commits) infoschema: fix PromQL for `tidb_distsql_copr_cache` (pingcap#36450) test: stabilize TestTopSQLCPUProfile (pingcap#36468) parser: add support of 'ADMIN SHOW DDL JOB QUERIES LIMIT m OFFSET n' transferring to AST (pingcap#36285) *: enable flaky test for all test (pingcap#36385) expression: fix return type of agg func `bit_or` when handling varbinary column (pingcap#36415) executor: fix aggregating enum zero value gets different results from mysql (pingcap#36208) server: skip check tiflash version (pingcap#36451) *: Minor update to SECURITY.md to improved clarity (pingcap#36346) table partition: add telemetry for partition table (pingcap#36204) ddl: invalid multiple MAXVALUE partitions (pingcap#36329) (pingcap#36345) planner: Fixed `Merge` hint in nested CTE (pingcap#36432) metric: impove concurrency ddl metrics (pingcap#36405) planner: add more test cases for leading outer join (pingcap#36409) ddl: only set concurrent variable if no error (pingcap#36437) ddl: fix update panic in the middle of multi-schema change (pingcap#36421) session: Mising OptimizeWithPlanAndThenWarmUp in prepare-execute path (pingcap#36347) executor,metrics: add a metric for observing execution phases (pingcap#35906) br: unified docker image align with tidb (pingcap#36016) ddl: skip to close nil sessPool (pingcap#36425) log-backup: remove the timezone from log-date (pingcap#36369) ...

* upstream/master: (280 commits) infoschema: fix PromQL for `tidb_distsql_copr_cache` (pingcap#36450) test: stabilize TestTopSQLCPUProfile (pingcap#36468) parser: add support of 'ADMIN SHOW DDL JOB QUERIES LIMIT m OFFSET n' transferring to AST (pingcap#36285) *: enable flaky test for all test (pingcap#36385) expression: fix return type of agg func `bit_or` when handling varbinary column (pingcap#36415) executor: fix aggregating enum zero value gets different results from mysql (pingcap#36208) server: skip check tiflash version (pingcap#36451) *: Minor update to SECURITY.md to improved clarity (pingcap#36346) table partition: add telemetry for partition table (pingcap#36204) ddl: invalid multiple MAXVALUE partitions (pingcap#36329) (pingcap#36345) planner: Fixed `Merge` hint in nested CTE (pingcap#36432) metric: impove concurrency ddl metrics (pingcap#36405) planner: add more test cases for leading outer join (pingcap#36409) ddl: only set concurrent variable if no error (pingcap#36437) ddl: fix update panic in the middle of multi-schema change (pingcap#36421) session: Mising OptimizeWithPlanAndThenWarmUp in prepare-execute path (pingcap#36347) executor,metrics: add a metric for observing execution phases (pingcap#35906) br: unified docker image align with tidb (pingcap#36016) ddl: skip to close nil sessPool (pingcap#36425) log-backup: remove the timezone from log-date (pingcap#36369) ...

…rimary-key * upstream/master: (104 commits) br: fix compatibility issue with concurrent ddl (pingcap#36474) infoschema: fix PromQL for `tidb_distsql_copr_cache` (pingcap#36450) test: stabilize TestTopSQLCPUProfile (pingcap#36468) parser: add support of 'ADMIN SHOW DDL JOB QUERIES LIMIT m OFFSET n' transferring to AST (pingcap#36285) *: enable flaky test for all test (pingcap#36385) expression: fix return type of agg func `bit_or` when handling varbinary column (pingcap#36415) executor: fix aggregating enum zero value gets different results from mysql (pingcap#36208) server: skip check tiflash version (pingcap#36451) *: Minor update to SECURITY.md to improved clarity (pingcap#36346) table partition: add telemetry for partition table (pingcap#36204) ddl: invalid multiple MAXVALUE partitions (pingcap#36329) (pingcap#36345) planner: Fixed `Merge` hint in nested CTE (pingcap#36432) metric: impove concurrency ddl metrics (pingcap#36405) planner: add more test cases for leading outer join (pingcap#36409) ddl: only set concurrent variable if no error (pingcap#36437) ddl: fix update panic in the middle of multi-schema change (pingcap#36421) session: Mising OptimizeWithPlanAndThenWarmUp in prepare-execute path (pingcap#36347) executor,metrics: add a metric for observing execution phases (pingcap#35906) br: unified docker image align with tidb (pingcap#36016) ddl: skip to close nil sessPool (pingcap#36425) ...

…pingcap#35906)" This reverts commit 23f25af.

…#35906)" This reverts commit 23f25af.

executor,metrics: add a metric for observing execution phases

81a5274

Signed-off-by: zyguan <[email protected]>

executor: record next durations more precise

631c38d

Signed-off-by: zyguan <[email protected]>

zyguan added the sig/diagnosis SIG: Diagnosis label Jul 5, 2022

zyguan added 2 commits July 5, 2022 07:01

executor: simplify the way to get next duration

732be48

Signed-off-by: zyguan <[email protected]>

revert changes on util/execdetails

cc0ca97

Signed-off-by: zyguan <[email protected]>

zyguan commented Jul 5, 2022

View reviewed changes

executor/adapter.go Outdated Show resolved Hide resolved

zyguan added 2 commits July 15, 2022 10:16

Merge remote-tracking branch 'origin/master' into exec-phase

6fc9782

observe duration of waiting latest-ts

1e2bff1

Signed-off-by: zyguan <[email protected]>

ti-chi-bot removed the do-not-merge/needs-linked-issue label Jul 15, 2022

zyguan marked this pull request as ready for review July 15, 2022 10:46

ti-chi-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 15, 2022

cfzjywxk requested review from sticnarf, you06, longfangsong and cfzjywxk July 17, 2022 08:02

cfzjywxk reviewed Jul 18, 2022

View reviewed changes

zyguan added 3 commits July 19, 2022 13:17

Merge branch 'master' into exec-phase

519d053

add comments for phase durations

a61399f

Signed-off-by: zyguan <[email protected]>

address pingcap#35906 (comment)

4ccde74

Signed-off-by: zyguan <[email protected]>

you06 reviewed Jul 19, 2022

View reviewed changes

zyguan added 2 commits July 19, 2022 10:07

observe the duration of writing response

df8b914

Signed-off-by: zyguan <[email protected]>

Merge branch 'master' into exec-phase

0e90568

zyguan commented Jul 19, 2022

View reviewed changes

sticnarf reviewed Jul 20, 2022

View reviewed changes

Merge branch 'master' into exec-phase

4e2d00c

cfzjywxk approved these changes Jul 20, 2022

View reviewed changes

ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Jul 20, 2022

Merge remote-tracking branch 'origin/master' into exec-phase

9e97b78

Signed-off-by: zyguan <[email protected]>

you06 approved these changes Jul 21, 2022

View reviewed changes

ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jul 21, 2022

ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Jul 21, 2022

ti-chi-bot added 4 commits July 21, 2022 15:56

Merge branch 'master' into exec-phase

bfb259d

Merge branch 'master' into exec-phase

eca9088

Merge branch 'master' into exec-phase

f14fed6

Merge branch 'master' into exec-phase

9884452

ti-chi-bot added 3 commits July 21, 2022 20:01

Merge branch 'master' into exec-phase

eece875

Merge branch 'master' into exec-phase

58419d7

Merge branch 'master' into exec-phase

1b25036

ti-chi-bot merged commit 23f25af into pingcap:master Jul 21, 2022

zyguan added a commit to zyguan/tidb that referenced this pull request Aug 10, 2022

Revert "executor,metrics: add a metric for observing execution phases (…

ad9f4dd

…pingcap#35906)" This reverts commit 23f25af.

lcwangchao pushed a commit that referenced this pull request Aug 10, 2022

Revert "executor,metrics: add a metric for observing execution phases (…

d1a3dd1

…#35906)" This reverts commit 23f25af.

ti-chi-bot pushed a commit that referenced this pull request Aug 11, 2022

executor,metrics: revert #35906 to avoid perf regression (#37025)

1894360

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

executor,metrics: add a metric for observing execution phases #35906

executor,metrics: add a metric for observing execution phases #35906

zyguan commented Jul 4, 2022 •

edited

Loading

ti-chi-bot commented Jul 4, 2022 •

edited

Loading

sre-bot commented Jul 4, 2022 •

edited

Loading

cfzjywxk Jul 18, 2022

zyguan Jul 19, 2022

cfzjywxk Jul 18, 2022

you06 Jul 19, 2022

zyguan Jul 19, 2022

zyguan Jul 19, 2022

you06 Jul 20, 2022

zyguan commented Jul 19, 2022

cfzjywxk commented Jul 20, 2022 •

edited

Loading

cfzjywxk commented Jul 20, 2022 •

edited

Loading

sticnarf Jul 20, 2022

zyguan Jul 20, 2022

zyguan commented Jul 20, 2022

cfzjywxk left a comment

cfzjywxk commented Jul 21, 2022

ti-chi-bot commented Jul 21, 2022

sticnarf commented Jul 21, 2022

sre-bot commented Jul 21, 2022

executor,metrics: add a metric for observing execution phases #35906

executor,metrics: add a metric for observing execution phases #35906

Conversation

zyguan commented Jul 4, 2022 • edited Loading

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

ti-chi-bot commented Jul 4, 2022 • edited Loading

sre-bot commented Jul 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zyguan commented Jul 19, 2022

cfzjywxk commented Jul 20, 2022 • edited Loading

cfzjywxk commented Jul 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zyguan commented Jul 20, 2022

cfzjywxk left a comment

Choose a reason for hiding this comment

cfzjywxk commented Jul 21, 2022

ti-chi-bot commented Jul 21, 2022

sticnarf commented Jul 21, 2022

sre-bot commented Jul 21, 2022

TiDB MergeCI notify

zyguan commented Jul 4, 2022 •

edited

Loading

ti-chi-bot commented Jul 4, 2022 •

edited

Loading

sre-bot commented Jul 4, 2022 •

edited

Loading

cfzjywxk commented Jul 20, 2022 •

edited

Loading

cfzjywxk commented Jul 20, 2022 •

edited

Loading