Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: add HTTP API to generate TiDB metric profile (#18272) #18531

Merged
merged 12 commits into from
Jul 29, 2020

Conversation

ti-srebot
Copy link
Contributor

cherry-pick #18272 to release-4.0


What problem does this PR solve?

TiDB has thousands of metrics, it is hard to know the related metric for users. This PR try to generate a metric profile SVG to describe the total time consumption of related metric in a specified time range.

Below is an example:

curl "127.0.0.1:10080/metrics/profile?start=2020-06-29T22:00:00%2B08:00&end=2020-06-29T22:20:00%2B08:00" > profile.dot

dot -Tsvg -o profile.svg profile.dot

The start and end use to specify the time range, the time format is RFC3339(2006-01-02T15:04:05Z07:00).

The profile.svg look like as below:

image

From the upper profile SVG, it's easy to see the time consumption ratio of each metric and the relationship between metrics.

Explain

This profile style, color, and format are very similar to the golang pprof profile.

Node

image

  • tidb_execute: it was the metric name, tidb_execute means the consuming time that spends on TiDB execution.
  • 20924.55s (30.51%) of 64938.09s (94.68%):
    • 20924.55s(30.51%) means tidb_execute it self consumes 20924.55 seconds, take up 30.51% of the total time.
    • 64938.09s (94.68%) means tidb_execute itself and all children consume 64938.09 seconds, take up 94.68% of the total time. So, all the children of tidb_execution consume 64938.09-20924.55=44013.54 seconds.

Maybe we can add more information about the metric in this node, such as avg_P99, avg_P80,total_count ...

Edge

image

The edge from tidb_txn_cmd to tidb_txn_cmd.get has comment with 17311.60s, it means 17311.60 seconds was consumed after this edge.

You must found that edges such as the edge from tidb_txn_cmd to tidb_kv_request, the edge line is dotted, and there is no comment to indicate how much time was consume by its child. This is because the child such as tidb_kv_request may have multiple parents, I only know the total time consumed by tidb_kv_request, I don't know the time consumed that after the dotted edge.

Attention

Currently, the total time was the tidb_query, so the total time of tidb_query and its children consumed was 100%.

Sometimes, you may found the nodes such as tidb_kv_request , itself and its children consume time more than the tidb_query total time. this is because 1 tidb_query may have many kv requests, and those kv requests may be executed concurrently.

For some metric that the total value was less than 0.01% of the total time, the metric will not display in the profile SVG.

Example in different workload

  • Calculate Cartesian join of two tables

image

  • Too much write conflict in update

image

  • TiKV disk io is the bottleneck when insert

image

What is changed and how it works?

Each node in the profile has a corresponding metric, such as the metric of tidb_query is tidb_server_handle_query_duration_seconds.

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:

Check List

Tests

  • Manual test (add detailed scripts or steps below)

Side effects

  • No

Release note

  • Add HTTP API to generate TiDB metric profile.

@ti-srebot
Copy link
Contributor Author

/run-all-tests

@ti-srebot ti-srebot added sig/sql-infra SIG: SQL Infra sig/execution SIG execution component/metrics type/4.0-cherry-pick type/enhancement The issue or PR belongs to an enhancement. labels Jul 14, 2020
@ti-srebot ti-srebot added this to the v4.0.3 milestone Jul 14, 2020
@winoros winoros modified the milestones: v4.0.3, v4.0.4 Jul 15, 2020
Copy link
Contributor

@AilinKid AilinKid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Jul 27, 2020
Copy link
Member

@wjhuang2016 wjhuang2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jul 27, 2020
@crazycs520
Copy link
Contributor

/run-all-tests

@crazycs520 crazycs520 added status/all-tests-passed status/can-merge Indicates a PR has been approved by a committer. labels Jul 28, 2020
@ti-srebot
Copy link
Contributor Author

Sorry @crazycs520, you don't have permission to trigger auto merge event on this branch.

@crazycs520
Copy link
Contributor

/run-all-tests

@crazycs520
Copy link
Contributor

/rebuild

@crazycs520
Copy link
Contributor

/run-sqllogic-test-1
/run-integration-ddl-test

@imtbkcat imtbkcat removed this from the v4.0.4 milestone Jul 28, 2020
@imtbkcat imtbkcat added this to the v4.0.5 milestone Jul 28, 2020
@ti-srebot ti-srebot added status/LGT3 The PR has already had 3 LGTM. and removed status/LGT2 Indicates that a PR has LGTM 2. labels Jul 28, 2020
@zz-jason
Copy link
Member

/merge

@ti-srebot
Copy link
Contributor Author

Your auto merge job has been accepted, waiting for:

  • 18806
  • 18667
  • 18368
  • 17926
  • 17823

@ti-srebot
Copy link
Contributor Author

/run-all-tests

@ti-srebot
Copy link
Contributor Author

@ti-srebot merge failed.

@zz-jason
Copy link
Member

/merge

@ti-srebot
Copy link
Contributor Author

Your auto merge job has been accepted, waiting for:

  • 18413
  • 18434
  • 18280
  • 18121
  • 18122
  • 18027
  • 18234
  • 18529
  • 18727
  • 18583
  • 18513
  • 17972
  • 17231
  • 18692
  • 17988
  • 18683
  • 18159
  • 17863
  • 17823

@ti-srebot
Copy link
Contributor Author

/run-all-tests

@ti-srebot
Copy link
Contributor Author

@ti-srebot merge failed.

@zz-jason
Copy link
Member

/merge

@ti-srebot
Copy link
Contributor Author

Your auto merge job has been accepted, waiting for:

  • 17988
  • 18121

@ti-srebot
Copy link
Contributor Author

/run-all-tests

@ti-srebot
Copy link
Contributor Author

/run-all-tests

@ti-srebot
Copy link
Contributor Author

@ti-srebot merge failed.

@zz-jason
Copy link
Member

/merge

@ti-srebot
Copy link
Contributor Author

/run-all-tests

@ti-srebot
Copy link
Contributor Author

@ti-srebot merge failed.

@zz-jason zz-jason merged commit 1dd6ade into pingcap:release-4.0 Jul 29, 2020
@crazycs520 crazycs520 deleted the release-4.0-9d536c426daa branch July 29, 2020 02:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/metrics sig/execution SIG execution sig/sql-infra SIG: SQL Infra status/can-merge Indicates a PR has been approved by a committer. status/LGT3 The PR has already had 3 LGTM. type/enhancement The issue or PR belongs to an enhancement. type/4.0-cherry-pick
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants