Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throttler: return app name in check result, synthesize "why throttled" explanation from result #16416

Merged
merged 41 commits into from
Jul 28, 2024

Conversation

shlomi-noach
Copy link
Contributor

@shlomi-noach shlomi-noach commented Jul 17, 2024

Description

Followup to #15988, this PR further enhances CheckThrottlerResponse by adding the name of throttled/granted app.

Why is this needed? Isn't the app name the same as the app in CheckThrottlerRequest?

Not necessarily!

  • The requesting app might be vcopier:d666bbfc_169e_11ef_b0b3_0a43f95f28a3:vreplication:online-ddl. Suppose the request is throttled. Why is it being throttled? There could be many reasons and based on either of the specific identifiers. For example, someone might have issued a alter vitess_migration throttle all, in which case the throttled app will be online-ddl. So this is what we indicate back to the user.
  • There may not be any specific rule for the given app, but it's possible that the all app is being throttled. Either because someone actually invoked UpdateThrottlerConfig --throttle-app=all ..., or just because all is assigned specific metrics, one of which crossed its threshold. This way or another, we want to tell the user the "all" app is at "fault" here.

Check result summary

Based on that, we provide a concise summary, something like app 'online-ddl' is throttled because 'lag' exceeded its threshold of 5``, or vreplication is denied access until ....

Example actual summaries:

online-ddl is granted access
online-ddl is explicitly denied access
online-ddl is denied access due to threads_running metric value 4 exceeding threshold 2
online-ddl is denied access due to lag metric value 94.821447 exceeding threshold 5
online-ddl is denied access due to lag metric value 22.951372 exceeding threshold 5
all is explicitly denied access

This is in turn injected in _vt.vreplication and then in _vt.schema_migrations as a human readable assist for analyzing production situations.

reason_throttled in _vt.vreplication

@rohit-nayak-ps you will like this.

Looks like this:

                   id: 3
             workflow: d741953d_4503_11ef_8761_0a43f95f28a3
               source: keyspace:"commerce" shard:"0" filter:{rules:{match:"_vt_vrp_d741953d450311ef87610a43f95f28a3_20240718124708_" filter:"select `customer_id`
 as `customer_id`, `email` as `email` from `customer`" source_unique_key_columns:"customer_id" target_unique_key_columns:"customer_id" source_unique_key_target_c
olumns:"customer_id" force_unique_key:"PRIMARY"}}
                  pos: MySQL56/414befe7-4358-11ef-bc56-0a43f95f28a3:1-650
             stop_pos: NULL
              max_tps: 9223372036854775807
  max_replication_lag: 9223372036854775807
                 cell:
         tablet_types: in_order:REPLICA,PRIMARY
         time_updated: 1721306835
transaction_timestamp: 1721306834
                state: Stopped
              message: stopped for online DDL cutover
              db_name: vt_commerce
          rows_copied: 0
                 tags:
       time_heartbeat: 1721306835
        workflow_type: 5
       time_throttled: 1721306829
  component_throttled: vcopier
     reason_throttled: online-ddl is denied access due to shard/lag metric value 8.725835 exceeding threshold 5
    workflow_sub_type: 0
 defer_secondary_keys: 0
              options: {}

The same is copied to _vt.schema_migrations.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

…etricChan -> throttleMetricChan

Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Copy link
Contributor

vitess-bot bot commented Jul 17, 2024

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work labels Jul 17, 2024
Copy link

codecov bot commented Jul 17, 2024

Codecov Report

Attention: Patch coverage is 63.96396% with 40 lines in your changes missing coverage. Please review.

Project coverage is 68.63%. Comparing base (be8f9f4) to head (0eef755).
Report is 8 commits behind head on main.

Files Patch % Lines
go/vt/vttablet/tabletserver/throttle/client.go 55.55% 8 Missing ⚠️
go/vt/vttablet/tabletserver/throttle/throttler.go 75.00% 8 Missing ⚠️
.../vt/vttablet/tabletserver/throttle/check_result.go 68.42% 6 Missing ⚠️
go/vt/vttablet/onlineddl/executor.go 0.00% 4 Missing ⚠️
go/vt/binlog/binlogplayer/binlog_player.go 0.00% 2 Missing ⚠️
go/vt/vttablet/tabletmanager/rpc_throttler.go 0.00% 2 Missing ⚠️
.../vt/vttablet/tabletmanager/vreplication/vcopier.go 33.33% 2 Missing ⚠️
.../vt/vttablet/tabletmanager/vreplication/vplayer.go 33.33% 2 Missing ⚠️
...vttablet/tabletmanager/vreplication/vreplicator.go 0.00% 2 Missing ⚠️
go/vt/vttablet/tabletserver/gc/tablegc.go 0.00% 1 Missing ⚠️
... and 3 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16416      +/-   ##
==========================================
- Coverage   68.65%   68.63%   -0.02%     
==========================================
  Files        1550     1551       +1     
  Lines      199412   199515     +103     
==========================================
+ Hits       136900   136931      +31     
- Misses      62512    62584      +72     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Shlomi Noach <[email protected]>
…the Summary(), populate ThrottledReason added in proto, then in turn populate new _vt.vreplication.reason_throttled

Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is brilliant!

I will enhance the throttler reporting added to vtadmin in #16308 to display the reason once this is merged.

@shlomi-noach shlomi-noach requested a review from a team July 25, 2024 08:20
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This is a nice enhancement. ❤️

@@ -85,14 +91,32 @@ func (c *CheckResult) IsOK() bool {
return c.StatusCode == http.StatusOK
}

// Summary returns a human-readable summary of the check result
func (c *CheckResult) Summary() string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use String() so that CheckResult implements the Stringer interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ambivalent. There can only be one String() function, and I'm not sure what the expected implementation should be. Summary() does not capture all of the CheckResult's fields. Should String() capture all fields? Should String() be human readable or machine readable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gonna keep this with Summary(). We can always easily change it to String() later.

func (c *CheckResult) Summary() string {
switch c.StatusCode {
case http.StatusOK:
return fmt.Sprintf("%v is granted access", c.AppName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nitty, but IMO better to use %s with strings like AppName.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

var (
throttleTicks int64
throttleInit sync.Once
emptyCheckResult = &CheckResult{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for this not to be a const?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

golang does nto permit this as a const

@@ -418,6 +418,7 @@ func TestApplyThrottlerConfigMetricThresholds(t *testing.T) {
assert.EqualValues(t, 0.3, checkResult.Value) // self lag value
assert.EqualValues(t, http.StatusOK, checkResult.StatusCode)
assert.Len(t, checkResult.Metrics, 1)
assert.Contains(t, checkResult.Summary(), "test is granted access")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the "test" value in all of these asserts is from the const test app name. Nitty, but better to use the const everywhere it applies. Similar for the other app names which are based on existing consts/vars.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


// continuing previous test, we had 3 throttled apps. "all" is a new app being throttled.
assert.Equal(t, 4, throttler.throttledApps.ItemCount())
})
//
// //
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we added an extraneous comment indicator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@shlomi-noach shlomi-noach merged commit e341f23 into vitessio:main Jul 28, 2024
132 checks passed
@shlomi-noach shlomi-noach deleted the throttler-response-app branch July 28, 2024 07:04
venkatraju pushed a commit to slackhq/vitess that referenced this pull request Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Throttler Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants