Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail VReplication workflows on errors that persist and unrecoverable errors #10429

Merged
merged 8 commits into from
Jun 23, 2022

Conversation

rohit-nayak-ps
Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps commented Jun 3, 2022

Description

As part of an initial design decision, VReplication workflows always retry in case it encounters an error after sleeping for 5 seconds. The reasoning was that, for large reshards/migrations and perpetual materialize workflows, we could often encounter recoverable errors like PRS, restarting of vttablets/mysql servers, network partitions etc. So rather than error out waiting for an operator to manually restart workflows we decided to keep retrying.

Since we only retried every five seconds any resource wastage due to continuously retrying unrecoverable workflows would be small and in most cases we would transparently recover and make forward progress with minimum downtime. This is especially important for Materialize workflows were the user is expecting near realtime performance.

Usually the vreplication workflows would be setup manually and the possibility of errors due to schema issues was minimal and so this approach worked well. However with the introduction of vreplication-based online DDL workflows we see a lot of automated use where user-specified DDLs are directly used to configure vreplication workflows. Incorrect DDLs can thus result in errors that result in prolonged retries that are not recoverable.

Error reporting in VReplication is also not great: we update the message column in the _vt.vreplication table, but that can get overwritten when we retry. We do also log errors in the _vt.vreplication_log table

A change was introduced recently in Online DDL workflows to mitigate this: we look up the error against a set of MySQL errors that we knew were not recoverable and in that case we put the workflow in an error state. Then there are no more automated retries and a manual restart after fixing the error is expected.

However there are still unrecoverable schema-related errors that are not yet mapped or do not map cleanly to MySQL errors. There could also be misconfigured workflows (example: no replicas in a keyspace when the tablet type is set to only replicas, incorrect cell settings etc). Continuously retrying workflows in such cases can delay detecting them.

This PR:

  • extends the check for unrecoverable errors to all workflow types, not just Online DDLs
  • for all workflows, detects errors that persist for more than the 🚩 new vttablet flag
    --vreplication_max_time_to_retry_errors (default: 15 minutes).

For above cases it directly moves the workflow to Error state, which is then reported in Workflow Show.

Checklist

  • "Backport me!" label has been added if this change should be backported
    • We should backport this to 14.0.0-rc, but no further
  • Tests were added or are not required
  • Documentation was added or is not required

@rohit-nayak-ps rohit-nayak-ps added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VReplication release notes (needs details) This PR needs to be listed in the release notes in a dedicated section (deprecation notice, etc...) Skip Upgrade Downgrade labels Jun 3, 2022
@rohit-nayak-ps rohit-nayak-ps marked this pull request as ready for review June 3, 2022 18:16
@github-actions
Copy link
Contributor

github-actions bot commented Jun 3, 2022

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has the correct release notes label. release notes none should only be used for PRs that are so trivial that they need not be included.
  • If a new flag is being introduced, review whether it is really needed. The flag names should be clear and intuitive (as far as possible), and the flag's help should be descriptive.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should either include a link to an issue that describes the bug OR an actual description of the bug and how to reproduce, along with a description of the fix.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.

@rohit-nayak-ps rohit-nayak-ps added release notes and removed release notes (needs details) This PR needs to be listed in the release notes in a dedicated section (deprecation notice, etc...) labels Jun 3, 2022
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajm188 what is the guidance on new command-line flags added to binaries?
dash or underscore?

retryDelay = flag.Duration("vreplication_retry_delay", 5*time.Second, "delay before retrying a failed binlog connection")
retryDelay = flag.Duration("vreplication_retry_delay", 5*time.Second, "delay before retrying a failed workflow event in the replication phase")

maxTimeToRetryErrors = flag.Duration("vreplication_max_time_to_retry_errors", 15*time.Minute, "stop trying to retry after this time")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
maxTimeToRetryErrors = flag.Duration("vreplication_max_time_to_retry_errors", 15*time.Minute, "stop trying to retry after this time")
maxTimeToRetryErrors = flag.Duration("vreplication_max_time_to_retry_errors", 15*time.Minute, "stop retrying after this time")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be dashes but I also think that consistency is important and that we should migrate all Vitess flag names from underscores to dashes, either early in 15.0 or 16.0 SNAPSHOT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense. we can do them all at once (and allow both versions for at least one release).

Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a few questions/nits, but I LOVE this! We can then store this information e.g. in the _vt.vdiff table too.

The only thing that kept me from approving -- as most of my comments are not blockers and subjective preference -- is that I think using vterrors would be better. What do you think?

Thanks!

ℹ️ Note: I think that we should backport this to 14.0.0, but no further. I added the label and updated the description accordingly, but if you disagree I can undo those changes (already discussed with @deepthi).

retryDelay = flag.Duration("vreplication_retry_delay", 5*time.Second, "delay before retrying a failed binlog connection")
retryDelay = flag.Duration("vreplication_retry_delay", 5*time.Second, "delay before retrying a failed workflow event in the replication phase")

maxTimeToRetryErrors = flag.Duration("vreplication_max_time_to_retry_errors", 15*time.Minute, "stop trying to retry after this time")
Copy link
Contributor

@mattlord mattlord Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:
stop automatically retrying when we've had consecutive failures with the same error for this long after the first occurrence

ct.blpStats.ErrorCounts.Add([]string{"Stream Error"}, 1)
binlogplayer.LogError(fmt.Sprintf("error in stream %v, retrying after %v", ct.id, *retryDelay), err)
log.Flush()
Copy link
Contributor

@mattlord mattlord Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log flushes can potentially be slow and IMO we should avoid them in the hot path when possible. Are you adding these for local debugging during feature development? It also shouldn't be necessary, I don't think (and also may not be an issue to flush), as I believe STDERR is used for the error log and that is unbuffered by default anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I'm not sure why flush is used here in particular, I think this point of code is infrequent enough that it can justify flush`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was leftover from a debug session, removed


ct.lastWorkflowError.record(err)
if isUnrecoverableError(err) /* mysql error that we know needs manual intervention */ ||
!ct.lastWorkflowError.canRetry() /* cannot detect if this is recoverable, but it is persisting too long */ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but I think this method should be called shouldRetry() rather than canRetry().


type lastError struct {
name string
lastError error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a vterror instead? That way we can enforce a meaningful error type and code as part of the structure. That should make it much easier to classify different ones based on the error code.

*/

type lastError struct {
name string
Copy link
Contributor

@mattlord mattlord Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended to be a short summary or just a key?

Comment on lines 33 to 35
lastError error
lastErrorStartTime time.Time
lastErrorMu sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type is lastError so using lastError in the field names feels redundant. Can we call them ~:

error         vterror
firstSeen     time.Time
mu            sync.Mutex

le.lastError = err
}

func (le *lastError) canRetry() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, IMO we should call this shouldRetry() as we always can. It's an (annoying) nit though, I know.

le.lastErrorMu.Lock()
defer le.lastErrorMu.Unlock()
if !time.Time.IsZero(le.lastErrorStartTime) && time.Since(le.lastErrorStartTime) > le.maxTimeInError {
log.Errorf("Got same error since %s, will not retry anymore: you will need to manually restart workflow once error '%s' is fixed",
Copy link
Contributor

@mattlord mattlord Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.Errorf("The same error has been seen continuously since %s, we will assume this is a non-recoverable error and will not retry anymore; the workflow will need to be manually restarted once error '%s' has been addressed",

Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! The code change is small and simple and solves a big problem. My only request for change is dealing with unexpected input value for --vreplication_max_time_to_retry_errors

go/vt/vttablet/tabletmanager/vreplication/controller.go Outdated Show resolved Hide resolved
go/vt/vttablet/tabletmanager/vreplication/controller.go Outdated Show resolved Hide resolved
@mattlord
Copy link
Contributor

@rohit-nayak-ps I tried to address my own nits and comments here: 6e26308

I can revert some or all of it if you disagree, but I thought you'd be fine with them so attempted to save you some time. 🙂

@shlomi-noach for the input validation, were you thinking something like this?

diff --git a/go/vt/vttablet/tabletmanager/vreplication/controller.go b/go/vt/vttablet/tabletmanager/vreplication/controller.go
index cf906cebcf..945fcc695c 100644
--- a/go/vt/vttablet/tabletmanager/vreplication/controller.go
+++ b/go/vt/vttablet/tabletmanager/vreplication/controller.go
@@ -48,7 +48,7 @@ var (
        _          = flag.Duration("vreplication_healthcheck_timeout", 1*time.Minute, "healthcheck retry delay")
        retryDelay = flag.Duration("vreplication_retry_delay", 5*time.Second, "delay before retrying a failed workflow event in the replication phase")

-       maxTimeToRetryError = flag.Duration("vreplication_max_time_to_retry_on_error", 15*time.Minute, "stop automatically retrying when we've had consecutive failures with the same error for this long after the first occurrence")
+       maxTimeToRetryError = flag.Duration("vreplication_max_time_to_retry_on_error", 15*time.Minute, "stop automatically retrying when we've had consecutive failures with the same error for this long after the first occurrence (min: vreplication_retry_delay * 5, max: 24 hours)")
 )

 // controller is created by Engine. Members are initialized upfront.
@@ -81,6 +81,11 @@ func newController(ctx context.Context, params map[string]string, dbClientFactor
        if blpStats == nil {
                blpStats = binlogplayer.NewStats()
        }
+       if *maxTimeToRetryError < *retryDelay*5 {
+               *maxTimeToRetryError = *retryDelay * 5
+       } else if *maxTimeToRetryError > 24*time.Hour {
+               *maxTimeToRetryError = 24 * time.Hour
+       }
        ct := &controller{
                vre:               vre,
                dbClientFactory:   dbClientFactory,

Or did I misunderstand? Thanks!

@shlomi-noach
Copy link
Contributor

@shlomi-noach for the input validation, were you thinking something like this?

Sure! Although, I notice this adds a new dependency on --vreplication_retry_delay, which in itself isn't constrained (what happens if --vreplication_retry_delay is zero or negative?)

@shlomi-noach
Copy link
Contributor

All, let's pursue this PR? As far as I'm concerned, there is just the remaining issue of validating the input; but I'm also OK to merge this as-is to get this fix sooner into main, in the assumption that we will follow up with validation on a different PR. Opinions?

@rohit-nayak-ps
Copy link
Contributor Author

All, let's pursue this PR? As far as I'm concerned, there is just the remaining issue of validating the input; but I'm also OK to merge this as-is to get this fix sooner into main, in the assumption that we will follow up with validation on a different PR. Opinions?

Agreed that we should merge this soon. I will take a fine look at this later today to review the newer commits and aim to push this today itself.

@frouioui frouioui mentioned this pull request Jun 22, 2022
43 tasks
@shlomi-noach
Copy link
Contributor

flag value validation added in bd1ba9b

Signed-off-by: Rohit Nayak <[email protected]>
…ng unit tests that set the value to a very small value

Signed-off-by: Rohit Nayak <[email protected]>
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a very small nit, but LGTM!

@@ -161,6 +163,7 @@ func isUnrecoverableError(err error) bool {
mysql.ERInvalidCastToJSON,
mysql.ERJSONValueTooBig,
mysql.ERJSONDocumentTooDeep:
log.Errorf("got unrecoverable error: %v", sqlErr)
Copy link
Contributor

@mattlord mattlord Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but we should capitalize the beginning of log messages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was the other way around? That we had to uncapitalize everything, because those messages are preceded by prefix like Error: ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalized per request.

Signed-off-by: Shlomi Noach <[email protected]>
@shlomi-noach
Copy link
Contributor

Cluster (shardedrecovery_stress_verticalsplit_heavy) is super flaky.

@shlomi-noach shlomi-noach merged commit cc7974f into vitessio:main Jun 23, 2022
@shlomi-noach shlomi-noach deleted the rn-vr-fail-on-repeated-errors branch June 23, 2022 07:44
shlomi-noach added a commit to planetscale/vitess that referenced this pull request Jun 23, 2022
…errors (vitessio#10429)

* Fail workflow if same error persists too long. Fail for unrecoverable errors also in non-online ddl workflows

Signed-off-by: Rohit Nayak <[email protected]>

* Update max time default to 15m, was 1m for testing purposes

Signed-off-by: Rohit Nayak <[email protected]>

* Leverage vterrors for Equals; attempt to address my own nits

Signed-off-by: Matt Lord <[email protected]>

* sanity: validate range of vreplication_retry_delay and of vreplication_max_time_to_retry_on_error

Signed-off-by: Shlomi Noach <[email protected]>

* Fix flags test

Signed-off-by: Rohit Nayak <[email protected]>

* Remove leftover log.Flush()

Signed-off-by: Rohit Nayak <[email protected]>

* Revert validations min/max settings on retry delay since it is breaking unit tests that set the value to a very small value

Signed-off-by: Rohit Nayak <[email protected]>

* captilize per request

Signed-off-by: Shlomi Noach <[email protected]>

Co-authored-by: Matt Lord <[email protected]>
Co-authored-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
@shlomi-noach
Copy link
Contributor

Backport: #10573

shlomi-noach added a commit that referenced this pull request Jun 23, 2022
…sist and unrecoverable errors (#10573)

* Fail VReplication workflows on errors that persist and unrecoverable errors (#10429)

* Fail workflow if same error persists too long. Fail for unrecoverable errors also in non-online ddl workflows

Signed-off-by: Rohit Nayak <[email protected]>

* Update max time default to 15m, was 1m for testing purposes

Signed-off-by: Rohit Nayak <[email protected]>

* Leverage vterrors for Equals; attempt to address my own nits

Signed-off-by: Matt Lord <[email protected]>

* sanity: validate range of vreplication_retry_delay and of vreplication_max_time_to_retry_on_error

Signed-off-by: Shlomi Noach <[email protected]>

* Fix flags test

Signed-off-by: Rohit Nayak <[email protected]>

* Remove leftover log.Flush()

Signed-off-by: Rohit Nayak <[email protected]>

* Revert validations min/max settings on retry delay since it is breaking unit tests that set the value to a very small value

Signed-off-by: Rohit Nayak <[email protected]>

* captilize per request

Signed-off-by: Shlomi Noach <[email protected]>

Co-authored-by: Matt Lord <[email protected]>
Co-authored-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>

* fix TestHelpOutput

Signed-off-by: Shlomi Noach <[email protected]>

* spaces, not tabs

Signed-off-by: Shlomi Noach <[email protected]>

Co-authored-by: Rohit Nayak <[email protected]>
Co-authored-by: Matt Lord <[email protected]>
rsajwani pushed a commit to planetscale/vitess that referenced this pull request Jun 27, 2022
…errors (vitessio#10429) (vitessio#783)

* Fail workflow if same error persists too long. Fail for unrecoverable errors also in non-online ddl workflows

Signed-off-by: Rohit Nayak <[email protected]>

* Update max time default to 15m, was 1m for testing purposes

Signed-off-by: Rohit Nayak <[email protected]>

* Leverage vterrors for Equals; attempt to address my own nits

Signed-off-by: Matt Lord <[email protected]>

* sanity: validate range of vreplication_retry_delay and of vreplication_max_time_to_retry_on_error

Signed-off-by: Shlomi Noach <[email protected]>

* Fix flags test

Signed-off-by: Rohit Nayak <[email protected]>

* Remove leftover log.Flush()

Signed-off-by: Rohit Nayak <[email protected]>

* Revert validations min/max settings on retry delay since it is breaking unit tests that set the value to a very small value

Signed-off-by: Rohit Nayak <[email protected]>

* captilize per request

Signed-off-by: Shlomi Noach <[email protected]>

Co-authored-by: Matt Lord <[email protected]>
Co-authored-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>

Co-authored-by: Rohit Nayak <[email protected]>
Co-authored-by: Matt Lord <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants