Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wavefront output should distinguish between retryable and non-retryable errors #8404

Merged
merged 1 commit into from
Nov 13, 2020

Conversation

fishy
Copy link
Contributor

@fishy fishy commented Nov 13, 2020

Currently we assume all wavefront send errors are retryable, and when an
error happens during Write function we will reject the buffer and keep
retrying the next tick. This means that when an actually non-retryable
error happens, we'll just keep getting the same error on every tick, and
never flush the buffer.

One such error we encountered is "empty metric name" error.

Add isRetryable function to detect non-retryable errors, and make it
default to assume that all errors are retryable (so it matches the
current behavior), but make it possible to mark certain errors as
non-retryable.

Currently only handled that "empty metric name" error as non-retryable.
A support ticket has been filed against wavefront to provide a canonical
way to distinguish between retryable and non-retryable errors.

Signed-off-by: Yuxuan 'fishy' Wang [email protected]

Required for all PRs:

  • Signed CLA.
  • Associated README.md updated.
  • Has appropriate unit tests.

@ssoroka
Copy link
Contributor

ssoroka commented Nov 13, 2020

Hey @fishy . it's up to each output to decide what's retry-able and not. If an output gets an error trying to post a metric and it doesn't want it to be retried, it should log the error and continue, not returning an error from the Write() function.

@ssoroka ssoroka changed the title Distinguish between retryable and non-retryable output errors Wavefront output should distinguish between retryable and non-retryable errors Nov 13, 2020
@ssoroka ssoroka added area/wavefront bug unexpected problem or unintended behavior labels Nov 13, 2020
@fishy
Copy link
Contributor Author

fishy commented Nov 13, 2020

@ssoroka Thanks for the feedback, Steven! That sounds reasonable, but the problems are that if we go that route we need to make sure that the output plugin has the actual working logger configured, or the attempt to log and swallow the error will go into a blackhole, or even cause panic. Looking at the plugin code, it does have the logger in the struct, but I'm not entirely sure whether or not it's configured (for example, it's not initialized during the init function:

return &Wavefront{
Token: "DUMMY_TOKEN",
MetricSeparator: ".",
ConvertPaths: true,
ConvertBool: true,
TruncateTags: false,
ImmediateFlush: true,
}
). Can you point me to the code actually set the Log field in wavefront plugin?

@ssoroka
Copy link
Contributor

ssoroka commented Nov 13, 2020

if you're looking at Log Telegraf.Logger, it gets set automatically by Telegraf. you don't see it because it's happening with reflection. It's an internal function called SetLoggerOnPlugin. You can use the logger safely anywhere within the struct's functions

@fishy fishy force-pushed the wavefront-non-retryable-errors branch from 07913a6 to ace138e Compare November 13, 2020 21:31
@fishy
Copy link
Contributor Author

fishy commented Nov 13, 2020

Thanks, @ssoroka . I updated the PR to only check retryable inside wavefront plugin. Please take another look.

(commit message and PR description also updated accordingly)

return fmt.Errorf("Wavefront sending error: %v", err)
}
w.Log.Errorf("non-retryable error during Wavefront.Write: %v", err)
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might not want to return here, and instead finish writing the batches, as some of those metrics are probably still good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done.

…le errors

Currently we assume all wavefront send errors are retryable, and when an
error happens during Write function we will reject the buffer and keep
retrying the next tick. This means that when an actually non-retryable
error happens, we'll just keep getting the same error on every tick, and
never flush the buffer.

One such error we encountered is "empty metric name" error.

Add isRetryable function to detect non-retryable errors, and make it
default to assume that all errors are retryable (so it matches the
current behavior), but make it possible to mark certain errors as
non-retryable.

Currently only handled that "empty metric name" error as non-retryable.
A support ticket has been filed against wavefront to provide a canonical
way to distinguish between retryable and non-retryable errors.

Signed-off-by: Yuxuan 'fishy' Wang <[email protected]>
@fishy fishy force-pushed the wavefront-non-retryable-errors branch from ace138e to 12a7d53 Compare November 13, 2020 21:58
@ssoroka ssoroka merged commit 18460e1 into influxdata:master Nov 13, 2020
@ssoroka
Copy link
Contributor

ssoroka commented Nov 13, 2020

Merged! thank you

@fishy fishy deleted the wavefront-non-retryable-errors branch November 13, 2020 22:11
@fishy
Copy link
Contributor Author

fishy commented Nov 13, 2020

@ssoroka Thanks! I assume this will be included in 1.16.3 release? Do you know when will that release happen?

@ssoroka
Copy link
Contributor

ssoroka commented Nov 13, 2020

Yep! I think we might have another one in a week or two. Until then it'll be in the nightly release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/wavefront bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants