Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry backoff never resets on reaching MaxElapsedTime for subsequent fn invocations #3506

Closed
tim-mwangi opened this issue Dec 2, 2022 · 4 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@tim-mwangi
Copy link

Description

In our agent - https://github.com/hypertrace/goagent - which uses opentelemetry-go we observed that if the exporter errors out, it ends up invoking the retry logic here https://github.com/open-telemetry/opentelemetry-go/blob/main/exporters/otlp/internal/retry/retry.go and when MaxElapsedTime is reached for one of the batches being exported, all the attempts to export end up hitting this even after a short recovery where the exports are successful and then the exporter starts to fail again. There's no real retry for these subsequent failures. I traced down the code and it seems as if once the Backoff is initialized and reset here

b := &backoff.ExponentialBackOff{
, it is never reset again since the return value of RequestFunc is a function which never resets the local Backoff b. I also verified using a unit test I added to retry_test.go.

Unit test:

func TestVerifyThatBackoffResetIsNotInvoked(t *testing.T) {
  whCallCount := 0
  ev := func(error) (bool, time.Duration) {
    return true, 20 * time.Millisecond
  }
  reqFunc := Config{
    Enabled:         true,
    InitialInterval: 30 * time.Millisecond,
    MaxInterval:     300 * time.Millisecond,
    MaxElapsedTime:  1 * time.Second,
  }.RequestFunc(ev)

  wh := func(context.Context) error {
    fmt.Printf("current time: %v\n", time.Now())
    // return success every 7th call.
    if whCallCount%7 == 0 {
      whCallCount++
      return nil
    }
    whCallCount++
    return assert.AnError
  }

  ctx := context.Background()
  for i := 0; i < 10; i++ {
    fmt.Printf("call no. %d\n", i)
    err := reqFunc(ctx, wh)
    if err != nil {
      fmt.Printf("error from call: %v\n", err)
    } else {
      fmt.Println("success: no error from call")
    }
  }

  time.Sleep(3 * time.Second)
}

Results of the unit test:

call no. 0
current time: 2022-12-01 15:58:39.547316 -0800 PST m=+0.001479155
success: no error from call
call no. 1
current time: 2022-12-01 15:58:39.54752 -0800 PST m=+0.001682540
current time: 2022-12-01 15:58:39.581243 -0800 PST m=+0.035405325
current time: 2022-12-01 15:58:39.646719 -0800 PST m=+0.100881369
current time: 2022-12-01 15:58:39.725428 -0800 PST m=+0.179589531
current time: 2022-12-01 15:58:39.822869 -0800 PST m=+0.277029490
current time: 2022-12-01 15:58:39.963639 -0800 PST m=+0.417798602
current time: 2022-12-01 15:58:40.234658 -0800 PST m=+0.688816086
success: no error from call
call no. 2
current time: 2022-12-01 15:58:40.234692 -0800 PST m=+0.688850010
current time: 2022-12-01 15:58:40.405348 -0800 PST m=+0.859504867
error from call: max retry time elapsed: assert.AnError general error for testing
call no. 3
current time: 2022-12-01 15:58:40.405445 -0800 PST m=+0.859601564
error from call: max retry time elapsed: assert.AnError general error for testing
call no. 4
current time: 2022-12-01 15:58:40.405478 -0800 PST m=+0.859634655
error from call: max retry time elapsed: assert.AnError general error for testing
call no. 5
current time: 2022-12-01 15:58:40.405499 -0800 PST m=+0.859656005
error from call: max retry time elapsed: assert.AnError general error for testing
call no. 6
current time: 2022-12-01 15:58:40.405524 -0800 PST m=+0.859681161
error from call: max retry time elapsed: assert.AnError general error for testing
call no. 7
current time: 2022-12-01 15:58:40.405538 -0800 PST m=+0.859694707
success: no error from call
call no. 8
current time: 2022-12-01 15:58:40.405546 -0800 PST m=+0.859702641
error from call: max retry time elapsed: assert.AnError general error for testing
call no. 9
current time: 2022-12-01 15:58:40.405554 -0800 PST m=+0.859710996
error from call: max retry time elapsed: assert.AnError general error for testing

Call 1 above makes some retries and is able to get through on the 7th try. Call 2 picks up the backoff where it left off in Call 1 instead of having a reset backoff. All the other calls after this use a backoff with the MaxElapsedTime reached and even the successful call 7 does not reset it.

Environment

  • OS: Mac OS X Monterey 12.6.1
  • Architecture: x86
  • Go Version: 1.17
  • opentelemetry-go version: 1.10.0

Steps To Reproduce

See unit test above.
You can also break the exporting behavior by disabling the collector endpoint until the max retry time elapsed messaged is observed, bring the collector back up for recovery and the bring it down again and observe that instead of retries, max retry time elapsed error is received immediately.

http-server % go run main.go
2022/11/29 12:35:21 context deadline exceeded
2022/11/29 12:35:31 context deadline exceeded

// MaxElapsedTime reached
2022/11/29 12:37:15 max retry time elapsed: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4317: connect: connection refused"
2022/11/29 12:37:16 max retry time elapsed: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4317: connect: connection refused"

// Short recovery

// Break again 
2022/11/29 12:40:00 max retry time elapsed: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4317: connect: connection refused"
2022/11/29 12:40:01 max retry time elapsed: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4317: connect: connection refused"
2022/11/29 12:40:01 max retry time elapsed: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4317: connect: connection refused"

Expected behavior

For separate export calls, there should be retries on each batch and on recovery followed failure the backoff should reset and resume retrying.

@tim-mwangi tim-mwangi added the bug Something isn't working label Dec 2, 2022
@matej-g
Copy link
Contributor

matej-g commented Jan 25, 2023

I hit this today with metrics exporter, is anyone working on a fix?

@dmathieu
Copy link
Member

I don't think anyone is explicitly working on a fix.
I would think we should reset the backoff if the export succeeds.

@matej-g
Copy link
Contributor

matej-g commented Jan 25, 2023

Cool, happy to take this on 👍

@dmathieu
Copy link
Member

This has been fixed in #3756.

@XSAM XSAM added this to the untracked milestone Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants