You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In our agent - https://github.com/hypertrace/goagent - which uses opentelemetry-go we observed that if the exporter errors out, it ends up invoking the retry logic here https://github.com/open-telemetry/opentelemetry-go/blob/main/exporters/otlp/internal/retry/retry.go and when MaxElapsedTime is reached for one of the batches being exported, all the attempts to export end up hitting this even after a short recovery where the exports are successful and then the exporter starts to fail again. There's no real retry for these subsequent failures. I traced down the code and it seems as if once the Backoff is initialized and reset here
, it is never reset again since the return value of RequestFunc is a function which never resets the local Backoff b. I also verified using a unit test I added to retry_test.go.
call no. 0
current time: 2022-12-01 15:58:39.547316 -0800 PST m=+0.001479155
success: no error from call
call no. 1
current time: 2022-12-01 15:58:39.54752 -0800 PST m=+0.001682540
current time: 2022-12-01 15:58:39.581243 -0800 PST m=+0.035405325
current time: 2022-12-01 15:58:39.646719 -0800 PST m=+0.100881369
current time: 2022-12-01 15:58:39.725428 -0800 PST m=+0.179589531
current time: 2022-12-01 15:58:39.822869 -0800 PST m=+0.277029490
current time: 2022-12-01 15:58:39.963639 -0800 PST m=+0.417798602
current time: 2022-12-01 15:58:40.234658 -0800 PST m=+0.688816086
success: no error from call
call no. 2
current time: 2022-12-01 15:58:40.234692 -0800 PST m=+0.688850010
current time: 2022-12-01 15:58:40.405348 -0800 PST m=+0.859504867
error from call: max retry time elapsed: assert.AnError general error for testing
call no. 3
current time: 2022-12-01 15:58:40.405445 -0800 PST m=+0.859601564
error from call: max retry time elapsed: assert.AnError general error for testing
call no. 4
current time: 2022-12-01 15:58:40.405478 -0800 PST m=+0.859634655
error from call: max retry time elapsed: assert.AnError general error for testing
call no. 5
current time: 2022-12-01 15:58:40.405499 -0800 PST m=+0.859656005
error from call: max retry time elapsed: assert.AnError general error for testing
call no. 6
current time: 2022-12-01 15:58:40.405524 -0800 PST m=+0.859681161
error from call: max retry time elapsed: assert.AnError general error for testing
call no. 7
current time: 2022-12-01 15:58:40.405538 -0800 PST m=+0.859694707
success: no error from call
call no. 8
current time: 2022-12-01 15:58:40.405546 -0800 PST m=+0.859702641
error from call: max retry time elapsed: assert.AnError general error for testing
call no. 9
current time: 2022-12-01 15:58:40.405554 -0800 PST m=+0.859710996
error from call: max retry time elapsed: assert.AnError general error for testing
Call 1 above makes some retries and is able to get through on the 7th try. Call 2 picks up the backoff where it left off in Call 1 instead of having a reset backoff. All the other calls after this use a backoff with the MaxElapsedTime reached and even the successful call 7 does not reset it.
Environment
OS: Mac OS X Monterey 12.6.1
Architecture: x86
Go Version: 1.17
opentelemetry-go version: 1.10.0
Steps To Reproduce
See unit test above.
You can also break the exporting behavior by disabling the collector endpoint until the max retry time elapsed messaged is observed, bring the collector back up for recovery and the bring it down again and observe that instead of retries, max retry time elapsed error is received immediately.
http-server % go run main.go
2022/11/29 12:35:21 context deadline exceeded
2022/11/29 12:35:31 context deadline exceeded
// MaxElapsedTime reached
2022/11/29 12:37:15 max retry time elapsed: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4317: connect: connection refused"
2022/11/29 12:37:16 max retry time elapsed: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4317: connect: connection refused"
// Short recovery
// Break again
2022/11/29 12:40:00 max retry time elapsed: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4317: connect: connection refused"
2022/11/29 12:40:01 max retry time elapsed: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4317: connect: connection refused"
2022/11/29 12:40:01 max retry time elapsed: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4317: connect: connection refused"
Expected behavior
For separate export calls, there should be retries on each batch and on recovery followed failure the backoff should reset and resume retrying.
The text was updated successfully, but these errors were encountered:
Description
In our agent - https://github.com/hypertrace/goagent - which uses opentelemetry-go we observed that if the exporter errors out, it ends up invoking the retry logic here https://github.com/open-telemetry/opentelemetry-go/blob/main/exporters/otlp/internal/retry/retry.go and when
MaxElapsedTime
is reached for one of the batches being exported, all the attempts to export end up hitting this even after a short recovery where the exports are successful and then the exporter starts to fail again. There's no real retry for these subsequent failures. I traced down the code and it seems as if once the Backoff is initialized and reset hereopentelemetry-go/exporters/otlp/internal/retry/retry.go
Line 82 in 69ad652
RequestFunc
is a function which never resets the local Backoffb
. I also verified using a unit test I added to retry_test.go.Unit test:
Results of the unit test:
Call 1 above makes some retries and is able to get through on the 7th try. Call 2 picks up the backoff where it left off in Call 1 instead of having a reset backoff. All the other calls after this use a backoff with the MaxElapsedTime reached and even the successful call 7 does not reset it.
Environment
Steps To Reproduce
See unit test above.
You can also break the exporting behavior by disabling the collector endpoint until the
max retry time elapsed
messaged is observed, bring the collector back up for recovery and the bring it down again and observe that instead of retries,max retry time elapsed
error is received immediately.Expected behavior
For separate export calls, there should be retries on each batch and on recovery followed failure the backoff should reset and resume retrying.
The text was updated successfully, but these errors were encountered: