Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(otlp-exporter-base): add retries #3207

Merged

Conversation

svetlanabrennan
Copy link
Contributor

@svetlanabrennan svetlanabrennan commented Aug 26, 2022

Which problem is this PR solving?

Adding retries to otlp http & proto exporters.

Note: this pr does not add retries to the grpc otlp exporter.

Fixes #1233

Short description of the changes

Added retry logic to sendWithHttp and sendWithXhr functions. It does the following:

If the exporter gets a failed response that is retryable, we check if the response included a "retry-after" header.

  • If yes, retry again in the time specified in the retry-header which can be seconds or a date/time.
  • If no, we use an exponential backoff with jitter

This logic is retried until there is a successful response or until the exporter timer times out (set via timeoutMillis in the exporter config or via env vars). If the exporter timers times outs, the request is aborted with a "Request Timeout" error.

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Unit Tests - for browser only

Checklist:

  • Followed the style guidelines of this project
  • Unit tests have been added
  • Documentation has been updated

@svetlanabrennan
Copy link
Contributor Author

svetlanabrennan commented Aug 26, 2022

For some reason, using req.destroy() in a real http request export here is not triggering any of the following events:

  • 'aborted' on the res object
  • 'error' on the res object with an error with message 'Error: aborted' and code 'ECONNRESET'.
  • 'error' with an error with message 'Error: socket hang up' and code 'ECONNRESET'

These events should fire depending on when the request is aborted (before socket assigned, before connection succeeds or after response is received).

If I use req.abort() then it works (fires the correct events) but req.abort() was deprecated in node 14 so I was trying to avoid using it in node 14 and higher.

Any idea what I'm missing here or what's going on?

If you run this test and use req.destroy() here instead of `req.abort() and use node 16, you'll see what I'm referring to.

@codecov
Copy link

codecov bot commented Aug 26, 2022

Codecov Report

Merging #3207 (30ba4ec) into main (708afd0) will decrease coverage by 0.18%.
The diff coverage is 57.14%.

❗ Current head 30ba4ec differs from pull request most recent head 02572c4. Consider uploading reports for the commit 02572c4 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3207      +/-   ##
==========================================
- Coverage   93.79%   93.62%   -0.18%     
==========================================
  Files         264      265       +1     
  Lines        7349     7417      +68     
  Branches     1491     1511      +20     
==========================================
+ Hits         6893     6944      +51     
- Misses        456      473      +17     
Impacted Files Coverage Δ
...kages/otlp-exporter-base/src/platform/node/util.ts 64.15% <48.33%> (-12.48%) ⬇️
...perimental/packages/otlp-exporter-base/src/util.ts 93.10% <88.23%> (-2.02%) ⬇️
...-trace-base/src/platform/node/RandomIdGenerator.ts 87.50% <0.00%> (-6.25%) ⬇️
.../src/platform/browser/export/BatchSpanProcessor.ts 50.00% <0.00%> (ø)
packages/opentelemetry-sdk-trace-base/src/Span.ts 98.58% <0.00%> (+0.70%) ⬆️
api/karma.worker.js 100.00% <0.00%> (+100.00%) ⬆️
packages/opentelemetry-resources/karma.worker.js 100.00% <0.00%> (+100.00%) ⬆️
...ckages/opentelemetry-sdk-trace-web/karma.worker.js 100.00% <0.00%> (+100.00%) ⬆️
...kages/opentelemetry-sdk-trace-base/karma.worker.js 100.00% <0.00%> (+100.00%) ⬆️

@dyladan
Copy link
Member

dyladan commented Aug 30, 2022

For some reason, using req.destroy() in a real http request export here is not triggering any of the following events:

  • 'aborted' on the res object
  • 'error' on the res object with an error with message 'Error: aborted' and code 'ECONNRESET'.
  • 'error' with an error with message 'Error: socket hang up' and code 'ECONNRESET'

These events should fire depending on when the request is aborted (before socket assigned, before connection succeeds or after response is received).

If I use req.abort() then it works (fires the correct events) but req.abort() was deprecated in node 14 so I was trying to avoid using it in node 14 and higher.

Any idea what I'm missing here or what's going on?

If you run this test and use req.destroy() here instead of `req.abort() and use node 16, you'll see what I'm referring to.

Those events only fire if you call req.destroy after the response is received https://nodejs.org/api/http.html#httprequesturl-options-callback.

@svetlanabrennan
Copy link
Contributor Author

svetlanabrennan commented Oct 6, 2022

I would love some feedback on the current implementation of the retry logic.

I still need to confirm and add the following items:

  • "jitter" to the retry logic
  • confirm the retryable codes (I think there is an issue open about this in the spec)
  • fix a test that's timing out

Copy link
Member

@legendecas legendecas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, thank you for working on this!

@svetlanabrennan svetlanabrennan marked this pull request as ready for review October 24, 2022 19:17
@svetlanabrennan svetlanabrennan requested a review from a team October 24, 2022 19:17
Signed-off-by: Svetlana Brennan <[email protected]>
@svetlanabrennan
Copy link
Contributor Author

I've added the following:

  • jitter to exponential backoff
  • added logic to check for "retry-header" to use in retry logic
  • fix some tests and removed a flaky real http request test (node)

@svetlanabrennan
Copy link
Contributor Author

I'm unable to add real node tests to test this retry logic because the tests are very flaky when I use a real node http request. I tried to use "nock" but I found out that it overrides the client http request so I can't test the retry logic.

@weyert
Copy link
Contributor

weyert commented Oct 31, 2022

I'm unable to add real node tests to test this retry logic because the tests are very flaky when I use a real node http request. I tried to use "nock" but I found out that it overrides the client http request so I can't test the retry logic.

I am wondering if you could use msw for this?

Copy link
Member

@legendecas legendecas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay.

I'm wondering if it would be possible to unify the retry handling around the sendWithXhr and sendWithHttp -- it only depends on the 'Retry-After' header and the response status code.

It may be simpler if the platform-dependent part can be minimized.

@svetlanabrennan
Copy link
Contributor Author

I'm wondering if it would be possible to unify the retry handling around the sendWithXhr and sendWithHttp -- it only depends on the 'Retry-After' header and the response status code.

Sure I can give that a shot after thanksgiving break.

@svetlanabrennan
Copy link
Contributor Author

@legendecas Regarding your comment:

I'm wondering if it would be possible to unify the retry handling around the sendWithXhr and sendWithHttp -- it only depends on the 'Retry-After' header and the response status code.

It may be simpler if the platform-dependent part can be minimized.

Sorry for the delay on this.

I don't think I can unify the retry handling between sendWithXhr and sendWithHttp. There's quite a few difference between the two (ex. how the retry-after header is retrieve, how the OTLPExporterError is created, how the request status codes are retrieved....etc). Did you happen to see a way this can be done and point me in the right direction?

@legendecas
Copy link
Member

legendecas commented Jan 17, 2023

ex. how the retry-after header is retrieve, how the OTLPExporterError is created, how the request status codes are retrieved....etc.

I think we should create a dedicated http client abstraction on both Node.js and browser, either by using widely adopted npm package (like superagent or axios) or creating our own one (since we may need to tweak it to be able to use sendBeacon too). Anyway, I believe this can be done in a follow up PR.

With the only comment left (#3207 (comment)) resolved, I think this should be ready to land.

@svetlanabrennan
Copy link
Contributor Author

I've updated the parseRetryAfterToMillis function based on @legendecas feedback. This PR is ready for review.

@legendecas
Copy link
Member

I'm going to merge this as @opentelemetry/instrumentation-http:test failure on node-windows-tests is being tracked at #3628.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Collector exporter retries and backoff
4 participants