Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support retries #108

Open
Mr0grog opened this issue Jan 4, 2023 · 0 comments
Open

Support retries #108

Mr0grog opened this issue Jan 4, 2023 · 0 comments

Comments

@Mr0grog
Copy link
Collaborator

Mr0grog commented Jan 4, 2023

One thing I’ve noticed while using this in production is that retry functionality is not built in (I did some digging in the @datadog/datadog-api-client source just to make sure). Datadog’s API is pretty robust and doesn’t fail often, but there are occasional errors (really, you can never avoid this). It might be nice if some form of retry support was built in.

For comparison, the Datadog Agent retries any metrics that fail. If I understand the architecture described there correctly, it simply keeps retrying forever until the queue is full (the default queue size is 15 MB). It can optionally save old metrics that go beyond the max queue size to disk instead of dropping them, but does not do so by default.

I think there are two pretty straightforward approaches here:

  1. Just retry all metrics send requests up to N times (where N is configurable) when they fail. This is really straightforward, and every flush() call either succeeds or fails in the same way as today.

  2. Keep a queue of any metrics that failed, and add them to the series being sent in the next flush (so if one send fails, include those metrics 10 seconds later in the next send). This doesn’t cover the case of someone explicitly flushing because their program is exiting (e.g. a cron job is ending, a service is gracefully shutting down after getting a SIGINT, etc.) — those cases would need something like (1) above. (Maybe failures while flushing go onto a queue, but you could explicitly set an option when calling flush() that tells it to immediately retry instead of queueing the failures for the next flush.) One downside here is that it depends on flush() being called on a regular interval. That normally happens, but if someone has turned it off, retries won’t happen until the next explicit flush() call, which might be subtly problematic.

  3. Have a totally separate queue for retries that just keeps running automatically. This solves the issue in (2) where retries might not happen if autoflushing is off, but still has similar issues as (2) when it comes to a program exiting. It also makes the mechanics of letting an explicit call to flush() have an option to retry or fail more complicated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant