You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One thing I’ve noticed while using this in production is that retry functionality is not built in (I did some digging in the @datadog/datadog-api-client source just to make sure). Datadog’s API is pretty robust and doesn’t fail often, but there are occasional errors (really, you can never avoid this). It might be nice if some form of retry support was built in.
For comparison, the Datadog Agent retries any metrics that fail. If I understand the architecture described there correctly, it simply keeps retrying forever until the queue is full (the default queue size is 15 MB). It can optionally save old metrics that go beyond the max queue size to disk instead of dropping them, but does not do so by default.
I think there are two pretty straightforward approaches here:
Just retry all metrics send requests up to N times (where N is configurable) when they fail. This is really straightforward, and every flush() call either succeeds or fails in the same way as today.
Keep a queue of any metrics that failed, and add them to the series being sent in the next flush (so if one send fails, include those metrics 10 seconds later in the next send). This doesn’t cover the case of someone explicitly flushing because their program is exiting (e.g. a cron job is ending, a service is gracefully shutting down after getting a SIGINT, etc.) — those cases would need something like (1) above. (Maybe failures while flushing go onto a queue, but you could explicitly set an option when calling flush() that tells it to immediately retry instead of queueing the failures for the next flush.) One downside here is that it depends on flush() being called on a regular interval. That normally happens, but if someone has turned it off, retries won’t happen until the next explicit flush() call, which might be subtly problematic.
Have a totally separate queue for retries that just keeps running automatically. This solves the issue in (2) where retries might not happen if autoflushing is off, but still has similar issues as (2) when it comes to a program exiting. It also makes the mechanics of letting an explicit call to flush() have an option to retry or fail more complicated.
The text was updated successfully, but these errors were encountered:
One thing I’ve noticed while using this in production is that retry functionality is not built in (I did some digging in the
@datadog/datadog-api-client
source just to make sure). Datadog’s API is pretty robust and doesn’t fail often, but there are occasional errors (really, you can never avoid this). It might be nice if some form of retry support was built in.For comparison, the Datadog Agent retries any metrics that fail. If I understand the architecture described there correctly, it simply keeps retrying forever until the queue is full (the default queue size is 15 MB). It can optionally save old metrics that go beyond the max queue size to disk instead of dropping them, but does not do so by default.
I think there are two pretty straightforward approaches here:
Just retry all metrics send requests up to N times (where N is configurable) when they fail. This is really straightforward, and every
flush()
call either succeeds or fails in the same way as today.Keep a queue of any metrics that failed, and add them to the series being sent in the next flush (so if one send fails, include those metrics 10 seconds later in the next send). This doesn’t cover the case of someone explicitly flushing because their program is exiting (e.g. a cron job is ending, a service is gracefully shutting down after getting a SIGINT, etc.) — those cases would need something like (1) above. (Maybe failures while flushing go onto a queue, but you could explicitly set an option when calling
flush()
that tells it to immediately retry instead of queueing the failures for the next flush.) One downside here is that it depends onflush()
being called on a regular interval. That normally happens, but if someone has turned it off, retries won’t happen until the next explicitflush()
call, which might be subtly problematic.Have a totally separate queue for retries that just keeps running automatically. This solves the issue in (2) where retries might not happen if autoflushing is off, but still has similar issues as (2) when it comes to a program exiting. It also makes the mechanics of letting an explicit call to
flush()
have an option to retry or fail more complicated.The text was updated successfully, but these errors were encountered: