Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry mechanism behaves incorrectly when HTTP 429 is returned by Datadog #1923

Open
pbaranow opened this issue Mar 25, 2024 · 5 comments
Open

Comments

@pbaranow
Copy link

Describe the bug
Summary:
Script which fetches all dashboards in the loop exits with an error, when rate limit runs out, even though enable_retry option is turned on. According to debug logs, script exits on first HTTP 429 returned, with no retry attempted.

Details:
We have a script, which runs every night to fetch all dashboards from Datadog. It does it by fetching list of dashboards, and then goes one after another to fetch details of each one. After number of our dashboard grew, we have run into HTTP 429 errors due to Datadog's rate limit.

We decided to use retry option, which is built into the library since 2.16.0, but it seems it's not ready to how Datadog responds in case of hitting rate limit.

When I started the script in a loop with debug option enabled I see that Datadog returns HTTP 200 up until the moment when rate limit is reached, then next request gets HTTP 429 (API Keys removed from logs below):

# normal request before rate limit runs out
send: b'GET /api/v1/dashboard/<id-of-dashboard-59> Host: us5.datadoghq.com Accept-Encoding: gzip User-Agent: datadog-api-client-python/2.23.0
reply: 'HTTP/1.1 200 OK'
...
header: content-encoding: gzip
header: x-ratelimit-limit: 60
header: x-ratelimit-period: 60
header: x-ratelimit-remaining: 1
header: x-ratelimit-reset: 29
header: x-ratelimit-name: dashboards_get_custom_api

# normal request, last one within the limits
send: b'GET /api/v1/dashboard/<id-of-dashboard-60> Host: us5.datadoghq.com Accept-Encoding: gzip User-Agent: datadog-api-client-python/2.23.0
reply: 'HTTP/1.1 200 OK'
...
header: content-encoding: gzip
header: x-ratelimit-limit: 60
header: x-ratelimit-period: 60
header: x-ratelimit-remaining: 0
header: x-ratelimit-reset: 28
header: x-ratelimit-name: dashboards_get_custom_api

# next request, this one gets HTTP 429
send: b'GET /api/v1/dashboard/<id-of-dashboard-61> Host: us5.datadoghq.com Accept-Encoding: gzip User-Agent: datadog-api-client-python/2.23.0
reply: 'HTTP/1.1 429 Too Many Requests'
...
header: x-ratelimit-limit: 60
header: x-ratelimit-period: 60
header: x-ratelimit-remaining: 0
header: x-ratelimit-reset: 28
header: x-ratelimit-name: dashboards_get_custom_api

# and at this point script fails with
Error: (429)
Reason: Too Many Requests
HTTP response headers: {'x-ratelimit-limit': '60', 'x-ratelimit-period': '60', 'x-ratelimit-remaining': '0', 'x-ratelimit-reset': '28', 'x-ratelimit-name': 'dashboards_get_custom_api', 'content-type': 'application/json', 'Content-Length': '183', 'x-content-type-options': 'nosniff', 'strict-transport-security': 'max-age=31536000; includeSubDomains; preload', 'date': 'Mon, 25 Mar 2024 09:05:32 GMT', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'}
HTTP response body: {'status': 'error', 'code': 429, 'errors': ['Too many requests'], 'statuspage': 'http://status.us5.datadoghq.com', 'twitter': 'http://twitter.com/datadogops', 'email': '[email protected]'}

To Reproduce
See description above

Expected behavior
I expect library to sleep for x-ratelimit-reset time, just like it's described in tests, which introduced this functionality. Right now I need to add a sleep between requests to API as a workaround

Screenshots
N/A - logs attached

Environment and Versions (please complete the following information):
client library version 2.23.0

Additional context
Add any other context about the problem here.

@therve
Copy link
Contributor

therve commented Mar 25, 2024

Hi,

Can you share the script that you're using? It does work as expected for a simple script that I tried. Thanks.

@pbaranow
Copy link
Author

Hi @therve

I managed to reduce the script to this short version below. We have 70+ dashboards now and if I start the script right at the start of a minut (e.g. 12:05:01), then rate limit runs out after ~40 seconds. If I wait a bit and start script at e.g. 12:05:40, then script finishes with success.

I see in debug logs, that in the latter case rate limit gets back to 60 after full minute (so for example above it's at 12:06:00). In other words, to have this script fail, we need to have 60+ requests to Datadog API within less than 60 seconds and it needs to start right at the start of a minute.

import os
import sys

from datadog_api_client import ApiClient, Configuration
from datadog_api_client.v1.api.dashboards_api import DashboardsApi
from datadog_api_client.v1.api.dashboard_lists_api import DashboardListsApi as DashboardListsApi
from datadog_api_client.v2.api.dashboard_lists_api import DashboardListsApi as DashboardListsApiV2


configuration = Configuration()
configuration.debug = True
configuration.host = os.environ.get("DD_HOST", "us5.datadoghq.com")
configuration.enable_retry = True
configuration.max_retries = 5
with ApiClient(configuration) as api_client:
    for dashboard_list in DashboardListsApi(api_client).list_dashboard_lists()["dashboard_lists"]:
        for dashboard in DashboardListsApiV2(api_client).get_dashboard_list_items(dashboard_list["id"])["dashboards"]:
            try:
                dashboard_remote = DashboardsApi(api_client).get_dashboard(dashboard["id"]).to_dict()
            except Exception as e:
                if type(e).__name__ == 'NotFoundException':
                    continue
                else:
                    print("Error:", e)
                    sys.exit(1)

@therve
Copy link
Contributor

therve commented Mar 25, 2024

That's very weird. What's your python version? And your urllib3 version?

@pbaranow
Copy link
Author

pbaranow commented Mar 25, 2024

Script executed on macOS with:

$ python3 -V
Python 3.9.6
$
$ pip show urllib3
Name: urllib3
Version: 2.2.1

Copy link

Thanks for your contribution!

This issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that the issue will not be automatically closed, but this notification will remind us to investigate why there's been inactivity. Thank you for participating in the Datadog open source community.

If you would like this issue to remain open:

  1. Verify that you can still reproduce the issue in the latest version of this project.

  2. Comment that the issue is still reproducible and include updated details requested in the issue template.

@github-actions github-actions bot added the stale label Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants