-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Metricbeat][Azure] Azure monitor/storage: Fix 429 Too Many Requests error #38294
[Metricbeat][Azure] Azure monitor/storage: Fix 429 Too Many Requests error #38294
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honoring the HTTP Retry-After
header is an excellent approach to this problem.
The effect is that the metricset sleeps for 5 minutes, and this may have several consequences:
- This may interfere if Metricbeat needs to reload the metricset while the goroutine is blocked sleeping.
- If the metricset collects metrics with PT1M timegrain, we may not have any metrics for 5 minutes or more.
We need to find an answer to these questions:
- What's the rate limit for the "get metric definitions" endpoint?
- The metricset also calls other endpoints (for example, to get the metric values): do we also need to manage the retry for them, and what is the rate limit for that endpoint?
If we know how many resources the metricset can realistically collect metrics for in 1 to 5 minutes, we can at least log a warning message saying that there are too many resources to honor the time grain.
Have you tried to run Metricbeat on a real Azure account with enough resources to trigger the throttling?
I cannot find this limit anywhere, I believe it is something that probably changes according to the subscription tier.
We can also add a new configuration option on whether the customer wants to try again or not, and this way it is not blocked if they don't want to. |
Update on testing: For this test, I added 4 resources. Each of these resources produces 101 calls. This means that we need to run 101 calls * 4 resources = 404 calls at minimum (if no retry happens). But as we know, we cannot do 400 calls in 5 minutes, so by call 402 we get the error and metricbeat sleeps for 309 seconds (which is 5min9sec). Only after that it runs the calls that are missing. So the function only finishes after 538seconds (in this case that I tested). This is 8.96 minutes to finish the 404 calls (plus 1 retry call). Then we get the documents. So in this scenario, for each cycle, we are taking 9 minutes... |
Update from offline conversations during testing this PR, where we learned that the metrics definition phase could be more efficient: For example, if Metricbeat uses a configuration with three metric groups using the same namespace, like the following: resources:
- metrics:
- ignore_unsupported: true
name:
- ServiceApiLatency
- Availability
- ServiceApiResult
namespace: Microsoft.MyNamespace/MyResourceType
timegrain: PT1M
- ignore_unsupported: true
name:
- ServiceApiHit
namespace: Microsoft.MyNamespace/MyResourceType
timegrain: PT1M
- ignore_unsupported: true
name:
- SaturationShoebox
namespace: Microsoft.MyNamespace/MyResourceType
timegrain: PT1M
resource_query: resourceType eq 'Microsoft.MyNamespace/MyResourceType' The beats/x-pack/metricbeat/module/azure/monitor/client_helper.go Lines 21 to 29 in 8e790de
With the current implementation, 10 resources and 3 groups (same namespace), would require 30 API calls. Instead, it should call As a proof of concept, I tried to see if we can only get the metrics definition once (untested): // mapMetrics should validate and map the metric related configuration to relevant azure monitor api parameters
func mapMetrics(client *azure.Client, resources []*armresources.GenericResourceExpanded, resourceConfig azure.ResourceConfig) ([]azure.Metric, error) {
var metrics []azure.Metric
// index to find the unique combination of resource and metric namespace
index := map[string]struct {
Resource *armresources.GenericResourceExpanded
MetricConfig *azure.MetricConfig
MetricDefinitionCollection *armmonitor.MetricDefinitionCollection
}{}
// group resources and resourceConfig by resource.ID and metric.Namespace
for _, resource := range resources {
for _, metric := range resourceConfig.Metrics {
key := *resource.ID + metric.Namespace
if _, exists := index[key]; exists {
// we already have the metric definitions for this resource and namespace
continue
}
metricDefinitions, err := client.AzureMonitorService.GetMetricDefinitions(*resource.ID, metric.Namespace)
if err != nil {
return nil, fmt.Errorf("no metric definitions were found for resource %s and namespace %s %w", *resource.ID, metric.Namespace, err)
}
index[key] = struct {
Resource *armresources.GenericResourceExpanded
MetricConfig *azure.MetricConfig
MetricDefinitionCollection *armmonitor.MetricDefinitionCollection
}{
Resource: resource,
MetricConfig: &metric,
MetricDefinitionCollection: &metricDefinitions,
}
}
}
for _, resource := range resources {
for _, metric := range resourceConfig.Metrics {
//get all metrics supported by the namespace provided
//metricDefinitions, err := client.AzureMonitorService.GetMetricDefinitions(*resource.ID, metric.Namespace)
//if err != nil {
// return nil, fmt.Errorf("no metric definitions were found for resource %s and namespace %s %w", *resource.ID, metric.Namespace, err)
//}
// lookup the metric definitions for the resource and namespace
key := *resource.ID + metric.Namespace
if _, ok := index[key]; !ok {
// this should never happen
return nil, fmt.Errorf("no metric definitions were found for resource %s and namespace %s", *resource.ID, metric.Namespace)
}
metricDefinitions := index[key].MetricDefinitionCollection With an implementation that identifies unique combinations of resource and namespace, 10 resources and 3 groups (same namespace), would require 10 API calls. |
While the goroutine sleeps for 309 seconds, the metricset is probably not collecting metric values, right? Sleeping for so long is a problem for integrations using metrics with PT1M timegrain, where users expect one document for each resource every 60s. But I'm not 100% sure, and we need to check. @constanca-m, could you see gaps in the metrics when the time grain is PT1M? |
Yes, I was already seeing that before. I am only making 100 calls, so no timeout and no sleep. And Discover looks like this: The configuration I am using is:
|
I want to add that even with just 100 calls, we can reproduce the 429 error. After running it 6 times (I think it was 6), the 7th blocked because of the 429 error, so it is still very present here. Results: I believe the big time gap is when metricbeat sleep occured. The amount of documents being received remains the same. I don't think we are losing data, it makes sense to have this huge gap since the API was not available for requests anymore. |
Good start. How Discover looks like with this configuration (this matches the config a user reported)?
(in a real config, the multiple metric groups would have different dimensions, but this should have the same effect) Also we need to measure:
|
With PT1M metrics, the users expectation is to have one measure every one minute. This is not happening. |
My take on this issue. The current implementation limits the number of resources we can successfully collect PT5M (one document every 5 minutes) and PT1M (one document every 1 minute) metrics. So, I guess we need to:
@axw, I'd love to hear what you think! 🙇 |
Update on comment #38294 (comment). The results of 3 sleeps are: I will now test scenario from comment #38294 (comment) |
For example, we should fetch async and cache the metric definitions, like the otel azuremonitorreceiver does. |
Following #38294 (comment) I expected as well the same results, but we don't see them.
As expected, it makes 100 calls as well. This is the same.
It does not. This seems to be unexpected. Check the results in Discover: We can see that the results from 100 calls are coming this time every 3 minutes. So in 5 minutes, we cannot make more than 400 calls / 5 minutes, and this is why we don't see the error 429. The main question then is: what is defining this period between calls? Why do we have 3 minutes per each call, and before that we had 2 min? Also, why do we have a bigger gap between first call and second call, which also happened before? Another detail: running with this configuration takes around 50s to complete the 100 calls. Running with the previous one #38294 (comment) takes around 50s as well, so the cause is not in the time it takes to complete all the calls. The configuration used is:
|
Great, I see the code is now caching the metrics definitions for each namespace.
Uhm, I don't have an convincing explanation for this. Between the "get metrics definition" calls, we have the other calls to get the metric values. I can only think of different network conditions (latency, bandwidth) causing these differences. |
We should also write a KB article describing the current limits. In particular, we need to collect these information:
I guess this number may be different because PT1M requires collecting metrics every 60s, probably causing the throttling with less resources. Do we already have this information from previous tests, or we need to perform new ad-hoc ones? |
for { | ||
nextPage, err := pager.NextPage(service.context) | ||
if err != nil { | ||
retryError := service.sleepIfPossible(err, resourceId, namespace) | ||
if retryError != nil { | ||
return armmonitor.MetricDefinitionCollection{}, err | ||
} | ||
} else { | ||
metricDefinitionCollection.Value = append(metricDefinitionCollection.Value, nextPage.Value...) | ||
break | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these for loop and else clause required?
Suppose the result for a resource contains two pages and:
- the call to
pager.NextPage()
succeedes on the first page. - the call to
pager.NextPage()
fails on the second page with the 429 error. - the goroutine sleeps for 5 mins
At this point, we haven't successfully gotten page 2; would the pager.More()
return true
or false
?
If it returns true
, we don't need the inner for loop.
If true
, we could use something like:
for pager.More() {
nextPage, err := pager.NextPage(service.context)
if err != nil {
retryError := service.sleepIfPossible(err, resourceId, namespace)
if retryError != nil {
return armmonitor.MetricDefinitionCollection{}, err
}
continue
}
metricDefinitionCollection.Value = append(metricDefinitionCollection.Value, nextPage.Value...)
}
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9e93b11
to
3576ea1
Compare
@constanca-m, I tested the metricset with 400 resources and PT5M, and it works reasonably well: it takes longer than 5 minutes when it reloads the metric definitions. Here's the configuration I used: - module: azure
metricsets:
- monitor
enabled: true
period: 300s
refresh_list_interval: 600s
resources:
- resource_query: "resourceType eq 'Microsoft.ContainerRegistry/registries'"
metrics:
- name: [ "TotalPullCount", "SuccessfulPullCount", "TotalPushCount", "SuccessfulPushCount","RunDuration", "AgentPoolCPUTime" ]
namespace: "Microsoft.ContainerRegistry/registries"
timegrain: "PT5M" On average, it takes ~190s to collect values for 400 metrics (0.5s for each metric). So with the current implementation, we can collect no more than:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to improve the metricset's efficiency and performance, but this is a step in the right direction.
@constanca-m, thank you for your effort and patience.
Signed-off-by: constanca <[email protected]>
Signed-off-by: constanca <[email protected]>
Signed-off-by: constanca <[email protected]>
Signed-off-by: constanca <[email protected]>
Signed-off-by: constanca <[email protected]>
Signed-off-by: constanca <[email protected]>
Signed-off-by: constanca <[email protected]>
Signed-off-by: constanca <[email protected]>
Signed-off-by: constanca <[email protected]>
Signed-off-by: constanca <[email protected]>
3576ea1
to
0d673a8
Compare
💚 Build Succeeded
History
cc @constanca-m |
💚 Build Succeeded
History
cc @constanca-m |
💚 Build Succeeded
History
cc @constanca-m |
💚 Build Succeeded
History
cc @constanca-m |
💚 Build Succeeded
History
cc @constanca-m |
💚 Build Succeeded
History
cc @constanca-m |
💚 Build Succeeded
History
cc @constanca-m |
💔 Build Failed
Failed CI Steps
History
cc @constanca-m |
💔 Build Failed
Failed CI StepsHistory
cc @constanca-m |
💚 Build Succeeded
History
cc @constanca-m |
…error (#38294) * Fiz azure too many requests error Signed-off-by: constanca <[email protected]> * Add CHANGELOG.next.asciidoc entry Signed-off-by: constanca <[email protected]> * Run mage check Signed-off-by: constanca <[email protected]> * Get metrics 1 time for each namespace/id Signed-off-by: constanca <[email protected]> * Add error comparison Signed-off-by: constanca <[email protected]> * Move retry to monitor service Signed-off-by: constanca <[email protected]> * Remove GetMetricDefinitions without retry Signed-off-by: constanca <[email protected]> * fix storage unit tests Signed-off-by: constanca <[email protected]> * fix monitor unit tests Signed-off-by: constanca <[email protected]> * Remove unnecessary for cycle Signed-off-by: constanca <[email protected]> --------- Signed-off-by: constanca <[email protected]> (cherry picked from commit 10ff992)
…error (#38294) (#38457) * Fiz azure too many requests error Signed-off-by: constanca <[email protected]> * Add CHANGELOG.next.asciidoc entry Signed-off-by: constanca <[email protected]> * Run mage check Signed-off-by: constanca <[email protected]> * Get metrics 1 time for each namespace/id Signed-off-by: constanca <[email protected]> * Add error comparison Signed-off-by: constanca <[email protected]> * Move retry to monitor service Signed-off-by: constanca <[email protected]> * Remove GetMetricDefinitions without retry Signed-off-by: constanca <[email protected]> * fix storage unit tests Signed-off-by: constanca <[email protected]> * fix monitor unit tests Signed-off-by: constanca <[email protected]> * Remove unnecessary for cycle Signed-off-by: constanca <[email protected]> --------- Signed-off-by: constanca <[email protected]> (cherry picked from commit 10ff992) Co-authored-by: Constança Manteigas <[email protected]>
Proposed commit message
Please see the issue elastic/integrations#8575.
Metricsets affected by this change: storage and monitor.
This PR fixes the issue the following way:
time.sleep
for the seconds given by the header and try again after the time is up.This is also the retry suggestion given by Retry Usage Guidance from Azure.
For monitor metricset, the previous implementation also had a different problem affecting the number of calls: we were trying to get the metrics definition for every metric, without checking if we had done that already for the combination namespace + resource ID. So, for example, this configuration:
Would lead to 2 calls for each resource, even though the namespace is the same for every metric group. This is also fixed in this PR.
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
Enable azure monitor metricset, like in this example configuration.
To reproduce the error you can change the code a bit to force it to make many requests. You can put the
getMetricsDefinitionsWithRetry
inside a for cycle for that.I added some printing lines to the code. You should see an output like this:
Please notice that running this takes around 7 minutes to first encounter the error. After that time you have to wait for the sleep time, and only then you start making requests again.
Related issues
Closes elastic/integrations#8575.