Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metricbeat][Azure] Azure monitor/storage: Fix 429 Too Many Requests error #38294

Merged
merged 10 commits into from
Mar 20, 2024

Conversation

constanca-m
Copy link
Contributor

@constanca-m constanca-m commented Mar 13, 2024

Proposed commit message

Please see the issue elastic/integrations#8575.

Metricsets affected by this change: storage and monitor.

This PR fixes the issue the following way:

  1. We check if the error produced has the header "Retry After".
  2. If this header is present then we use time.sleep for the seconds given by the header and try again after the time is up.
  3. If the header is not present, then it is likely another error and we return.

This is also the retry suggestion given by Retry Usage Guidance from Azure.

For monitor metricset, the previous implementation also had a different problem affecting the number of calls: we were trying to get the metrics definition for every metric, without checking if we had done that already for the combination namespace + resource ID. So, for example, this configuration:

    - resource_query: "resourceType eq 'Microsoft.ContainerRegistry/registries'"
      metrics:
        - name: [ "TotalPullCount", "SuccessfulPullCount", "TotalPushCount", "SuccessfulPushCount","RunDuration", "AgentPoolCPUTime" ]
          namespace: "Microsoft.ContainerRegistry/registries"
          timegrain: "PT1M"
        - name: [ "StorageUsed" ]
          namespace: "Microsoft.ContainerRegistry/registries"
          timegrain: "PT1H"
          dimensions:
            - name: "Geolocation"
              value: "*"

Would lead to 2 calls for each resource, even though the namespace is the same for every metric group. This is also fixed in this PR.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Enable azure monitor metricset, like in this example configuration.

To reproduce the error you can change the code a bit to force it to make many requests. You can put the getMetricsDefinitionsWithRetry inside a for cycle for that.

I added some printing lines to the code. You should see an output like this:

[...]
>> Trying request  400
>> Trying request  401
>> Trying request  402
>> Metricbeat will sleep for 309 seconds and retry again.
>> Trying request  402  <---------- Same request as before
>> Trying request  403
[...]

Please notice that running this takes around 7 minutes to first encounter the error. After that time you have to wait for the sleep time, and only then you start making requests again.

Related issues

Closes elastic/integrations#8575.

@constanca-m constanca-m added Team:Cloud-Monitoring Label for the Cloud Monitoring team backport-v8.13.0 Automated backport with mergify labels Mar 13, 2024
@constanca-m constanca-m self-assigned this Mar 13, 2024
@constanca-m constanca-m requested a review from a team as a code owner March 13, 2024 08:50
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Mar 13, 2024
@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 13, 2024

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 54 min 12 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@constanca-m constanca-m requested a review from zmoog March 13, 2024 09:01
Copy link
Contributor

@zmoog zmoog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honoring the HTTP Retry-After header is an excellent approach to this problem.

The effect is that the metricset sleeps for 5 minutes, and this may have several consequences:

  • This may interfere if Metricbeat needs to reload the metricset while the goroutine is blocked sleeping.
  • If the metricset collects metrics with PT1M timegrain, we may not have any metrics for 5 minutes or more.

We need to find an answer to these questions:

  1. What's the rate limit for the "get metric definitions" endpoint?
  2. The metricset also calls other endpoints (for example, to get the metric values): do we also need to manage the retry for them, and what is the rate limit for that endpoint?

If we know how many resources the metricset can realistically collect metrics for in 1 to 5 minutes, we can at least log a warning message saying that there are too many resources to honor the time grain.

Have you tried to run Metricbeat on a real Azure account with enough resources to trigger the throttling?

@constanca-m
Copy link
Contributor Author

constanca-m commented Mar 13, 2024

What's the rate limit for the "get metric definitions" endpoint?

I cannot find this limit anywhere, I believe it is something that probably changes according to the subscription tier.

This may interfere if Metricbeat needs to reload the metricset while the goroutine is blocked sleeping.

We can also add a new configuration option on whether the customer wants to try again or not, and this way it is not blocked if they don't want to.

@constanca-m
Copy link
Contributor Author

constanca-m commented Mar 13, 2024

Update on testing:

For this test, I added 4 resources. Each of these resources produces 101 calls.

This means that we need to run 101 calls * 4 resources = 404 calls at minimum (if no retry happens).

But as we know, we cannot do 400 calls in 5 minutes, so by call 402 we get the error and metricbeat sleeps for 309 seconds (which is 5min9sec).

Only after that it runs the calls that are missing. So the function only finishes after 538seconds (in this case that I tested). This is 8.96 minutes to finish the 404 calls (plus 1 retry call). Then we get the documents.

So in this scenario, for each cycle, we are taking 9 minutes...

@zmoog
Copy link
Contributor

zmoog commented Mar 13, 2024

Update from offline conversations during testing this PR, where we learned that the metrics definition phase could be more efficient:

For example, if Metricbeat uses a configuration with three metric groups using the same namespace, like the following:

      resources:
        - metrics:
            - ignore_unsupported: true
              name:
                - ServiceApiLatency
                - Availability
                - ServiceApiResult
              namespace: Microsoft.MyNamespace/MyResourceType
              timegrain: PT1M
            - ignore_unsupported: true
              name:
                - ServiceApiHit
              namespace: Microsoft.MyNamespace/MyResourceType
              timegrain: PT1M
            - ignore_unsupported: true
              name:
                - SaturationShoebox
              namespace: Microsoft.MyNamespace/MyResourceType
              timegrain: PT1M
          resource_query: resourceType eq 'Microsoft.MyNamespace/MyResourceType'

The mapMetrics() helper would call client.AzureMonitorService.GetMetricDefinitions() for all resource for each metrics group:

func mapMetrics(client *azure.Client, resources []*armresources.GenericResourceExpanded, resourceConfig azure.ResourceConfig) ([]azure.Metric, error) {
var metrics []azure.Metric
for _, resource := range resources {
for _, metric := range resourceConfig.Metrics {
// get all metrics supported by the namespace provided
metricDefinitions, err := client.AzureMonitorService.GetMetricDefinitions(*resource.ID, metric.Namespace)
if err != nil {
return nil, fmt.Errorf("no metric definitions were found for resource %s and namespace %s %w", *resource.ID, metric.Namespace, err)
}

With the current implementation, 10 resources and 3 groups (same namespace), would require 30 API calls.

Instead, it should call client.AzureMonitorService.GetMetricDefinitions() only once for each unique combination of resource and namespace.

As a proof of concept, I tried to see if we can only get the metrics definition once (untested):

// mapMetrics should validate and map the metric related configuration to relevant azure monitor api parameters
func mapMetrics(client *azure.Client, resources []*armresources.GenericResourceExpanded, resourceConfig azure.ResourceConfig) ([]azure.Metric, error) {
	var metrics []azure.Metric

	// index to find the unique combination of resource and metric namespace
	index := map[string]struct {
		Resource                   *armresources.GenericResourceExpanded
		MetricConfig               *azure.MetricConfig
		MetricDefinitionCollection *armmonitor.MetricDefinitionCollection
	}{}

	// group resources and resourceConfig by resource.ID and metric.Namespace
	for _, resource := range resources {
		for _, metric := range resourceConfig.Metrics {
			key := *resource.ID + metric.Namespace
			if _, exists := index[key]; exists {
				// we already have the metric definitions for this resource and namespace
				continue
			}

			metricDefinitions, err := client.AzureMonitorService.GetMetricDefinitions(*resource.ID, metric.Namespace)
			if err != nil {
				return nil, fmt.Errorf("no metric definitions were found for resource %s and namespace %s %w", *resource.ID, metric.Namespace, err)
			}

			index[key] = struct {
				Resource                   *armresources.GenericResourceExpanded
				MetricConfig               *azure.MetricConfig
				MetricDefinitionCollection *armmonitor.MetricDefinitionCollection
			}{
				Resource:                   resource,
				MetricConfig:               &metric,
				MetricDefinitionCollection: &metricDefinitions,
			}
		}
	}

	for _, resource := range resources {
		for _, metric := range resourceConfig.Metrics {
			//get all metrics supported by the namespace provided
			//metricDefinitions, err := client.AzureMonitorService.GetMetricDefinitions(*resource.ID, metric.Namespace)
			//if err != nil {
			//	return nil, fmt.Errorf("no metric definitions were found for resource %s and namespace %s %w", *resource.ID, metric.Namespace, err)
			//}

			// lookup the metric definitions for the resource and namespace
			key := *resource.ID + metric.Namespace
			if _, ok := index[key]; !ok {
				// this should never happen
				return nil, fmt.Errorf("no metric definitions were found for resource %s and namespace %s", *resource.ID, metric.Namespace)
			}

			metricDefinitions := index[key].MetricDefinitionCollection

With an implementation that identifies unique combinations of resource and namespace, 10 resources and 3 groups (same namespace), would require 10 API calls.

@zmoog
Copy link
Contributor

zmoog commented Mar 13, 2024

Only after that it runs the calls that are missing. So the function only finishes after 538seconds (in this case that I tested). This is 8.96 minutes to finish the 404 calls (plus 1 retry call). Then we get the documents.

So in this scenario, for each cycle, we are taking 9 minutes...

While the goroutine sleeps for 309 seconds, the metricset is probably not collecting metric values, right?

Sleeping for so long is a problem for integrations using metrics with PT1M timegrain, where users expect one document for each resource every 60s.

But I'm not 100% sure, and we need to check. @constanca-m, could you see gaps in the metrics when the time grain is PT1M?

@constanca-m
Copy link
Contributor Author

constanca-m commented Mar 14, 2024

Only after that it runs the calls that are missing. So the function only finishes after 538seconds (in this case that I tested). This is 8.96 minutes to finish the 404 calls (plus 1 retry call). Then we get the documents.
So in this scenario, for each cycle, we are taking 9 minutes...

While the goroutine sleeps for 309 seconds, the metricset is probably not collecting metric values, right?

Sleeping for so long is a problem for integrations using metrics with PT1M timegrain, where users expect one document for each resource every 60s.

But I'm not 100% sure, and we need to check. @constanca-m, could you see gaps in the metrics when the time grain is PT1M?

Yes, I was already seeing that before. I am only making 100 calls, so no timeout and no sleep. And Discover looks like this:

image

The configuration I am using is:

- module: azure
  metricsets:
  - monitor
  enabled: true
  period: 60s

  refresh_list_interval: 60s
  resources:
    - resource_query: "resourceType eq 'Microsoft.ContainerRegistry/registries'"
      metrics:
        - name: [ "TotalPullCount", "SuccessfulPullCount", "TotalPushCount", "SuccessfulPushCount","RunDuration", "AgentPoolCPUTime" ]
          namespace: "Microsoft.ContainerRegistry/registries"
          timegrain: "PT1M"

@constanca-m
Copy link
Contributor Author

I want to add that even with just 100 calls, we can reproduce the 429 error. After running it 6 times (I think it was 6), the 7th blocked because of the 429 error, so it is still very present here.

Results:

image

I believe the big time gap is when metricbeat sleep occured. The amount of documents being received remains the same. I don't think we are losing data, it makes sense to have this huge gap since the API was not available for requests anymore.

@zmoog
Copy link
Contributor

zmoog commented Mar 14, 2024

Good start.

How Discover looks like with this configuration (this matches the config a user reported)?

- module: azure
  metricsets:
  - monitor
  enabled: true
  period: 60s

  refresh_list_interval: 60s
  resources:
    - resource_query: "resourceType eq 'Microsoft.ContainerRegistry/registries'"
      metrics:
        - name: [ "TotalPullCount", "SuccessfulPullCount", "TotalPushCount", "SuccessfulPushCount","RunDuration", "AgentPoolCPUTime" ]
          namespace: "Microsoft.ContainerRegistry/registries"
          timegrain: "PT1M"
        - name: [ "TotalPullCount", "SuccessfulPullCount", "TotalPushCount", "SuccessfulPushCount","RunDuration", "AgentPoolCPUTime" ]
          namespace: "Microsoft.ContainerRegistry/registries"
          timegrain: "PT1M"
        - name: [ "TotalPullCount", "SuccessfulPullCount", "TotalPushCount", "SuccessfulPushCount","RunDuration", "AgentPoolCPUTime" ]
          namespace: "Microsoft.ContainerRegistry/registries"
          timegrain: "PT1M"

(in a real config, the multiple metric groups would have different dimensions, but this should have the same effect)

Also we need to measure:

  • how many calls it makes
  • does it triggers 429 error?
  • how many seconds it sleeps (if any)

@zmoog
Copy link
Contributor

zmoog commented Mar 14, 2024

I don't think we are losing data, it makes sense to have this huge gap since the API was not available for requests anymore.

With PT1M metrics, the users expectation is to have one measure every one minute. This is not happening.

@zmoog
Copy link
Contributor

zmoog commented Mar 14, 2024

My take on this issue.

The current implementation limits the number of resources we can successfully collect PT5M (one document every 5 minutes) and PT1M (one document every 1 minute) metrics.

So, I guess we need to:

  • Identify the maximum number of Azure resources that allow a complete collection of PT5M and PT1M metrics.
  • Eliminate all inefficiencies that do not require rewriting a large portion of the metricset.
  • Document the maximum number of Azure resources supported for complete collection. Above this limit, the metric collection will be degraded (it will miss some measures).
  • Plan more significant changes for the next Metricbeat release.

@axw, I'd love to hear what you think! 🙇

@constanca-m
Copy link
Contributor Author

Update on comment #38294 (comment).

The results of 3 sleeps are:

image

I will now test scenario from comment #38294 (comment)

@zmoog
Copy link
Contributor

zmoog commented Mar 14, 2024

Plan more significant changes for the next Metricbeat release.

For example, we should fetch async and cache the metric definitions, like the otel azuremonitorreceiver does.

@constanca-m
Copy link
Contributor Author

constanca-m commented Mar 14, 2024

Following #38294 (comment)

I expected as well the same results, but we don't see them.

how many calls it makes

As expected, it makes 100 calls as well. This is the same.

does it triggers 429 error?

It does not. This seems to be unexpected. Check the results in Discover:

image

We can see that the results from 100 calls are coming this time every 3 minutes. So in 5 minutes, we cannot make more than 400 calls / 5 minutes, and this is why we don't see the error 429. The main question then is: what is defining this period between calls? Why do we have 3 minutes per each call, and before that we had 2 min? Also, why do we have a bigger gap between first call and second call, which also happened before?

Another detail: running with this configuration takes around 50s to complete the 100 calls. Running with the previous one #38294 (comment) takes around 50s as well, so the cause is not in the time it takes to complete all the calls.

The configuration used is:

- module: azure
  metricsets:
  - monitor
  enabled: true
  period: 60s

  refresh_list_interval: 60s
  resources:
    - resource_query: "resourceType eq 'Microsoft.ContainerRegistry/registries'"
      metrics:
        - name: [ "TotalPullCount", "SuccessfulPullCount", "TotalPushCount", "SuccessfulPushCount","RunDuration", "AgentPoolCPUTime" ]
          namespace: "Microsoft.ContainerRegistry/registries"
          timegrain: "PT1M"
        - name: [ "TotalPullCount", "SuccessfulPullCount", "TotalPushCount", "SuccessfulPushCount","RunDuration", "AgentPoolCPUTime" ]
          namespace: "Microsoft.ContainerRegistry/registries"
          timegrain: "PT1M"
        - name: [ "TotalPullCount", "SuccessfulPullCount", "TotalPushCount", "SuccessfulPushCount","RunDuration", "AgentPoolCPUTime" ]
          namespace: "Microsoft.ContainerRegistry/registries"
          timegrain: "PT1M"

@zmoog
Copy link
Contributor

zmoog commented Mar 14, 2024

how many calls it makes

As expected, it makes 100 calls as well. This is the same.

Great, I see the code is now caching the metrics definitions for each namespace.

We can see that the results from 100 calls are coming this time every 3 minutes. So in 5 minutes, we cannot make more than 400 calls / 5 minutes, and this is why we don't see the error 429. The main question then is: what is defining this period between calls? Why do we have 3 minutes per each call, and before that we had 2 min? Also, why do we have a bigger gap between first call and second call, which also happened before?

Uhm, I don't have an convincing explanation for this.

Between the "get metrics definition" calls, we have the other calls to get the metric values. I can only think of different network conditions (latency, bandwidth) causing these differences.

@zmoog
Copy link
Contributor

zmoog commented Mar 14, 2024

We should also write a KB article describing the current limits.

In particular, we need to collect these information:

  • Maximum number of resources without throttling using PT1M metrics.
  • Maximum number of resources without throttling using PT5M metrics.

I guess this number may be different because PT1M requires collecting metrics every 60s, probably causing the throttling with less resources.

Do we already have this information from previous tests, or we need to perform new ad-hoc ones?

Comment on lines 252 to 258
for {
nextPage, err := pager.NextPage(service.context)
if err != nil {
retryError := service.sleepIfPossible(err, resourceId, namespace)
if retryError != nil {
return armmonitor.MetricDefinitionCollection{}, err
}
} else {
metricDefinitionCollection.Value = append(metricDefinitionCollection.Value, nextPage.Value...)
break
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these for loop and else clause required?

Suppose the result for a resource contains two pages and:

  • the call to pager.NextPage() succeedes on the first page.
  • the call to pager.NextPage() fails on the second page with the 429 error.
  • the goroutine sleeps for 5 mins

At this point, we haven't successfully gotten page 2; would the pager.More() return true or false?

If it returns true, we don't need the inner for loop.

If true, we could use something like:

	for pager.More() {
		nextPage, err := pager.NextPage(service.context)
		if err != nil {
			retryError := service.sleepIfPossible(err, resourceId, namespace)
			if retryError != nil {
				return armmonitor.MetricDefinitionCollection{}, err
			}
			continue
		}
		metricDefinitionCollection.Value = append(metricDefinitionCollection.Value, nextPage.Value...)
	}

WDYT?

Copy link
Contributor Author

@constanca-m constanca-m Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is a much better solution! I tried it, and the number of documents after sleep remains the same:

image

I also had some printing lines, and I could see the request was repeated.

@zmoog
Copy link
Contributor

zmoog commented Mar 19, 2024

@constanca-m, I tested the metricset with 400 resources and PT5M, and it works reasonably well: it takes longer than 5 minutes when it reloads the metric definitions.

Here's the configuration I used:

- module: azure
  metricsets:
  - monitor
  enabled: true
  period: 300s
  refresh_list_interval: 600s
  resources:
  - resource_query: "resourceType eq 'Microsoft.ContainerRegistry/registries'"
    metrics:
      - name: [ "TotalPullCount", "SuccessfulPullCount", "TotalPushCount", "SuccessfulPushCount","RunDuration", "AgentPoolCPUTime" ]
        namespace: "Microsoft.ContainerRegistry/registries"
        timegrain: "PT5M"

On average, it takes ~190s to collect values for 400 metrics (0.5s for each metric).

So with the current implementation, we can collect no more than:

  • 120 metric values with a PT1M time grain.
  • 600 metric values with a PT5M time grain (not tested yet).

Copy link
Contributor

@zmoog zmoog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to improve the metricset's efficiency and performance, but this is a step in the right direction.

@constanca-m, thank you for your effort and patience.

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @constanca-m

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @constanca-m

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @constanca-m

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @constanca-m

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @constanca-m

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @constanca-m

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @constanca-m

@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 19, 2024

@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 19, 2024

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @constanca-m

@constanca-m constanca-m merged commit 10ff992 into elastic:main Mar 20, 2024
39 of 44 checks passed
@constanca-m constanca-m deleted the azure-too-many-requests branch March 20, 2024 08:06
mergify bot pushed a commit that referenced this pull request Mar 20, 2024
…error (#38294)

* Fiz azure too many requests error

Signed-off-by: constanca <[email protected]>

* Add CHANGELOG.next.asciidoc entry

Signed-off-by: constanca <[email protected]>

* Run mage check

Signed-off-by: constanca <[email protected]>

* Get metrics 1 time for each namespace/id

Signed-off-by: constanca <[email protected]>

* Add error comparison

Signed-off-by: constanca <[email protected]>

* Move retry to monitor service

Signed-off-by: constanca <[email protected]>

* Remove GetMetricDefinitions without retry

Signed-off-by: constanca <[email protected]>

* fix storage unit tests

Signed-off-by: constanca <[email protected]>

* fix monitor unit tests

Signed-off-by: constanca <[email protected]>

* Remove unnecessary for cycle

Signed-off-by: constanca <[email protected]>

---------

Signed-off-by: constanca <[email protected]>
(cherry picked from commit 10ff992)
constanca-m added a commit that referenced this pull request Mar 20, 2024
…error (#38294) (#38457)

* Fiz azure too many requests error

Signed-off-by: constanca <[email protected]>

* Add CHANGELOG.next.asciidoc entry

Signed-off-by: constanca <[email protected]>

* Run mage check

Signed-off-by: constanca <[email protected]>

* Get metrics 1 time for each namespace/id

Signed-off-by: constanca <[email protected]>

* Add error comparison

Signed-off-by: constanca <[email protected]>

* Move retry to monitor service

Signed-off-by: constanca <[email protected]>

* Remove GetMetricDefinitions without retry

Signed-off-by: constanca <[email protected]>

* fix storage unit tests

Signed-off-by: constanca <[email protected]>

* fix monitor unit tests

Signed-off-by: constanca <[email protected]>

* Remove unnecessary for cycle

Signed-off-by: constanca <[email protected]>

---------

Signed-off-by: constanca <[email protected]>
(cherry picked from commit 10ff992)

Co-authored-by: Constança Manteigas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.13.0 Automated backport with mergify bugfix Team:Cloud-Monitoring Label for the Cloud Monitoring team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Azure] Throttling when too many requests are sent to Azure Monitor API
5 participants