Emit cloudprovider and throttling metrics #806

gnufied · 2021-03-19T18:20:01Z

The intree cloudprovider emits cloudprovider API metrics and throttling metrics.

These metrics are important for debugging - https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/legacy-cloud-providers/aws/aws_metrics.go

wongma7 · 2021-03-19T18:32:25Z

cc @ayberk @AndyXiangLi .

We can simply add these metrics.

Or better yet I would like us to solve #393.

there are 3 cloud provider abstractions. our cloud.go, v1/legacy, and v2 and I want to avoid duplication. https://github.com/kubernetes/cloud-provider-aws/tree/6c10f4b1ba2d0377665cbaa7a2cbe6730591ec64/pkg/providers/v2

i think we should design the v2 interface to replace our cloud.go https://github.com/kubernetes/cloud-provider-aws/tree/6c10f4b1ba2d0377665cbaa7a2cbe6730591ec64/pkg/providers/v2 and then use it as a library.

There is also an argument for using v1 because for migration purposes if we want to ensure 1:1 compatibility and avoid aws related bugs like handlign of eventual consistency then we should use v1.

wongma7 · 2021-03-19T18:40:40Z

Also if we use v1 then we emit these metrics for free i think but it's not that hard to add them so i dont buy it

wongma7 · 2021-04-20T17:48:35Z

/assign

wongma7 · 2021-04-20T18:02:46Z

the throttling in particular is hard to emit unless we copy the retry handler and sdk setup from cloud provider : https://github.com/kubernetes/cloud-provider-aws/blob/5f394ba297bf280ceb3edfc38922630b4bd83f46/pkg/providers/v1/retry_handler.go#L100 . our custom cloud.go is much simpler.

I am still trying to weigh what is easier, copy/pasting what we need from the old cloud provider, or refactoring to depend on it completely...

gnufied · 2021-04-20T18:30:06Z

One bummer when I last investigated around the retry/throttling issue is - ideally AWS should return a header(forgot what is called) when it is safe to retry a certain request and clients are supposed to respect it. But in my testing I found that header to be entirely missing and hence k8s implements its own retry/throttling. It might be nice to offload some the code in aws-sdk if we can (but I know outside of this project's scope).

wongma7 · 2021-04-20T19:15:40Z

Yeah I agree ideally we should rely on aws-sdk for retry handling as much as possible. We can configure a min delay so even if that Retry-After header is missing, we should't need the custom k8s retryer: https://github.com/aws/aws-sdk-go/blob/e2d6cb448883e4f4fcc5246650f89bde349041ec/aws/client/default_retryer.go#L95. (In fact we recently started tuning the sdk built-in retryer already to mitigate issue w/ throttling causing leaked vols: #769, my thinking now is that we should continue tuning it instead of going back and copying the cloudprovider retryer.)

SO... what I will do is copy the metric part [1] from the v1 cloudprovider retryer but not copy the actual retry logic part [2]. I guess the k8s implementation is already outdated anyway because now the SDK lets us not only configure the default retryer but provide a totaly custom Retryer https://github.com/aws/aws-sdk-go/blob/e2d6cb448883e4f4fcc5246650f89bde349041ec/aws/request/retryer.go#L13

[1] https://github.com/kubernetes/cloud-provider-aws/blob/5f394ba297bf280ceb3edfc38922630b4bd83f46/pkg/providers/v1/retry_handler.go#L90
[2] https://github.com/kubernetes/cloud-provider-aws/blob/5f394ba297bf280ceb3edfc38922630b4bd83f46/pkg/providers/v1/retry_handler.go#L53

In terms of the wider "cloudprovider dependency"/"duplicating code in 2/3 places" issue I was rambling about, now what I am picturing is this:

v1 cloudprovider https://github.com/kubernetes/cloud-provider-aws/tree/master/pkg/providers/v1 exists for us to copy and take lessons from. I realize now that depending on and/or changing it is not gonna be worth the effort.
this repo's cloud.go can serve as a template for v2 cloudprovider which is still WIP. So let's continue to maintain it even if it means copying bits and pieces from v1 cloudprovider.
v2 cloudprovider https://github.com/kubernetes/cloud-provider-aws/tree/master/pkg/providers/v2 is WIP so we will design with this repo in mind such that it can someday easily replace cloud.go
then someday this repo will depend on v2 cloudprovider instead of cloud.go.

wongma7 · 2021-04-20T20:21:46Z

other major difference callled out in comment:

https://github.com/kubernetes/cloud-provider-aws/blob/5f394ba297bf280ceb3edfc38922630b4bd83f46/pkg/providers/v1/retry_handler.go#L38
Note that we share a CrossRequestRetryDelay across multiple AWS requests; this is a process-wide back-off,
// whereas the aws-sdk-go implements a per-request exponential backoff/retry

k8s-ci-robot assigned wongma7 Apr 20, 2021

wongma7 mentioned this issue Apr 20, 2021

Emit AWS API operation duration/error/throttle metrics #842

Merged

wongma7 closed this as completed in #842 Apr 22, 2021

wongma7 mentioned this issue Apr 22, 2021

instance metadata issue fix #846

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit cloudprovider and throttling metrics #806

Emit cloudprovider and throttling metrics #806

gnufied commented Mar 19, 2021

wongma7 commented Mar 19, 2021

wongma7 commented Mar 19, 2021

wongma7 commented Apr 20, 2021

wongma7 commented Apr 20, 2021

gnufied commented Apr 20, 2021

wongma7 commented Apr 20, 2021

wongma7 commented Apr 20, 2021

Emit cloudprovider and throttling metrics #806

Emit cloudprovider and throttling metrics #806

Comments

gnufied commented Mar 19, 2021

wongma7 commented Mar 19, 2021

wongma7 commented Mar 19, 2021

wongma7 commented Apr 20, 2021

wongma7 commented Apr 20, 2021

gnufied commented Apr 20, 2021

wongma7 commented Apr 20, 2021

wongma7 commented Apr 20, 2021