Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit cloudprovider and throttling metrics #806

Closed
gnufied opened this issue Mar 19, 2021 · 7 comments · Fixed by #842
Closed

Emit cloudprovider and throttling metrics #806

gnufied opened this issue Mar 19, 2021 · 7 comments · Fixed by #842
Assignees

Comments

@gnufied
Copy link
Contributor

gnufied commented Mar 19, 2021

The intree cloudprovider emits cloudprovider API metrics and throttling metrics.

These metrics are important for debugging - https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/legacy-cloud-providers/aws/aws_metrics.go

@wongma7
Copy link
Contributor

wongma7 commented Mar 19, 2021

cc @ayberk @AndyXiangLi .

We can simply add these metrics.

Or better yet I would like us to solve #393.

there are 3 cloud provider abstractions. our cloud.go, v1/legacy, and v2 and I want to avoid duplication. https://github.com/kubernetes/cloud-provider-aws/tree/6c10f4b1ba2d0377665cbaa7a2cbe6730591ec64/pkg/providers/v2

i think we should design the v2 interface to replace our cloud.go https://github.com/kubernetes/cloud-provider-aws/tree/6c10f4b1ba2d0377665cbaa7a2cbe6730591ec64/pkg/providers/v2 and then use it as a library.

There is also an argument for using v1 because for migration purposes if we want to ensure 1:1 compatibility and avoid aws related bugs like handlign of eventual consistency then we should use v1.

@wongma7
Copy link
Contributor

wongma7 commented Mar 19, 2021

Also if we use v1 then we emit these metrics for free i think but it's not that hard to add them so i dont buy it

@wongma7
Copy link
Contributor

wongma7 commented Apr 20, 2021

/assign

@wongma7
Copy link
Contributor

wongma7 commented Apr 20, 2021

the throttling in particular is hard to emit unless we copy the retry handler and sdk setup from cloud provider : https://github.com/kubernetes/cloud-provider-aws/blob/5f394ba297bf280ceb3edfc38922630b4bd83f46/pkg/providers/v1/retry_handler.go#L100 . our custom cloud.go is much simpler.

I am still trying to weigh what is easier, copy/pasting what we need from the old cloud provider, or refactoring to depend on it completely...

@gnufied
Copy link
Contributor Author

gnufied commented Apr 20, 2021

One bummer when I last investigated around the retry/throttling issue is - ideally AWS should return a header(forgot what is called) when it is safe to retry a certain request and clients are supposed to respect it. But in my testing I found that header to be entirely missing and hence k8s implements its own retry/throttling. It might be nice to offload some the code in aws-sdk if we can (but I know outside of this project's scope).

@wongma7
Copy link
Contributor

wongma7 commented Apr 20, 2021

Yeah I agree ideally we should rely on aws-sdk for retry handling as much as possible. We can configure a min delay so even if that Retry-After header is missing, we should't need the custom k8s retryer: https://github.com/aws/aws-sdk-go/blob/e2d6cb448883e4f4fcc5246650f89bde349041ec/aws/client/default_retryer.go#L95. (In fact we recently started tuning the sdk built-in retryer already to mitigate issue w/ throttling causing leaked vols: #769, my thinking now is that we should continue tuning it instead of going back and copying the cloudprovider retryer.)

SO... what I will do is copy the metric part [1] from the v1 cloudprovider retryer but not copy the actual retry logic part [2]. I guess the k8s implementation is already outdated anyway because now the SDK lets us not only configure the default retryer but provide a totaly custom Retryer https://github.com/aws/aws-sdk-go/blob/e2d6cb448883e4f4fcc5246650f89bde349041ec/aws/request/retryer.go#L13

[1] https://github.com/kubernetes/cloud-provider-aws/blob/5f394ba297bf280ceb3edfc38922630b4bd83f46/pkg/providers/v1/retry_handler.go#L90
[2] https://github.com/kubernetes/cloud-provider-aws/blob/5f394ba297bf280ceb3edfc38922630b4bd83f46/pkg/providers/v1/retry_handler.go#L53


In terms of the wider "cloudprovider dependency"/"duplicating code in 2/3 places" issue I was rambling about, now what I am picturing is this:

@wongma7
Copy link
Contributor

wongma7 commented Apr 20, 2021

other major difference callled out in comment:

https://github.com/kubernetes/cloud-provider-aws/blob/5f394ba297bf280ceb3edfc38922630b4bd83f46/pkg/providers/v1/retry_handler.go#L38
Note that we share a CrossRequestRetryDelay across multiple AWS requests; this is a process-wide back-off,
// whereas the aws-sdk-go implements a per-request exponential backoff/retry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants