Skip to content
This repository has been archived by the owner on Mar 5, 2024. It is now read-only.

Treat ErrPolicyForbidden and ErrPodNotFound as permanent errors during retry backoff #250

Merged
merged 7 commits into from
Jul 3, 2019
Merged

Treat ErrPolicyForbidden and ErrPodNotFound as permanent errors during retry backoff #250

merged 7 commits into from
Jul 3, 2019

Conversation

hoelzro
Copy link
Contributor

@hoelzro hoelzro commented May 13, 2019

If a pod requests credentials it's not allowed to, kiam-agent will keep retrying the same request until the five second timeout is reached. This changeset treats ErrPolicyForbidden (as well as ErrPodNotFound) as permanent errors, so that the AWS client talking to kiam-agent gets a response in a more timely fashion.

Without this change, fetchCredentials continues to retry
GetPodCredentials until it times out, and it's unlikely that
the pod will magically appear with the given IP or that the policy
will change before the time out comes.
We can't compare error values by pointer as we tried in the previous
commit, because the error value we get from GetCredentials is one that
was serialized over the wire and placed somewhere different in memory.
Currently ErrPolicyForbidden and ErrPodNotFound correspond to
"Unknown" errors in gRPC, so our only recourse for identifying these
errors on the client side is comparing them by error message
This way if someone calls fetchCredentials and wants to respond to the
error values on a case-by-case basis, they don't have to go through
the whole rigmarole

Also, fix a copy/paste typo I made in the previous commit =(
@hoelzro
Copy link
Contributor Author

hoelzro commented May 13, 2019

On a side note, I would have preferred to avoid string comparisons for the error messages, but both of these error types are currently mapped to "Unknown" at the gRPC layer. Changing those would be more involved and potentially disruptive to other users of KIAM, but if you think it would be worth doing, please let me know and I can submit another PR with that change as well.

@pingles
Copy link
Contributor

pingles commented May 14, 2019

It's a 👍 from me, @uswitch/cloud?

@pingles
Copy link
Contributor

pingles commented May 14, 2019

Actually, having just read... I'd support this change for ErrPolicyForbidden but not ErrPodNotFound.

My rationale would be that there's a chance the cache/watcher will be behind when the pod requests credentials- causing ErrPodNotFound to be intermittent. I think it'd be better for the agent to keep retrying up to the SDK disconnection.

Any objection @hoelzro?

Feedback from Paul Ingles at #250 (comment):

> Actually, having just read... I'd support this change for ErrPolicyForbidden but not ErrPodNotFound.

> My rationale would be that there's a chance the cache/watcher will be behind when the pod requests
> credentials- causing ErrPodNotFound to be intermittent. I think it'd be better for the agent to keep
> retrying up to the SDK disconnection.
@hoelzro
Copy link
Contributor Author

hoelzro commented May 14, 2019

@pingles Yeah, that makes sense to me! I've added a commit removing permanent error status for ErrPodNotFound.

@pingles
Copy link
Contributor

pingles commented May 16, 2019

Thanks! Could you try and find a way to add a test too please- it's a little verbose but should be relatively easy to do in https://github.com/uswitch/kiam/blob/master/pkg/aws/metadata/handler_credentials_test.go

fetchCredentials shouldn't need to know that interaction with the server
is happening over gRPC
@hoelzro
Copy link
Contributor Author

hoelzro commented May 16, 2019

Test added! There are a few things that came to mind while writing the test:

  • I changed the code so that the gateway client translates gRPC errors into server.ErrPolicyForbidden and server.ErrPodNotFound - I felt that this belonged in the code directly invoking the RPC routines.
  • The test tests for a 500 error - there's no real specification for what the EC2 metadata server returns on error, but I figured that would be fine. What do you think? I happen to know that boto, for example, retries on every non-200 status, so maybe how clients handle different statuses should be the guiding star.
  • I noticed that while GetPodCredentials will translate a k8s.ErrPodNotFound into server.ErrPodNotFound, GetPodRole does not - should the code for the latter be updated to do that?

@pingles
Copy link
Contributor

pingles commented May 24, 2019

Sorry it's taken me ages to take a look; I've been stacked at work. I'll see if I can get a chance to review next week unless one of @uswitch/cloud beat me to it 😄

@hoelzro
Copy link
Contributor Author

hoelzro commented Jun 18, 2019

@pingles No worries - please ping me when you have a chance to review it!

@hoelzro
Copy link
Contributor Author

hoelzro commented Jul 2, 2019

@pingles Have you or @uswitch/cloud had a chance to look at this?

@pingles
Copy link
Contributor

pingles commented Jul 2, 2019

I'm really sorry I haven't. It looks ok to me, @tombooth / @Joseph-Irving could you take a look too please?

Copy link
Contributor

@Joseph-Irving Joseph-Irving left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Joseph-Irving Joseph-Irving merged commit 02f63bc into uswitch:master Jul 3, 2019
@hoelzro
Copy link
Contributor Author

hoelzro commented Jul 3, 2019

@pingles @Joseph-Irving Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants