Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected error from sdk: failed to decode <role> EC2 IMDS role credentials #1253

Closed
3 tasks done
gsaraf opened this issue May 10, 2021 · 9 comments · Fixed by #1448
Closed
3 tasks done

Unexpected error from sdk: failed to decode <role> EC2 IMDS role credentials #1253

gsaraf opened this issue May 10, 2021 · 9 comments · Fixed by #1448
Assignees
Labels
bug This issue is a bug.

Comments

@gsaraf
Copy link

gsaraf commented May 10, 2021

Describe the bug
In a service using the SDK, running in a docker container on a t2-micro EC2 instance, we get occasional errors:

operation error ECR: DescribeImages, failed to sign request: failed to retrieve credentials: failed to decode orchestrator EC2 IMDS role credentials, context canceled

orchestrator is the instance IAM role.
The context we are passing is context.TODO(), so it wasn't actually cancelled.
I checked, and the instance is almost idle - always has full cpu credits (if that matters).

The error seems to return 3-4 seconds after calling DescribeImages.
The errors aren't distributed randomly over time - sometimes they happen several times in a row.

Version of AWS SDK for Go?
1.3.2

Version of Go (go version)?
1.16.3

To Reproduce (observed behavior)

func getConfig() (aws.Config, error) {
	cfg, err := config.LoadDefaultConfig(context.TODO())
	if err != nil {
		return aws.Config{}, fmt.Errorf("configuration error = %v", err)
	}

	if len(cfg.Region) == 0 {
		l.Logger.Warnf("Region not set in aws configuration, setting it to %s", awsRegion)
		cfg.Region = "eu-central-1"
	}

	return cfg, nil
}

var imageNotExistErr = types.ImageNotFoundException{}

func ImagePushTime(repositoryName, imageTag string) (*time.Time, error) {
	cfg, err := getConfig()
	if err != nil {
		return nil, err
	}
	client := ecr.NewFromConfig(cfg)
	describeImageInput := &ecr.DescribeImagesInput{
		ImageIds:       []types.ImageIdentifier{{ImageTag: &imageTag}},
		RepositoryName: &repositoryName,
	}
	describeOut, err := client.DescribeImages(context.TODO(), describeImageInput)
	if err != nil {
		var ae smithy.APIError
		if errors.As(err, &ae) {
			if ae.ErrorCode() == imageNotExistErr.ErrorCode() {
				return nil, os.ErrNotExist
			}
		}

		return nil, err
	}

	if len(describeOut.ImageDetails) != 1 {
		return nil, errors.New("there may be exactly one image returned at this point")
	}

	return describeOut.ImageDetails[0].ImagePushedAt, nil
}

Expected behavior
DescribeImages works consistently :)

Additional context

  • I did find references to metadata_service_num_attempts / metadata_service_timeout in searches for semi-related error messages, but those don't appear to be settable in the go sdk. Not sure if it would have helped.
  • I've upgraded our code to v1.4.0 of the aws go sdk. Since the errors aren't consistent, it may take a while before we can be sure it has been solved / still exists.
@gsaraf gsaraf added the bug This issue is a bug. label May 10, 2021
@skotambkar skotambkar self-assigned this May 14, 2021
@gsaraf
Copy link
Author

gsaraf commented May 15, 2021

Quick update: upgrading to AWS SDK v1.4.0 didn't seem to affect the incidence of these errors. Still happening occasionally.

@saj
Copy link

saj commented Jun 9, 2021

I have also encountered this problem, though I have found it incredibly difficult to reproduce.

failed to decode ... EC2 IMDS role credentials, context canceled

There are not many places in the SDK that call a context cancellation function. One stood out to me:

func (m *operationTimeout) HandleInitialize(
	ctx context.Context, input middleware.InitializeInput, next middleware.InitializeHandler,
) (
	output middleware.InitializeOutput, metadata middleware.Metadata, err error,
) {
	var cancelFn func()

	ctx, cancelFn = context.WithTimeout(ctx, m.Timeout)
	defer cancelFn()

	return next.HandleInitialize(ctx, input)
}

feature/ec2/imds/request_middleware.go

@skotambkar, is it possible for that deferred call to cancelFn to race with a response body read in ec2rolecreds? Some exported imds.Get* methods handle body reads as part of the middleware stack: those reads are therefore guaranteed to execute before the call to cancelFn. I don't think that guarantee applies here, however. GetMetadata returns a raw io.ReadCloser, and later reads escape the middleware stack:

func buildGetMetadataOutput(resp *smithyhttp.Response) (interface{}, error) {
	return &GetMetadataOutput{
		Content: resp.Body,
	}, nil
}

People like @gsaraf and myself -- who execute our programs from within containers -- may be more likely to encounter this race because the MTU on our container links are almost certainly set to a smaller value (1,500) than the interfaces on the host EC2 instance (9,000). The IMDS response body for /latest/meta-data/iam/security-credentials/... is long enough to fragment over multiple IP packets when the program is executing from within a container. (I was unable to reproduce outside a container despite hours of continuous testing.)

I call this a race because I think there is some internal buffering in Go's http package that may sometimes allow a body read to succeed even if the context is canceled.

@saj
Copy link

saj commented Jun 10, 2021

To test that hypothesis, I applied the following patch to the imds package:

--- a/feature/ec2/imds/api_op_GetMetadata.go
+++ b/feature/ec2/imds/api_op_GetMetadata.go
@@ -1,6 +1,7 @@
 package imds

 import (
+       "bytes"
        "context"
        "fmt"
        "io"
@@ -70,7 +71,12 @@ func buildGetMetadataPath(params interface{}) (string, error) {
 }

 func buildGetMetadataOutput(resp *smithyhttp.Response) (interface{}, error) {
+       buf := bytes.Buffer{}
+       _, err := io.Copy(&buf, resp.Body)
+       if err != nil {
+               return nil, err
+       }
        return &GetMetadataOutput{
-               Content: resp.Body,
+               Content: io.NopCloser(&buf),
        }, nil
 }

These eager reads should be sequenced before the call to cancelFn.

Seems to work OK so far. Though, given how difficult it was to reproduce this problem in the first place, I cannot be certain that I have made any difference. I'll let this little test program run for a few more hours. If it should eventually break, I will post a follow-up to say that I was on the wrong path.

@saj
Copy link

saj commented Jun 10, 2021

My test program has been running for over 12h now, and still seems fine. I was previously able to repro after about 30m. This looks promising.

@gsaraf
Copy link
Author

gsaraf commented Jun 14, 2021

@saj - nice investigation! Sounds really promising! On my end I added a trivial retry mechanism on top of all of our AWS calls which works but is a really ugly solution.

@skotambkar - Would a PR with @saj's suggested change be acceptable?

@saj
Copy link

saj commented Jun 18, 2021

Quick update. I deployed my earlier patch out to production jobs shortly after my last post. I have not been alerted to a recurrence of this problem since.

To help cement the priority of this issue relative to everything else in the maintainers' backlog, I think it is accurate to state that -- as a result of this problem -- aws-sdk-go-v2 may not function reliably for AWS customers who execute containerised Go workloads on EC2 and rely on role credentials (instance profiles). As far as I know, even the latest stable releases of Docker CE still default to an MTU of 1,500.

(We probably won't see anyone from the Kubernetes crowd in this bug. Typically, it would not make much sense for workloads placed by a multi-tenant job scheduler to source their credentials from the IMDS.)

@bcap
Copy link

bcap commented Aug 9, 2021

Indeed I do see this issue happening quite often on a local binary trying to list all S3 buckets (code). The binary runs on a machine in London-UK connected through VPN to Ashburn-Virginia-USA. Here instance roles are used as well instead of credentials.

Like mentioned above, because there is a race between cancelling the context and consuming the response body, I suspect that only responses that can have their bodies fully buffered in a single packet will always work. In this case we have enough buckets, which will likely cause multiple TCP round-trips.

This pretty much means that the current implementation of the request middleware makes it impossible to return a data stream that was not fully buffered already in memory. This basically defeats the whole point of streams, which golang's net/http package relies on (and most of golang io really).

Also, another important point here is why we have a hardcoded 5s timeout for metadata operations?

defaultOperationTimeout = 5 * time.Second

err = stack.Initialize.Add(&operationTimeout{
Timeout: defaultOperationTimeout,
}, middleware.Before)

This seems unavoidable, which is not really reasonable. Considering the API functions already take contexts from the caller, the caller has control to timeout requests and that does not need to be enforced within the API.

Shouldnt we just remove operationTimeout entirely from the middleware stack?

@jasdel
Copy link
Contributor

jasdel commented Oct 5, 2021

Thanks for proving additional details on this issue. I think we've identified the change that is needed to fix this bug. The IMDS client's response deserialization should read the full response body before handing back the response up the middleware stack. This will ensure that the HTTP response will of been ready before the cancelFn could be called.

This is independent of the built in default timeout of the IMDS client. I'll take a look at that separately after fixing the deserialization of response issue.

jasdel added a commit to jasdel/aws-sdk-go-v2 that referenced this issue Oct 5, 2021
Fixes aws#1253 race between reading a IMDS response, and the
operationTimeout middleware cleaning up its timeout context. Changes the
IMDS client to always buffer the response body received, before the
result is deserialized. This ensures that the consumer of the
operation's response body will not race with context cleanup within the
middleware stack.
jasdel added a commit to jasdel/aws-sdk-go-v2 that referenced this issue Oct 5, 2021
…imeout race

Fixes aws#1253 race between reading a IMDS response, and the
operationTimeout middleware cleaning up its timeout context. Changes the
IMDS client to always buffer the response body received, before the
result is deserialized. This ensures that the consumer of the
operation's response body will not race with context cleanup within the
middleware stack.
jasdel added a commit that referenced this issue Oct 6, 2021
…t race (#1448)

Fixes #1253 race between reading a IMDS response, and the
operationTimeout middleware cleaning up its timeout context. Changes the
IMDS client to always buffer the response body received, before the
result is deserialized. This ensures that the consumer of the
operation's response body will not race with context cleanup within the
middleware stack.

Updates the IMDS Client operations to not override the passed in
Context's Deadline or Timeout options. If an Client operation is called
with a Context with a Deadline or Timeout, the client will no longer
override it with the client's default timeout.

Updates operationTimeout so that if DefaultTimeout is unset (aka zero)
operationTimeout will not set a default timeout on the context.
@github-actions
Copy link

github-actions bot commented Oct 6, 2021

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

jrichard8 pushed a commit to jrichard8/aws-sdk-go-v2 that referenced this issue Feb 14, 2022
…t race (aws#1448)

Fixes aws#1253 race between reading a IMDS response, and the
operationTimeout middleware cleaning up its timeout context. Changes the
IMDS client to always buffer the response body received, before the
result is deserialized. This ensures that the consumer of the
operation's response body will not race with context cleanup within the
middleware stack.

Updates the IMDS Client operations to not override the passed in
Context's Deadline or Timeout options. If an Client operation is called
with a Context with a Deadline or Timeout, the client will no longer
override it with the client's default timeout.

Updates operationTimeout so that if DefaultTimeout is unset (aka zero)
operationTimeout will not set a default timeout on the context.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants