Unexpected error from sdk: failed to decode <role> EC2 IMDS role credentials #1253

gsaraf · 2021-05-10T15:01:52Z

I've gone though the API reference
I've checked AWS Forums and StackOverflow for answers
I've searched for previous similar issues and didn't find any solution

Describe the bug
In a service using the SDK, running in a docker container on a t2-micro EC2 instance, we get occasional errors:

operation error ECR: DescribeImages, failed to sign request: failed to retrieve credentials: failed to decode orchestrator EC2 IMDS role credentials, context canceled

orchestrator is the instance IAM role.
The context we are passing is context.TODO(), so it wasn't actually cancelled.
I checked, and the instance is almost idle - always has full cpu credits (if that matters).

The error seems to return 3-4 seconds after calling DescribeImages.
The errors aren't distributed randomly over time - sometimes they happen several times in a row.

Version of AWS SDK for Go?
1.3.2

Version of Go (go version)?
1.16.3

To Reproduce (observed behavior)

func getConfig() (aws.Config, error) {
	cfg, err := config.LoadDefaultConfig(context.TODO())
	if err != nil {
		return aws.Config{}, fmt.Errorf("configuration error = %v", err)
	}

	if len(cfg.Region) == 0 {
		l.Logger.Warnf("Region not set in aws configuration, setting it to %s", awsRegion)
		cfg.Region = "eu-central-1"
	}

	return cfg, nil
}

var imageNotExistErr = types.ImageNotFoundException{}

func ImagePushTime(repositoryName, imageTag string) (*time.Time, error) {
	cfg, err := getConfig()
	if err != nil {
		return nil, err
	}
	client := ecr.NewFromConfig(cfg)
	describeImageInput := &ecr.DescribeImagesInput{
		ImageIds:       []types.ImageIdentifier{{ImageTag: &imageTag}},
		RepositoryName: &repositoryName,
	}
	describeOut, err := client.DescribeImages(context.TODO(), describeImageInput)
	if err != nil {
		var ae smithy.APIError
		if errors.As(err, &ae) {
			if ae.ErrorCode() == imageNotExistErr.ErrorCode() {
				return nil, os.ErrNotExist
			}
		}

		return nil, err
	}

	if len(describeOut.ImageDetails) != 1 {
		return nil, errors.New("there may be exactly one image returned at this point")
	}

	return describeOut.ImageDetails[0].ImagePushedAt, nil
}

Expected behavior
DescribeImages works consistently :)

Additional context

I did find references to metadata_service_num_attempts / metadata_service_timeout in searches for semi-related error messages, but those don't appear to be settable in the go sdk. Not sure if it would have helped.
I've upgraded our code to v1.4.0 of the aws go sdk. Since the errors aren't consistent, it may take a while before we can be sure it has been solved / still exists.

The text was updated successfully, but these errors were encountered:

gsaraf · 2021-05-15T17:54:09Z

Quick update: upgrading to AWS SDK v1.4.0 didn't seem to affect the incidence of these errors. Still happening occasionally.

saj · 2021-06-09T23:57:47Z

I have also encountered this problem, though I have found it incredibly difficult to reproduce.

failed to decode ... EC2 IMDS role credentials, context canceled

There are not many places in the SDK that call a context cancellation function. One stood out to me:

func (m *operationTimeout) HandleInitialize(
	ctx context.Context, input middleware.InitializeInput, next middleware.InitializeHandler,
) (
	output middleware.InitializeOutput, metadata middleware.Metadata, err error,
) {
	var cancelFn func()

	ctx, cancelFn = context.WithTimeout(ctx, m.Timeout)
	defer cancelFn()

	return next.HandleInitialize(ctx, input)
}

feature/ec2/imds/request_middleware.go

@skotambkar, is it possible for that deferred call to cancelFn to race with a response body read in ec2rolecreds? Some exported imds.Get* methods handle body reads as part of the middleware stack: those reads are therefore guaranteed to execute before the call to cancelFn. I don't think that guarantee applies here, however. GetMetadata returns a raw io.ReadCloser, and later reads escape the middleware stack:

func buildGetMetadataOutput(resp *smithyhttp.Response) (interface{}, error) {
	return &GetMetadataOutput{
		Content: resp.Body,
	}, nil
}

People like @gsaraf and myself -- who execute our programs from within containers -- may be more likely to encounter this race because the MTU on our container links are almost certainly set to a smaller value (1,500) than the interfaces on the host EC2 instance (9,000). The IMDS response body for /latest/meta-data/iam/security-credentials/... is long enough to fragment over multiple IP packets when the program is executing from within a container. (I was unable to reproduce outside a container despite hours of continuous testing.)

I call this a race because I think there is some internal buffering in Go's http package that may sometimes allow a body read to succeed even if the context is canceled.

saj · 2021-06-10T01:55:30Z

To test that hypothesis, I applied the following patch to the imds package:

--- a/feature/ec2/imds/api_op_GetMetadata.go
+++ b/feature/ec2/imds/api_op_GetMetadata.go
@@ -1,6 +1,7 @@
 package imds

 import (
+       "bytes"
        "context"
        "fmt"
        "io"
@@ -70,7 +71,12 @@ func buildGetMetadataPath(params interface{}) (string, error) {
 }

 func buildGetMetadataOutput(resp *smithyhttp.Response) (interface{}, error) {
+       buf := bytes.Buffer{}
+       _, err := io.Copy(&buf, resp.Body)
+       if err != nil {
+               return nil, err
+       }
        return &GetMetadataOutput{
-               Content: resp.Body,
+               Content: io.NopCloser(&buf),
        }, nil
 }

These eager reads should be sequenced before the call to cancelFn.

Seems to work OK so far. Though, given how difficult it was to reproduce this problem in the first place, I cannot be certain that I have made any difference. I'll let this little test program run for a few more hours. If it should eventually break, I will post a follow-up to say that I was on the wrong path.

saj · 2021-06-10T13:41:24Z

My test program has been running for over 12h now, and still seems fine. I was previously able to repro after about 30m. This looks promising.

gsaraf · 2021-06-14T10:53:20Z

@saj - nice investigation! Sounds really promising! On my end I added a trivial retry mechanism on top of all of our AWS calls which works but is a really ugly solution.

@skotambkar - Would a PR with @saj's suggested change be acceptable?

saj · 2021-06-18T01:26:37Z

Quick update. I deployed my earlier patch out to production jobs shortly after my last post. I have not been alerted to a recurrence of this problem since.

To help cement the priority of this issue relative to everything else in the maintainers' backlog, I think it is accurate to state that -- as a result of this problem -- aws-sdk-go-v2 may not function reliably for AWS customers who execute containerised Go workloads on EC2 and rely on role credentials (instance profiles). As far as I know, even the latest stable releases of Docker CE still default to an MTU of 1,500.

(We probably won't see anyone from the Kubernetes crowd in this bug. Typically, it would not make much sense for workloads placed by a multi-tenant job scheduler to source their credentials from the IMDS.)

bcap · 2021-08-09T18:50:38Z

Indeed I do see this issue happening quite often on a local binary trying to list all S3 buckets (code). The binary runs on a machine in London-UK connected through VPN to Ashburn-Virginia-USA. Here instance roles are used as well instead of credentials.

Like mentioned above, because there is a race between cancelling the context and consuming the response body, I suspect that only responses that can have their bodies fully buffered in a single packet will always work. In this case we have enough buckets, which will likely cause multiple TCP round-trips.

This pretty much means that the current implementation of the request middleware makes it impossible to return a data stream that was not fully buffered already in memory. This basically defeats the whole point of streams, which golang's net/http package relies on (and most of golang io really).

Also, another important point here is why we have a hardcoded 5s timeout for metadata operations?

aws-sdk-go-v2/feature/ec2/imds/request_middleware.go

Line 213 in bf1672a

defaultOperationTimeout = 5 * time.Second

aws-sdk-go-v2/feature/ec2/imds/request_middleware.go

Lines 54 to 56 in bf1672a

    
           err = stack.Initialize.Add(&operationTimeout{ 
        
           	Timeout: defaultOperationTimeout, 
        
           }, middleware.Before)

This seems unavoidable, which is not really reasonable. Considering the API functions already take contexts from the caller, the caller has control to timeout requests and that does not need to be enforced within the API.

Shouldnt we just remove operationTimeout entirely from the middleware stack?

jasdel · 2021-10-05T19:50:05Z

Thanks for proving additional details on this issue. I think we've identified the change that is needed to fix this bug. The IMDS client's response deserialization should read the full response body before handing back the response up the middleware stack. This will ensure that the HTTP response will of been ready before the cancelFn could be called.

This is independent of the built in default timeout of the IMDS client. I'll take a look at that separately after fixing the deserialization of response issue.

Fixes aws#1253 race between reading a IMDS response, and the operationTimeout middleware cleaning up its timeout context. Changes the IMDS client to always buffer the response body received, before the result is deserialized. This ensures that the consumer of the operation's response body will not race with context cleanup within the middleware stack.

…imeout race Fixes aws#1253 race between reading a IMDS response, and the operationTimeout middleware cleaning up its timeout context. Changes the IMDS client to always buffer the response body received, before the result is deserialized. This ensures that the consumer of the operation's response body will not race with context cleanup within the middleware stack.

…t race (#1448) Fixes #1253 race between reading a IMDS response, and the operationTimeout middleware cleaning up its timeout context. Changes the IMDS client to always buffer the response body received, before the result is deserialized. This ensures that the consumer of the operation's response body will not race with context cleanup within the middleware stack. Updates the IMDS Client operations to not override the passed in Context's Deadline or Timeout options. If an Client operation is called with a Context with a Deadline or Timeout, the client will no longer override it with the client's default timeout. Updates operationTimeout so that if DefaultTimeout is unset (aka zero) operationTimeout will not set a default timeout on the context.

github-actions · 2021-10-06T18:19:18Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

…t race (aws#1448) Fixes aws#1253 race between reading a IMDS response, and the operationTimeout middleware cleaning up its timeout context. Changes the IMDS client to always buffer the response body received, before the result is deserialized. This ensures that the consumer of the operation's response body will not race with context cleanup within the middleware stack. Updates the IMDS Client operations to not override the passed in Context's Deadline or Timeout options. If an Client operation is called with a Context with a Deadline or Timeout, the client will no longer override it with the client's default timeout. Updates operationTimeout so that if DefaultTimeout is unset (aka zero) operationTimeout will not set a default timeout on the context.

gsaraf added the bug This issue is a bug. label May 10, 2021

skotambkar self-assigned this May 14, 2021

OrEisenberg mentioned this issue Sep 23, 2021

Bug Report: Race condition in AWSTOE. aws/ec2-image-builder-roadmap#66

Closed

ytsssun mentioned this issue Oct 4, 2021

Container image builds now randomly fail when triggered through CloudFormation (due to race condition in Java SDK?) aws-cloudformation/aws-cloudformation-resource-providers-imagebuilder#4

Closed

jasdel mentioned this issue Oct 5, 2021

feature/ec2/imds: Fix Client's response handling and operation timeout race #1448

Merged

jasdel closed this as completed in #1448 Oct 6, 2021

toothbrush mentioned this issue Aug 17, 2022

Upgrade AWS KMS SDK to v1.18.4 realestate-com-au/shush#35

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected error from sdk: failed to decode <role> EC2 IMDS role credentials #1253

Unexpected error from sdk: failed to decode <role> EC2 IMDS role credentials #1253

gsaraf commented May 10, 2021 •

edited

Loading

gsaraf commented May 15, 2021

saj commented Jun 9, 2021

saj commented Jun 10, 2021

saj commented Jun 10, 2021

gsaraf commented Jun 14, 2021

saj commented Jun 18, 2021

bcap commented Aug 9, 2021 •

edited

Loading

jasdel commented Oct 5, 2021

github-actions bot commented Oct 6, 2021

Unexpected error from sdk: failed to decode <role> EC2 IMDS role credentials #1253

Unexpected error from sdk: failed to decode <role> EC2 IMDS role credentials #1253

Comments

gsaraf commented May 10, 2021 • edited Loading

gsaraf commented May 15, 2021

saj commented Jun 9, 2021

saj commented Jun 10, 2021

saj commented Jun 10, 2021

gsaraf commented Jun 14, 2021

saj commented Jun 18, 2021

bcap commented Aug 9, 2021 • edited Loading

jasdel commented Oct 5, 2021

github-actions bot commented Oct 6, 2021

⚠️COMMENT VISIBILITY WARNING⚠️

gsaraf commented May 10, 2021 •

edited

Loading

bcap commented Aug 9, 2021 •

edited

Loading