-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected error from sdk: failed to decode <role> EC2 IMDS role credentials #1253
Comments
Quick update: upgrading to AWS SDK v1.4.0 didn't seem to affect the incidence of these errors. Still happening occasionally. |
I have also encountered this problem, though I have found it incredibly difficult to reproduce.
There are not many places in the SDK that call a context cancellation function. One stood out to me: func (m *operationTimeout) HandleInitialize(
ctx context.Context, input middleware.InitializeInput, next middleware.InitializeHandler,
) (
output middleware.InitializeOutput, metadata middleware.Metadata, err error,
) {
var cancelFn func()
ctx, cancelFn = context.WithTimeout(ctx, m.Timeout)
defer cancelFn()
return next.HandleInitialize(ctx, input)
}
@skotambkar, is it possible for that deferred call to func buildGetMetadataOutput(resp *smithyhttp.Response) (interface{}, error) {
return &GetMetadataOutput{
Content: resp.Body,
}, nil
} People like @gsaraf and myself -- who execute our programs from within containers -- may be more likely to encounter this race because the MTU on our container links are almost certainly set to a smaller value (1,500) than the interfaces on the host EC2 instance (9,000). The IMDS response body for I call this a race because I think there is some internal buffering in Go's |
To test that hypothesis, I applied the following patch to the --- a/feature/ec2/imds/api_op_GetMetadata.go
+++ b/feature/ec2/imds/api_op_GetMetadata.go
@@ -1,6 +1,7 @@
package imds
import (
+ "bytes"
"context"
"fmt"
"io"
@@ -70,7 +71,12 @@ func buildGetMetadataPath(params interface{}) (string, error) {
}
func buildGetMetadataOutput(resp *smithyhttp.Response) (interface{}, error) {
+ buf := bytes.Buffer{}
+ _, err := io.Copy(&buf, resp.Body)
+ if err != nil {
+ return nil, err
+ }
return &GetMetadataOutput{
- Content: resp.Body,
+ Content: io.NopCloser(&buf),
}, nil
} These eager reads should be sequenced before the call to Seems to work OK so far. Though, given how difficult it was to reproduce this problem in the first place, I cannot be certain that I have made any difference. I'll let this little test program run for a few more hours. If it should eventually break, I will post a follow-up to say that I was on the wrong path. |
My test program has been running for over 12h now, and still seems fine. I was previously able to repro after about 30m. This looks promising. |
@saj - nice investigation! Sounds really promising! On my end I added a trivial retry mechanism on top of all of our AWS calls which works but is a really ugly solution. @skotambkar - Would a PR with @saj's suggested change be acceptable? |
Quick update. I deployed my earlier patch out to production jobs shortly after my last post. I have not been alerted to a recurrence of this problem since. To help cement the priority of this issue relative to everything else in the maintainers' backlog, I think it is accurate to state that -- as a result of this problem -- aws-sdk-go-v2 may not function reliably for AWS customers who execute containerised Go workloads on EC2 and rely on role credentials (instance profiles). As far as I know, even the latest stable releases of Docker CE still default to an MTU of 1,500. (We probably won't see anyone from the Kubernetes crowd in this bug. Typically, it would not make much sense for workloads placed by a multi-tenant job scheduler to source their credentials from the IMDS.) |
Indeed I do see this issue happening quite often on a local binary trying to list all S3 buckets (code). The binary runs on a machine in London-UK connected through VPN to Ashburn-Virginia-USA. Here instance roles are used as well instead of credentials. Like mentioned above, because there is a race between cancelling the context and consuming the response body, I suspect that only responses that can have their bodies fully buffered in a single packet will always work. In this case we have enough buckets, which will likely cause multiple TCP round-trips. This pretty much means that the current implementation of the request middleware makes it impossible to return a data stream that was not fully buffered already in memory. This basically defeats the whole point of streams, which golang's net/http package relies on (and most of golang io really). Also, another important point here is why we have a hardcoded
aws-sdk-go-v2/feature/ec2/imds/request_middleware.go Lines 54 to 56 in bf1672a
This seems unavoidable, which is not really reasonable. Considering the API functions already take contexts from the caller, the caller has control to timeout requests and that does not need to be enforced within the API. Shouldnt we just remove |
Thanks for proving additional details on this issue. I think we've identified the change that is needed to fix this bug. The IMDS client's response deserialization should read the full response body before handing back the response up the middleware stack. This will ensure that the HTTP response will of been ready before the This is independent of the built in default timeout of the IMDS client. I'll take a look at that separately after fixing the deserialization of response issue. |
Fixes aws#1253 race between reading a IMDS response, and the operationTimeout middleware cleaning up its timeout context. Changes the IMDS client to always buffer the response body received, before the result is deserialized. This ensures that the consumer of the operation's response body will not race with context cleanup within the middleware stack.
…imeout race Fixes aws#1253 race between reading a IMDS response, and the operationTimeout middleware cleaning up its timeout context. Changes the IMDS client to always buffer the response body received, before the result is deserialized. This ensures that the consumer of the operation's response body will not race with context cleanup within the middleware stack.
…t race (#1448) Fixes #1253 race between reading a IMDS response, and the operationTimeout middleware cleaning up its timeout context. Changes the IMDS client to always buffer the response body received, before the result is deserialized. This ensures that the consumer of the operation's response body will not race with context cleanup within the middleware stack. Updates the IMDS Client operations to not override the passed in Context's Deadline or Timeout options. If an Client operation is called with a Context with a Deadline or Timeout, the client will no longer override it with the client's default timeout. Updates operationTimeout so that if DefaultTimeout is unset (aka zero) operationTimeout will not set a default timeout on the context.
|
…t race (aws#1448) Fixes aws#1253 race between reading a IMDS response, and the operationTimeout middleware cleaning up its timeout context. Changes the IMDS client to always buffer the response body received, before the result is deserialized. This ensures that the consumer of the operation's response body will not race with context cleanup within the middleware stack. Updates the IMDS Client operations to not override the passed in Context's Deadline or Timeout options. If an Client operation is called with a Context with a Deadline or Timeout, the client will no longer override it with the client's default timeout. Updates operationTimeout so that if DefaultTimeout is unset (aka zero) operationTimeout will not set a default timeout on the context.
Describe the bug
In a service using the SDK, running in a docker container on a t2-micro EC2 instance, we get occasional errors:
orchestrator
is the instance IAM role.The context we are passing is
context.TODO()
, so it wasn't actually cancelled.I checked, and the instance is almost idle - always has full cpu credits (if that matters).
The error seems to return 3-4 seconds after calling DescribeImages.
The errors aren't distributed randomly over time - sometimes they happen several times in a row.
Version of AWS SDK for Go?
1.3.2
Version of Go (
go version
)?1.16.3
To Reproduce (observed behavior)
Expected behavior
DescribeImages works consistently :)
Additional context
metadata_service_num_attempts
/metadata_service_timeout
in searches for semi-related error messages, but those don't appear to be settable in the go sdk. Not sure if it would have helped.The text was updated successfully, but these errors were encountered: