Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retry on 400 RequestTimeout #411

Closed
wants to merge 1 commit into from

Conversation

kahing
Copy link
Contributor

@kahing kahing commented Oct 27, 2015

I sometimes get "Your socket connection to the server was not read
from or written to within the timeout period. Idle connections
will be closed."

Seems like these errors should be automatically retried.

I sometimes get "Your socket connection to the server was not read
from or written to within the timeout period. Idle connections
will be closed."

Seems like these errors should be automatically retried.
@kahing
Copy link
Contributor Author

kahing commented Oct 27, 2015

@andrewgaul this may solve your problem

@kahing
Copy link
Contributor Author

kahing commented Oct 27, 2015

I am unsure why I get this error to begin with. Here's some log for this error:

2015/10/27 22:02:15 DEBUG: Request s3/UploadPart Details:
---[ REQUEST POST-SIGN ]-----------------------------
PUT http://goofys.s3-us-west-2.amazonaws.com/test_dir/largefile?partNumber=29&uploadId=figScXN8Bs_SZCdsosYLepPjje7tCKufqjErL4x3irjOa5e1oxE7ZG3VhW.3sr47v_SEYXGBAbBVYLxU37MaKcldukqi9P21eAcqMjAgAWMSp_pmd4iqmBXYnUAqIQny HTTP/1.1
Host: goofys.s3-us-west-2.amazonaws.com
User-Agent: aws-sdk-go/0.9.17
Content-Length: 5242880
Authorization: <omitted>
X-Amz-Content-Sha256: c036cbb7553a909f8b8877d4461924307f27ecb66cff928eeeafd569c3887e29
X-Amz-Date: 20151027T220215Z
Accept-Encoding: gzip
...
2015/10/27 22:03:03 DEBUG: Response s3/UploadPart Details:
---[ RESPONSE ]--------------------------------------
HTTP/1.1 400 Bad Request
Connection: close
Transfer-Encoding: chunked
Content-Type: application/xml
Date: Tue, 27 Oct 2015 22:02:42 GMT
Server: AmazonS3
X-Amz-Id-2: T3xI0zSTjXYdr+BieOZL8P+TaNCLIZga5wbSdGpU5hMWTVk2RVZUg1ILLJPjikxctDoKQ3GEkJY=
X-Amz-Request-Id: 78438C4C74F7C9AD


2015/10/27 22:03:03 DEBUG: Validate Response s3/UploadPart failed, will retry, error UnknownError: unknown error
2015/10/27 22:03:03 DEBUG: Request s3/UploadPart Details:
---[ REQUEST POST-SIGN ]-----------------------------
PUT http://goofys.s3-us-west-2.amazonaws.com/test_dir/largefile?partNumber=29&uploadId=figScXN8Bs_SZCdsosYLepPjje7tCKufqjErL4x3irjOa5e1oxE7ZG3VhW.3sr47v_SEYXGBAbBVYLxU37MaKcldukqi9P21eAcqMjAgAWMSp_pmd4iqmBXYnUAqIQny HTTP/1.1
Host: goofys.s3-us-west-2.amazonaws.com
User-Agent: aws-sdk-go/0.9.17
Content-Length: 5242880
Authorization: <omitted>
X-Amz-Content-Sha256: c036cbb7553a909f8b8877d4461924307f27ecb66cff928eeeafd569c3887e29
X-Amz-Date: 20151027T220215Z
Accept-Encoding: gzip

This is run within ec2 to s3 bucket in the same region. Noticed how it took 48 seconds to get this error uploading a 5MB part.

@jasdel
Copy link
Contributor

jasdel commented Oct 28, 2015

Hi @kahing thanks for creating the PR. In the error log you provided were you using the aws.LogDebugWithRequestErrors LogLevel? From the error message it looked like the request was expected to retry. Did you see multiple entries of the following?

2015/10/27 22:03:03 DEBUG: Validate Response s3/UploadPart failed, will retry, ...

Would you be able to include the error code and message that you received as a response for the UploadPart APi call?

@kahing
Copy link
Contributor Author

kahing commented Oct 28, 2015

Ah yes, the log is with this PR included. It wasn't retrying before. The original error looks something like:

2015/10/28 22:22:53 DEBUG: Response s3/UploadPart Details:
---[ RESPONSE ]--------------------------------------
HTTP/1.1 400 Bad Request
Connection: close
Transfer-Encoding: chunked
Content-Type: application/xml
Date: Wed, 28 Oct 2015 22:22:33 GMT
Server: AmazonS3
X-Amz-Id-2: R/aUkurBzKrh4KosjiRw248a4TPF42ZvzfuWEtxlJM2Z/xD/M3PWG+xJ/0ax+Xm45iOqDWXN9NU=
X-Amz-Request-Id: 290F8AB76C12C5D2


2015/10/28 22:22:53 DEBUG: Validate Response s3/UploadPart failed, not retrying, error RequestTimeout: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.
        status code: 400, request id:

I am still very baffled why this error happens at all.

@jasdel
Copy link
Contributor

jasdel commented Oct 29, 2015

From what i can find this error occurs when the service does not read the full Content-Length specified in your request. Specifically this can occur when the Content-Length is greater than the number of bytes sent in the request.

Sometimes this could occur the SDK retrying a request which failed, but for some reason did not rewind the request body, or a data corruption occurred. Are you able to reproduce this bug consistently? I don't see any reason not to take in this change, but i want to make sure there is not another issue going on in the SDK that will be hidden with the retries.

Related to:

@kahing
Copy link
Contributor Author

kahing commented Oct 29, 2015

I see this very often. Initially I omitted the Content-Length and just relied on the sdk to figure it out (the Body is a bytes.NewReader([]byte)), but that and manually setting ContentLength to len(buf) had the same results.

If I provide my own reader and log from WriteTo(), I can see that net/http/Client reads from my buffer 4KB at a time, and at some point would stop, and then usually after another 43 seconds or so I will get the error.

@kahing
Copy link
Contributor Author

kahing commented Oct 29, 2015

I should add that my buffers are always 5MB and usually this stops after a few hundred KB.

@jasdel
Copy link
Contributor

jasdel commented Oct 29, 2015

So it sounds like in your case the full 5mb is not being uploaded. Is this correct? Something happens causing only part of the bites to be sent. Are you connecting through a proxy? Also is your app running on a server like EC2 instance or locally on a dev box?

@kahing
Copy link
Contributor Author

kahing commented Oct 29, 2015

This is running on ec2 c4.xlarge in us-west-2a connecting to a bucket in us-west-2

@jasdel
Copy link
Contributor

jasdel commented Oct 29, 2015

Thanks for the info! Pulling in the PR.

I would not of expected that delay in an EC2 instance. Out of curiosity is your data coming from an os.File or some other buffer. Is it possible the provider of the buffer is not providing data as fast as expected, or gets stuck? Have you tried to to send just something like bytes.NewBuffer(make([]bytes, 5 * 1024 * 1024)) instead to see if the issue still occur?

@kahing
Copy link
Contributor Author

kahing commented Oct 29, 2015

I do not expect the delay either and the buffer is a regular []byte. It seems to happen more on https and less (or not at all) on http. Just got a tcpdump and will look into it soon.

@jasdel
Copy link
Contributor

jasdel commented Oct 29, 2015

Merged in this PR, ae0b628

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants