-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 sometimes does not complete downloads #312
Comments
This is not an issue with the SDK but rather with general usage of S3. If you send many parallel requests, eventually S3 will throttle your access and kill your connection. One way you can mitigate the failures is by using HTTP's Expect 100-continue header, which S3 supports: request.on('build', function() {
request.httpRequest.headers['Expect'] = '100-continue';
}); This will send headers and wait for a server response prior to attempting to upload/download any data, thereby allowing S3 to return with any errors prior to accepting your connection. If this is not passed, S3's only choice is to cut your connection off in the middle of transfer, causing the behavior you are seeing. That said, you should generally expect that you may encounter abrupt disconnections for a variety of reasons (bad internet connection, routing issues, etc.), so you should design your application to recover from any download failures. The SDK has retry logic for most operations out of the box, which is why you generally don't see these failures, but retry logic is disabled for the raw Hope that helps. |
What's the best number of parallel requests to maximize bandwidth? |
The number will depend heavily on many different factors-- bandwidth, latency, I/O, and possibly even CPU. There is no exact value. An arbitrary value to start with might be 5, but you can do some tests to see what works best for you. |
Wait wait wait. In this example code I am not sending any parallel requests. It is sending one request at a time. And I'm seeing this aborted request happening quite frequently. Parallel requests is not an explanation for the behavior I'm seeing. |
Also regarding retry logic - this example code doesn't have it but the real code does, and I was seeing a request sometimes fail three times in a row, each with a second delay before retrying. It should not be this spotty. |
Parallel requests are not the only way to hit S3's throttling limit, they are just an easier way. Downloading many files at once can also hit that limit. If you add the Expect 100-continue header to your requests, do you see any error returned by S3? |
How is this different than "parallel requests"? Again, this test code downloads a single file at once, then repeats.
I observe no difference:
S3 still gives a 200 status code and then aborts the stream early. I haven't done anything with S3 in many hours. I download a file once, and then once that is done I try to download it again and the download is aborted midway. There's no way I hit a real limit here. |
What version of Node.js are you using? I'm trying to reproduce this with no luck:
One thing I will note is that if S3 is actually killing the connection mid-way, this is likely not an SDK issue. If you are getting a 'close' event, that typically means Node.js itself is seeing the connection closing, not the SDK, which means the connection is being severed outside of the SDK's control. It's possible the Node.js version is related-- it also is possible that the way the SDK is setting up the connection is incorrect, but without being able to reproduce, it's hard to say for sure. |
v0.10.29 |
I'm confused as to why you're unable to duplicate the issue. It happens regularly and frequently for me. The most "downloaded nnn" number I've ever reached was 40. For what it's worth, here's a slightly simplified test case that only checks file size and removes the etag/md5 calculation from the equation: var AWS = require('aws-sdk');
var fs = require('fs');
var client = new AWS.S3({
accessKeyId: process.env.S3_KEY,
secretAccessKey: process.env.S3_SECRET,
});
var s3Path = "1tmp1/node_modules/chem-cli/node_modules/watchify2/node_modules/browserify/node_modules/browser-builtins/node_modules/http-browserify/example/json-stream/node_modules/JSONStream/test/fixtures/all_npm.json";
var localFile = "test_out.json";
var s3Params = {
Bucket: process.env.S3_BUCKET,
Key: s3Path,
};
var count = 1;
downloadOnce();
function downloadOnce() {
doTheDownload(function(err) {
if (err) throw err;
console.log("downloaded", count++);
downloadOnce();
});
}
function doTheDownload(cb) {
var request = client.getObject(s3Params);
var response = request.createReadStream();
var outStream = fs.createWriteStream(localFile);
var errorOccurred = false;
var contentLength;
response.on('error', handleError);
outStream.on('error', handleError);
request.on('build', function() {
request.httpRequest.headers.Expect = '100-continue';
});
request.on('httpHeaders', function(statusCode, headers, resp) {
if (statusCode < 300) {
contentLength = parseInt(headers['content-length'], 10);
} else {
handleError(new Error("http status code " + statusCode));
}
});
outStream.on('close', function() {
if (errorOccurred) return;
fs.stat(localFile, function(err, stat) {
if (err) return handleError(err);
if (stat.size !== contentLength) return handleError(new Error("size mismatch"));
cb();
});
});
response.pipe(outStream);
function handleError(err) {
if (errorOccurred) return;
errorOccurred = true;
cb(err);
}
}
|
I'm also trying to do the same file download with s3cmd and it's working fine. I haven't seen a single one fail in 200+ downloads.
I ran the node code simultaneously to this and I've reproduced several failures in the node code all the while s3cmd is humming away, no problem, no retries. |
I'm running your simplified case with the same results. I'm currently on 875 and it's going smoothly. I grabbed the same all_npm.json file you are using just to ensure that the data isn't the culprit, too. Based on the code, I'm still not seeing anything pointing to the SDK. The 'close' event is coming from a plain ReadableStream object initiated by Node.js, not the SDK. If I could reproduce this on my end I could do some debugging to see if there's any discrepancy in how we forward certain events, but I'm not sure that would cause the socket to disconnect-- we are simply reading the events that come down off of the stream. |
Sorry I just noticed that pipe() is being used here, so I edited out my point about data/readable events. Node's pipe() should use the right thing. I'm taking a look at using the HTTP stream directly. I'll post code for you to try in a bit. |
What is different about our environments? Bucket? Node.js version? Network distance between local computer and S3 region? Operating system (Ubuntu 14.04 LTS 64-bit)? This issue is perfectly reproducible for me. The culprit must be in the delta between our environments. |
I'm running on OSX. My bucket is in the classic region, and I'm on the west coast. Using Node.js 0.10.28. I'll try 0.10.29 in a sec here, I do recall some streaming related changes in that patchlevel release. |
Related? nodejs/node-v0.x-archive@2efe4ab761666 (Note the comments at the bottom and mscdex/busboy@8aea552) |
Oh cool. I was able to reproduce this in 0.10.29. Have you tried 0.10.28 by any chance? |
Nope, I upgraded to 0.10.29 as soon as it came out and never looked back :P I'll try it. |
I don't think this would be related-- I don't think we ever call |
I can't seem to reproduce reliably (it got up to 200+ on the second attempt) in 0.10.29, but I've gotten it to fail more than a handful of times now. I can't seem to get it to fail at all in 0.10.28. I will spend some time looking into what changed between these versions, but I'm leaning to some unexpected regression in Node.js, unless we are using the API incorrectly, somehow. |
I'm at 400+ with v0.10.28. |
There may be something related to nodejs/node-v0.x-archive@31150df |
Although we are using 'end' instead of 'finish', which seems to be the right thing. I'm looking through nodejs/node-v0.x-archive@v0.10.28...v0.10.29 to see if anything stands out. |
Cool, glad we're on the right track. I'm going back to v0.10.29 now. It is the latest "stable" version after all. |
Won't that cause the issue to reappear until this is fixed? It's still unclear if the issue is an SDK bug or Node.js regression. If this is a Node.js regression, 0.10.29 is probably not the right version to be on. |
I've git bisected it to this commit: nodejs/node-v0.x-archive@c862c03 |
See https://github.com/orangemocha/node-connection-drop I have pinpointed the performance degradation to ac2263b This change brings performance back to the orginal levels. Signed-off-by: Fedor Indutny <[email protected]>
I tested v0.10.29 with that commit reverted and it fixes it. I've also tried setting |
Thanks for the deep dive @andrewrk! I will do some checking to see why that bit us and if we were incorrectly depending on API behavior. It looks like there is something related to timing (async vs. sync), but I'm not entirely sure what just yet. |
FWIW the actual issue here seems to be that I'm not entirely sure how the commit you found affects this behavior, and also why that happens, since those |
On Jul 8, 2014 5:50 PM, "Loren Segal" [email protected] wrote:
This is what the readable-stream module is for. Code your app for 0.10.x
This statement concerns me. Specifically you should not iterate over the events and blithely re-emit If the tests are fixed now I can provide a patch that works for both 0.8 |
The events aren't exactly being blindly emitted. You'll notice that in the sample, all of the events are actually listened to. In all usages of https://gist.github.com/lsegal/e6966a03bcbf077724d1 (diff)
The tests have been working since Friday with the workaround provided in #313, and fixed today. I'd be happy to take a patch, but note that working around this is not the only concern-- understanding what the actual contract for Readable streams is. As I mentioned, it would be easy enough to switch to PassThrough and use something like readable-stream (which only became stable long after we implemented |
I just looked at your gist. This code is no good: Object.keys(stream._events).forEach(function(event) {
resp.on(event, function(arg) {
console.log(event);
stream.emit(event, arg);
});
}); Nowhere in the API does it say you can read the stream.on('newListener', function(event) {
if (!sent && event === 'readable') {
sent = true;
process.nextTick(req.send);
}
}); This is scary, error prone, and unnecessary. // Taken from:
// http://nodejs.org/api/stream.html#stream_example_simpleprotocol_v1_sub_optimal
// if the source doesn't have data, we don't have data yet.
if (chunk === null) stream.push(''); This hack is unnecessary if you use the API correctly. Just pipe everything and let Node.js handle all the tricky details.
This is trivially proven wrong: var fs = require('fs');
var inStream = fs.createReadStream("blah.txt");
var PassThrough = require('stream').PassThrough;
var outStream = new PassThrough();
outStream.on('readable', function() {
console.log("got readable");
});
outStream.on('end', function() {
console.log("got end");
});
inStream.pipe(outStream);
var writeStream = fs.createWriteStream("blah-out.txt");
outStream.pipe(writeStream); Output:
The code example I gave you earlier is simple and effective. Delete all the weird hacks, just use pipe, and everything is fixed. There's no bug in Node.js. The mystery has been solved. The bug is in aws-sdk-js. Please fix your code. |
Thanks for your patience on this @andrewrk and @jamessharp. The above commit is the shortest path fix to get things working. I plan on doing a refactor to use PassThrough with readable-stream for backward compatibility support for 0.8.x, but a first pass broke a few tests due to some of our mocks being a little too brittle for the change. |
Thanks for solving the issue :) Looking forward to the next release. |
v2.0.6 is now out with these changes. |
👍 |
var params = {
Bucket: 'my_bucket',
Key: 'my_key'
}
s3.headObject(params, function(err, resp) {
console.log('you will never see me!')
}) Maybe the problem is some way related to the current issue. So I'll just leave it here. |
@ybogdanov this issue was specifically related to the use of streaming APIs in the SDK ( |
AWS S3, from which we download pre-built pkgpanda packages, might sometimes end the connection before the package has been fully downloaded. (See aws/aws-sdk-js#312 and boto/boto3#540 for more context on this) In our case that leads to invalid `tar.xz` archives which cause `unexpected end of input` errors and result in faulty builds of DC/OS (`dcos_generate_config.sh`) which contain these invalid archives and then result in install-time errors for users. See this comment for an investigation into how this can affect installation errors: https://jira.mesosphere.com/browse/DCOS_OSS-4097?focusedCommentId=219259&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-219259 This change here adds a sanity check to the `download` function used by pkgpanda. It checks whether the final file size (after flushing the file to disk) matches the `Content-Length` we get from S3. If it doesn't match, it sleep for 2, 4, 8, ... seconds and then retries until it either downloads the file completely or the retries are exhausted. In that case, it raises an exception.
AWS S3, from which we download pre-built pkgpanda packages, might sometimes end the connection before the package has been fully downloaded. (See aws/aws-sdk-js#312 and boto/boto3#540 for more context on this) In our case that leads to invalid `tar.xz` archives which cause `unexpected end of input` errors and result in faulty builds of DC/OS (`dcos_generate_config.sh`) which contain these invalid archives and then result in install-time errors for users. See this comment for an investigation into how this can affect installation errors: https://jira.mesosphere.com/browse/DCOS_OSS-4097?focusedCommentId=219259&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-219259 This change here adds a sanity check to the `download` function used by pkgpanda. It checks whether the final file size (after flushing the file to disk) matches the `Content-Length` we get from S3. If it doesn't match, it sleep for 2, 4, 8, ... seconds and then retries until it either downloads the file completely or the retries are exhausted. In that case, it raises an exception.
AWS S3, from which we download pre-built pkgpanda packages, might sometimes end the connection before the package has been fully downloaded. (See aws/aws-sdk-js#312 and boto/boto3#540 for more context on this) In our case that leads to invalid `tar.xz` archives which cause `unexpected end of input` errors and result in faulty builds of DC/OS (`dcos_generate_config.sh`) which contain these invalid archives and then result in install-time errors for users. See this comment for an investigation into how this can affect installation errors: https://jira.mesosphere.com/browse/DCOS_OSS-4097?focusedCommentId=219259&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-219259 This change here adds a sanity check to the `download` function used by pkgpanda. It checks whether the final file size (after flushing the file to disk) matches the `Content-Length` we get from S3. If it doesn't match, it sleep for 2, 4, 8, ... seconds and then retries until it either downloads the file completely or the retries are exhausted. In that case, it raises an exception.
AWS S3, from which we download pre-built pkgpanda packages, might sometimes end the connection before the package has been fully downloaded. (See aws/aws-sdk-js#312 and boto/boto3#540 for more context on this) In our case that leads to invalid `tar.xz` archives which cause `unexpected end of input` errors and result in faulty builds of DC/OS (`dcos_generate_config.sh`) which contain these invalid archives and then result in install-time errors for users. See this comment for an investigation into how this can affect installation errors: https://jira.mesosphere.com/browse/DCOS_OSS-4097?focusedCommentId=219259&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-219259 This change here adds a sanity check to the `download` function used by pkgpanda. It checks whether the final file size (after flushing the file to disk) matches the `Content-Length` we get from S3. If it doesn't match, it sleep for 2, 4, 8, ... seconds and then retries until it either downloads the file completely or the retries are exhausted. In that case, it raises an exception.
AWS S3, from which we download pre-built pkgpanda packages, might sometimes end the connection before the package has been fully downloaded. (See aws/aws-sdk-js#312 and boto/boto3#540 for more context on this) In our case that leads to invalid `tar.xz` archives which cause `unexpected end of input` errors and result in faulty builds of DC/OS (`dcos_generate_config.sh`) which contain these invalid archives and then result in install-time errors for users. See this comment for an investigation into how this can affect installation errors: https://jira.mesosphere.com/browse/DCOS_OSS-4097?focusedCommentId=219259&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-219259 This change here adds a sanity check to the `download` function used by pkgpanda. It checks whether the final file size (after flushing the file to disk) matches the `Content-Length` we get from S3. If it doesn't match, it sleep for 2, 4, 8, ... seconds and then retries until it either downloads the file completely or the retries are exhausted. In that case, it raises an exception.
AWS S3, from which we download pre-built pkgpanda packages, might sometimes end the connection before the package has been fully downloaded. (See aws/aws-sdk-js#312 and boto/boto3#540 for more context on this) In our case that leads to invalid `tar.xz` archives which cause `unexpected end of input` errors and result in faulty builds of DC/OS (`dcos_generate_config.sh`) which contain these invalid archives and then result in install-time errors for users. See this comment for an investigation into how this can affect installation errors: https://jira.mesosphere.com/browse/DCOS_OSS-4097?focusedCommentId=219259&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-219259 This change here adds a sanity check to the `download` function used by pkgpanda. It checks whether the final file size (after flushing the file to disk) matches the `Content-Length` we get from S3. If it doesn't match, it sleep for 2, 4, 8, ... seconds and then retries until it either downloads the file completely or the retries are exhausted. In that case, it raises an exception. (cherry picked from commit 4c38f2d)
AWS S3, from which we download pre-built pkgpanda packages, might sometimes end the connection before the package has been fully downloaded. (See aws/aws-sdk-js#312 and boto/boto3#540 for more context on this) In our case that leads to invalid `tar.xz` archives which cause `unexpected end of input` errors and result in faulty builds of DC/OS (`dcos_generate_config.sh`) which contain these invalid archives and then result in install-time errors for users. See this comment for an investigation into how this can affect installation errors: https://jira.mesosphere.com/browse/DCOS_OSS-4097?focusedCommentId=219259&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-219259 This change here adds a sanity check to the `download` function used by pkgpanda. It checks whether the final file size (after flushing the file to disk) matches the `Content-Length` we get from S3. If it doesn't match, it sleep for 2, 4, 8, ... seconds and then retries until it either downloads the file completely or the retries are exhausted. In that case, it raises an exception. (cherry picked from commit 4c38f2d)
AWS S3, from which we download pre-built pkgpanda packages, might sometimes end the connection before the package has been fully downloaded. (See aws/aws-sdk-js#312 and boto/boto3#540 for more context on this) In our case that leads to invalid `tar.xz` archives which cause `unexpected end of input` errors and result in faulty builds of DC/OS (`dcos_generate_config.sh`) which contain these invalid archives and then result in install-time errors for users. See this comment for an investigation into how this can affect installation errors: https://jira.mesosphere.com/browse/DCOS_OSS-4097?focusedCommentId=219259&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-219259 This change here adds a sanity check to the `download` function used by pkgpanda. It checks whether the final file size (after flushing the file to disk) matches the `Content-Length` we get from S3. If it doesn't match, it sleep for 2, 4, 8, ... seconds and then retries until it either downloads the file completely or the retries are exhausted. In that case, it raises an exception.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread. |
I put together a test case for you:
Edit
s3Path
variable to be some file on S3. My example file is 362KB.Here's an example run:
So the first 37 times it downloaded fine, but the 38th time we downloaded the file, AWS SDK said that the file was complete, when it really wasn't. If I go look at test_out.json on my computer, the first 356KB are downloaded correctly and then the file is truncated.
I am seeing this problem frequently.
The text was updated successfully, but these errors were encountered: