Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 sometimes does not complete downloads #312

Closed
andrewrk opened this issue Jul 2, 2014 · 43 comments
Closed

S3 sometimes does not complete downloads #312

andrewrk opened this issue Jul 2, 2014 · 43 comments

Comments

@andrewrk
Copy link

andrewrk commented Jul 2, 2014

I put together a test case for you:

var AWS = require('aws-sdk');
var fs = require('fs');
var crypto = require('crypto');


var client = new AWS.S3({
    accessKeyId: process.env.S3_KEY,
    secretAccessKey: process.env.S3_SECRET,
});

var s3Path = "1tmp1/node_modules/chem-cli/node_modules/watchify2/node_modules/browserify/node_modules/browser-builtins/node_modules/http-browserify/example/json-stream/node_modules/JSONStream/test/fixtures/all_npm.json";
var localFile = "test_out.json";
var s3Params = {
  Bucket: process.env.S3_BUCKET,
  Key: s3Path,
};
var count = 1;
downloadOnce();
function downloadOnce() {
  doTheDownload(function(err) {
    if (err) throw err;
    console.log("downloaded", count++);
    downloadOnce();
  });
}

function doTheDownload(cb) {
  var request = client.getObject(s3Params);
  var response = request.createReadStream();
  var outStream = fs.createWriteStream(localFile);
  var hash = crypto.createHash('md5');
  var errorOccurred = false;
  var eTag = "";

  response.on('error', handleError);
  outStream.on('error', handleError);

  request.on('httpHeaders', function(statusCode, headers, resp) {
    if (statusCode < 300) {
      var contentLength = parseInt(headers['content-length'], 10);
      eTag = headers.etag || "";
    } else {
      handleError(new Error("http status code " + statusCode));
    }
  });

  hash.on('data', function(digest) {
    if (!compareETag(eTag, digest)) {
      console.log("eTag", eTag, "digest", digest.toString('hex'), "path:", localFile);
      handleError(new Error("ETag does not match MD5 checksum"));
    }
  });

  outStream.on('close', function() {
    if (errorOccurred) return;
    cb();
  });

  response.pipe(outStream);
  response.pipe(hash);

  function handleError(err) {
    if (errorOccurred) return;
    errorOccurred = true;
    cb(err);
  }
}

function compareETag(eTag, md5Buffer) {
  eTag = eTag.replace(/^\s*'?\s*"?\s*(.*?)\s*"?\s*'?\s*$/, "$1");
  var hex = md5Buffer.toString('hex');
  return eTag === hex;
}

Edit s3Path variable to be some file on S3. My example file is 362KB.

Here's an example run:

$ S3_BUCKET='examplebucket' S3_KEY='examplekey' S3_SECRET='examplesecret' node test.js
downloaded 1
downloaded 2
downloaded 3
downloaded 4
downloaded 5
downloaded 6
downloaded 7
downloaded 8
downloaded 9
downloaded 10
downloaded 11
downloaded 12
downloaded 13
downloaded 14
downloaded 15
downloaded 16
downloaded 17
downloaded 18
downloaded 19
downloaded 20
downloaded 21
downloaded 22
downloaded 23
downloaded 24
downloaded 25
downloaded 26
downloaded 27
downloaded 28
downloaded 29
downloaded 30
downloaded 31
downloaded 32
downloaded 33
downloaded 34
downloaded 35
downloaded 36
downloaded 37
eTag "abac71913645791ed7a98e4461ee1a71" digest 70e838acf6649dc50bbac3e40272a674 path: test_out.json

/home/andy/tmp/npmtest/test.js:21
    if (err) throw err;
                   ^
Error: ETag does not match MD5 checksum
    at Hash.<anonymous> (/home/andy/tmp/npmtest/test.js:50:19)
    at Hash.emit (events.js:95:17)
    at Hash.<anonymous> (_stream_readable.js:748:14)
    at Hash.emit (events.js:92:17)
    at emitReadable_ (_stream_readable.js:410:10)
    at emitReadable (_stream_readable.js:406:5)
    at readableAddChunk (_stream_readable.js:168:9)
    at Hash.Readable.push (_stream_readable.js:130:10)
    at Hash.Transform.push (_stream_transform.js:140:32)
    at Hash._flush (crypto.js:201:8)

So the first 37 times it downloaded fine, but the 38th time we downloaded the file, AWS SDK said that the file was complete, when it really wasn't. If I go look at test_out.json on my computer, the first 356KB are downloaded correctly and then the file is truncated.

I am seeing this problem frequently.

@lsegal
Copy link
Contributor

lsegal commented Jul 2, 2014

This is not an issue with the SDK but rather with general usage of S3. If you send many parallel requests, eventually S3 will throttle your access and kill your connection.

One way you can mitigate the failures is by using HTTP's Expect 100-continue header, which S3 supports:

request.on('build', function() {
  request.httpRequest.headers['Expect'] = '100-continue';
});

This will send headers and wait for a server response prior to attempting to upload/download any data, thereby allowing S3 to return with any errors prior to accepting your connection. If this is not passed, S3's only choice is to cut your connection off in the middle of transfer, causing the behavior you are seeing.

That said, you should generally expect that you may encounter abrupt disconnections for a variety of reasons (bad internet connection, routing issues, etc.), so you should design your application to recover from any download failures. The SDK has retry logic for most operations out of the box, which is why you generally don't see these failures, but retry logic is disabled for the raw createReadStream() operation, since it is just that: a raw low-level socket operation.

Hope that helps.

@andrewrk
Copy link
Author

andrewrk commented Jul 2, 2014

What's the best number of parallel requests to maximize bandwidth?

@lsegal
Copy link
Contributor

lsegal commented Jul 2, 2014

The number will depend heavily on many different factors-- bandwidth, latency, I/O, and possibly even CPU. There is no exact value. An arbitrary value to start with might be 5, but you can do some tests to see what works best for you.

@andrewrk
Copy link
Author

andrewrk commented Jul 2, 2014

Wait wait wait. In this example code I am not sending any parallel requests. It is sending one request at a time. And I'm seeing this aborted request happening quite frequently.

Parallel requests is not an explanation for the behavior I'm seeing.

@andrewrk
Copy link
Author

andrewrk commented Jul 2, 2014

Also regarding retry logic - this example code doesn't have it but the real code does, and I was seeing a request sometimes fail three times in a row, each with a second delay before retrying. It should not be this spotty.

@lsegal
Copy link
Contributor

lsegal commented Jul 2, 2014

Parallel requests are not the only way to hit S3's throttling limit, they are just an easier way. Downloading many files at once can also hit that limit.

If you add the Expect 100-continue header to your requests, do you see any error returned by S3?

@andrewrk
Copy link
Author

andrewrk commented Jul 2, 2014

Downloading many files at once can also hit that limit.

How is this different than "parallel requests"? Again, this test code downloads a single file at once, then repeats.

If you add the Expect 100-continue header to your requests, do you see any error returned by S3?

I observe no difference:

$ node test.js
downloaded 1
eTag "abac71913645791ed7a98e4461ee1a71" digest 76b63bd096db02bcc83a092c1d479307 path: test_out.json

/home/andy/tmp/npmtest/test.js:21
    if (err) throw err;
                   ^
Error: ETag does not match MD5 checksum

S3 still gives a 200 status code and then aborts the stream early.

I haven't done anything with S3 in many hours. I download a file once, and then once that is done I try to download it again and the download is aborted midway. There's no way I hit a real limit here.

@lsegal
Copy link
Contributor

lsegal commented Jul 2, 2014

What version of Node.js are you using? I'm trying to reproduce this with no luck:

...
downloaded 3411
downloaded 3412
downloaded 3413
downloaded 3414
...

One thing I will note is that if S3 is actually killing the connection mid-way, this is likely not an SDK issue. If you are getting a 'close' event, that typically means Node.js itself is seeing the connection closing, not the SDK, which means the connection is being severed outside of the SDK's control.

It's possible the Node.js version is related-- it also is possible that the way the SDK is setting up the connection is incorrect, but without being able to reproduce, it's hard to say for sure.

@andrewrk
Copy link
Author

andrewrk commented Jul 2, 2014

What version of Node.js are you using?

v0.10.29

@andrewrk
Copy link
Author

andrewrk commented Jul 2, 2014

I'm confused as to why you're unable to duplicate the issue. It happens regularly and frequently for me. The most "downloaded nnn" number I've ever reached was 40.

For what it's worth, here's a slightly simplified test case that only checks file size and removes the etag/md5 calculation from the equation:

var AWS = require('aws-sdk');
var fs = require('fs');

var client = new AWS.S3({
    accessKeyId: process.env.S3_KEY,
    secretAccessKey: process.env.S3_SECRET,
});

var s3Path = "1tmp1/node_modules/chem-cli/node_modules/watchify2/node_modules/browserify/node_modules/browser-builtins/node_modules/http-browserify/example/json-stream/node_modules/JSONStream/test/fixtures/all_npm.json";
var localFile = "test_out.json";
var s3Params = {
  Bucket: process.env.S3_BUCKET,
  Key: s3Path,
};
var count = 1;
downloadOnce();
function downloadOnce() {
  doTheDownload(function(err) {
    if (err) throw err;
    console.log("downloaded", count++);
    downloadOnce();
  });
}

function doTheDownload(cb) {
  var request = client.getObject(s3Params);
  var response = request.createReadStream();
  var outStream = fs.createWriteStream(localFile);
  var errorOccurred = false;
  var contentLength;

  response.on('error', handleError);
  outStream.on('error', handleError);

  request.on('build', function() {
    request.httpRequest.headers.Expect = '100-continue';
  });

  request.on('httpHeaders', function(statusCode, headers, resp) {
    if (statusCode < 300) {
      contentLength = parseInt(headers['content-length'], 10);
    } else {
      handleError(new Error("http status code " + statusCode));
    }
  });

  outStream.on('close', function() {
    if (errorOccurred) return;
    fs.stat(localFile, function(err, stat) {
      if (err) return handleError(err);
      if (stat.size !== contentLength) return handleError(new Error("size mismatch"));
      cb();
    });
  });

  response.pipe(outStream);

  function handleError(err) {
    if (errorOccurred) return;
    errorOccurred = true;
    cb(err);
  }
}
...
downloaded 24
downloaded 25
downloaded 26
downloaded 27

/home/andy/tmp/npmtest/test.js:20
    if (err) throw err;
                   ^
Error: size mismatch
    at /home/andy/tmp/npmtest/test.js:54:59
    at Object.oncomplete (fs.js:107:15)

@andrewrk
Copy link
Author

andrewrk commented Jul 2, 2014

I'm also trying to do the same file download with s3cmd and it's working fine. I haven't seen a single one fail in 200+ downloads.

while true;
do rm -f test_out.json;
  s3cmd get s3://mybucket/1tmp1/node_modules/chem-cli/node_modules/watchify2/node_modules/browserify/node_modules/browser-builtins/node_modules/http-browserify/example/json-stream/node_modules/JSONStream/test/fixtures/all_npm.json test_out2.json;
  printf "%d" $?;
  md5sum test_out2.json;
done

I ran the node code simultaneously to this and I've reproduced several failures in the node code all the while s3cmd is humming away, no problem, no retries.

@lsegal
Copy link
Contributor

lsegal commented Jul 2, 2014

I'm running your simplified case with the same results. I'm currently on 875 and it's going smoothly. I grabbed the same all_npm.json file you are using just to ensure that the data isn't the culprit, too.

Based on the code, I'm still not seeing anything pointing to the SDK. The 'close' event is coming from a plain ReadableStream object initiated by Node.js, not the SDK. If I could reproduce this on my end I could do some debugging to see if there's any discrepancy in how we forward certain events, but I'm not sure that would cause the socket to disconnect-- we are simply reading the events that come down off of the stream.

@lsegal
Copy link
Contributor

lsegal commented Jul 2, 2014

Sorry I just noticed that pipe() is being used here, so I edited out my point about data/readable events. Node's pipe() should use the right thing.

I'm taking a look at using the HTTP stream directly. I'll post code for you to try in a bit.

@andrewrk
Copy link
Author

andrewrk commented Jul 2, 2014

What is different about our environments? Bucket? Node.js version? Network distance between local computer and S3 region? Operating system (Ubuntu 14.04 LTS 64-bit)?

This issue is perfectly reproducible for me. The culprit must be in the delta between our environments.

@lsegal
Copy link
Contributor

lsegal commented Jul 2, 2014

I'm running on OSX. My bucket is in the classic region, and I'm on the west coast. Using Node.js 0.10.28. I'll try 0.10.29 in a sec here, I do recall some streaming related changes in that patchlevel release.

@andrewrk
Copy link
Author

andrewrk commented Jul 2, 2014

Related? nodejs/node-v0.x-archive@2efe4ab761666

(Note the comments at the bottom and mscdex/busboy@8aea552)

@lsegal
Copy link
Contributor

lsegal commented Jul 2, 2014

Oh cool. I was able to reproduce this in 0.10.29. Have you tried 0.10.28 by any chance?

@andrewrk
Copy link
Author

andrewrk commented Jul 2, 2014

Nope, I upgraded to 0.10.29 as soon as it came out and never looked back :P

I'll try it.

@lsegal
Copy link
Contributor

lsegal commented Jul 2, 2014

Related? nodejs/node-v0.x-archive@2efe4ab

I don't think this would be related-- I don't think we ever call .read() with .on('data') events-- unless Node is doing this internally. But also, this would only hide an 'error' event, which would mean the error already occurred (not what we want to be seeing anyway).

@lsegal
Copy link
Contributor

lsegal commented Jul 2, 2014

I can't seem to reproduce reliably (it got up to 200+ on the second attempt) in 0.10.29, but I've gotten it to fail more than a handful of times now. I can't seem to get it to fail at all in 0.10.28. I will spend some time looking into what changed between these versions, but I'm leaning to some unexpected regression in Node.js, unless we are using the API incorrectly, somehow.

@andrewrk
Copy link
Author

andrewrk commented Jul 2, 2014

I'm at 400+ with v0.10.28.

@lsegal
Copy link
Contributor

lsegal commented Jul 2, 2014

There may be something related to nodejs/node-v0.x-archive@31150df

@lsegal
Copy link
Contributor

lsegal commented Jul 2, 2014

Although we are using 'end' instead of 'finish', which seems to be the right thing. I'm looking through nodejs/node-v0.x-archive@v0.10.28...v0.10.29 to see if anything stands out.

@andrewrk
Copy link
Author

andrewrk commented Jul 2, 2014

Cool, glad we're on the right track. I'm going back to v0.10.29 now. It is the latest "stable" version after all.

@lsegal
Copy link
Contributor

lsegal commented Jul 2, 2014

I'm going back to v0.10.29 now.

Won't that cause the issue to reappear until this is fixed? It's still unclear if the issue is an SDK bug or Node.js regression. If this is a Node.js regression, 0.10.29 is probably not the right version to be on.

@andrewrk
Copy link
Author

andrewrk commented Jul 2, 2014

I've git bisected it to this commit: nodejs/node-v0.x-archive@c862c03

andrewrk referenced this issue in nodejs/node-v0.x-archive Jul 2, 2014
See https://github.com/orangemocha/node-connection-drop

I have pinpointed the performance degradation to
ac2263b

This change brings performance back to the orginal levels.

Signed-off-by: Fedor Indutny <[email protected]>
@andrewrk
Copy link
Author

andrewrk commented Jul 2, 2014

I tested v0.10.29 with that commit reverted and it fixes it. I've also tried setting sslEnabled to false in the test case and it fixes it.

@lsegal
Copy link
Contributor

lsegal commented Jul 3, 2014

Thanks for the deep dive @andrewrk! I will do some checking to see why that bit us and if we were incorrectly depending on API behavior. It looks like there is something related to timing (async vs. sync), but I'm not entirely sure what just yet.

@lsegal
Copy link
Contributor

lsegal commented Jul 9, 2014

FWIW the actual issue here seems to be that readable events fire off right before the end event fires. You will see this behavior every time the download fails (multiple readable events quickly followed by end). It seems like The end event kills the reads before the stream is fully flushed-- that would be my best guess explanation.

I'm not entirely sure how the commit you found affects this behavior, and also why that happens, since those readable events should fire synchronously before end, but I definitely think that's the root cause of this issue.

@andrewrk
Copy link
Author

andrewrk commented Jul 9, 2014

On Jul 8, 2014 5:50 PM, "Loren Segal" [email protected] wrote:

@andrewrk I've been looking into this. Your patch could work for 0.10.x,
but unfortunately PassThrough is not available in 0.8.x, so it's not
possible to take that in wholesale.

This is what the readable-stream module is for. Code your app for 0.10.x
and use the readable-stream for PassThrough and friends in 0.8.x

I'm not convinced that the SDK is using the streams API incorrectly.

This statement concerns me.

Specifically you should not iterate over the events and blithely re-emit
them.

If the tests are fixed now I can provide a patch that works for both 0.8
and 0.10 using readable-stream.

@lsegal
Copy link
Contributor

lsegal commented Jul 9, 2014

Specifically you should not iterate over the events and blithely re-emit them.

The events aren't exactly being blindly emitted. You'll notice that in the sample, all of the events are actually listened to. In all usages of pipe(), both readable and end would need forwarding-- forwarding error should not have an effect, as it's only emitted if something actually goes wrong, which is exactly what we are trying to avoid. Also, changing this behavior has no effect on the outcome of the sample-- here's an updated gist that only tracks monitored events:

https://gist.github.com/lsegal/e6966a03bcbf077724d1 (diff)

If the tests are fixed now I can provide a patch that works for both 0.8 and 0.10 using readable-stream.

The tests have been working since Friday with the workaround provided in #313, and fixed today. I'd be happy to take a patch, but note that working around this is not the only concern-- understanding what the actual contract for Readable streams is. As I mentioned, it would be easy enough to switch to PassThrough and use something like readable-stream (which only became stable long after we implemented createReadStream).

@andrewrk
Copy link
Author

andrewrk commented Jul 9, 2014

I just looked at your gist. This code is no good:

      Object.keys(stream._events).forEach(function(event) {
        resp.on(event, function(arg) {
          console.log(event);
          stream.emit(event, arg);
        });
      });

Nowhere in the API does it say you can read the _events property. That's a private field.

  stream.on('newListener', function(event) {
    if (!sent && event === 'readable') {
      sent = true;
      process.nextTick(req.send);
    }
  });

This is scary, error prone, and unnecessary.

        // Taken from:
        // http://nodejs.org/api/stream.html#stream_example_simpleprotocol_v1_sub_optimal
        // if the source doesn't have data, we don't have data yet.
        if (chunk === null) stream.push('');

This hack is unnecessary if you use the API correctly.

Just pipe everything and let Node.js handle all the tricky details.

In all usages of pipe(), both readable and end would need forwarding

This is trivially proven wrong:

var fs = require('fs');
var inStream = fs.createReadStream("blah.txt");

var PassThrough = require('stream').PassThrough;

var outStream = new PassThrough();
outStream.on('readable', function() {
  console.log("got readable");
});
outStream.on('end', function() {
  console.log("got end");
});
inStream.pipe(outStream);
var writeStream = fs.createWriteStream("blah-out.txt");
outStream.pipe(writeStream);

Output:

got readable
got end

The code example I gave you earlier is simple and effective. Delete all the weird hacks, just use pipe, and everything is fixed.

There's no bug in Node.js. The mystery has been solved. The bug is in aws-sdk-js. Please fix your code.

@lsegal lsegal closed this as completed in e0f3fce Jul 10, 2014
lsegal added a commit that referenced this issue Jul 10, 2014
@lsegal
Copy link
Contributor

lsegal commented Jul 10, 2014

Thanks for your patience on this @andrewrk and @jamessharp. The above commit is the shortest path fix to get things working. I plan on doing a refactor to use PassThrough with readable-stream for backward compatibility support for 0.8.x, but a first pass broke a few tests due to some of our mocks being a little too brittle for the change.

@andrewrk
Copy link
Author

Thanks for solving the issue :)

Looking forward to the next release.

lsegal added a commit that referenced this issue Jul 10, 2014
lsegal added a commit that referenced this issue Jul 10, 2014
@lsegal
Copy link
Contributor

lsegal commented Jul 10, 2014

v2.0.6 is now out with these changes.

@andrewrk
Copy link
Author

👍

@ybogdanov
Copy link

[email protected] is also breaks in between node 0.10.26 and 0.10.31. Can't provide a lot of details on that, didn't have time to reproduce it carefully. Just rolled back to 0.10.26 to keep everything working.
I'm doing HEAD requests on S3 (about few hundred in a row), narrowing them by 50 maximum in parallel. Everything was smooth, without any errors or timeouts, until we updated nodejs to 0.10.31. Some of requests (about 5%) started failing. Well, actually they were not failing - the callback simply didn't fire. However, as S3 access logs shown, those requests were actually there and response code was 200.

var params = {
  Bucket: 'my_bucket',
  Key: 'my_key'
}

s3.headObject(params, function(err, resp) {
  console.log('you will never see me!')
})

Maybe the problem is some way related to the current issue. So I'll just leave it here.

@lsegal
Copy link
Contributor

lsegal commented Sep 24, 2014

@ybogdanov this issue was specifically related to the use of streaming APIs in the SDK (createReadStream in particular). The headObject operation doesn't use any streams, just the regular http interface. If callbacks aren't firing this would likely be related to something else, though off the top of my head I'm not sure what. Your report perhaps should be moved into its own issue so we can track it.

mrnugget added a commit to mrnugget/dcos that referenced this issue Feb 19, 2019
AWS S3, from which we download pre-built pkgpanda packages, might
sometimes end the connection before the package has been fully
downloaded.

(See aws/aws-sdk-js#312 and
boto/boto3#540 for more context on this)

In our case that leads to invalid `tar.xz` archives which cause
`unexpected end of input` errors and result in faulty builds of DC/OS
(`dcos_generate_config.sh`) which contain these invalid archives and
then result in install-time errors for users. See this comment for an
investigation into how this can affect installation errors:

https://jira.mesosphere.com/browse/DCOS_OSS-4097?focusedCommentId=219259&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-219259

This change here adds a sanity check to the `download` function used by
pkgpanda. It checks whether the final file size (after flushing the file
to disk) matches the `Content-Length` we get from S3. If it doesn't
match, it sleep for 2, 4, 8, ... seconds and then retries until it
either downloads the file completely or the retries are exhausted. In
that case, it raises an exception.
mrnugget added a commit to mrnugget/dcos that referenced this issue Feb 19, 2019
AWS S3, from which we download pre-built pkgpanda packages, might
sometimes end the connection before the package has been fully
downloaded.

(See aws/aws-sdk-js#312 and
boto/boto3#540 for more context on this)

In our case that leads to invalid `tar.xz` archives which cause
`unexpected end of input` errors and result in faulty builds of DC/OS
(`dcos_generate_config.sh`) which contain these invalid archives and
then result in install-time errors for users. See this comment for an
investigation into how this can affect installation errors:

https://jira.mesosphere.com/browse/DCOS_OSS-4097?focusedCommentId=219259&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-219259

This change here adds a sanity check to the `download` function used by
pkgpanda. It checks whether the final file size (after flushing the file
to disk) matches the `Content-Length` we get from S3. If it doesn't
match, it sleep for 2, 4, 8, ... seconds and then retries until it
either downloads the file completely or the retries are exhausted. In
that case, it raises an exception.
mrnugget added a commit to mrnugget/dcos that referenced this issue Feb 25, 2019
AWS S3, from which we download pre-built pkgpanda packages, might
sometimes end the connection before the package has been fully
downloaded.

(See aws/aws-sdk-js#312 and
boto/boto3#540 for more context on this)

In our case that leads to invalid `tar.xz` archives which cause
`unexpected end of input` errors and result in faulty builds of DC/OS
(`dcos_generate_config.sh`) which contain these invalid archives and
then result in install-time errors for users. See this comment for an
investigation into how this can affect installation errors:

https://jira.mesosphere.com/browse/DCOS_OSS-4097?focusedCommentId=219259&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-219259

This change here adds a sanity check to the `download` function used by
pkgpanda. It checks whether the final file size (after flushing the file
to disk) matches the `Content-Length` we get from S3. If it doesn't
match, it sleep for 2, 4, 8, ... seconds and then retries until it
either downloads the file completely or the retries are exhausted. In
that case, it raises an exception.
mrnugget added a commit to mrnugget/dcos that referenced this issue Feb 25, 2019
AWS S3, from which we download pre-built pkgpanda packages, might
sometimes end the connection before the package has been fully
downloaded.

(See aws/aws-sdk-js#312 and
boto/boto3#540 for more context on this)

In our case that leads to invalid `tar.xz` archives which cause
`unexpected end of input` errors and result in faulty builds of DC/OS
(`dcos_generate_config.sh`) which contain these invalid archives and
then result in install-time errors for users. See this comment for an
investigation into how this can affect installation errors:

https://jira.mesosphere.com/browse/DCOS_OSS-4097?focusedCommentId=219259&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-219259

This change here adds a sanity check to the `download` function used by
pkgpanda. It checks whether the final file size (after flushing the file
to disk) matches the `Content-Length` we get from S3. If it doesn't
match, it sleep for 2, 4, 8, ... seconds and then retries until it
either downloads the file completely or the retries are exhausted. In
that case, it raises an exception.
mrnugget added a commit to mrnugget/dcos that referenced this issue Feb 25, 2019
AWS S3, from which we download pre-built pkgpanda packages, might
sometimes end the connection before the package has been fully
downloaded.

(See aws/aws-sdk-js#312 and
boto/boto3#540 for more context on this)

In our case that leads to invalid `tar.xz` archives which cause
`unexpected end of input` errors and result in faulty builds of DC/OS
(`dcos_generate_config.sh`) which contain these invalid archives and
then result in install-time errors for users. See this comment for an
investigation into how this can affect installation errors:

https://jira.mesosphere.com/browse/DCOS_OSS-4097?focusedCommentId=219259&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-219259

This change here adds a sanity check to the `download` function used by
pkgpanda. It checks whether the final file size (after flushing the file
to disk) matches the `Content-Length` we get from S3. If it doesn't
match, it sleep for 2, 4, 8, ... seconds and then retries until it
either downloads the file completely or the retries are exhausted. In
that case, it raises an exception.
mrnugget added a commit to mrnugget/dcos that referenced this issue Feb 25, 2019
AWS S3, from which we download pre-built pkgpanda packages, might
sometimes end the connection before the package has been fully
downloaded.

(See aws/aws-sdk-js#312 and
boto/boto3#540 for more context on this)

In our case that leads to invalid `tar.xz` archives which cause
`unexpected end of input` errors and result in faulty builds of DC/OS
(`dcos_generate_config.sh`) which contain these invalid archives and
then result in install-time errors for users. See this comment for an
investigation into how this can affect installation errors:

https://jira.mesosphere.com/browse/DCOS_OSS-4097?focusedCommentId=219259&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-219259

This change here adds a sanity check to the `download` function used by
pkgpanda. It checks whether the final file size (after flushing the file
to disk) matches the `Content-Length` we get from S3. If it doesn't
match, it sleep for 2, 4, 8, ... seconds and then retries until it
either downloads the file completely or the retries are exhausted. In
that case, it raises an exception.

(cherry picked from commit 4c38f2d)
mrnugget added a commit to mrnugget/dcos that referenced this issue Feb 25, 2019
AWS S3, from which we download pre-built pkgpanda packages, might
sometimes end the connection before the package has been fully
downloaded.

(See aws/aws-sdk-js#312 and
boto/boto3#540 for more context on this)

In our case that leads to invalid `tar.xz` archives which cause
`unexpected end of input` errors and result in faulty builds of DC/OS
(`dcos_generate_config.sh`) which contain these invalid archives and
then result in install-time errors for users. See this comment for an
investigation into how this can affect installation errors:

https://jira.mesosphere.com/browse/DCOS_OSS-4097?focusedCommentId=219259&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-219259

This change here adds a sanity check to the `download` function used by
pkgpanda. It checks whether the final file size (after flushing the file
to disk) matches the `Content-Length` we get from S3. If it doesn't
match, it sleep for 2, 4, 8, ... seconds and then retries until it
either downloads the file completely or the retries are exhausted. In
that case, it raises an exception.

(cherry picked from commit 4c38f2d)
mrnugget added a commit to mrnugget/dcos that referenced this issue Feb 26, 2019
AWS S3, from which we download pre-built pkgpanda packages, might
sometimes end the connection before the package has been fully
downloaded.

(See aws/aws-sdk-js#312 and
boto/boto3#540 for more context on this)

In our case that leads to invalid `tar.xz` archives which cause
`unexpected end of input` errors and result in faulty builds of DC/OS
(`dcos_generate_config.sh`) which contain these invalid archives and
then result in install-time errors for users. See this comment for an
investigation into how this can affect installation errors:

https://jira.mesosphere.com/browse/DCOS_OSS-4097?focusedCommentId=219259&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-219259

This change here adds a sanity check to the `download` function used by
pkgpanda. It checks whether the final file size (after flushing the file
to disk) matches the `Content-Length` we get from S3. If it doesn't
match, it sleep for 2, 4, 8, ... seconds and then retries until it
either downloads the file completely or the retries are exhausted. In
that case, it raises an exception.
@lock
Copy link

lock bot commented Sep 30, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 30, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants