-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download Fails Gzip Decompression #1724
Comments
Whoa. That seems really not good. I'll try to reproduce. We're slowly trying to move away from I'm going to upload some gzip-ed files and see if I can get the error. Can you also provide |
Sure
|
I don't use the |
Oh there is a typo, it should be. Here is an exerp from running the help dialog that was what I used to figure out what all the flags did.
|
@jjangsangy Is this deterministic? I.e. if you upload the same file twice, does the newly created blob become a failed blob? |
I used a script to generate a bunch of random text files: import string
import random
import six
MIN_SIZE = 1024
MAX_SIZE = 65536
NUM_FILES = 256
def main():
for index in six.moves.xrange(NUM_FILES):
file_size = random.randint(MIN_SIZE, MAX_SIZE)
contents = [random.choice(string.printable)
for _ in six.moves.xrange(file_size)]
filename = 'foo%04d.txt' % (index,)
with open(filename, 'wb') as file_obj:
file_obj.write(''.join(contents)) and then uploaded via
then tried to reproduce the failure via import gcloud.storage
client = gcloud.storage.Client(project='<foo>')
bucket = client.bucket('<bucket>')
for blob in bucket.list_blobs():
just_check = blob.download_as_string() but it succeeded 100% of the time. I didn't try to repro on a Mac, so maybe that's the problem? Or maybe it's the specific contents of 10% of your files? |
Sure I'll try to create a new batch of files and see if I can reproduce. |
Yes, I was able to reproduce the result.
I also ran this through a |
Could you provide an example file for me to reproduce with? You can email it to me if need be. |
I would like to, but the nature of this data is sensitive. I can try reproducing the issue with some other files that do not have this restriction and send them to you. |
OK thanks. Are the text files ASCII or UTF-8? You could do something like: with open('foo.txt', 'r') as file_obj:
all_chars = set(file_obj.read()) on each file and then see if your failing files have novel characters that aren't in the others by taking a set difference. |
@dhermes so I did find a file, it is data from a public domain so I can disclose it. What would be the best way of getting it to you? |
You could upload it as a gist: https://gist.github.com/ |
I don't think it's an encoding issue, I would expect that downloading the files using Here is a link to that test case that failed. |
@jjangsangy I'm getting a 404 when I try to open that link. Any issues just putting the contents in a GitHub gist? |
Oh sorry, just updated it |
Got it. Testing now. |
Also, here is the metadata from the copy that is currently in my bucket. The file that you upload should have the same md5.
|
The file at the link you gave is ~5x the size:
Maybe the upload is failing partway through and getting corrupted? box.com says 8.9MB and locally I see the same. Your length 1788589 is approx. 1.71 * 1024 * 1024, i.e. not enough characters to reach 9MB. |
@thobrla Care to take a look? |
@jjangsangy NVM my 5x comment, that's because my uploaded never got gzipped. I'm not sure the issue there might be. I'm executing
|
So it looks like you need to upload it with
|
On it. Thanks. |
Also, it looks like you guys might be salting your checksums? I'll upload the same file twice in a row and they result in different md5/crc32.
|
OK just like that I can reproduce it! The issue was using Our content lengths now agree but the hash values are still different:
This code fails 100% of the time. |
I can't say conclusively but I think I found the issue. Running my example in IPython and then using the
The encoding is correctly detected but the response is only partial, so it can't be de-compressed. |
I see, is there a reason some files are flagged as 206 and some are not? |
The 206 is a partial content status code, indicating more contents are coming. So a reason would be that files small enough (probably less than 1MB) don't need more than 1 response to send all the contents. |
I added a test with @tseaver / @craigcitro it seems that @jjangsangy Do you mind if I add the /cc @jonparrott |
Ya sure no problem, glad to be of help! |
@thobrla is the one who knows most of the transfer code. it would help to have a simple code snippet that reproduces a failure. also, for confirmation, i assume it works fine if you download via gsutil directly? |
I'm not sure how baked in |
@craigcitro Clone my gist and run the scripts to repro:
|
apitools has always been wedded to httplib2 because oauth2client was. in principle |
@craigcitro I confirmed it works fine with
|
gsutil also uses apitools under the hood, so i suspect gsutil and gcloud are doing different things on top. 206 means there's still more content to fetch (which i'm sure you already know); are you calling through |
That'll be @tseaver's territory since he ported |
@tseaver PTAL. |
Sorry for the slow reply, as I was on vacation. As for why the hashes differ, I don't believe gzip compression is guaranteed to be deterministic. For context, gsutil actually gets and handles a 206 when downloading this file. The request looks like this:
and the response looks like this:
gsutil gets all of the gzipped bytes, stores them locally to disk, then decompresses them. It performs hash comparison against the bytes prior to decompression since the service hashes correspond to the stored (gzip) encoding. |
Thanks @thobrla, we've clearly got an issue with how we're using |
You can modify the a copy of the gsutil code to set a breakpoint, but the key for gsutil is overriding httplib2's decompression for downloaded bytes. Before we had this code, we had another approach where we hashed on the fly before the decompression occurred. We still need that code for on-the-fly-hashing progress callbacks, but I think the comments around gzip there may be out of date now. |
Awesome, thanks! |
@tseaver Can you weigh in here? |
Hello, As part of trying to get things under control (as well as to empower us to provide better customer service in the future), I am declaring a "bankruptcy" of sorts on many of the old issues, especially those likely to have been addressed or made obsolete by more recent updates. My goal is to close stale issues whose relevance or solution is no longer immediately evident, and which appear to be of lower importance. I believe in good faith that this is one of those issues, but I am scanning quickly and may occasionally be wrong. If this is an issue of high importance, please comment here and we will reconsider. If this is an issue whose solution is trivial, please consider providing a pull request. Thank you! |
I have files that have been uploaded using the gcloud command line interface with the
-z
flag which applies gzip content-encoding on files during transfer.gcloud -m cp -z text gs://<bucket>
It appears that running the
blob.download_to_file()
andblob.download_to_string()
raise ahttplib2.FailedToDecompressContent
exception on about 10% of the files.Here is some context on the platform I'm running this on.
The files that do decompress correctly seem to be successful in decompressing, but the ones that fail also fail every time. This also seems to be filesize independent as I have both large and files successfully decompress on download.
Thanks!
The text was updated successfully, but these errors were encountered: