Download Fails Gzip Decompression #1724

jjangsangy · 2016-04-15T18:15:16Z

I have files that have been uploaded using the gcloud command line interface with the -z flag which applies gzip content-encoding on files during transfer.

gcloud -m cp -z text gs://<bucket>

It appears that running the blob.download_to_file() and blob.download_to_string() raise a httplib2.FailedToDecompressContent exception on about 10% of the files.

Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/django/core/management/__init__.py", line 354, in execute_from_command_line
    utility.execute()
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/django/core/management/__init__.py", line 346, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/django/core/management/base.py", line 394, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/django/core/management/base.py", line 445, in execute
    output = self.handle(*args, **options)
  File "/Users/sanghan/Projects/memexcadastre/download/management/commands/download.py", line 44, in handle
    file.download_to_filename(filename)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/gcloud/storage/blob.py", line 329, in download_to_filename
    self.download_to_file(file_obj, client=client)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/gcloud/storage/blob.py", line 314, in download_to_file
    download.initialize_download(request, client._connection.http)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/gcloud/streaming/transfer.py", line 347, in initialize_download
    self.bytes_http or http, http_request)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/gcloud/streaming/http_wrapper.py", line 405, in make_api_request
    check_response_func=check_response_func)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/gcloud/streaming/http_wrapper.py", line 353, in _make_api_request_no_retry
    redirections=redirections, connection_type=connection_type)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/oauth2client/client.py", line 622, in new_request
    redirections, connection_type)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/httplib2/__init__.py", line 1609, in request
    (response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/httplib2/__init__.py", line 1351, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/httplib2/__init__.py", line 1337, in _conn_request
    content = _decompressContent(response, content)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/httplib2/__init__.py", line 410, in _decompressContent
    raise FailedToDecompressContent(_("Content purported to be compressed with %s but failed to decompress.") % response.get('content-encoding'), response, content)
httplib2.FailedToDecompressContent: Content purported to be compressed with gzip but failed to decompress.

Here is some context on the platform I'm running this on.

>>> import pkg_resources
>>> pkg_resources.get_build_platform()
'macosx-10.11-x86_64'

$ pip show gcloud

Metadata-Version: 2.0
Name: gcloud
Version: 0.12.0
Summary: API Client library for Google Cloud
Home-page: https://github.com/GoogleCloudPlatform/gcloud-python
Author: Google Cloud Platform
Author-email: [email protected]
Installer: pip
License: Apache 2.0
Location: /Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages
Requires: pyOpenSSL, six, httplib2, protobuf, oauth2client, googleapis-common-protos

$ python --version
Python 2.7.11

The files that do decompress correctly seem to be successful in decompressing, but the ones that fail also fail every time. This also seems to be filesize independent as I have both large and files successfully decompress on download.

Thanks!

The text was updated successfully, but these errors were encountered:

dhermes · 2016-04-15T18:19:47Z

Whoa. That seems really not good. I'll try to reproduce.

We're slowly trying to move away from httplib2 (see #1214). You might try to use httplib2shim suggested there and see if it resolves this issue.

I'm going to upload some gzip-ed files and see if I can get the error. Can you also provide gcloud --version in case the issue actually is the content of the ~~files~~ blobs?

jjangsangy · 2016-04-15T18:21:13Z

Sure

gcloud --version
Google Cloud SDK 105.0.0

alpha 2016.01.12
app-engine-java 1.9.34
app-engine-python 1.9.35
beta 2016.01.12
bq 2.0.24
bq-nix 2.0.24
core 2016.04.11
core-nix 2016.03.28
gcd-emulator v1beta3-1.0.0
gcloud
gsutil 4.18
gsutil-nix 4.18
kubectl
kubectl-darwin-x86_64 1.2.0

dhermes · 2016-04-15T18:23:07Z

I don't use the gcloud CLI very often, was gcloud -m cp -z text gs://<bucket> the actual command you used?

jjangsangy · 2016-04-15T18:25:48Z

Oh there is a typo, it should be. gsutil -m cp -z text *.txt gs://<bucket>, but more or less the same.

Here is an exerp from running the help dialog that was what I used to figure out what all the flags did.

$ gsutil cp --help


  -z <ext,...>   Applies gzip content-encoding to file uploads with the given
                 extensions. This is useful when uploading files with
                 compressible content (such as .js, .css, or .html files)
                 because it saves network bandwidth and space in Google Cloud
                 Storage, which in turn reduces storage costs.

                 When you specify the -z option, the data from your files is
                 compressed before it is uploaded, but your actual files are
                 left uncompressed on the local disk. The uploaded objects
                 retain the Content-Type and name of the original files but are
                 given a Content-Encoding header with the value "gzip" to
                 indicate that the object data stored are compressed on the
                 Google Cloud Storage servers.

                 For example, the following command:

                   gsutil cp -z html -a public-read cattypes.html gs://mycats

                 will do all of the following:

                 - Upload as the object gs://mycats/cattypes.html (cp command)
                 - Set the Content-Type to text/html (based on file extension)
                 - Compress the data in the file cattypes.html (-z option)
                 - Set the Content-Encoding to gzip (-z option)
                 - Set the ACL to public-read (-a option)
                 - If a user tries to view cattypes.html in a browser, the
                   browser will know to uncompress the data based on the
                   Content-Encoding header, and to render it as HTML based on
                   the Content-Type header.

                 Note that if you download an object with Content-Encoding:gzip
                 gsutil will decompress the content before writing the local
                 file.

dhermes · 2016-04-15T18:35:56Z

@jjangsangy Is this deterministic? I.e. if you upload the same file twice, does the newly created blob become a failed blob?

dhermes · 2016-04-15T18:40:03Z

I used a script to generate a bunch of random text files:

import string
import random

import six


MIN_SIZE = 1024
MAX_SIZE = 65536
NUM_FILES = 256


def main():
    for index in six.moves.xrange(NUM_FILES):
        file_size = random.randint(MIN_SIZE, MAX_SIZE)
        contents = [random.choice(string.printable)
                    for _ in six.moves.xrange(file_size)]
        filename = 'foo%04d.txt' % (index,)
        with open(filename, 'wb') as file_obj:
            file_obj.write(''.join(contents))

and then uploaded via

gsutil -m cp -z text *.txt gs://<bucket>

then tried to reproduce the failure via

import gcloud.storage
client = gcloud.storage.Client(project='<foo>')
bucket = client.bucket('<bucket>')
for blob in bucket.list_blobs():
    just_check = blob.download_as_string()

but it succeeded 100% of the time. I didn't try to repro on a Mac, so maybe that's the problem? Or maybe it's the specific contents of 10% of your files?

jjangsangy · 2016-04-15T18:42:03Z

Sure I'll try to create a new batch of files and see if I can reproduce.

jjangsangy · 2016-04-15T18:50:34Z

Yes, I was able to reproduce the result.

  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/httplib2/__init__.py", line 1337, in _conn_request
    content = _decompressContent(response, content)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/httplib2/__init__.py", line 410, in _decompressContent
    raise FailedToDecompressContent(_("Content purported to be compressed with %s but failed to decompress.") % response.get('content-encoding'), response, content)
httplib2.FailedToDecompressContent: Content purported to be compressed with gzip but failed to decompress.

I also ran this through a try:except block and It looks like it is deterministic and the same files are not being decompressed.

dhermes · 2016-04-15T18:51:35Z

Could you provide an example file for me to reproduce with? You can email it to me if need be.

jjangsangy · 2016-04-15T18:55:41Z

I would like to, but the nature of this data is sensitive.

I can try reproducing the issue with some other files that do not have this restriction and send them to you.

dhermes · 2016-04-15T19:00:46Z

OK thanks. Are the text files ASCII or UTF-8? You could do something like:

with open('foo.txt', 'r') as file_obj:
    all_chars = set(file_obj.read())

on each file and then see if your failing files have novel characters that aren't in the others by taking a set difference.

jjangsangy · 2016-04-15T19:18:30Z

@dhermes so I did find a file, it is data from a public domain so I can disclose it. What would be the best way of getting it to you?

dhermes · 2016-04-15T19:22:35Z

You could upload it as a gist: https://gist.github.com/

jjangsangy · 2016-04-15T19:56:44Z

I don't think it's an encoding issue, I would expect that downloading the files using blob.download_to_file() actually tries to read the file?

Here is a link to that test case that failed.

commoncrawl.txt

dhermes · 2016-04-15T19:57:50Z

@jjangsangy I'm getting a 404 when I try to open that link. Any issues just putting the contents in a GitHub gist?

jjangsangy · 2016-04-15T19:58:34Z

Oh sorry, just updated it

dhermes · 2016-04-15T20:05:12Z

Got it. Testing now.

jjangsangy · 2016-04-15T20:13:23Z

Also, here is the metadata from the copy that is currently in my bucket. The file that you upload should have the same md5.

    Cache-Control:      no-transform
    Content-Encoding:       gzip
    Content-Length:     1788589
    Content-Type:       text/plain
    Hash (crc32c):      LAsg/A==
    Hash (md5):     cvuoX9bXonPb3UbSuP5Yeg==
    ETag:           CNCHzuCpkcwCEAE=
    Generation:     1460746765698000

dhermes · 2016-04-15T20:31:10Z

The file at the link you gave is ~5x the size:

$ gsutil ls -L gs://<bucket>/commoncrawl.txt
gs://<bucket>/commoncrawl.txt:
        Creation time:          Fri, 15 Apr 2016 20:19:12 GMT
        Content-Length:         9347637
        Content-Type:           text/plain
        Hash (crc32c):          4OTrVw==
        Hash (md5):             OJpsldHv6p5Kakvy8whqRQ==
        ETag:                   COCu+Mq7kcwCEAE=
        Generation:             1460751552092000
        Metageneration:         1
        ACL:            [
  {
    "entity": "project-owners-.....",

Maybe the upload is failing partway through and getting corrupted? box.com says 8.9MB and locally I see the same. Your length 1788589 is approx. 1.71 * 1024 * 1024, i.e. not enough characters to reach 9MB.

dhermes · 2016-04-15T20:33:47Z

@thobrla Care to take a look?

dhermes · 2016-04-15T20:34:26Z

@jjangsangy NVM my 5x comment, that's because my uploaded never got gzipped. I'm not sure the issue there might be.

I'm executing

gsutil -m cp -z text commoncrawl.txt gs://<bucket>

jjangsangy · 2016-04-15T20:42:58Z

So it looks like you need to upload it with -Z uppercase Z.

gsutil cp -Z commoncrawl.txt gs://<bucket-name>

dhermes · 2016-04-15T20:45:40Z

On it. Thanks.

jjangsangy · 2016-04-15T20:53:03Z

Also, it looks like you guys might be salting your checksums?

I'll upload the same file twice in a row and they result in different md5/crc32.

$ gsutil cp -z txt commoncrawl.txt gs://<bucket>
Copying file://commoncrawl.txt [Content-Type=text/plain]...
Uploading   gs://<bucket>/commoncrawl.txt:                 1.71 MiB/1.71 MiB

$ gsutil ls -L gs://<bucket>/commoncrawl.txt | grep "Hash"
    Hash (crc32c):      cu+ayQ==
    Hash (md5):     SRYs2Dz624KUmbxclQ2UoA==

$ gsutil cp -z txt commoncrawl.txt gs://<bucket>
Copying file://commoncrawl.txt [Content-Type=text/plain]...
Uploading   gs://<bucket>/commoncrawl.txt:                 1.71 MiB/1.71 MiB
    Hash (crc32c):      0tlu5w==
    Hash (md5):     EitPyF4/SAlQTXJc4aMQ8g==

dhermes · 2016-04-15T20:53:14Z

OK just like that I can reproduce it! The issue was using -z text rather than -z txt BTW.

Our content lengths now agree but the hash values are still different:

$ gsutil ls -L gs://<bucket>/commoncrawl.txt
gs://<bucket>/commoncrawl.txt:
        Creation time:          Fri, 15 Apr 2016 20:47:02 GMT
        Cache-Control:          no-transform
        Content-Encoding:               gzip
        Content-Length:         1788589
        Content-Type:           text/plain
        Hash (crc32c):          eG37zQ==
        Hash (md5):             YPXMpNfu/wkuuu6IqrIo8Q==
        ETag:                   CMDC1+fBkcwCEAE=
        Generation:             1460753222984000
        Metageneration:         1
        ACL:            [
  {
    "entity": "project-owners-...",

This code fails 100% of the time.

dhermes · 2016-04-15T20:58:46Z

I can't say conclusively but I think I found the issue. Running my example in IPython and then using the %debug magic

ipdb> locals().keys()
['content', 'new_content', 'response', 'encoding']
ipdb> response['status']
'206'
ipdb> import httplib
ipdb> httplib.PARTIAL_CONTENT
206
ipdb> len(content)
0
ipdb> len(new_content)
1048576
ipdb> len(new_content) < 1788589
True
ipdb> encoding
'gzip'

The encoding is correctly detected but the response is only partial, so it can't be de-compressed.

jjangsangy · 2016-04-15T21:01:43Z

I see, is there a reason some files are flagged as 206 and some are not?

dhermes · 2016-04-15T21:05:55Z

The 206 is a partial content status code, indicating more contents are coming.

So a reason would be that files small enough (probably less than 1MB) don't need more than 1 response to send all the contents.

dhermes · 2016-04-15T21:10:57Z

I added a test with httplib2shim and it doesn't fail, but it doesn't seem to do the right thing either. It returns 5060440 chars when 9347637 are expected.

@tseaver / @craigcitro it seems that apitools is the culprit here. WDYT?

@jjangsangy Do you mind if I add the commoncrawl.txt contents to my test gist so it makes it easier for others to run the code?

/cc @jonparrott

jjangsangy · 2016-04-15T21:11:54Z

Ya sure no problem, glad to be of help!

craigcitro · 2016-04-15T21:22:28Z

@thobrla is the one who knows most of the transfer code. it would help to have a simple code snippet that reproduces a failure.

also, for confirmation, i assume it works fine if you download via gsutil directly?

dhermes · 2016-04-15T21:35:22Z

I'm not sure how baked in httplib2 is to gcloud.streaming, but certain places are a bit worrisome.
(I ran git grep httplib2 -- '*.py' | grep import | egrep -v test_ and gcloud/streaming/http_wrapper.py is the only thing that really deeply relies on httplib2.)

dhermes · 2016-04-15T21:37:01Z

@craigcitro Clone my gist and run the scripts to repro:

$ git clone [email protected]:8eaa290ffc633cae06913428d5290c1b.git
$ cd 8eaa290ffc633cae06913428d5290c1b
$ python test.py \
> --repro-project <project> \
> --repro-bucket <bucket>

craigcitro · 2016-04-15T21:39:56Z

apitools has always been wedded to httplib2 because oauth2client was. in principle http_wrapper.py should own all that info, but it's never been tested.

dhermes · 2016-04-15T21:46:15Z

@craigcitro I confirmed it works fine with gsutil:

$ gsutil cat gs://<bucket>/commoncrawl.txt > foo.txt.gz
$ gunzip foo.txt.gz
$ diff -s foo.txt commoncrawl.txt
Files foo.txt and commoncrawl.txt are identical

craigcitro · 2016-04-15T22:00:54Z

gsutil also uses apitools under the hood, so i suspect gsutil and gcloud are doing different things on top. 206 means there's still more content to fetch (which i'm sure you already know); are you calling through StreamInChunks? if so, it should be continuing until we hit a 200.

dhermes · 2016-04-15T22:03:37Z

That'll be @tseaver's territory since he ported apitools in as gcloud.streaming. Looking at gcloud.streaming.http_wrapper it seems I'll be in there making the boundary between httplib2 and our library a lot cleaner so we can swap out for another transport layer.

dhermes · 2016-04-18T23:59:57Z

@tseaver PTAL.

thobrla · 2016-04-19T17:41:20Z

Sorry for the slow reply, as I was on vacation. As for why the hashes differ, I don't believe gzip compression is guaranteed to be deterministic.

For context, gsutil actually gets and handles a 206 when downloading this file. The request looks like this:

GET /download/storage/v1/b/bucket/o/commoncrawl.txt?generation=...&alt=media
HTTP/1.1
Host: www.googleapis.com
content-length: 0
accept-encoding: gzip
range: bytes=0-1788588

and the response looks like this:

reply: 'HTTP/1.1 206 Partial Content'
header: Content-Type: text/plain
header: Content-Disposition: attachment
header: Content-Encoding: gzip
header: Cache-Control: no-cache, no-store, max-age=0, must-revalidate
header: Content-Range: bytes 0-1788588/1788589
header: Transfer-Encoding: chunked

gsutil gets all of the gzipped bytes, stores them locally to disk, then decompresses them. It performs hash comparison against the bytes prior to decompression since the service hashes correspond to the stored (gzip) encoding.

dhermes · 2016-04-20T00:52:14Z

Thanks @thobrla, we've clearly got an issue with how we're using apitools / gcloud.streaming. Is there a way to trace the codepath within gsutil to see how it's using apitools (and how it's telling httplib2 to ignore the gzip header on the 206)?

thobrla · 2016-04-20T01:47:37Z

You can modify the a copy of the gsutil code to set a breakpoint, but the key for gsutil is overriding httplib2's decompression for downloaded bytes.

Before we had this code, we had another approach where we hashed on the fly before the decompression occurred. We still need that code for on-the-fly-hashing progress callbacks, but I think the comments around gzip there may be out of date now.

dhermes · 2016-04-20T01:55:01Z

Awesome, thanks!

dhermes · 2016-04-26T18:44:44Z

@tseaver Can you weigh in here?

lukesneeringer · 2017-08-11T20:46:43Z

Hello,
One of the challenges of maintaining a large open source project is that sometimes, you can bite off more than you can chew. As the lead maintainer of google-cloud-python, I can definitely say that I have let the issues here pile up.

As part of trying to get things under control (as well as to empower us to provide better customer service in the future), I am declaring a "bankruptcy" of sorts on many of the old issues, especially those likely to have been addressed or made obsolete by more recent updates.

My goal is to close stale issues whose relevance or solution is no longer immediately evident, and which appear to be of lower importance. I believe in good faith that this is one of those issues, but I am scanning quickly and may occasionally be wrong. If this is an issue of high importance, please comment here and we will reconsider. If this is an issue whose solution is trivial, please consider providing a pull request.

Thank you!

dhermes added api: storage Issues related to the Cloud Storage API. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Apr 15, 2016

brianjpetersen mentioned this issue Mar 19, 2017

Setting Content-Encoding header for Cloud Storage Uploads with upload_from_file #3099

Closed

lukesneeringer added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Apr 19, 2017

lukesneeringer closed this as completed Aug 11, 2017

tseaver mentioned this issue Oct 29, 2019

Storage: Add opt-in support for raw downloads. #9565

Closed

JustinBeckwith assigned lukesneeringer Feb 1, 2021

Download Fails Gzip Decompression #1724

Download Fails Gzip Decompression #1724

Comments

jjangsangy commented Apr 15, 2016 • edited Loading

dhermes commented Apr 15, 2016 • edited Loading

jjangsangy commented Apr 15, 2016

dhermes commented Apr 15, 2016

jjangsangy commented Apr 15, 2016 • edited Loading

dhermes commented Apr 15, 2016

dhermes commented Apr 15, 2016

jjangsangy commented Apr 15, 2016

jjangsangy commented Apr 15, 2016

dhermes commented Apr 15, 2016

jjangsangy commented Apr 15, 2016

dhermes commented Apr 15, 2016

jjangsangy commented Apr 15, 2016 • edited Loading

dhermes commented Apr 15, 2016

jjangsangy commented Apr 15, 2016 • edited Loading

dhermes commented Apr 15, 2016

jjangsangy commented Apr 15, 2016

dhermes commented Apr 15, 2016

jjangsangy commented Apr 15, 2016

dhermes commented Apr 15, 2016

dhermes commented Apr 15, 2016

dhermes commented Apr 15, 2016 • edited Loading

jjangsangy commented Apr 15, 2016 • edited Loading

dhermes commented Apr 15, 2016

jjangsangy commented Apr 15, 2016

dhermes commented Apr 15, 2016

dhermes commented Apr 15, 2016

jjangsangy commented Apr 15, 2016

dhermes commented Apr 15, 2016 • edited Loading

dhermes commented Apr 15, 2016

jjangsangy commented Apr 15, 2016

craigcitro commented Apr 15, 2016

dhermes commented Apr 15, 2016

dhermes commented Apr 15, 2016

craigcitro commented Apr 15, 2016

dhermes commented Apr 15, 2016

craigcitro commented Apr 15, 2016

dhermes commented Apr 15, 2016

dhermes commented Apr 18, 2016

thobrla commented Apr 19, 2016

dhermes commented Apr 20, 2016

thobrla commented Apr 20, 2016 • edited by dhermes Loading

dhermes commented Apr 20, 2016

dhermes commented Apr 26, 2016

lukesneeringer commented Aug 11, 2017

jjangsangy commented Apr 15, 2016 •

edited

Loading

dhermes commented Apr 15, 2016 •

edited

Loading

jjangsangy commented Apr 15, 2016 •

edited

Loading

jjangsangy commented Apr 15, 2016 •

edited

Loading

jjangsangy commented Apr 15, 2016 •

edited

Loading

dhermes commented Apr 15, 2016 •

edited

Loading

jjangsangy commented Apr 15, 2016 •

edited

Loading

dhermes commented Apr 15, 2016 •

edited

Loading

thobrla commented Apr 20, 2016 •

edited by dhermes

Loading