Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download Fails Gzip Decompression #1724

Closed
jjangsangy opened this issue Apr 15, 2016 · 44 comments
Closed

Download Fails Gzip Decompression #1724

jjangsangy opened this issue Apr 15, 2016 · 44 comments
Assignees
Labels
api: storage Issues related to the Cloud Storage API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@jjangsangy
Copy link

jjangsangy commented Apr 15, 2016

I have files that have been uploaded using the gcloud command line interface with the -z flag which applies gzip content-encoding on files during transfer.

gcloud -m cp -z text gs://<bucket>

It appears that running the blob.download_to_file() and blob.download_to_string() raise a httplib2.FailedToDecompressContent exception on about 10% of the files.

Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/django/core/management/__init__.py", line 354, in execute_from_command_line
    utility.execute()
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/django/core/management/__init__.py", line 346, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/django/core/management/base.py", line 394, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/django/core/management/base.py", line 445, in execute
    output = self.handle(*args, **options)
  File "/Users/sanghan/Projects/memexcadastre/download/management/commands/download.py", line 44, in handle
    file.download_to_filename(filename)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/gcloud/storage/blob.py", line 329, in download_to_filename
    self.download_to_file(file_obj, client=client)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/gcloud/storage/blob.py", line 314, in download_to_file
    download.initialize_download(request, client._connection.http)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/gcloud/streaming/transfer.py", line 347, in initialize_download
    self.bytes_http or http, http_request)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/gcloud/streaming/http_wrapper.py", line 405, in make_api_request
    check_response_func=check_response_func)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/gcloud/streaming/http_wrapper.py", line 353, in _make_api_request_no_retry
    redirections=redirections, connection_type=connection_type)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/oauth2client/client.py", line 622, in new_request
    redirections, connection_type)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/httplib2/__init__.py", line 1609, in request
    (response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/httplib2/__init__.py", line 1351, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/httplib2/__init__.py", line 1337, in _conn_request
    content = _decompressContent(response, content)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/httplib2/__init__.py", line 410, in _decompressContent
    raise FailedToDecompressContent(_("Content purported to be compressed with %s but failed to decompress.") % response.get('content-encoding'), response, content)
httplib2.FailedToDecompressContent: Content purported to be compressed with gzip but failed to decompress.

Here is some context on the platform I'm running this on.

>>> import pkg_resources
>>> pkg_resources.get_build_platform()
'macosx-10.11-x86_64'
$ pip show gcloud

Metadata-Version: 2.0
Name: gcloud
Version: 0.12.0
Summary: API Client library for Google Cloud
Home-page: https://github.com/GoogleCloudPlatform/gcloud-python
Author: Google Cloud Platform
Author-email: [email protected]
Installer: pip
License: Apache 2.0
Location: /Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages
Requires: pyOpenSSL, six, httplib2, protobuf, oauth2client, googleapis-common-protos
$ python --version
Python 2.7.11

The files that do decompress correctly seem to be successful in decompressing, but the ones that fail also fail every time. This also seems to be filesize independent as I have both large and files successfully decompress on download.

Thanks!

@dhermes dhermes added api: storage Issues related to the Cloud Storage API. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Apr 15, 2016
@dhermes
Copy link
Contributor

dhermes commented Apr 15, 2016

Whoa. That seems really not good. I'll try to reproduce.

We're slowly trying to move away from httplib2 (see #1214). You might try to use httplib2shim suggested there and see if it resolves this issue.

I'm going to upload some gzip-ed files and see if I can get the error. Can you also provide gcloud --version in case the issue actually is the content of the files blobs?

@jjangsangy
Copy link
Author

Sure

gcloud --version
Google Cloud SDK 105.0.0

alpha 2016.01.12
app-engine-java 1.9.34
app-engine-python 1.9.35
beta 2016.01.12
bq 2.0.24
bq-nix 2.0.24
core 2016.04.11
core-nix 2016.03.28
gcd-emulator v1beta3-1.0.0
gcloud
gsutil 4.18
gsutil-nix 4.18
kubectl
kubectl-darwin-x86_64 1.2.0

@dhermes
Copy link
Contributor

dhermes commented Apr 15, 2016

I don't use the gcloud CLI very often, was gcloud -m cp -z text gs://<bucket> the actual command you used?

@jjangsangy
Copy link
Author

jjangsangy commented Apr 15, 2016

Oh there is a typo, it should be. gsutil -m cp -z text *.txt gs://<bucket>, but more or less the same.

Here is an exerp from running the help dialog that was what I used to figure out what all the flags did.

$ gsutil cp --help


  -z <ext,...>   Applies gzip content-encoding to file uploads with the given
                 extensions. This is useful when uploading files with
                 compressible content (such as .js, .css, or .html files)
                 because it saves network bandwidth and space in Google Cloud
                 Storage, which in turn reduces storage costs.

                 When you specify the -z option, the data from your files is
                 compressed before it is uploaded, but your actual files are
                 left uncompressed on the local disk. The uploaded objects
                 retain the Content-Type and name of the original files but are
                 given a Content-Encoding header with the value "gzip" to
                 indicate that the object data stored are compressed on the
                 Google Cloud Storage servers.

                 For example, the following command:

                   gsutil cp -z html -a public-read cattypes.html gs://mycats

                 will do all of the following:

                 - Upload as the object gs://mycats/cattypes.html (cp command)
                 - Set the Content-Type to text/html (based on file extension)
                 - Compress the data in the file cattypes.html (-z option)
                 - Set the Content-Encoding to gzip (-z option)
                 - Set the ACL to public-read (-a option)
                 - If a user tries to view cattypes.html in a browser, the
                   browser will know to uncompress the data based on the
                   Content-Encoding header, and to render it as HTML based on
                   the Content-Type header.

                 Note that if you download an object with Content-Encoding:gzip
                 gsutil will decompress the content before writing the local
                 file.

@dhermes
Copy link
Contributor

dhermes commented Apr 15, 2016

@jjangsangy Is this deterministic? I.e. if you upload the same file twice, does the newly created blob become a failed blob?

@dhermes
Copy link
Contributor

dhermes commented Apr 15, 2016

I used a script to generate a bunch of random text files:

import string
import random

import six


MIN_SIZE = 1024
MAX_SIZE = 65536
NUM_FILES = 256


def main():
    for index in six.moves.xrange(NUM_FILES):
        file_size = random.randint(MIN_SIZE, MAX_SIZE)
        contents = [random.choice(string.printable)
                    for _ in six.moves.xrange(file_size)]
        filename = 'foo%04d.txt' % (index,)
        with open(filename, 'wb') as file_obj:
            file_obj.write(''.join(contents))

and then uploaded via

gsutil -m cp -z text *.txt gs://<bucket>

then tried to reproduce the failure via

import gcloud.storage
client = gcloud.storage.Client(project='<foo>')
bucket = client.bucket('<bucket>')
for blob in bucket.list_blobs():
    just_check = blob.download_as_string()

but it succeeded 100% of the time. I didn't try to repro on a Mac, so maybe that's the problem? Or maybe it's the specific contents of 10% of your files?

@jjangsangy
Copy link
Author

Sure I'll try to create a new batch of files and see if I can reproduce.

@jjangsangy
Copy link
Author

Yes, I was able to reproduce the result.

  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/httplib2/__init__.py", line 1337, in _conn_request
    content = _decompressContent(response, content)
  File "/Users/sanghan/Projects/memexcadastre/venv/lib/python2.7/site-packages/httplib2/__init__.py", line 410, in _decompressContent
    raise FailedToDecompressContent(_("Content purported to be compressed with %s but failed to decompress.") % response.get('content-encoding'), response, content)
httplib2.FailedToDecompressContent: Content purported to be compressed with gzip but failed to decompress.

I also ran this through a try:except block and It looks like it is deterministic and the same files are not being decompressed.

@dhermes
Copy link
Contributor

dhermes commented Apr 15, 2016

Could you provide an example file for me to reproduce with? You can email it to me if need be.

@jjangsangy
Copy link
Author

I would like to, but the nature of this data is sensitive.

I can try reproducing the issue with some other files that do not have this restriction and send them to you.

@dhermes
Copy link
Contributor

dhermes commented Apr 15, 2016

OK thanks. Are the text files ASCII or UTF-8? You could do something like:

with open('foo.txt', 'r') as file_obj:
    all_chars = set(file_obj.read())

on each file and then see if your failing files have novel characters that aren't in the others by taking a set difference.

@jjangsangy
Copy link
Author

jjangsangy commented Apr 15, 2016

@dhermes so I did find a file, it is data from a public domain so I can disclose it. What would be the best way of getting it to you?

@dhermes
Copy link
Contributor

dhermes commented Apr 15, 2016

You could upload it as a gist: https://gist.github.com/

@jjangsangy
Copy link
Author

jjangsangy commented Apr 15, 2016

I don't think it's an encoding issue, I would expect that downloading the files using blob.download_to_file() actually tries to read the file?

Here is a link to that test case that failed.

commoncrawl.txt

@dhermes
Copy link
Contributor

dhermes commented Apr 15, 2016

@jjangsangy I'm getting a 404 when I try to open that link. Any issues just putting the contents in a GitHub gist?

@jjangsangy
Copy link
Author

Oh sorry, just updated it

@dhermes
Copy link
Contributor

dhermes commented Apr 15, 2016

Got it. Testing now.

@jjangsangy
Copy link
Author

Also, here is the metadata from the copy that is currently in my bucket. The file that you upload should have the same md5.

    Cache-Control:      no-transform
    Content-Encoding:       gzip
    Content-Length:     1788589
    Content-Type:       text/plain
    Hash (crc32c):      LAsg/A==
    Hash (md5):     cvuoX9bXonPb3UbSuP5Yeg==
    ETag:           CNCHzuCpkcwCEAE=
    Generation:     1460746765698000

@dhermes
Copy link
Contributor

dhermes commented Apr 15, 2016

The file at the link you gave is ~5x the size:

$ gsutil ls -L gs://<bucket>/commoncrawl.txt
gs://<bucket>/commoncrawl.txt:
        Creation time:          Fri, 15 Apr 2016 20:19:12 GMT
        Content-Length:         9347637
        Content-Type:           text/plain
        Hash (crc32c):          4OTrVw==
        Hash (md5):             OJpsldHv6p5Kakvy8whqRQ==
        ETag:                   COCu+Mq7kcwCEAE=
        Generation:             1460751552092000
        Metageneration:         1
        ACL:            [
  {
    "entity": "project-owners-.....",

Maybe the upload is failing partway through and getting corrupted? box.com says 8.9MB and locally I see the same. Your length 1788589 is approx. 1.71 * 1024 * 1024, i.e. not enough characters to reach 9MB.

@dhermes
Copy link
Contributor

dhermes commented Apr 15, 2016

@thobrla Care to take a look?

@dhermes
Copy link
Contributor

dhermes commented Apr 15, 2016

@jjangsangy NVM my 5x comment, that's because my uploaded never got gzipped. I'm not sure the issue there might be.

I'm executing

gsutil -m cp -z text commoncrawl.txt gs://<bucket>

@jjangsangy
Copy link
Author

jjangsangy commented Apr 15, 2016

So it looks like you need to upload it with -Z uppercase Z.

gsutil cp -Z commoncrawl.txt gs://<bucket-name>

@dhermes
Copy link
Contributor

dhermes commented Apr 15, 2016

On it. Thanks.

@jjangsangy
Copy link
Author

Also, it looks like you guys might be salting your checksums?

I'll upload the same file twice in a row and they result in different md5/crc32.

$ gsutil cp -z txt commoncrawl.txt gs://<bucket>
Copying file://commoncrawl.txt [Content-Type=text/plain]...
Uploading   gs://<bucket>/commoncrawl.txt:                 1.71 MiB/1.71 MiB

$ gsutil ls -L gs://<bucket>/commoncrawl.txt | grep "Hash"
    Hash (crc32c):      cu+ayQ==
    Hash (md5):     SRYs2Dz624KUmbxclQ2UoA==

$ gsutil cp -z txt commoncrawl.txt gs://<bucket>
Copying file://commoncrawl.txt [Content-Type=text/plain]...
Uploading   gs://<bucket>/commoncrawl.txt:                 1.71 MiB/1.71 MiB
    Hash (crc32c):      0tlu5w==
    Hash (md5):     EitPyF4/SAlQTXJc4aMQ8g==

@dhermes
Copy link
Contributor

dhermes commented Apr 15, 2016

OK just like that I can reproduce it! The issue was using -z text rather than -z txt BTW.

Our content lengths now agree but the hash values are still different:

$ gsutil ls -L gs://<bucket>/commoncrawl.txt
gs://<bucket>/commoncrawl.txt:
        Creation time:          Fri, 15 Apr 2016 20:47:02 GMT
        Cache-Control:          no-transform
        Content-Encoding:               gzip
        Content-Length:         1788589
        Content-Type:           text/plain
        Hash (crc32c):          eG37zQ==
        Hash (md5):             YPXMpNfu/wkuuu6IqrIo8Q==
        ETag:                   CMDC1+fBkcwCEAE=
        Generation:             1460753222984000
        Metageneration:         1
        ACL:            [
  {
    "entity": "project-owners-...",

This code fails 100% of the time.

@dhermes
Copy link
Contributor

dhermes commented Apr 15, 2016

I can't say conclusively but I think I found the issue. Running my example in IPython and then using the %debug magic

ipdb> locals().keys()
['content', 'new_content', 'response', 'encoding']
ipdb> response['status']
'206'
ipdb> import httplib
ipdb> httplib.PARTIAL_CONTENT
206
ipdb> len(content)
0
ipdb> len(new_content)
1048576
ipdb> len(new_content) < 1788589
True
ipdb> encoding
'gzip'

The encoding is correctly detected but the response is only partial, so it can't be de-compressed.

@jjangsangy
Copy link
Author

I see, is there a reason some files are flagged as 206 and some are not?

@dhermes
Copy link
Contributor

dhermes commented Apr 15, 2016

The 206 is a partial content status code, indicating more contents are coming.

So a reason would be that files small enough (probably less than 1MB) don't need more than 1 response to send all the contents.

@dhermes
Copy link
Contributor

dhermes commented Apr 15, 2016

I added a test with httplib2shim and it doesn't fail, but it doesn't seem to do the right thing either. It returns 5060440 chars when 9347637 are expected.

@tseaver / @craigcitro it seems that apitools is the culprit here. WDYT?

@jjangsangy Do you mind if I add the commoncrawl.txt contents to my test gist so it makes it easier for others to run the code?

/cc @jonparrott

@jjangsangy
Copy link
Author

Ya sure no problem, glad to be of help!

@craigcitro
Copy link
Contributor

@thobrla is the one who knows most of the transfer code. it would help to have a simple code snippet that reproduces a failure.

also, for confirmation, i assume it works fine if you download via gsutil directly?

@dhermes
Copy link
Contributor

dhermes commented Apr 15, 2016

I'm not sure how baked in httplib2 is to gcloud.streaming, but certain places are a bit worrisome.
(I ran git grep httplib2 -- '*.py' | grep import | egrep -v test_ and gcloud/streaming/http_wrapper.py is the only thing that really deeply relies on httplib2.)

@dhermes
Copy link
Contributor

dhermes commented Apr 15, 2016

@craigcitro Clone my gist and run the scripts to repro:

$ git clone [email protected]:8eaa290ffc633cae06913428d5290c1b.git
$ cd 8eaa290ffc633cae06913428d5290c1b
$ python test.py \
> --repro-project <project> \
> --repro-bucket <bucket>

@craigcitro
Copy link
Contributor

apitools has always been wedded to httplib2 because oauth2client was. in principle http_wrapper.py should own all that info, but it's never been tested.

@dhermes
Copy link
Contributor

dhermes commented Apr 15, 2016

@craigcitro I confirmed it works fine with gsutil:

$ gsutil cat gs://<bucket>/commoncrawl.txt > foo.txt.gz
$ gunzip foo.txt.gz
$ diff -s foo.txt commoncrawl.txt
Files foo.txt and commoncrawl.txt are identical

@craigcitro
Copy link
Contributor

gsutil also uses apitools under the hood, so i suspect gsutil and gcloud are doing different things on top. 206 means there's still more content to fetch (which i'm sure you already know); are you calling through StreamInChunks? if so, it should be continuing until we hit a 200.

@dhermes
Copy link
Contributor

dhermes commented Apr 15, 2016

That'll be @tseaver's territory since he ported apitools in as gcloud.streaming. Looking at gcloud.streaming.http_wrapper it seems I'll be in there making the boundary between httplib2 and our library a lot cleaner so we can swap out for another transport layer.

@dhermes
Copy link
Contributor

dhermes commented Apr 18, 2016

@tseaver PTAL.

@thobrla
Copy link

thobrla commented Apr 19, 2016

Sorry for the slow reply, as I was on vacation. As for why the hashes differ, I don't believe gzip compression is guaranteed to be deterministic.

For context, gsutil actually gets and handles a 206 when downloading this file. The request looks like this:

GET /download/storage/v1/b/bucket/o/commoncrawl.txt?generation=...&alt=media
HTTP/1.1
Host: www.googleapis.com
content-length: 0
accept-encoding: gzip
range: bytes=0-1788588

and the response looks like this:

reply: 'HTTP/1.1 206 Partial Content'
header: Content-Type: text/plain
header: Content-Disposition: attachment
header: Content-Encoding: gzip
header: Cache-Control: no-cache, no-store, max-age=0, must-revalidate
header: Content-Range: bytes 0-1788588/1788589
header: Transfer-Encoding: chunked

gsutil gets all of the gzipped bytes, stores them locally to disk, then decompresses them. It performs hash comparison against the bytes prior to decompression since the service hashes correspond to the stored (gzip) encoding.

@dhermes
Copy link
Contributor

dhermes commented Apr 20, 2016

Thanks @thobrla, we've clearly got an issue with how we're using apitools / gcloud.streaming. Is there a way to trace the codepath within gsutil to see how it's using apitools (and how it's telling httplib2 to ignore the gzip header on the 206)?

@thobrla
Copy link

thobrla commented Apr 20, 2016

You can modify the a copy of the gsutil code to set a breakpoint, but the key for gsutil is overriding httplib2's decompression for downloaded bytes.

Before we had this code, we had another approach where we hashed on the fly before the decompression occurred. We still need that code for on-the-fly-hashing progress callbacks, but I think the comments around gzip there may be out of date now.

@dhermes
Copy link
Contributor

dhermes commented Apr 20, 2016

Awesome, thanks!

@dhermes
Copy link
Contributor

dhermes commented Apr 26, 2016

@tseaver Can you weigh in here?

@lukesneeringer lukesneeringer added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Apr 19, 2017
@lukesneeringer
Copy link
Contributor

Hello,
One of the challenges of maintaining a large open source project is that sometimes, you can bite off more than you can chew. As the lead maintainer of google-cloud-python, I can definitely say that I have let the issues here pile up.

As part of trying to get things under control (as well as to empower us to provide better customer service in the future), I am declaring a "bankruptcy" of sorts on many of the old issues, especially those likely to have been addressed or made obsolete by more recent updates.

My goal is to close stale issues whose relevance or solution is no longer immediately evident, and which appear to be of lower importance. I believe in good faith that this is one of those issues, but I am scanning quickly and may occasionally be wrong. If this is an issue of high importance, please comment here and we will reconsider. If this is an issue whose solution is trivial, please consider providing a pull request.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

5 participants