Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting Content-Encoding header for Cloud Storage Uploads with upload_from_file #3099

Closed
brianjpetersen opened this issue Mar 5, 2017 · 15 comments
Assignees
Labels
api: storage Issues related to the Cloud Storage API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@brianjpetersen
Copy link

I'm on Python 3.5.2 with google.cloud.storage.__version__ = '0.23.0'.

I'm attempting to upload objects to a bucket such that the object supports decompressive gzip transcoding. I haven't been able to figure out how to accomplish this after searching through the documentation and the code as well as reviewing existing issues. My most promising attempt was setting the blob.content_encoding property, which seems like it should work but doesn't. See below for an example.

Does/can the API support this?

import google.cloud.storage
import gzip
import os
import requests
import datetime
import io


BUCKET_NAME = ...
GOOGLE_APPLICATION_CREDENTIALS_PATH = ...


os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = GOOGLE_APPLICATION_CREDENTIALS_PATH
client = google.cloud.storage.Client()
bucket = client.get_bucket(BUCKET_NAME)

blob = bucket.blob('plaintext')
blob.content_type = 'text/plain'
with io.BytesIO() as f:
    f.write(b' '.join(100*(b'plaintext', )))
    blob.upload_from_file(f, size=f.tell(), rewind=True)
url = blob.generate_signed_url(datetime.datetime.max)
"""
This prints:

b'plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext plaintext' None
"""
response = requests.get(url)
print(response.content, response.headers.get('Content-Encoding', None))


blob = bucket.blob('compressed')
blob.content_type = 'text/plain'
blob.content_encoding = 'gzip'
with io.BytesIO() as f:
    with gzip.GzipFile(fileobj=f, mode='wb',  compresslevel=9) as fgz:
        fgz.write(b' '.join(100*(b'compressed', )))
    blob.upload_from_file(f, size=f.tell(), rewind=True)
url = blob.generate_signed_url(datetime.datetime.max)
"""
This prints:

b'\x1f\x8b\x08\x00\xac}\xbbX\x02\xffK\xce\xcf-(J-.NMQH\x1ee\x8e2G\x99\xa3L2\x99\x00/\x80\x15\xa7K\x04\x00\x00' None
"""
response = requests.get(url)
print(response.content, response.headers.get('Content-Encoding', None))


"""
If I manually set the content-encoding header through the metadata option on this object in the console, I get the appropriate response:

b'compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed compressed' gzip
"""
response = requests.get(url)
print(response.content, response.headers.get('Content-Encoding', None))
@daspecster daspecster added the api: storage Issues related to the Cloud Storage API. label Mar 6, 2017
@lukesneeringer
Copy link
Contributor

Hi @brianjpetersen,
Thanks for raising this, and sorry it took a couple days for us to say anything in response.

Let me summarize to make sure I understand the problem. Basically there seems to be no obvious valid way to set the Content-Encoding in the metadata to gzip and have it stick in storage. Is that correct?

@lukesneeringer lukesneeringer added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Mar 9, 2017
@brianjpetersen
Copy link
Author

That's right. Setting the content_encoding attribute to 'gzip' on the object before uploading doesn't actually result in proper transcoding on subsequent GETs to Cloud Storage. Furthermore, the metadata on the uploaded object (viewed in the Cloud Storage in the web console) doesn't reflect that the Content-Encoding was set to 'gzip' (see below).

screen shot 2017-03-09 at 11 44 08 am

@lukesneeringer
Copy link
Contributor

Thanks. We will look into it.

@pdknsk
Copy link

pdknsk commented Mar 18, 2017

This is a duplicate of several bugs, summed up in this comment, which also has a work-around.

@brianjpetersen
Copy link
Author

Many thanks @pdknsk.

This workaround fixes the problem, although as noted in #754, the property update isn't atomic which has all sorts of nasty implications. As another commenter noted in a linked thread, unfortunately this prevents me from using gcloud-python (and Google Cloud Platform) at this time.

@pdknsk
Copy link

pdknsk commented Mar 19, 2017

I remembered a patch I had once used, which I've updated now. An alternative is to use the API directly, which is more complex.

@brianjpetersen
Copy link
Author

This unfortunately didn't seem to do the trick for the content_encoding property.

@pdknsk
Copy link

pdknsk commented Mar 19, 2017

Works for me.

>>> compressobj = zlib.compressobj(9, zlib.DEFLATED, 31) # 31 = gzip
>>> text_gzip = compressobj.compress('text') + compressobj.flush()
>>> len(text_gzip)
24
>>> text = bucket.blob('file.txt')
>>> text.cache_control = 'no-cache'
>>> text.content_encoding = 'gzip'
>>> text.upload_from_string(text_gzip)
>>> text.reload()
>>> text.size
24
>>> req = requests.get(text.public_url)
>>> req.content
'text'
>>> req.headers.get('Content-Encoding')
'gzip'

In the browser too.

@brianjpetersen
Copy link
Author

Apologies @pdknsk, pip and I weren't getting along last night. This does indeed address my need. You've been super helpful - thanks.

@brianjpetersen
Copy link
Author

brianjpetersen commented Mar 19, 2017

Although this contrived example works, now I'm getting a gzip-decoding error from requests with larger payloads. Using your example (slightly modified for Python 3):

>>> compressobj = zlib.compressobj(9, zlib.DEFLATED, 31) # 31 = gzip
>>> text_gzip = compressobj.compress(100*b'text') + compressobj.flush()
>>> text = bucket.blob('file.txt')
>>> text.cache_control = 'no-cache'
>>> text.content_encoding = 'gzip'
>>> text.upload_from_string(text_gzip)
>>> req = requests.get(text.public_url)

Traceback (most recent call last):
  File "/Users/brianjpetersen/Anaconda/python3/anaconda/lib/python3.5/site-packages/requests/packages/urllib3/response.py", line 192, in _decode
    data = self._decoder.decompress(data)
  File "/Users/brianjpetersen/Anaconda/python3/anaconda/lib/python3.5/site-packages/requests/packages/urllib3/response.py", line 58, in decompress
    return self._obj.decompress(data)
zlib.error: Error -3 while decompressing data: invalid distance too far back

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/brianjpetersen/Anaconda/python3/anaconda/lib/python3.5/site-packages/requests/models.py", line 664, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/Users/brianjpetersen/Anaconda/python3/anaconda/lib/python3.5/site-packages/requests/packages/urllib3/response.py", line 349, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/Users/brianjpetersen/Anaconda/python3/anaconda/lib/python3.5/site-packages/requests/packages/urllib3/response.py", line 503, in read_chunked
    flush_decoder=False)
  File "/Users/brianjpetersen/Anaconda/python3/anaconda/lib/python3.5/site-packages/requests/packages/urllib3/response.py", line 197, in _decode
    "failed to decode it." % content_encoding, e)
requests.packages.urllib3.exceptions.DecodeError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: invalid distance too far back',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test.py", line 47, in <module>
    response = requests.get(url)
  File "/Users/brianjpetersen/Anaconda/python3/anaconda/lib/python3.5/site-packages/requests/api.py", line 71, in get
    return request('get', url, params=params, **kwargs)
  File "/Users/brianjpetersen/Anaconda/python3/anaconda/lib/python3.5/site-packages/requests/api.py", line 57, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/brianjpetersen/Anaconda/python3/anaconda/lib/python3.5/site-packages/requests/sessions.py", line 475, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/brianjpetersen/Anaconda/python3/anaconda/lib/python3.5/site-packages/requests/sessions.py", line 617, in send
    r.content
  File "/Users/brianjpetersen/Anaconda/python3/anaconda/lib/python3.5/site-packages/requests/models.py", line 741, in content
    self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
  File "/Users/brianjpetersen/Anaconda/python3/anaconda/lib/python3.5/site-packages/requests/models.py", line 669, in generate
    raise ContentDecodingError(e)
requests.exceptions.ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: invalid distance too far back',))

Is this possibly related to #1724?

@daspecster
Copy link
Contributor

@brianjpetersen googling that error for requests got me to this SO question which lead me to the following issue.

See: https://bugs.python.org/issue27164

It sounds like there's an issue with Python 3.5.2. If you upgrade do you still have the same issue?

@brianjpetersen
Copy link
Author

That seems to be it. It's working on my 2.7 binary. Thanks.

@daspecster
Copy link
Contributor

OK great! I'm going to close this then.

@danielguardicore
Copy link

Using Python 2.7 and the latest (as of this date) google-cloud module, this problem still occurs when using upload_from_string.

@allardhoeve
Copy link

To make it even stranger, I get this intermittently on Python 3.7.1 and latest google-cloud-python.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

6 participants