Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add correct support for compressing file-like objects #174

Merged
merged 9 commits into from
Oct 4, 2023

Conversation

pyrooka
Copy link
Member

@pyrooka pyrooka commented Sep 25, 2023

This PR contains the fix for the bug in the request preparation step.
Previously file-like object couldn't be compressed and the code
threw and exception if the user tried to do it. The solution that can
be found in this PR, is a helper class called GzipStream which act
like a middle item between the reader and the writer and compresses
the data on the fly. More details can be found in the comments.

# Handle the compression for file-like objects.
# We need to use a custom stream/pipe method to prevent
# reading the whole file into the memory.
request['data'] = GzipStream(raw) if isinstance(raw, io.IOBase) else gzip.compress(raw)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use this if-else format to avoid too-many-branches linter error then I renamed the variable to raw to do not exceed the max row length...:)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would add some value if you were to add some additional function to the GzipStream() ctor so that you could also pass in a non-IOBase-type value for raw and it would do the right thing (in that case, perhaps just wrap raw with an appropriate IOBase-type class that could stream the bytes of raw... in other words, beef up GzipStream just a bit so it can handle file-like objects and static strings/buffers).
This way, the details of how a particular requestBody is gzip-encoded is hidden inside GzipStream and not exposed up in this BaseService method AND we also get a stream-based zip compression being performed which might help with large JSON requestBodies.

Copy link
Member

@ricellis ricellis Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI when I was originally investigating this in our SDK I noticed from gzip.compress() docs

Changed in version 3.11: Speed is improved by compressing all data at once instead of in a streamed fashion.

So there might not be much value in trying to be fully stream-based.

It also says:

Calls with mtime set to 0 are delegated to zlib.compress() for better speed.

but it isn't immediately obvious to me whether that delegation makes any difference to the streaming behaviour, but mtime defaults to current time so it won't happen at present anyway (nb. mtime only avialable from 3.8).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there might not be much value in trying to be fully stream-based.

In my opinion it would make the code cleaner (to have all compression related steps in the helper class) and although the performance is improved in 3.11, the memory usage could still be a problem when the file is large, right? I think that's Phil's main point.

mtime only avialable from 3.8

We still support 3.7 so that's not an option - at least for the gzip.compress function. From my understanding the mtime is only used during the decompressing so - I think - the streaming shouldn't affect the result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the memory usage could still be a problem when the file is large, right? I think that's Phil's main point

My understanding is that in 3.11 the entire contents will be in memory in gzip anyway. So my point was that might not be worth making changes only for that reason, but as you say there are other benefits.

@@ -647,6 +647,34 @@ def test_gzip_compression():
assert prepped['headers'].get('content-encoding') == 'gzip'


def test_gzip_compression_file_input():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to put the new test cases into a separate function to avoid the too-many-branches linter error.

ibm_cloud_sdk_core/utils.py Outdated Show resolved Hide resolved
ibm_cloud_sdk_core/utils.py Outdated Show resolved Hide resolved
# Handle the compression for file-like objects.
# We need to use a custom stream/pipe method to prevent
# reading the whole file into the memory.
request['data'] = GzipStream(raw) if isinstance(raw, io.IOBase) else gzip.compress(raw)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would add some value if you were to add some additional function to the GzipStream() ctor so that you could also pass in a non-IOBase-type value for raw and it would do the right thing (in that case, perhaps just wrap raw with an appropriate IOBase-type class that could stream the bytes of raw... in other words, beef up GzipStream just a bit so it can handle file-like objects and static strings/buffers).
This way, the details of how a particular requestBody is gzip-encoded is hidden inside GzipStream and not exposed up in this BaseService method AND we also get a stream-based zip compression being performed which might help with large JSON requestBodies.

Signed-off-by: Norbert Biczo <[email protected]>
@pyrooka
Copy link
Member Author

pyrooka commented Sep 26, 2023

I have updated the PR based on @padamstx's suggestion. Let me know what you think, I am open to change it you have major concerns or better ideas!

Copy link
Member

@padamstx padamstx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
If possible, it probably wouldn't hurt to try to test these changes with a much larger payload, perhaps by using some examples or integration tests from the platform-services python SDK or perhaps cloudant.
@ricellis do you by chance have a scenario that might be a good test prior to merging?

@ricellis
Copy link
Member

do you by chance have a scenario that might be a good test prior to merging?

I'll try and run it through our suites tomorrow.

@ricellis
Copy link
Member

No good I'm afraid, 167 failures, I didn't check each one individually, but they seem to be of two types.
Firstly,

[2023-09-27T11:11:00.357Z]     response = self.send(request, **kwargs)
[2023-09-27T11:11:00.357Z] ../../../pythonvenv/lib64/python3.11/site-packages/ibm_cloud_sdk_core/base_service.py:313: in send
[2023-09-27T11:11:00.357Z]     response = self.http_client.request(**request, cookies=self.jar, **kwargs)
[2023-09-27T11:11:00.357Z] ../../../pythonvenv/lib64/python3.11/site-packages/requests/sessions.py:589: in request
[2023-09-27T11:11:00.357Z]     resp = self.send(prep, **send_kwargs)
[2023-09-27T11:11:00.357Z] ../../../pythonvenv/lib64/python3.11/site-packages/requests/sessions.py:703: in send
[2023-09-27T11:11:00.357Z]     r = adapter.send(request, **kwargs)
[2023-09-27T11:11:00.357Z] ../../../pythonvenv/lib64/python3.11/site-packages/requests/adapters.py:486: in send
[2023-09-27T11:11:00.357Z]     resp = conn.urlopen(
[2023-09-27T11:11:00.357Z] ../../../pythonvenv/lib64/python3.11/site-packages/urllib3/connectionpool.py:714: in urlopen
[2023-09-27T11:11:00.357Z]     httplib_response = self._make_request(
[2023-09-27T11:11:00.357Z] ../../../pythonvenv/lib64/python3.11/site-packages/urllib3/connectionpool.py:413: in _make_request
[2023-09-27T11:11:00.357Z]     conn.request_chunked(method, url, **httplib_request_kw)
[2023-09-27T11:11:00.357Z] ../../../pythonvenv/lib64/python3.11/site-packages/urllib3/connection.py:270: in request_chunked
[2023-09-27T11:11:00.357Z]     for chunk in body:
[2023-09-27T11:11:00.357Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2023-09-27T11:11:00.357Z] 
[2023-09-27T11:11:00.357Z] self = <ibm_cloud_sdk_core.utils.GzipStream object at 0x7f4cf581eb60>, size = 1
[2023-09-27T11:11:00.358Z] 
[2023-09-27T11:11:00.358Z]     def read(self, size: int = -1):
[2023-09-27T11:11:00.358Z]         """Compresses and returns the requested size of data.
[2023-09-27T11:11:00.358Z]     
[2023-09-27T11:11:00.358Z]         Args:
[2023-09-27T11:11:00.358Z]             size: how many bytes to return. -1 to read and compress the whole file
[2023-09-27T11:11:00.358Z]         """
[2023-09-27T11:11:00.358Z]         if (size < 0) or (len(self.buffer) < size):
[2023-09-27T11:11:00.358Z]             for raw in self.uncompressed:
[2023-09-27T11:11:00.358Z]                 # We need to encode text like streams (e.g. TextIOWrapper) to bytes.
[2023-09-27T11:11:00.358Z]                 if isinstance(raw, str):
[2023-09-27T11:11:00.358Z]                     raw = raw.encode()
[2023-09-27T11:11:00.358Z]     
[2023-09-27T11:11:00.358Z]                 self.compressor.write(raw)
[2023-09-27T11:11:00.358Z]     
[2023-09-27T11:11:00.358Z]                 # Stop compressing if we reached the max allowed size.
[2023-09-27T11:11:00.358Z]                 if 0 < size < len(self.buffer):
[2023-09-27T11:11:00.358Z]                     self.compressor.flush()
[2023-09-27T11:11:00.358Z]                     break
[2023-09-27T11:11:00.358Z]             else:
[2023-09-27T11:11:00.358Z]                 self.compressor.close()
[2023-09-27T11:11:00.358Z]     
[2023-09-27T11:11:00.358Z]             if size < 0:
[2023-09-27T11:11:00.358Z]                 # Return all data from the buffer.
[2023-09-27T11:11:00.358Z]                 compressed = self.buffer
[2023-09-27T11:11:00.358Z]                 self.buffer = b''
[2023-09-27T11:11:00.358Z]         else:
[2023-09-27T11:11:00.358Z]             # If we already have enough data in our buffer
[2023-09-27T11:11:00.358Z]             # return the desired chunk of bytes
[2023-09-27T11:11:00.358Z]             compressed = self.buffer[:size]
[2023-09-27T11:11:00.359Z]             # then remove them from the buffer.
[2023-09-27T11:11:00.359Z]             self.buffer = self.buffer[size:]
[2023-09-27T11:11:00.359Z]     
[2023-09-27T11:11:00.359Z] >       return compressed
[2023-09-27T11:11:00.359Z] E       UnboundLocalError: cannot access local variable 'compressed' where it is not associated with a value

and secondly:

[2023-09-27T11:11:01.043Z] ../../../pythonvenv/lib64/python3.11/site-packages/responses/__init__.py:229: in wrapper
[2023-09-27T11:11:01.043Z]     return func(*args, **kwargs)
[2023-09-27T11:11:01.043Z] test/unit/test_cloudant_v1.py:7932: in test_post_explain_all_params
[2023-09-27T11:11:01.043Z]     responses.calls[0].request.body = gzip.decompress(responses.calls[0].request.body)
[2023-09-27T11:11:01.043Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2023-09-27T11:11:01.043Z] 
[2023-09-27T11:11:01.043Z] data = <ibm_cloud_sdk_core.utils.GzipStream object at 0x7f4cf2b7c9d0>
[2023-09-27T11:11:01.043Z] 
[2023-09-27T11:11:01.043Z]     def decompress(data):
[2023-09-27T11:11:01.043Z]         """Decompress a gzip compressed string in one shot.
[2023-09-27T11:11:01.043Z]         Return the decompressed string.
[2023-09-27T11:11:01.043Z]         """
[2023-09-27T11:11:01.043Z]         decompressed_members = []
[2023-09-27T11:11:01.043Z]         while True:
[2023-09-27T11:11:01.043Z] >           fp = io.BytesIO(data)
[2023-09-27T11:11:01.043Z] E           TypeError: a bytes-like object is required, not 'GzipStream'
[2023-09-27T11:11:01.044Z] 
[2023-09-27T11:11:01.044Z] /usr/lib64/python3.11/gzip.py:600: TypeError

@ricellis
Copy link
Member

ricellis commented Sep 27, 2023

FWIW the first is also what I saw testing locally when I tried to validate this change solved the issue reported in IBM/cloudant-python-sdk#554.
At first glance the second might be related to the generated test code's expectations about the request body.

@pyrooka
Copy link
Member Author

pyrooka commented Sep 27, 2023

Thanks a lot @ricellis, I will take a look! At first glance these issues should be reproducible in unit tests, but we'll see.

@pyrooka
Copy link
Member Author

pyrooka commented Oct 4, 2023

@ricellis Could you do another test run when you have some time? I've fixed the issues I found, and tweaked the generated unit tests to handle GzipStream bodies - so please use the generator, built from the main branch.
(I've tested the changes with all our APIs and found no issues, hopefully you will get the same result.)

@ricellis
Copy link
Member

ricellis commented Oct 4, 2023

Re-tested with new generated test fixes, all passing now. Thanks!

@pyrooka
Copy link
Member Author

pyrooka commented Oct 4, 2023

@ricellis Thanks for the testing (and finding the bugs)!

Copy link
Member

@dpopp07 dpopp07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@pyrooka pyrooka merged commit 2f91105 into main Oct 4, 2023
4 checks passed
@pyrooka pyrooka deleted the nb/fix-file-compression branch October 4, 2023 19:00
ibm-devx-sdk pushed a commit that referenced this pull request Oct 4, 2023
## [3.17.1](v3.17.0...v3.17.1) (2023-10-04)

### Bug Fixes

* add correct support for compressing file-like objects ([#174](#174)) ([2f91105](2f91105))
@ibm-devx-sdk
Copy link

🎉 This PR is included in version 3.17.1 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants