fix: add correct support for compressing file-like objects #174

pyrooka · 2023-09-25T16:44:20Z

This PR contains the fix for the bug in the request preparation step.
Previously file-like object couldn't be compressed and the code
threw and exception if the user tried to do it. The solution that can
be found in this PR, is a helper class called GzipStream which act
like a middle item between the reader and the writer and compresses
the data on the fly. More details can be found in the comments.

Signed-off-by: Norbert Biczo <[email protected]>

pyrooka · 2023-09-25T18:36:03Z

ibm_cloud_sdk_core/base_service.py

+            # Handle the compression for file-like objects.
+            # We need to use a custom stream/pipe method to prevent
+            # reading the whole file into the memory.
+            request['data'] = GzipStream(raw) if isinstance(raw, io.IOBase) else gzip.compress(raw)


I use this if-else format to avoid too-many-branches linter error then I renamed the variable to raw to do not exceed the max row length...:)

I think it would add some value if you were to add some additional function to the GzipStream() ctor so that you could also pass in a non-IOBase-type value for raw and it would do the right thing (in that case, perhaps just wrap raw with an appropriate IOBase-type class that could stream the bytes of raw... in other words, beef up GzipStream just a bit so it can handle file-like objects and static strings/buffers).
This way, the details of how a particular requestBody is gzip-encoded is hidden inside GzipStream and not exposed up in this BaseService method AND we also get a stream-based zip compression being performed which might help with large JSON requestBodies.

FYI when I was originally investigating this in our SDK I noticed from gzip.compress() docs

Changed in version 3.11: Speed is improved by compressing all data at once instead of in a streamed fashion.

So there might not be much value in trying to be fully stream-based.

It also says:

Calls with mtime set to 0 are delegated to zlib.compress() for better speed.

but it isn't immediately obvious to me whether that delegation makes any difference to the streaming behaviour, but mtime defaults to current time so it won't happen at present anyway (nb. mtime only avialable from 3.8).

So there might not be much value in trying to be fully stream-based.

In my opinion it would make the code cleaner (to have all compression related steps in the helper class) and although the performance is improved in 3.11, the memory usage could still be a problem when the file is large, right? I think that's Phil's main point.

mtime only avialable from 3.8

We still support 3.7 so that's not an option - at least for the gzip.compress function. From my understanding the mtime is only used during the decompressing so - I think - the streaming shouldn't affect the result.

the memory usage could still be a problem when the file is large, right? I think that's Phil's main point

My understanding is that in 3.11 the entire contents will be in memory in gzip anyway. So my point was that might not be worth making changes only for that reason, but as you say there are other benefits.

pyrooka · 2023-09-25T18:39:56Z

test/test_base_service.py

@@ -647,6 +647,34 @@ def test_gzip_compression():
    assert prepped['headers'].get('content-encoding') == 'gzip'


+def test_gzip_compression_file_input():


Had to put the new test cases into a separate function to avoid the too-many-branches linter error.

ibm_cloud_sdk_core/utils.py

padamstx · 2023-09-25T19:34:49Z

ibm_cloud_sdk_core/base_service.py

+            # Handle the compression for file-like objects.
+            # We need to use a custom stream/pipe method to prevent
+            # reading the whole file into the memory.
+            request['data'] = GzipStream(raw) if isinstance(raw, io.IOBase) else gzip.compress(raw)


I think it would add some value if you were to add some additional function to the GzipStream() ctor so that you could also pass in a non-IOBase-type value for raw and it would do the right thing (in that case, perhaps just wrap raw with an appropriate IOBase-type class that could stream the bytes of raw... in other words, beef up GzipStream just a bit so it can handle file-like objects and static strings/buffers).
This way, the details of how a particular requestBody is gzip-encoded is hidden inside GzipStream and not exposed up in this BaseService method AND we also get a stream-based zip compression being performed which might help with large JSON requestBodies.

Signed-off-by: Norbert Biczo <[email protected]>

pyrooka · 2023-09-26T11:34:24Z

I have updated the PR based on @padamstx's suggestion. Let me know what you think, I am open to change it you have major concerns or better ideas!

padamstx

LGTM
If possible, it probably wouldn't hurt to try to test these changes with a much larger payload, perhaps by using some examples or integration tests from the platform-services python SDK or perhaps cloudant.
@ricellis do you by chance have a scenario that might be a good test prior to merging?

Signed-off-by: Norbert Biczo <[email protected]>

ricellis · 2023-09-26T16:49:18Z

do you by chance have a scenario that might be a good test prior to merging?

I'll try and run it through our suites tomorrow.

ricellis · 2023-09-27T11:18:08Z

No good I'm afraid, 167 failures, I didn't check each one individually, but they seem to be of two types.
Firstly,

[2023-09-27T11:11:00.357Z]     response = self.send(request, **kwargs)
[2023-09-27T11:11:00.357Z] ../../../pythonvenv/lib64/python3.11/site-packages/ibm_cloud_sdk_core/base_service.py:313: in send
[2023-09-27T11:11:00.357Z]     response = self.http_client.request(**request, cookies=self.jar, **kwargs)
[2023-09-27T11:11:00.357Z] ../../../pythonvenv/lib64/python3.11/site-packages/requests/sessions.py:589: in request
[2023-09-27T11:11:00.357Z]     resp = self.send(prep, **send_kwargs)
[2023-09-27T11:11:00.357Z] ../../../pythonvenv/lib64/python3.11/site-packages/requests/sessions.py:703: in send
[2023-09-27T11:11:00.357Z]     r = adapter.send(request, **kwargs)
[2023-09-27T11:11:00.357Z] ../../../pythonvenv/lib64/python3.11/site-packages/requests/adapters.py:486: in send
[2023-09-27T11:11:00.357Z]     resp = conn.urlopen(
[2023-09-27T11:11:00.357Z] ../../../pythonvenv/lib64/python3.11/site-packages/urllib3/connectionpool.py:714: in urlopen
[2023-09-27T11:11:00.357Z]     httplib_response = self._make_request(
[2023-09-27T11:11:00.357Z] ../../../pythonvenv/lib64/python3.11/site-packages/urllib3/connectionpool.py:413: in _make_request
[2023-09-27T11:11:00.357Z]     conn.request_chunked(method, url, **httplib_request_kw)
[2023-09-27T11:11:00.357Z] ../../../pythonvenv/lib64/python3.11/site-packages/urllib3/connection.py:270: in request_chunked
[2023-09-27T11:11:00.357Z]     for chunk in body:
[2023-09-27T11:11:00.357Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2023-09-27T11:11:00.357Z] 
[2023-09-27T11:11:00.357Z] self = <ibm_cloud_sdk_core.utils.GzipStream object at 0x7f4cf581eb60>, size = 1
[2023-09-27T11:11:00.358Z] 
[2023-09-27T11:11:00.358Z]     def read(self, size: int = -1):
[2023-09-27T11:11:00.358Z]         """Compresses and returns the requested size of data.
[2023-09-27T11:11:00.358Z]     
[2023-09-27T11:11:00.358Z]         Args:
[2023-09-27T11:11:00.358Z]             size: how many bytes to return. -1 to read and compress the whole file
[2023-09-27T11:11:00.358Z]         """
[2023-09-27T11:11:00.358Z]         if (size < 0) or (len(self.buffer) < size):
[2023-09-27T11:11:00.358Z]             for raw in self.uncompressed:
[2023-09-27T11:11:00.358Z]                 # We need to encode text like streams (e.g. TextIOWrapper) to bytes.
[2023-09-27T11:11:00.358Z]                 if isinstance(raw, str):
[2023-09-27T11:11:00.358Z]                     raw = raw.encode()
[2023-09-27T11:11:00.358Z]     
[2023-09-27T11:11:00.358Z]                 self.compressor.write(raw)
[2023-09-27T11:11:00.358Z]     
[2023-09-27T11:11:00.358Z]                 # Stop compressing if we reached the max allowed size.
[2023-09-27T11:11:00.358Z]                 if 0 < size < len(self.buffer):
[2023-09-27T11:11:00.358Z]                     self.compressor.flush()
[2023-09-27T11:11:00.358Z]                     break
[2023-09-27T11:11:00.358Z]             else:
[2023-09-27T11:11:00.358Z]                 self.compressor.close()
[2023-09-27T11:11:00.358Z]     
[2023-09-27T11:11:00.358Z]             if size < 0:
[2023-09-27T11:11:00.358Z]                 # Return all data from the buffer.
[2023-09-27T11:11:00.358Z]                 compressed = self.buffer
[2023-09-27T11:11:00.358Z]                 self.buffer = b''
[2023-09-27T11:11:00.358Z]         else:
[2023-09-27T11:11:00.358Z]             # If we already have enough data in our buffer
[2023-09-27T11:11:00.358Z]             # return the desired chunk of bytes
[2023-09-27T11:11:00.358Z]             compressed = self.buffer[:size]
[2023-09-27T11:11:00.359Z]             # then remove them from the buffer.
[2023-09-27T11:11:00.359Z]             self.buffer = self.buffer[size:]
[2023-09-27T11:11:00.359Z]     
[2023-09-27T11:11:00.359Z] >       return compressed
[2023-09-27T11:11:00.359Z] E       UnboundLocalError: cannot access local variable 'compressed' where it is not associated with a value

and secondly:

[2023-09-27T11:11:01.043Z] ../../../pythonvenv/lib64/python3.11/site-packages/responses/__init__.py:229: in wrapper
[2023-09-27T11:11:01.043Z]     return func(*args, **kwargs)
[2023-09-27T11:11:01.043Z] test/unit/test_cloudant_v1.py:7932: in test_post_explain_all_params
[2023-09-27T11:11:01.043Z]     responses.calls[0].request.body = gzip.decompress(responses.calls[0].request.body)
[2023-09-27T11:11:01.043Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2023-09-27T11:11:01.043Z] 
[2023-09-27T11:11:01.043Z] data = <ibm_cloud_sdk_core.utils.GzipStream object at 0x7f4cf2b7c9d0>
[2023-09-27T11:11:01.043Z] 
[2023-09-27T11:11:01.043Z]     def decompress(data):
[2023-09-27T11:11:01.043Z]         """Decompress a gzip compressed string in one shot.
[2023-09-27T11:11:01.043Z]         Return the decompressed string.
[2023-09-27T11:11:01.043Z]         """
[2023-09-27T11:11:01.043Z]         decompressed_members = []
[2023-09-27T11:11:01.043Z]         while True:
[2023-09-27T11:11:01.043Z] >           fp = io.BytesIO(data)
[2023-09-27T11:11:01.043Z] E           TypeError: a bytes-like object is required, not 'GzipStream'
[2023-09-27T11:11:01.044Z] 
[2023-09-27T11:11:01.044Z] /usr/lib64/python3.11/gzip.py:600: TypeError

ricellis · 2023-09-27T11:23:28Z

FWIW the first is also what I saw testing locally when I tried to validate this change solved the issue reported in IBM/cloudant-python-sdk#554.
At first glance the second might be related to the generated test code's expectations about the request body.

pyrooka · 2023-09-27T12:38:51Z

Thanks a lot @ricellis, I will take a look! At first glance these issues should be reproducible in unit tests, but we'll see.

Signed-off-by: Norbert Biczo <[email protected]>

pyrooka · 2023-10-04T10:03:47Z

@ricellis Could you do another test run when you have some time? I've fixed the issues I found, and tweaked the generated unit tests to handle GzipStream bodies - so please use the generator, built from the main branch.
(I've tested the changes with all our APIs and found no issues, hopefully you will get the same result.)

ricellis · 2023-10-04T16:25:13Z

Re-tested with new generated test fixes, all passing now. Thanks!

pyrooka · 2023-10-04T16:33:06Z

@ricellis Thanks for the testing (and finding the bugs)!

dpopp07

LGTM!

## [3.17.1](v3.17.0...v3.17.1) (2023-10-04) ### Bug Fixes * add correct support for compressing file-like objects ([#174](#174)) ([2f91105](2f91105))

ibm-devx-sdk · 2023-10-04T19:04:17Z

🎉 This PR is included in version 3.17.1 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

pyrooka force-pushed the nb/fix-file-compression branch from 64b1dac to 6223a01 Compare September 25, 2023 18:31

pyrooka added 2 commits September 25, 2023 20:37

test: update unit tests to demostrate the issue

17039a3

Signed-off-by: Norbert Biczo <[email protected]>

fix: add correct support for compressing file-like objects

35216f3

Signed-off-by: Norbert Biczo <[email protected]>

pyrooka force-pushed the nb/fix-file-compression branch from 6223a01 to 35216f3 Compare September 25, 2023 18:37

pyrooka commented Sep 25, 2023

View reviewed changes

pyrooka requested review from dpopp07 and padamstx September 25, 2023 18:40

padamstx reviewed Sep 25, 2023

View reviewed changes

chore: fix typos

245b72b

Signed-off-by: Norbert Biczo <[email protected]>

pyrooka force-pushed the nb/fix-file-compression branch from e3746bc to 7adcadd Compare September 26, 2023 11:43

padamstx approved these changes Sep 26, 2023

View reviewed changes

chore: move all compression logic to helper class

bdc27d3

Signed-off-by: Norbert Biczo <[email protected]>

pyrooka force-pushed the nb/fix-file-compression branch from 7adcadd to bdc27d3 Compare September 26, 2023 13:39

pyrooka added 5 commits October 2, 2023 18:31

test: simulate urllib3 chunk reading and validate decompressions

ae5f618

Signed-off-by: Norbert Biczo <[email protected]>

test: cover reading from closed source

1363692

Signed-off-by: Norbert Biczo <[email protected]>

fix: handle reading from closed source

8251974

Signed-off-by: Norbert Biczo <[email protected]>

chore: change base class

7920e40

Signed-off-by: Norbert Biczo <[email protected]>

Merge branch 'main' into nb/fix-file-compression

de4ef28

dpopp07 approved these changes Oct 4, 2023

View reviewed changes

pyrooka merged commit 2f91105 into main Oct 4, 2023
4 checks passed

pyrooka deleted the nb/fix-file-compression branch October 4, 2023 19:00

ibm-devx-sdk pushed a commit that referenced this pull request Oct 4, 2023

chore(release): 3.17.1 release notes

5d52d98

## [3.17.1](v3.17.0...v3.17.1) (2023-10-04) ### Bug Fixes * add correct support for compressing file-like objects ([#174](#174)) ([2f91105](2f91105))

ibm-devx-sdk added the released label Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add correct support for compressing file-like objects #174

fix: add correct support for compressing file-like objects #174

pyrooka commented Sep 25, 2023 •

edited

Loading

pyrooka Sep 25, 2023

padamstx Sep 25, 2023

ricellis Sep 26, 2023 •

edited

Loading

pyrooka Sep 26, 2023

ricellis Sep 26, 2023

pyrooka Sep 25, 2023

padamstx Sep 25, 2023

pyrooka commented Sep 26, 2023

padamstx left a comment

ricellis commented Sep 26, 2023

ricellis commented Sep 27, 2023

ricellis commented Sep 27, 2023 •

edited

Loading

pyrooka commented Sep 27, 2023

pyrooka commented Oct 4, 2023

ricellis commented Oct 4, 2023

pyrooka commented Oct 4, 2023

dpopp07 left a comment

ibm-devx-sdk commented Oct 4, 2023

		@@ -647,6 +647,34 @@ def test_gzip_compression():
		assert prepped['headers'].get('content-encoding') == 'gzip'


		def test_gzip_compression_file_input():

fix: add correct support for compressing file-like objects #174

fix: add correct support for compressing file-like objects #174

Conversation

pyrooka commented Sep 25, 2023 • edited Loading

pyrooka Sep 25, 2023

Choose a reason for hiding this comment

padamstx Sep 25, 2023

Choose a reason for hiding this comment

ricellis Sep 26, 2023 • edited Loading

Choose a reason for hiding this comment

pyrooka Sep 26, 2023

Choose a reason for hiding this comment

ricellis Sep 26, 2023

Choose a reason for hiding this comment

pyrooka Sep 25, 2023

Choose a reason for hiding this comment

padamstx Sep 25, 2023

Choose a reason for hiding this comment

pyrooka commented Sep 26, 2023

padamstx left a comment

Choose a reason for hiding this comment

ricellis commented Sep 26, 2023

ricellis commented Sep 27, 2023

ricellis commented Sep 27, 2023 • edited Loading

pyrooka commented Sep 27, 2023

pyrooka commented Oct 4, 2023

ricellis commented Oct 4, 2023

pyrooka commented Oct 4, 2023

dpopp07 left a comment

Choose a reason for hiding this comment

ibm-devx-sdk commented Oct 4, 2023

pyrooka commented Sep 25, 2023 •

edited

Loading

ricellis Sep 26, 2023 •

edited

Loading

ricellis commented Sep 27, 2023 •

edited

Loading