Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup old compression workarounds #6259

Merged
merged 8 commits into from
May 4, 2022
Merged

Conversation

jakirkham
Copy link
Member

@jakirkham jakirkham commented May 3, 2022

Various workarounds have accumulated in distributed.protocol.compression to handle issues with prior versions of different libraries. However these libraries have since fixed these various issues and made several releases since. Given this, clean out some of these old workarounds. This should simplify the code here (and avoid some no longer needed copies). Also make sure to clarify version expectations and provide context.


  • Tests added / passed
  • Passes pre-commit run --all-files

These have been around since python-lz4's 0.23.1 release, which came out
in February 2018. Given these have been around for a while, it seems
reasonable to just require them and drop the workarounds.
In addition to including the `lz4.block` API, the python-lz4 0.23.1
release included proper support for the Python Buffer Protocol. So if we
are going to require the new API, it seems reasonable to drop these
workarounds as well.
Raise an `ImportError` if python-lz4 is too old. Since this is in a
`suppress` block, the `ImportError` will be captured, but the effect
will be python-lz4 won't be used in this case.
As of python-snappy 0.5.3 (released July 2018), any Python Buffer
Protocol object (like `memoryview`s) is supported. Given how long this
release has been out, it seems reasonable to drop the prior workaround.
Should avoid unneeded copying for other objects (like `memoryview`s).
Copy link
Member Author

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided a bit more context about when things were fixed and why the workarounds can be dropped below

Comment on lines -42 to -46
def _fixed_snappy_decompress(data):
# snappy.decompress() doesn't accept memoryviews
if isinstance(data, (memoryview, bytearray)):
data = bytes(data)
return snappy.decompress(data)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support for the Python Buffer Protocol was added in 0.5.3 with PR ( intake/python-snappy#72 ). This version was released Jul 2018 so should be old enough to rely upon. With Python Buffer Protocol support, there is no longer a need to copy too bytes here and we can just hand data to snappy.decompress.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines +50 to +52
try:
snappy.compress(memoryview(b""))
except TypeError:
raise ImportError("Need snappy >= 0.5.3")
Copy link
Member Author

@jakirkham jakirkham May 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recently snappy.__version__ was added with PR ( intake/python-snappy#119 ). Though it is not currently in a release and hasn't been around long enough to rely upon.

So just do a simple test that will fail without python-snappy version 0.5.3 or later. If python-snappy is not new enough (unlikely given how long this release has been out), simply skip using python-snappy. Test results on 0.5.2 & 0.5.3 below:

python-snappy 0.5.2:

In [1]: import snappy

In [2]: snappy.compress(memoryview(b""))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-077d330e1c70> in <module>
----> 1 snappy.compress(memoryview(b""))

~/miniconda/envs/snap52/lib/python3.6/site-packages/snappy/snappy.py in compress(data, encoding)
     82         data = data.encode(encoding)
     83 
---> 84     return _compress(data)
     85 
     86 def uncompress(data, decoding=None):

TypeError: argument 1 must be read-only bytes-like object, not memoryview

python-snappy 0.5.3:

In [1]: import snappy

In [2]: snappy.compress(memoryview(b""))
Out[2]: b'\x00'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

(do we want to use cramjam for all of the de/compressors??)

Comment on lines -57 to -66
try:
# try using the new lz4 API
import lz4.block

lz4_compress = lz4.block.compress
lz4_decompress = lz4.block.decompress
except ImportError:
# fall back to old one
lz4_compress = lz4.LZ4_compress
lz4_decompress = lz4.LZ4_uncompress
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lz4.block APIs were added in commit ( python-lz4/python-lz4@bb2c7ea ) and have been around since 0.23.1, which was released Feb 2018. So should be old enough to rely upon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines -68 to -87
# helper to bypass missing memoryview support in current lz4
# (fixed in later versions)

def _fixed_lz4_compress(data):
try:
return lz4_compress(data)
except TypeError:
if isinstance(data, (memoryview, bytearray)):
return lz4_compress(bytes(data))
else:
raise

def _fixed_lz4_decompress(data):
try:
return lz4_decompress(data)
except (ValueError, TypeError):
if isinstance(data, (memoryview, bytearray)):
return lz4_decompress(bytes(data))
else:
raise
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also Python Buffer Protocol support was added in PR ( python-lz4/python-lz4#38 ), which was also included in 0.23.1. So drop these workarounds as these objects should work without copying to bytes first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines +64 to +65
# Required to use `lz4.block` APIs and Python Buffer Protocol support.
if parse_version(lz4.__version__) < parse_version("0.23.1"):
raise ImportError("Need lz4 >= 0.23.1")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error is a bit more explicit, but this arguably would happen by just trying the import line below.

Ideally we would test `__version__` however not in a release atm (though
it was added recently). In any even it can't reliably be checked for
current versions. Instead just test some fast code that reliably fails
for versions prior to `0.5.3`. If the code fails, simply raise an
`ImportError` noting this requirement. As this is in a `suppress` block,
the `ImportError` will be caught. Though the net effect will be
disabling python-snappy use if it is too old.
These should already be satisfied when installed. Though this makes it
clearer that these are the required versions (particularly for any users
poking around trying to understand things).
In `zstandard`'s `0.9.0` release, support for the Python Buffer Protocol
was added. As we are supporting this functionality with other
compressors, go ahead and support it here too.
Comment on lines +80 to +82
# Required for Python Buffer Protocol support.
if parse_version(zstandard.__version__) < parse_version("0.9.0"):
raise ImportError("Need zstandard >= 0.9.0")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the other compressors already support the Python Buffer Protocol, make sure we support it here too. This was added in a series of commits referenced in issue ( indygreg/python-zstandard#26 ) that were included in 0.9.0, which came out Apr 2018. Thus has been around as long as the other releases being required here. So seems like a reasonable minimum.

@github-actions
Copy link
Contributor

github-actions bot commented May 3, 2022

Unit Test Results

       16 files  ±0         16 suites  ±0   7h 43m 55s ⏱️ - 21m 57s
  2 745 tests ±0    2 660 ✔️  - 1       81 💤 +1  4 ±0 
21 861 runs  ±0  20 808 ✔️  - 3  1 046 💤 ±0  7 +3 

For more details on these failures, see this check.

Results for commit 3aa8e2f. ± Comparison against base commit c11c8ee.

@jakirkham
Copy link
Member Author

@dask/maintenance , thoughts? 🙂

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good

Repeating comment: any thoughts on using cramjam instead of all these specific packages?

Comment on lines -42 to -46
def _fixed_snappy_decompress(data):
# snappy.decompress() doesn't accept memoryviews
if isinstance(data, (memoryview, bytearray)):
data = bytes(data)
return snappy.decompress(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines +50 to +52
try:
snappy.compress(memoryview(b""))
except TypeError:
raise ImportError("Need snappy >= 0.5.3")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

(do we want to use cramjam for all of the de/compressors??)

Comment on lines -57 to -66
try:
# try using the new lz4 API
import lz4.block

lz4_compress = lz4.block.compress
lz4_decompress = lz4.block.decompress
except ImportError:
# fall back to old one
lz4_compress = lz4.LZ4_compress
lz4_decompress = lz4.LZ4_uncompress
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines -68 to -87
# helper to bypass missing memoryview support in current lz4
# (fixed in later versions)

def _fixed_lz4_compress(data):
try:
return lz4_compress(data)
except TypeError:
if isinstance(data, (memoryview, bytearray)):
return lz4_compress(bytes(data))
else:
raise

def _fixed_lz4_decompress(data):
try:
return lz4_decompress(data)
except (ValueError, TypeError):
if isinstance(data, (memoryview, bytearray)):
return lz4_decompress(bytes(data))
else:
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@jakirkham
Copy link
Member Author

jakirkham commented May 3, 2022

Thanks Martin! 😄

Repeating comment: any thoughts on using cramjam instead of all these specific packages?

No strong feelings. Mostly was looking to simplify the existing code. Maybe we can raise this idea as a new issue and see if others agree? If so, guessing it is simple to swap in.

Edit: Raised issue ( #6265 )

@jakirkham
Copy link
Member Author

Planning on merging EOD tomorrow if no comments

@jakirkham jakirkham merged commit f779854 into dask:main May 4, 2022
@jakirkham jakirkham deleted the cleanup_compress branch May 4, 2022 23:42
@jakirkham
Copy link
Member Author

With these changes in place, cleaning up some unneeded copies in the compression code paths with PR ( #6273 ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants