Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support arbitrary buffer-like objects #26

Closed
pitrou opened this issue Jun 9, 2017 · 5 comments
Closed

Support arbitrary buffer-like objects #26

pitrou opened this issue Jun 9, 2017 · 5 comments

Comments

@pitrou
Copy link

pitrou commented Jun 9, 2017

The methods in this package (such as ZStdCompressor.compress() and ZStdDecompressor.decompress()) typically don't support bytearray or memoryview inputs. Though for some strange reason they seem to support Numpy arrays:

>>> c = zstd.ZstdCompressor(level=1, write_content_size=True)
>>> b = c.compress(b'12345678')
>>> b == c.compress(np.frombuffer(b'12345678', dtype='int8'))
True
>>> b == c.compress(memoryview(b'12345678'))
Traceback (most recent call last):
  File "<ipython-input-40-9ae8177a1d5c>", line 1, in <module>
    b == c.compress(memoryview(b'12345678'))
TypeError: compress() argument 1 must be read-only bytes-like object, not memoryview

>>> b == c.compress(bytearray(b'12345678'))
Traceback (most recent call last):
  File "<ipython-input-41-ddc68fad54ac>", line 1, in <module>
    b == c.compress(bytearray(b'12345678'))
TypeError: compress() argument 1 must be read-only bytes-like object, not bytearray

Note that other compression libraries typically accept arbitrary buffer-like objects (which allows passing them without copying to bytes before):

>>> zlib.compress(memoryview(b'12345678'))
b'x\x9c3426153\xb7\x00\x00\x07@\x01\xa5'
>>> bz2.compress(memoryview(b'12345678'))
b'BZh91AY&SY\xb6\x1c=\x04\x00\x00\x00\x08\x00?\xc0 \x001\x0c\x08\x19\x1ai\x935s\xf9E\xdc\x91N\x14$-\x87\x0fA\x00'
>>> lzma.compress(memoryview(b'12345678'))
b'\xfd7zXZ\x00\x00\x04\xe6\xd6\xb4F\x02\x00!\x01\x16\x00\x00\x00t/\xe5\xa3\x01\x00\x0712345678\x00\tx\xac+H\x80\x8b\\\x00\x01 \x08\xbb\x19\xd9\xbb\x1f\xb6\xf3}\x01\x00\x00\x00\x00\x04YZ'
>>> blosc.compress(memoryview(b'12345678'))
b'\x02\x01\x13\x08\x08\x00\x00\x00\x08\x00\x00\x00\x18\x00\x00\x0012345678'
@indygreg
Copy link
Owner

indygreg commented Jun 9, 2017

Eh? I assume you are talking about Python 3, because zlib.compress() doesn't accept memoryview in Python 2 (sadly).

Anyway, it is my intent for this to work. I /thought/ that using the y# format unit accepted anything conforming to the buffer protocol, as https://docs.python.org/3/c-api/arg.html#strings-and-buffers says it will accept any "bytes-like object" and anything implementing the buffer protocol is a "bytes-like object." I see that CPython uses y* for these functions. While the docs say y* is recommended, they don't say there are any feature differences. But I guess there must be. Boo.

Anyway, y* is the superior mechanism. So the code should be changed to that. This will require changing Python 2 to s* so we don't have multiple code paths for a Py_buffer and char * source.

@pitrou
Copy link
Author

pitrou commented Jun 10, 2017

Yes, I would recommend to use y* on Python 3 and s* on Python 2. Should I take a look or do you think it'll be much quicker for you to do? I'm not familiar with this source base.

(for the record, y# tries the buffer API but refuses any object that has a buffer-release hook, because the buffer is released before PyArg_ParseTuple returns -- in practice it probably means it works only with bytes objects).

@indygreg
Copy link
Owner

FWIW, I filed a docs bug at http://bugs.python.org/issue30625 because this limitation of y# was not obvious from the docs. In fact, the docs seem contradictory in saying that a "read only bytes-like object" (which they define memoryview(bytes) to be) should work for y#!

@indygreg
Copy link
Owner

I'm actively working on this.

indygreg added a commit that referenced this issue Jun 11, 2017
Without this, we may not accept all objects implementing the
buffer protocol.
indygreg added a commit that referenced this issue Jun 11, 2017
The y#/s# format unit don't appear to accept all types
exposing a read-only bytes-like object. The Python docs
do say that y*/s* is the recommended way to accept binary
data. So we switch to that.
indygreg added a commit that referenced this issue Jun 11, 2017
This is slightly more involved than other changes because of
existing checking for unicode types. Python 2 happily allows
unicode to be cast to a Py_buffer. Python 3 doesn't. So we
need to update the test accordingly.

The CFFI code has also been updated to use ffi.from_buffer()
so it is consistent with the C code.

And since this is the last function using y#/s# for argument
parsing, update the NEWS.
@indygreg
Copy link
Owner

And I pushed those commits to the master branch. Calling this one done.

Thanks for the bug report and the background on y*/y#.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants