Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can snappy.decompress(buffer) but cannot snappy.decompress(memoryview) #65

Closed
cclauss opened this issue Mar 21, 2018 · 7 comments
Closed

Comments

@cclauss
Copy link
Contributor

cclauss commented Mar 21, 2018

Is there a way to make snappy.decompress() work directly on a memoryview without casting it to another datatype? Casting makes a copy of the data which is undesirable in our case.

compressed = snappy.compress('The quick brown fox jumped over the lazy dog.')
print(snappy.decompress(buffer(compressed)))      # works on Python 2.7
print(snappy.decompress(memoryview(compressed)))  # fails on Python 2.7 and Python 3
# Python 2: TypeError: argument 1 must be string or read-only buffer, not memoryview
# Python 3: TypeError: argument 1 must be read-only bytes-like object, not memoryview

https://github.com/cclauss/snappy-hack/blob/master/snappy_hack.py

@cclauss cclauss changed the title Can snappy.decompress(buffer)but cannot snappy.decompress(memoryview) Can snappy.decompress(buffer) but cannot snappy.decompress(memoryview) Mar 21, 2018
@martindurant
Copy link
Member

@andrix , given that this code was written a good while ago, is there now scope for supporting memoryviews directly?

@andrix
Copy link
Collaborator

andrix commented Mar 21, 2018

hey @martindurant - I would need to research, it has been a long time. Currently, it expects a const char: https://github.com/andrix/python-snappy/blob/master/snappy/snappymodule.cc#L155.

I don't have enough time atm to research on how memoryviews are implemented, but it shouldn't be complex I imagine.

@himikof
Copy link

himikof commented Mar 25, 2018

The memoryview support should be a simple matter of switching from y# to y* (using Py_buffer structure) format during argument parsing (and, likely, switching from s# to s* on Python 2 to unify the code paths).
See a very similar issue in another library for the reference: indygreg/python-zstandard#26.

@martindurant
Copy link
Member

@himikof , would you like to put this into a PR?

@cclauss
Copy link
Contributor Author

cclauss commented Apr 5, 2018

@himikof @pitrou @indygreg Any help on this one? It is beyond my skills.

@indygreg
Copy link

indygreg commented Apr 5, 2018

Sure.

You'll want to use the buffer protocol for everything that is passing raw binary data around. This ensures the maximum compatibility with all Python types that deal with binary data (bytes, bytearray, memoryview, etc).

See https://github.com/indygreg/python-zstandard/blob/52fb4fba88993ffaf983da109b2696d07f14074d/c-ext/compressor.c#L491 for how a function should accept an object conforming to the buffer protocol.

When you use y* (Python 3) or s* (Python 2) for your value specified, CPython automagically coerces that into a Py_buffer. See https://docs.python.org/3/c-api/buffer.html. You can get a pointer to the raw data via Py_buffer.buf, which is of size Py_buffer.len. But there's a catch: a Py_buffer may not represent contiguous memory! So you need to validate the backing memory with PyBuffer_IsContiguous(&buf, 'C'). Additionally, the Py_buffer may represent data that has multiple dimensions. I don't think you'll find any multidimensional data in CPython types. It is used for things like numpy matrices. But maybe there is something in CPython these days. I'm not sure. However, types like array will have ndim=1 IIRC. Anyway, if you want to add additional checks for the shape of data within the buffer, you can use Py_buffer.ndim, ignoring values greater than 0, assuming you only want scalar data. (FWIW python-zstandard doesn't yet do this Py_buffer validation consistently.)

When you use y* or s* to get a Py_buffer, you need to call PyBuffer_Release() or you will leak that struct.

Note that Py_buffer.obj is a refcounted object. So if your function needs a handle on the underlying data pointed to be the Py_buffer, you need to increment its reference count until you are done using it, or another person with a handle on the Py_buffer may call PyBuffer_Release() and free the memory from under you.

If you aren't sure your function is receiving an object that conforms to the buffer protocol, you can use O and dynamically sniff the object type within your function with PyObject_CheckBuffer() then use PyObject_GetBuffer() to create a Py_buffer. See https://github.com/indygreg/python-zstandard/blob/52fb4fba88993ffaf983da109b2696d07f14074d/c-ext/decompressor.c#L532 for an example.

That's the reading side.

You can also have your custom types implement the buffer protocol so other consumers can use Py_buffer to access their raw data. This requires that the type define the tp_as_buffer field. That is a function returning a PyBufferProcs defining functions for populating a new Py_buffer, reading from it, releasing it, etc. I've implemented this on a few types in https://github.com/indygreg/python-zstandard/blob/52fb4fba88993ffaf983da109b2696d07f14074d/c-ext/bufferutil.c. I'm not 100% sure I got it correct. I need to audit that code. Note that those buffer types store vectors of raw data - they aren't a simple scalar. This is a case where you use ndim=1.

Unless you need to invent a custom type for representing binary data, you probably don't need to implement the buffer protocol on any of your types though. python-zstandard has some experimental APIs that allow us to do things like take a memory mapped memory segment containing N targets for decompression and decompress them into a single output buffer with multiple threads. The custom type implementing the buffer protocol allows us to effectively create an array of N objects without the GIL held.

I hope this was useful.

@cclauss
Copy link
Contributor Author

cclauss commented Jun 1, 2018

I would be grateful if someone please submit a pull request on this one. I do not understand this area well enough to create such a PR myself but would be very interested to see a fix put in place.

martindurant added a commit that referenced this issue Jul 2, 2018
Fix #65: support new buffer API
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants