Can snappy.decompress(buffer) but cannot snappy.decompress(memoryview) #65

cclauss · 2018-03-21T04:06:03Z

Is there a way to make snappy.decompress() work directly on a memoryview without casting it to another datatype? Casting makes a copy of the data which is undesirable in our case.

compressed = snappy.compress('The quick brown fox jumped over the lazy dog.')
print(snappy.decompress(buffer(compressed)))      # works on Python 2.7
print(snappy.decompress(memoryview(compressed)))  # fails on Python 2.7 and Python 3
# Python 2: TypeError: argument 1 must be string or read-only buffer, not memoryview
# Python 3: TypeError: argument 1 must be read-only bytes-like object, not memoryview

https://github.com/cclauss/snappy-hack/blob/master/snappy_hack.py

martindurant · 2018-03-21T13:36:34Z

@andrix , given that this code was written a good while ago, is there now scope for supporting memoryviews directly?

andrix · 2018-03-21T15:02:02Z

hey @martindurant - I would need to research, it has been a long time. Currently, it expects a const char: https://github.com/andrix/python-snappy/blob/master/snappy/snappymodule.cc#L155.

I don't have enough time atm to research on how memoryviews are implemented, but it shouldn't be complex I imagine.

himikof · 2018-03-25T16:07:56Z

The memoryview support should be a simple matter of switching from y# to y* (using Py_buffer structure) format during argument parsing (and, likely, switching from s# to s* on Python 2 to unify the code paths).
See a very similar issue in another library for the reference: indygreg/python-zstandard#26.

martindurant · 2018-03-25T16:17:38Z

@himikof , would you like to put this into a PR?

cclauss · 2018-04-05T13:53:14Z

@himikof @pitrou @indygreg Any help on this one? It is beyond my skills.

indygreg · 2018-04-05T20:00:30Z

Sure.

You'll want to use the buffer protocol for everything that is passing raw binary data around. This ensures the maximum compatibility with all Python types that deal with binary data (bytes, bytearray, memoryview, etc).

See https://github.com/indygreg/python-zstandard/blob/52fb4fba88993ffaf983da109b2696d07f14074d/c-ext/compressor.c#L491 for how a function should accept an object conforming to the buffer protocol.

When you use y* (Python 3) or s* (Python 2) for your value specified, CPython automagically coerces that into a Py_buffer. See https://docs.python.org/3/c-api/buffer.html. You can get a pointer to the raw data via Py_buffer.buf, which is of size Py_buffer.len. But there's a catch: a Py_buffer may not represent contiguous memory! So you need to validate the backing memory with PyBuffer_IsContiguous(&buf, 'C'). Additionally, the Py_buffer may represent data that has multiple dimensions. I don't think you'll find any multidimensional data in CPython types. It is used for things like numpy matrices. But maybe there is something in CPython these days. I'm not sure. However, types like array will have ndim=1 IIRC. Anyway, if you want to add additional checks for the shape of data within the buffer, you can use Py_buffer.ndim, ignoring values greater than 0, assuming you only want scalar data. (FWIW python-zstandard doesn't yet do this Py_buffer validation consistently.)

When you use y* or s* to get a Py_buffer, you need to call PyBuffer_Release() or you will leak that struct.

Note that Py_buffer.obj is a refcounted object. So if your function needs a handle on the underlying data pointed to be the Py_buffer, you need to increment its reference count until you are done using it, or another person with a handle on the Py_buffer may call PyBuffer_Release() and free the memory from under you.

If you aren't sure your function is receiving an object that conforms to the buffer protocol, you can use O and dynamically sniff the object type within your function with PyObject_CheckBuffer() then use PyObject_GetBuffer() to create a Py_buffer. See https://github.com/indygreg/python-zstandard/blob/52fb4fba88993ffaf983da109b2696d07f14074d/c-ext/decompressor.c#L532 for an example.

That's the reading side.

You can also have your custom types implement the buffer protocol so other consumers can use Py_buffer to access their raw data. This requires that the type define the tp_as_buffer field. That is a function returning a PyBufferProcs defining functions for populating a new Py_buffer, reading from it, releasing it, etc. I've implemented this on a few types in https://github.com/indygreg/python-zstandard/blob/52fb4fba88993ffaf983da109b2696d07f14074d/c-ext/bufferutil.c. I'm not 100% sure I got it correct. I need to audit that code. Note that those buffer types store vectors of raw data - they aren't a simple scalar. This is a case where you use ndim=1.

Unless you need to invent a custom type for representing binary data, you probably don't need to implement the buffer protocol on any of your types though. python-zstandard has some experimental APIs that allow us to do things like take a memory mapped memory segment containing N targets for decompression and decompress them into a single output buffer with multiple threads. The custom type implementing the buffer protocol allows us to effectively create an array of N objects without the GIL held.

I hope this was useful.

cclauss · 2018-06-01T13:18:50Z

I would be grateful if someone please submit a pull request on this one. I do not understand this area well enough to create such a PR myself but would be very interested to see a fix put in place.

Fix #65: support new buffer API

cclauss changed the title ~~Can snappy.decompress(buffer)but cannot snappy.decompress(memoryview)~~ Can snappy.decompress(buffer) but cannot snappy.decompress(memoryview) Mar 21, 2018

cclauss mentioned this issue Mar 21, 2018

[BEAM-1251] Upgrade from buffer to memoryview for Python 3 apache/beam#4820

Closed

10 tasks

martindurant closed this as completed in 29847ae Jul 2, 2018

martindurant added a commit that referenced this issue Jul 2, 2018

Merge pull request #72 from pitrou/buffer_api

e7b2c4d

Fix #65: support new buffer API

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can snappy.decompress(buffer) but cannot snappy.decompress(memoryview) #65

Can snappy.decompress(buffer) but cannot snappy.decompress(memoryview) #65

cclauss commented Mar 21, 2018 •

edited

Loading

martindurant commented Mar 21, 2018

andrix commented Mar 21, 2018

himikof commented Mar 25, 2018

martindurant commented Mar 25, 2018

cclauss commented Apr 5, 2018

indygreg commented Apr 5, 2018

cclauss commented Jun 1, 2018

Can snappy.decompress(buffer) but cannot snappy.decompress(memoryview) #65

Can snappy.decompress(buffer) but cannot snappy.decompress(memoryview) #65

Comments

cclauss commented Mar 21, 2018 • edited Loading

martindurant commented Mar 21, 2018

andrix commented Mar 21, 2018

himikof commented Mar 25, 2018

martindurant commented Mar 25, 2018

cclauss commented Apr 5, 2018

indygreg commented Apr 5, 2018

cclauss commented Jun 1, 2018

cclauss commented Mar 21, 2018 •

edited

Loading