Replace libsnappy with cramjam #130

gliptak · 2024-02-05T15:31:09Z

the semantics need to be lined up

martindurant · 2024-02-05T15:32:46Z

cc @milesgranger , a long time coming!

martindurant · 2024-02-05T15:34:35Z

src/snappy/snappy.py

+
+
+class UncompressError(Exception):
+    pass


 py3k = False


py2 stuff is for pypy only - maybe that should be dropped too

I was thinking a followup PR for current Python versions (and dropping Python2). I actually started with #129

martindurant · 2024-02-05T15:36:48Z

Looks like the pypy tests need an explicit cast from cramjam buffer to bytes (i.e., bytes(...)).

gliptak · 2024-02-05T15:48:34Z

@martindurant I had that earlier and it broke in a different way. let me restore

gliptak · 2024-02-05T15:54:50Z

https://github.com/andrix/python-snappy/actions/runs/7787155361/job/21233634917?pr=130

FAILED test_snappy.py::SnappyCompressionTest::test_uncompress_error - cramjam.DecompressionError: failed to fill whole buffer
FAILED test_snappy.py::SnappyValidBufferTest::test_invalid_compressed_buffer - cramjam.DecompressionError: snappy: corrupt input (expected stream header but got unexpected chunk type byte 110)
FAILED test_snappy.py::SnappyStreaming::test_compression - AssertionError: 24 != 6
FAILED test_snappy.py::SnappyStreaming::test_decompression - cramjam.DecompressionError: failed to fill whole buffer

https://github.com/andrix/python-snappy/actions/runs/7787155357/job/21233634312?pr=130

packages/snappy/snappy.py:47: in <module>
    import cramjam
E   ModuleNotFoundError: No module named 'cramjam'

martindurant · 2024-02-05T15:58:27Z

For the sdist (and actually any real install), will need to actually add cramjam to required deps in https://github.com/andrix/python-snappy/blob/master/setup.py#L66

martindurant · 2024-02-05T16:00:18Z

For pypy, we might need @milesgranger 's help, perhaps the errors ring a bell. Is it possible we are confusing framed Vs raw?

gliptak · 2024-02-05T16:28:12Z

https://github.com/andrix/python-snappy/actions/runs/7787502576/job/21234796426?pr=130

    def test_compress_memoryview(self):
        data = b"hello world!"
        expected = snappy.compress(data)
        actual = snappy.compress(memoryview(data))
>       self.assertEqual(actual, expected)
E       AssertionError: cramjam.Buffer<len=30> != cramjam.Buffer<len=30>

milesgranger · 2024-02-05T17:13:22Z

Ah, that's a bit unfortunate w/ the buffer comparisons not working. Can just cast by bytes(...) == bytes(...) if you want in the meantime.

I'll try to look more deeply at the other errors. But the De/Compressor classes work on framed snappy format FWIW.

milesgranger · 2024-02-05T17:20:34Z

_________________ SnappyCompressionTest.test_uncompress_error __________________

self = <test_snappy.SnappyCompressionTest testMethod=test_uncompress_error>

    def test_uncompress_error(self):
>       self.assertRaises(snappy.UncompressError, snappy.uncompress,
                          "hoa".encode('utf-8'))

test_snappy.py:70:

This is just a matter of catching cramjam's error and re-raising as python-snappy's error, no?

    def test_compression(self):
        # test that we can add compressed chunks
        compressor = snappy.StreamCompressor()
        data = b"\0" * 50
        compressed_data = snappy.compress(data)
        crc = struct.pack("<L", snappy.snappy._masked_crc32c(data))
        self.assertEqual(crc, b"\x8f)H\xbd")
>       self.assertEqual(len(compressed_data), 6)
E       AssertionError: 24 != 6

This is compressing with framed and comparing against what snappy raw would output (checked, this will be 6 w/ raw format)

And ya, I think the other two are raw vs framed comparison errors.

gliptak · 2024-02-05T18:29:02Z

thank you @milesgranger I added bytes cast and exception wrapping

would you have guidance on correcting "framed vs. raw" errors?

milesgranger · 2024-02-06T06:21:39Z

would you have guidance on correcting "framed vs. raw" errors?

You bet! python-snappy's compress function is the snappy raw format. In the specification / Rust snappy it's preferred to use the framed format, therefore cramjam's snappy.compress is the framed format.

That ends up looking like this:

In [1]: import cramjam

In [2]: import snappy

In [3]: data = b"\0" * 50

In [4]: snappy.compress(data)
Out[4]: b'2\x00\x00\xc2\x01\x00'

In [5]: bytes(cramjam.snappy.compress(data))
Out[5]: b'\xff\x06\x00\x00sNaPpY\x00\n\x00\x00\x8f)H\xbd2\x00\x00\xc2\x01\x00'

In [6]: bytes(cramjam.snappy.compress_raw(data))
Out[6]: b'2\x00\x00\xc2\x01\x00'

So it appears, if python-snappy is to keep it's current API, then behind the covers, snappy.compress ought to be calling cramjam.snappy.compress_raw.

For the De/Compressor classes, I'm less sure how to maintain the current API. I think I took inspiration from lz4 or something, can't really remember. But effectively cramjam's all have a .compress method which adds chunks to the inner compressed stream. Then .finish() to add any header/tail information and return the completed compressed stream. This seems different for python-snappy's impl. Which Compressor.compress returns the head, then subsequent calls add leave out the header:

In [1]: import snappy

In [2]: import cramjam

In [3]: data = b"\0" * 50

In [4]: compressor = snappy.StreamCompressor()

In [5]: compressor.compress(data)
Out[5]: b'\xff\x06\x00\x00sNaPpY\x00\n\x00\x00\x8f)H\xbd2\x00\x00\xc2\x01\x00'

In [6]: compressor.compress(data)
Out[6]: b'\x00\n\x00\x00\x8f)H\xbd2\x00\x00\xc2\x01\x00'

In [7]: c = cramjam.snappy.Compressor()

In [8]: c.compress(data)
Out[8]: 50

In [9]: c.compress(data)
Out[9]: 50

In [10]: bytes(c.finish())
Out[10]: b'\xff\x06\x00\x00sNaPpY\x00\r\x00\x00o\xfan\xe2d\x00\x00\xfe\x01\x00\x8a\x01\x00'

In [11]: python_snappy = b'\xff\x06\x00\x00sNaPpY\x00\n\x00\x00\x8f)H\xbd2\x00\x00\xc2\x01\x00' + b'\x00\n\x00
    ...: \x00\x8f)H\xbd2\x00\x00\xc2\x01\x00'

In [12]: cramjam.snappy.decompress(python_snappy)
Out[12]: cramjam.Buffer<len=100>

You can see cramjam's is different, but still able to decompress python-snappy's so it seems okay just adding more padding it seems around each chunk and then it's more on the user to concatenate the total stream. Whereas with cramjam it's a bit more "managed". Also I note that add_chunk and compress returns the same data after the first call to compress fyi.

So, that's my understanding. If you wanted to maintain the current API that might be as easy (hacky?) as having the first call to Compressor.compress return exactly what cramjam.snappy.compress gives, then all subsequent calls chop off the header (10 bytes), that seems to produce exactly the same as python-snappy's compressor:

In [1]: import cramjam

In [2]: import snappy

In [3]: data = b"\0" * 50

In [4]: compressor = snappy.StreamCompressor()

In [5]: compressor.compress(data)
Out[5]: b'\xff\x06\x00\x00sNaPpY\x00\n\x00\x00\x8f)H\xbd2\x00\x00\xc2\x01\x00'

In [6]: compressor.compress(data)
Out[6]: b'\x00\n\x00\x00\x8f)H\xbd2\x00\x00\xc2\x01\x00'

In [7]: bytes(cramjam.snappy.compress(data))
Out[7]: b'\xff\x06\x00\x00sNaPpY\x00\n\x00\x00\x8f)H\xbd2\x00\x00\xc2\x01\x00'

In [8]: bytes(cramjam.snappy.compress(data))[10:]
Out[8]: b'\x00\n\x00\x00\x8f)H\xbd2\x00\x00\xc2\x01\x00'

As a separate note, the first suggestion to cast cramjam.Buffer using bytes works for the test, but maybe it's worthwhile to basically do that everywhere python-snappy returns data? That way current users upgrading won't suddenly see "cramjam.Buffer" showing up when python-snappy always use to return bytes.

Signed-off-by: Gábor Lipták <[email protected]>

gliptak · 2024-02-06T16:10:53Z

seems to be new python2.7 related error https://github.com/andrix/python-snappy/actions/runs/7802511369/job/21280133020?pr=130

  DEPRECATION: pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.
  ERROR: Could not find a version that satisfies the requirement maturin<0.14,>=0.13 (from versions: 0.7.1, 0.7.2, 0.7.6, 0.11.0, 0.11.2, 0.11.4)
  ERROR: No matching distribution found for maturin<0.14,>=0.13

gliptak · 2024-02-06T16:22:52Z

thank you @milesgranger I updated with *_raw and bytes() and build is green with no tests updated

@martindurant do you see current tests providing strong coverage for Stream*?

PS also dropped pypy2.7 for above unrelated/environment error

martindurant · 2024-02-06T16:24:54Z

do you see current tests providing strong coverage for Stream*?

well apparently not! I am not surprised, and I am prepared to provide only "best effort" here; it would be best if docstrings are updated to say so, however. Were you thinking to drop them altogether?

gliptak · 2024-02-06T17:09:05Z

I wasn't thinking of removing functionality. https://github.com/andrix/python-snappy/blob/master/test_snappy.py#L119-L259 seems to be exercising stream functionality to good extent with asserts. one difference I see that tests use add_chunk (which seems to deal with CRC and headers?)

please let me know how you would like to proceed @martindurant (and provide "best effort" docstrings if appropriate)

martindurant · 2024-02-06T18:41:51Z

StreamCompressor.compress is an alias for .add_chunk , so that's fine.

martindurant · 2024-02-06T18:45:04Z

I believe StreamDe/Compressor could be replaced by cramjam.snappy.De/Compressor. The current code is doing the same framing as before, but calling cramjam's _raw codecs internally, so the exact locations of the chunk boundaries might change, but it produces valid output.

In other words, since there are after all some tests, I think all is well!

Any remaining reservations preventing merge?

gliptak · 2024-02-06T19:48:52Z

thank you @martindurant no reservations from my side

gliptak · 2024-02-06T20:13:05Z

#111

gliptak · 2024-02-20T19:59:36Z

#123

martindurant reviewed Feb 5, 2024

View reviewed changes

milesgranger mentioned this pull request Feb 5, 2024

Equality check on values for Buffer milesgranger/cramjam#134

Closed

Replace libsnappy with cramjam

277fb4a

Signed-off-by: Gábor Lipták <[email protected]>

gliptak force-pushed the cram1 branch from 2f74908 to 277fb4a Compare February 6, 2024 16:00

Drop pypy2.7

cabfd08

gliptak requested a review from martindurant February 6, 2024 16:22

martindurant mentioned this pull request Feb 7, 2024

Change the window size #131

Open

martindurant merged commit 73c7f55 into intake:master Feb 7, 2024
9 checks passed

gliptak deleted the cram1 branch February 7, 2024 15:10

martindurant mentioned this pull request Feb 19, 2024

support pep517 builds - poetry install on Mac fails #133

Open

Skully17 mentioned this pull request Feb 22, 2024

Error installing python-snappy with python 3.11 #134

Closed

gliptak mentioned this pull request Jun 10, 2024

Add Python 3.11 to GHA capitalone/DataProfiler#1090

Merged

martindurant mentioned this pull request Sep 16, 2024

0.7.3 has breaking changes to 0.7.2 #147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace libsnappy with cramjam #130

Replace libsnappy with cramjam #130

gliptak commented Feb 5, 2024

martindurant commented Feb 5, 2024

martindurant Feb 5, 2024

gliptak Feb 5, 2024

martindurant commented Feb 5, 2024

gliptak commented Feb 5, 2024

gliptak commented Feb 5, 2024

martindurant commented Feb 5, 2024

martindurant commented Feb 5, 2024

gliptak commented Feb 5, 2024

milesgranger commented Feb 5, 2024

milesgranger commented Feb 5, 2024

gliptak commented Feb 5, 2024

milesgranger commented Feb 6, 2024

gliptak commented Feb 6, 2024

gliptak commented Feb 6, 2024

martindurant commented Feb 6, 2024

gliptak commented Feb 6, 2024

martindurant commented Feb 6, 2024

martindurant commented Feb 6, 2024

gliptak commented Feb 6, 2024

gliptak commented Feb 6, 2024

gliptak commented Feb 20, 2024

Replace libsnappy with cramjam #130

Replace libsnappy with cramjam #130

Conversation

gliptak commented Feb 5, 2024

martindurant commented Feb 5, 2024

martindurant Feb 5, 2024

Choose a reason for hiding this comment

gliptak Feb 5, 2024

Choose a reason for hiding this comment

martindurant commented Feb 5, 2024

gliptak commented Feb 5, 2024

gliptak commented Feb 5, 2024

martindurant commented Feb 5, 2024

martindurant commented Feb 5, 2024

gliptak commented Feb 5, 2024

milesgranger commented Feb 5, 2024

milesgranger commented Feb 5, 2024

gliptak commented Feb 5, 2024

milesgranger commented Feb 6, 2024

gliptak commented Feb 6, 2024

gliptak commented Feb 6, 2024

martindurant commented Feb 6, 2024

gliptak commented Feb 6, 2024

martindurant commented Feb 6, 2024

martindurant commented Feb 6, 2024

gliptak commented Feb 6, 2024

gliptak commented Feb 6, 2024

gliptak commented Feb 20, 2024