Linewise iteration over compressed input #13

dreamflasher · 2017-03-02T09:59:10Z

Currently it's only possible to iterate over chunks:

dctx = zstd.ZstdDecompressor()
for chunk in dctx.read_from(fh):

How can I iterate line-by-line as it is possible with gzip.open()?

The text was updated successfully, but these errors were encountered:

indygreg · 2017-03-02T17:15:03Z

You can instantiate an io.BytesIO() from the decompressed output and then call readline() on that. But that requires buffering the output, which of course has performance issues for large outputs.

The proper solution is for python-zstandard to provide an API to obtain an object that conforms to the io.RawIOBase interface. This is how gzip.open() and friends work. There is definite value in providing that API out of the box. I'll add it to the TODO list.

dreamflasher · 2017-03-02T18:02:32Z

Thank you for adding it to the TODO, as I am especially looking for performance.
Nevertheless I wanted to try out your solution, what am I doing wrong:

        dctx = zstd.ZstdDecompressor()
        bytesio = BytesIO(dctx.read_from(fh))
        header = bytesio.readline().decode()

TypeError: a bytes-like object is required, not 'zstd.ZstdDecompressorIterator'

indygreg · 2017-03-02T18:09:10Z

That TypeError is raised by the BytesIO constructor since it doesn't accept an iterator., only a bytes-like object. BytesIO(b''.join(dctx.read_from(fh))) should work.

indygreg · 2017-03-02T18:11:40Z

Also, it is possible to write a wrapper class until the API is added to python-zstandard. See https://github.com/python/cpython/blob/3.6/Lib/_compression.py and https://github.com/python/cpython/blob/3.6/Lib/gzip.py for how the Python standard library does it.

indygreg · 2017-03-10T06:53:41Z

I started working on this today. I'm intent on it being in the next release, which will be 0.9. No ETA for that release, however. I view a io.RawIOBase compliant API for compression and decompression streams as the single biggest remaining feature left to implement. So this is a high priority for me.

…sion Like we just did for compression. This is a precursor to #13.

indygreg · 2017-03-16T02:22:50Z

I just pushed the beginnings of a new stream API. It doesn't yet support readline(). While I haven't tried it yet, you should be able to construct an io.BufferedReader() with the results of dctx.stream_reader(fh) and be able to readline() on that. Performance won't be great. But may be good enough. If it doesn't work, it is probably the classes I implemented not yet fully implementing the appropriate interfaces for I/O streams. It is tedious work...

stmlange · 2018-06-13T20:40:32Z

Hi,
first of all thanks for the hard work that went into the library....
I happen to need this functionality and unfortunately couldn't get it working with the io.BufferedReader() and dctx.stream_reader(fh).

The general concept looks like this:

import zstd
import io

path = '/tmp/foo.zst'

with open(path, 'rb') as fh:
        dctx = zstd.ZstdDecompressor()
        with dctx.stream_reader(fh) as reader:
                wrap = io.TextIOWrapper(io.BufferedReader(reader), encoding='utf8')
                print(wrap.readline())

However this fails with ValueError: I/O operation on closed file..
Is it just me being unable to use the library properly? :')

paulhoule · 2018-08-06T17:40:04Z

An "open" method would be a big help. Ultimately I want to write my own open function that will automatically use the right compressor or decompressor based on the file name and file signature. Something that behaves the way all of the other open functions do would be a big help.

geertj · 2018-08-19T13:29:30Z

The problem appears to be that ZstdDecompressionReader.closed is defined as a method while io.BufferedReader expects it to be a property.

indygreg · 2018-10-23T17:23:43Z

Following up on this, closed is now a property in the 0.10 release.

aawise · 2018-11-05T23:40:44Z

I tried to use this pattern exactly today:

with open(path, 'rb') as f:
    dctx = zstd.ZstdDecompressor()
    with dctx.stream_reader(f) as reader:
        wrap = io.BufferedReader(reader)
        line = wrap.readline()

and received an error during the readline call:
AttributeError: 'DecompressionReader' object has no attribute 'readinto'

This was with zstandard 0.10.1 on pypy3

indygreg · 2019-02-24T17:35:42Z

readinto() was implemented on DecompressionReader in 54f4d2a. Once I add test coverage and documentation for line-based reading, I think I'll close this issue. While I concede that implementing readline() directly would be useful, I'd rather not introduce the complexity of text-based I/O on the low-level (de)compression APIs. Instead, I think time would be better spent at introducing higher-level APIs (e.g. an open() - issue #64) which plumbed together an e.g. io.BufferedReader() to accomplish the same result.

If you feel differently, please let me know.

Now that io.RawIOBase is implemented, we can properly chain a ZstdDecompressionReader to an io.BufferedReader and io.TextIOWrapper to achieve buffering and line-based reading. This closes #13.

flyser · 2020-12-29T10:25:07Z

The documentation was removed in commit d613a5f. Was that intentional? If so, why?

indygreg · 2020-12-29T16:42:30Z

If the documentation was removed, it was accidental.

I believe the documentation now exists at https://github.com/indygreg/python-zstandard/blob/main/zstandard/backend_cffi.py#L3048?

flyser · 2020-12-29T16:45:04Z

@indygreg I was looking for https://python-zstandard.readthedocs.io/en/latest/search.html?q=readlines&check_keywords=yes&area=default which found nothing. Seems like the generated documentation is partially broken: https://python-zstandard.readthedocs.io/en/latest/decompressor.html

indygreg · 2020-12-29T17:28:09Z

Yeah, the Sphinx docs on readthedocs aren't working. I'll try to get that fixed in the next few hours.

indygreg · 2020-12-29T17:32:11Z

It should be working now.

lachesis · 2023-05-11T16:34:09Z

For anyone who finds this after me, the correct way to do this (copied from the link above) is:

   >>> with open(path, 'rb') as fh:
    ...     dctx = zstandard.ZstdDecompressor()
    ...     stream_reader = dctx.stream_reader(fh)
    ...     text_stream = io.TextIOWrapper(stream_reader, encoding='utf-8')
    ...     for line in text_stream:

indygreg added a commit that referenced this issue Mar 2, 2017

readme: add todo for an io.RawIOBase interface (#13)

833bf0b

indygreg added a commit that referenced this issue Mar 16, 2017

decompressionreader: implement i/o stream class and API for decompres…

6f3b613

…sion Like we just did for compression. This is a precursor to #13.

indygreg mentioned this issue Mar 26, 2018

ZstdDecompressor stream_reader doesn't support readline() #39

Closed

indygreg closed this as completed in 3d855c7 Feb 24, 2019

hauntsaninja mentioned this issue Apr 22, 2024

Return True from ZstdDecompressionReader.seekable #222

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linewise iteration over compressed input #13

Linewise iteration over compressed input #13

dreamflasher commented Mar 2, 2017 •

edited

Loading

indygreg commented Mar 2, 2017

dreamflasher commented Mar 2, 2017

indygreg commented Mar 2, 2017

indygreg commented Mar 2, 2017

indygreg commented Mar 10, 2017

indygreg commented Mar 16, 2017

stmlange commented Jun 13, 2018

paulhoule commented Aug 6, 2018

geertj commented Aug 19, 2018

indygreg commented Oct 23, 2018

aawise commented Nov 5, 2018 •

edited

Loading

indygreg commented Feb 24, 2019

flyser commented Dec 29, 2020

indygreg commented Dec 29, 2020

flyser commented Dec 29, 2020

indygreg commented Dec 29, 2020

indygreg commented Dec 29, 2020

lachesis commented May 11, 2023

Linewise iteration over compressed input #13

Linewise iteration over compressed input #13

Comments

dreamflasher commented Mar 2, 2017 • edited Loading

indygreg commented Mar 2, 2017

dreamflasher commented Mar 2, 2017

indygreg commented Mar 2, 2017

indygreg commented Mar 2, 2017

indygreg commented Mar 10, 2017

indygreg commented Mar 16, 2017

stmlange commented Jun 13, 2018

paulhoule commented Aug 6, 2018

geertj commented Aug 19, 2018

indygreg commented Oct 23, 2018

aawise commented Nov 5, 2018 • edited Loading

indygreg commented Feb 24, 2019

flyser commented Dec 29, 2020

indygreg commented Dec 29, 2020

flyser commented Dec 29, 2020

indygreg commented Dec 29, 2020

indygreg commented Dec 29, 2020

lachesis commented May 11, 2023

dreamflasher commented Mar 2, 2017 •

edited

Loading

aawise commented Nov 5, 2018 •

edited

Loading