Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linewise iteration over compressed input #13

Closed
dreamflasher opened this issue Mar 2, 2017 · 18 comments
Closed

Linewise iteration over compressed input #13

dreamflasher opened this issue Mar 2, 2017 · 18 comments

Comments

@dreamflasher
Copy link

dreamflasher commented Mar 2, 2017

Currently it's only possible to iterate over chunks:

dctx = zstd.ZstdDecompressor()
for chunk in dctx.read_from(fh):

How can I iterate line-by-line as it is possible with gzip.open()?

@indygreg
Copy link
Owner

indygreg commented Mar 2, 2017

You can instantiate an io.BytesIO() from the decompressed output and then call readline() on that. But that requires buffering the output, which of course has performance issues for large outputs.

The proper solution is for python-zstandard to provide an API to obtain an object that conforms to the io.RawIOBase interface. This is how gzip.open() and friends work. There is definite value in providing that API out of the box. I'll add it to the TODO list.

@dreamflasher
Copy link
Author

Thank you for adding it to the TODO, as I am especially looking for performance.
Nevertheless I wanted to try out your solution, what am I doing wrong:

        dctx = zstd.ZstdDecompressor()
        bytesio = BytesIO(dctx.read_from(fh))
        header = bytesio.readline().decode()

TypeError: a bytes-like object is required, not 'zstd.ZstdDecompressorIterator'

@indygreg
Copy link
Owner

indygreg commented Mar 2, 2017

That TypeError is raised by the BytesIO constructor since it doesn't accept an iterator., only a bytes-like object. BytesIO(b''.join(dctx.read_from(fh))) should work.

@indygreg
Copy link
Owner

indygreg commented Mar 2, 2017

Also, it is possible to write a wrapper class until the API is added to python-zstandard. See https://github.com/python/cpython/blob/3.6/Lib/_compression.py and https://github.com/python/cpython/blob/3.6/Lib/gzip.py for how the Python standard library does it.

@indygreg
Copy link
Owner

I started working on this today. I'm intent on it being in the next release, which will be 0.9. No ETA for that release, however. I view a io.RawIOBase compliant API for compression and decompression streams as the single biggest remaining feature left to implement. So this is a high priority for me.

indygreg added a commit that referenced this issue Mar 16, 2017
…sion

Like we just did for compression.

This is a precursor to #13.
@indygreg
Copy link
Owner

I just pushed the beginnings of a new stream API. It doesn't yet support readline(). While I haven't tried it yet, you should be able to construct an io.BufferedReader() with the results of dctx.stream_reader(fh) and be able to readline() on that. Performance won't be great. But may be good enough. If it doesn't work, it is probably the classes I implemented not yet fully implementing the appropriate interfaces for I/O streams. It is tedious work...

@stmlange
Copy link

Hi,
first of all thanks for the hard work that went into the library....
I happen to need this functionality and unfortunately couldn't get it working with the io.BufferedReader() and dctx.stream_reader(fh).

The general concept looks like this:

import zstd
import io

path = '/tmp/foo.zst'

with open(path, 'rb') as fh:
        dctx = zstd.ZstdDecompressor()
        with dctx.stream_reader(fh) as reader:
                wrap = io.TextIOWrapper(io.BufferedReader(reader), encoding='utf8')
                print(wrap.readline())

However this fails with ValueError: I/O operation on closed file..
Is it just me being unable to use the library properly? :')

@paulhoule
Copy link

An "open" method would be a big help. Ultimately I want to write my own open function that will automatically use the right compressor or decompressor based on the file name and file signature. Something that behaves the way all of the other open functions do would be a big help.

@geertj
Copy link

geertj commented Aug 19, 2018

The problem appears to be that ZstdDecompressionReader.closed is defined as a method while io.BufferedReader expects it to be a property.

@indygreg
Copy link
Owner

Following up on this, closed is now a property in the 0.10 release.

@aawise
Copy link

aawise commented Nov 5, 2018

I tried to use this pattern exactly today:

with open(path, 'rb') as f:
    dctx = zstd.ZstdDecompressor()
    with dctx.stream_reader(f) as reader:
        wrap = io.BufferedReader(reader)
        line = wrap.readline()

and received an error during the readline call:
AttributeError: 'DecompressionReader' object has no attribute 'readinto'

This was with zstandard 0.10.1 on pypy3

@indygreg
Copy link
Owner

readinto() was implemented on DecompressionReader in 54f4d2a. Once I add test coverage and documentation for line-based reading, I think I'll close this issue. While I concede that implementing readline() directly would be useful, I'd rather not introduce the complexity of text-based I/O on the low-level (de)compression APIs. Instead, I think time would be better spent at introducing higher-level APIs (e.g. an open() - issue #64) which plumbed together an e.g. io.BufferedReader() to accomplish the same result.

If you feel differently, please let me know.

indygreg added a commit that referenced this issue Feb 24, 2019
Now that io.RawIOBase is implemented, we can properly chain a
ZstdDecompressionReader to an io.BufferedReader and
io.TextIOWrapper to achieve buffering and line-based reading.

This closes #13.
@flyser
Copy link

flyser commented Dec 29, 2020

The documentation was removed in commit d613a5f. Was that intentional? If so, why?

@indygreg
Copy link
Owner

If the documentation was removed, it was accidental.

I believe the documentation now exists at https://github.com/indygreg/python-zstandard/blob/main/zstandard/backend_cffi.py#L3048?

@flyser
Copy link

flyser commented Dec 29, 2020

@indygreg
Copy link
Owner

Yeah, the Sphinx docs on readthedocs aren't working. I'll try to get that fixed in the next few hours.

@indygreg
Copy link
Owner

It should be working now.

@lachesis
Copy link

For anyone who finds this after me, the correct way to do this (copied from the link above) is:

   >>> with open(path, 'rb') as fh:
    ...     dctx = zstandard.ZstdDecompressor()
    ...     stream_reader = dctx.stream_reader(fh)
    ...     text_stream = io.TextIOWrapper(stream_reader, encoding='utf-8')
    ...     for line in text_stream:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants