Add support for snappy text decompression #22298 #22486

siddgoel · 2022-02-18T00:15:37Z

Why are these changes needed?

Adds a streaming based reading option for Snappy-compressed files. Arrow doesn't support streaming Snappy decompression since the canonical C++ Snappy library doesn't natively support streaming decompression. This PR works around this by doing streaming reads of snappy-compressed files using the streaming decompression API provided in the python-snappy package.

This commit supplies a custom datasource that uses Arrow + python-snappy to read and decompress Snappy-compressed files.

Related issue number

Closes #22023

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…gestion ray-project#22479

clarkzinzow

I just realized something that should be really nice!

Since the py-snappy library can do streaming decompression on an Arrow NativeFile, we could push all of this down into BinaryDatasource as an implementation detail switching on whether Snappy decompression is needed, so we don't even need the new SnappyTextDatasource.

Much as I did here in the other PR, we can check for Snappy compression in FileBasedDatasource.prepare_read() and, if found, make sure that we don't open the Arrow stream with that compression and instead pass that compression arg to FileBasedDatasource._read_file() as one of the **reader_args, allowing subclasses of FileBasedDatasource (such as BinaryDatasource) to implement their own manual streaming decompression of Snappy-compressed files:

# file_based_datasource.py
class FileBasedDatasource(Datasource):

    def prepare_read(...):
        # ...
        def read_files(
            read_paths: List[str],
            fs: Union["pyarrow.fs.FileSystem", _S3FileSystemWrapper],
        ) -> Iterable[Block]:
            # ...
            for read_path in read_paths:
                compression = open_stream_args.pop("compression", None)
                if compression is None:
                    try:
                        # If no compression manually given, try to detect compression codec from
                        # path.
                        compression = pa.Codec.detect(read_path).name
                    except (ValueError, TypeError):
                        compression = None
                if compression == "snappy":
                    # Pass Snappy compression as a reader arg, so datasource subclasses can
                    # manually handle streaming decompression in self._read_stream().
                    reader_args["compression"] = compression
                elif compression is not None:
                    # Non-Snappy compression, pass as open_input_stream() arg so Arrow can take
                    # care of streaming decompression for us.
                    open_stream_args["compression"] = compression
                with fs.open_input_stream(read_path, **open_stream_args) as f:
                    for data in read_stream(f, read_path, filesystem=fs, **reader_args):
                        output_buffer.add_block(data)
                        if output_buffer.has_next():
                            yield output_buffer.next()

The BinaryDatasource can then check for compression=="snappy" as well and if found, use py-snappy for streaming decompression using your current implementation in SnappyTextDatasource:

class BinaryDatasource(FileBasedDatasource):
    def _read_file(
        self,
        f: "pyarrow.NativeFile",
        path: str,
        filesystem: "pyarrow.fs.FileSystem",
        **reader_args
    ):
        import pyarrow as pa

        include_paths = reader_args.pop("include_paths", False)
        if reader_args.get("compression") == "snappy":
            import snappy

            rawbytes = io.BytesIO()

            if isinstance(filesystem, pa.fs.HadoopFileSystem):
                snappy.hadoop_snappy.stream_decompress(src=f, dst=rawbytes)
            else:
                snappy.stream_decompress(src=f, dst=rawbytes)

            data = rawbytes.getvalue()
        else:
            data = f.readall()
        if include_paths:
            return [(path, data)]
        else:
            return [data]

This should all happen transparently to the user: if manually given compression="snappy" or if we infer Snappy compression from the file path, we'll switch to py-snappy-based streaming decompression within the BinaryDatasource implementation.

What do you think? Am I missing anything here, or do you think this could work?

python/ray/data/datasource/binary_datasource.py

python/ray/data/datasource/csv_datasource.py

python/ray/data/datasource/file_based_datasource.py

python/ray/data/datasource/numpy_datasource.py

python/requirements.txt

python/ray/data/datasource/snappy_text_datasource.py

… binary datasource

clarkzinzow

This is looking really good! In addition to the review comments, a few other things:

The lint is failing, could you run ./ci/travis/format.sh and fix the remaining failures?
It looks the tests are failing since the snappy package is missing. Could you add the python-snappy dependency to this Datasets testing requirements manifest?

clarkzinzow · 2022-03-03T00:59:23Z

python/ray/data/datasource/binary_datasource.py

+            filesystem = reader_args.get("filesystem", None)
+            rawbytes = BytesIO()
+
+            if isinstance(filesystem, pyarrow.fs.HadoopFileSystem):


Suggested change

if isinstance(filesystem, pyarrow.fs.HadoopFileSystem):

if isinstance(filesystem, HadoopFileSystem):

python/ray/data/datasource/binary_datasource.py

python/ray/data/tests/test_dataset_formats.py

clarkzinzow · 2022-03-09T00:23:21Z

@siddgoel Ping on this!

… data ingestion ray-project#22479" This reverts commit 62bbaad.

clarkzinzow

LGTM, if the tests pass this looks good to merge. Great work!

scv119 · 2022-03-15T18:56:35Z

looks a bunch of tests failed. could you merge master and trigger the tests again.

siddgoel · 2022-03-15T20:47:21Z

@scv119 done, we should be good to go once the tests are finished

siddharth.goel and others added 5 commits February 17, 2022 18:39

Implement snappy_text_datasource

9c9e279

Add tests

68eb9e8

Add filesystem argument to read_file in file_based_datasource.py

3829b7d

[Dataset][nighlty-test] use 2 instead of 15 windows for 1.5TB data in…

62bbaad

…gestion ray-project#22479

Run linter on code

7002baf

siddgoel requested review from ericl, scv119 and clarkzinzow as code owners February 18, 2022 00:15

clarkzinzow self-assigned this Feb 18, 2022

clarkzinzow reviewed Feb 19, 2022

View reviewed changes

siddharth.goel added 2 commits February 23, 2022 13:15

Incorporate review comments => Integrate the decompression logic with…

c21cf93

… binary datasource

Update file_based_datasource.py

c75ba53

clarkzinzow reviewed Mar 3, 2022

View reviewed changes

Update file_based_datasource.py

d244788

siddgoel requested a review from jjyao as a code owner March 11, 2022 01:00

Revert "[Dataset][nighlty-test] use 2 instead of 15 windows for 1.5TB…

b059beb

… data ingestion ray-project#22479" This reverts commit 62bbaad.

clarkzinzow approved these changes Mar 14, 2022

View reviewed changes

ericl self-assigned this Mar 15, 2022

scv119 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 15, 2022

Merge branch 'ray-project:master' into bugfix/snappy-text-compression

c0028ff

ericl merged commit 0722cbb into ray-project:master Mar 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for snappy text decompression #22298 #22486

Add support for snappy text decompression #22298 #22486

siddgoel commented Feb 18, 2022

clarkzinzow left a comment •

edited

Loading

clarkzinzow left a comment

clarkzinzow Mar 3, 2022

clarkzinzow commented Mar 9, 2022

clarkzinzow left a comment

scv119 commented Mar 15, 2022

siddgoel commented Mar 15, 2022

	if isinstance(filesystem, pyarrow.fs.HadoopFileSystem):
	if isinstance(filesystem, HadoopFileSystem):

Add support for snappy text decompression #22298 #22486

Add support for snappy text decompression #22298 #22486

Conversation

siddgoel commented Feb 18, 2022

Why are these changes needed?

Related issue number

Checks

clarkzinzow left a comment • edited Loading

Choose a reason for hiding this comment

clarkzinzow left a comment

Choose a reason for hiding this comment

clarkzinzow Mar 3, 2022

Choose a reason for hiding this comment

clarkzinzow commented Mar 9, 2022

clarkzinzow left a comment

Choose a reason for hiding this comment

scv119 commented Mar 15, 2022

siddgoel commented Mar 15, 2022

clarkzinzow left a comment •

edited

Loading