Rewrite the Polars datasets to not rely on fsspec unnecessarily #625

astrojuanlu · 2024-03-24T13:43:53Z

Description

Rewrite the polars datasets: https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets/polars
to not rely on fsspec, because they don't need it.

Context

Polars can read remote systems just fine thanks to https://docs.rs/object_store/latest/object_store/, but the kedro-datasets version asks me for fsspec dependencies anyway:

In [13]: ds = EagerPolarsDataset(filepath="s3://reddit-submissions/submissions-raw", file_format="delta")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
File ~/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/site-packages/fsspec/registry.py:238, in get_filesystem_class(protocol)
    237 try:
--> 238     register_implementation(protocol, _import_class(bit["class"]))
    239 except ImportError as e:

File ~/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/site-packages/fsspec/registry.py:273, in _import_class(cls, minv)
    272 s3 = mod == "s3fs"
--> 273 mod = importlib.import_module(mod)
    274 if s3 and mod.__version__.split(".") < ["0", "5"]:

File /opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/importlib/__init__.py:126, in import_module(name, package)
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1204, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1176, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1140, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 's3fs'

Related: #590

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

Kedro version used (pip show kedro or kedro -V):
Kedro plugin and kedro plugin version used (pip show kedro-airflow):
Python version used (python -V):
Operating system and version:

The text was updated successfully, but these errors were encountered:

noklam · 2024-03-24T17:09:51Z

Interesting to learn about why Rust choose object store instead of a general filesystem interface. One things that come into mind is how to make the Kedro Versioning works for this as we almost use fs.glob everywhere. We may need something new for handling object store.

To be fair local + object store is the majority case. We do handle some niche fs like hdfs (was common 10 years ago, almost non-exist now), sftp etc.

astrojuanlu · 2024-03-25T09:25:54Z

The good thing is that, if you drop fsspec by leveraging the underlying mechanism of the target library, writing a new custom dataset is trivial

https://github.com/astrojuanlu/workshop-kedro-huggingface/blob/7666a33/delta_polars_dataset.py

astrojuanlu · 2024-04-08T12:35:23Z

Today I showed this fsspec-free dataset to a user and they were happy to see how easy it is to write:

import typing as t

import polars as pl
from kedro.io import AbstractDataset


class DeltaPolarsDataset(AbstractDataset[pl.DataFrame, pl.DataFrame]):
    """``DeltaDataset`` loads/saves data from/to a Delta Table using an underlying
    filesystem (e.g.: local, S3, GCS). It returns a Polars dataframe.
    """

    DEFAULT_LOAD_ARGS: dict[str, t.Any] = {}
    DEFAULT_SAVE_ARGS: dict[str, t.Any] = {}

    def __init__(
        self,
        filepath: str,
        load_args: dict[str, t.Any] | None = None,
        save_args: dict[str, t.Any] | None = None,
        credentials: dict[str, t.Any] | None = None,
        storage_options: dict[str, t.Any] | None = None,
        metadata: dict[str, t.Any] | None = None,
    ):
        self._filepath = filepath
        self._load_args = {**self.DEFAULT_LOAD_ARGS, **(load_args or {})}
        self._save_args = {**self.DEFAULT_SAVE_ARGS, **(save_args or {})}
        self._credentials = credentials or {}
        self._storage_options = storage_options or {}

        self._storage_options.update(self._credentials)
        self._metadata = metadata or {}

    def _load(self) -> pl.DataFrame:
        return pl.read_delta(
            self._filepath, storage_options=self._storage_options, **self._load_args
        )

    def _save(self, data: pl.DataFrame) -> None:
        data.write_delta(
            self._filepath, storage_options=self._storage_options, **self._save_args
        )

    def _describe(self) -> dict[str, t.Any]:
        return dict(
            filepath=self._filepath,
            load_args=self._load_args,
            save_args=self._save_args,
            storage_options=self._storage_options,
            metadata=self._metadata,
        )

astrojuanlu · 2024-04-08T13:19:41Z

It was noted today in backlog grooming by @noklam that fsspec is linked to our versioning, therefore ditching it might be more complicated than expected and not worth the effort.

For the particular case of Delta format, let's

Look at Cannot use file_format: delta with polars.EagerPolarsDataset #444, and
If the above cannot be easily fixed, we can contribute a custom DeltaPolarsDataset as I wrote in Rewrite the Polars datasets to not rely on fsspec unnecessarily #625 (comment)

noklam · 2024-04-08T13:20:29Z

From backlog grooming:

If we consider versioning, switching to object_storage is a bigger change because we need new mechanism to figure out what is the latest version
For delta dataset itself, it doesn't support Versioning yet so maybe we can add a new dataset instead of re-writing all Polars dataset for now.

MatthiasRoels · 2024-06-27T18:01:24Z

Interesting to learn about why Rust choose object store instead of a general filesystem interface.

@noklam: It is explained in the docs.

MatthiasRoels · 2024-06-27T18:24:48Z

I did some experiments to see where we are in using plain vanilla read_*, scan_* and write_* operations on object stores (for a local filesystem, they work as expected).

read_parquet, scan_parquet work as expected. It is even possible to read parquet file generated by spark (e.g. a directory containing parquet files) using a glob pattern. If however, your dataset has Hive partitioning, e.g. .../year=2024/month=6/day=26/*, you still have to use pyarrow...
write_parquet doesn't work directly unless you set use_pyarrow=True. Otherwise, you have to use e.g. s3fs to write to s3 (basically how we currently have it implemented).
read_csv is the only method that currently supports reading from S3. Both scan_csv and write_csv fail. So to write csv's, you'll have to fall back to our current implementation using s3fs.

So if we make any change, we could split up the logic based on the file_format; if parquet, use native methods (unless you want to read data with Hive Partitioning) and use the current implementation otherwise.

Any thoughts?

astrojuanlu · 2024-06-27T20:04:18Z

So if we make any change, we could split up the logic based on the file_format; if parquet, use native methods (unless you want to read data with Hive Partitioning) and use the current implementation otherwise.

Any thoughts?

At this point we should start by documenting what the "social contract" of Kedro datasets is. In other words, how war we go in filling gaps like the ones you describe.

Somewhat related: kedro-org/kedro#1936

astrojuanlu added bug Something isn't working datasets labels Mar 24, 2024

merelcht changed the title ~~Polars dataset relies on fsspec unnecessarily~~ Rewrite the Polars datasets to not rely on fsspec unnecessarily Apr 8, 2024

astrojuanlu mentioned this issue Jun 27, 2024

Easier CustomDataset Creation kedro-org/kedro#1936

Open

astrojuanlu mentioned this issue Jul 29, 2024

UserWarning: Polars found a filename when using EagerPolarsDataset #789

Open

astrojuanlu mentioned this issue Aug 21, 2024

fix(datasets): Take mode argument into account when saving dataset #805

Merged

4 tasks

This was referenced Sep 3, 2024

Design DataCatalog2.0 kedro-org/kedro#3995

Open

Research summary of insights for redesigning Kedro's versioning kedro-org/kedro#4129

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite the Polars datasets to not rely on fsspec unnecessarily #625

Rewrite the Polars datasets to not rely on fsspec unnecessarily #625

astrojuanlu commented Mar 24, 2024 •

edited by merelcht

Loading

noklam commented Mar 24, 2024

astrojuanlu commented Mar 25, 2024

astrojuanlu commented Apr 8, 2024

astrojuanlu commented Apr 8, 2024

noklam commented Apr 8, 2024

MatthiasRoels commented Jun 27, 2024

MatthiasRoels commented Jun 27, 2024

astrojuanlu commented Jun 27, 2024

Rewrite the Polars datasets to not rely on fsspec unnecessarily #625

Rewrite the Polars datasets to not rely on fsspec unnecessarily #625

Comments

astrojuanlu commented Mar 24, 2024 • edited by merelcht Loading

Description

Context

Your Environment

noklam commented Mar 24, 2024

astrojuanlu commented Mar 25, 2024

astrojuanlu commented Apr 8, 2024

astrojuanlu commented Apr 8, 2024

noklam commented Apr 8, 2024

MatthiasRoels commented Jun 27, 2024

MatthiasRoels commented Jun 27, 2024

astrojuanlu commented Jun 27, 2024

astrojuanlu commented Mar 24, 2024 •

edited by merelcht

Loading