Support Polars lazy evaluation #224

noklam · 2023-06-02T11:14:41Z

Description

Polars is efficient and support lazy evaluation mode which could be useful for memory hungry pipeline.

https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_csv.html
Maybe supporting a flag with lazy to use pl.scan_csv ?

Context

Why is this change important to you? How would you use it? How can it benefit other users?

Possible Implementation

(Optional) Suggest an idea for implementing the addition or change.

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2023-06-02T12:00:03Z

Tangentially related to kedro-org/kedro#2374 ?

noklam · 2023-06-02T13:22:26Z

I think it's different. Polars already support this out of the box, so there isn't much implementation overhead. Something like this:

if lazy:
  pl.scan_csv(xxx)
else:
  pl.read_csv(xxx)

MatthiasRoels · 2023-06-12T13:09:33Z

Yes, but I do see another issue in the current implementation of PolarsDatasets... Currently, polars cannot read from (remote) object stores such as S3 natively. There are several solutions for this:

Make use of fsspec, which is what is currently implemented in kedro-datasets. The drawback is that we have to load everything into memory first (using pure Python code, as fsspec is doing all the work) before we can create a pl.Lazyframe object. This is not ideal for larger than memory datasets, but also not very efficient (you rely on Python to do the heavy lifting).
First download the file for object storage (e.g. S3) to local disk and then do a read_/scan_ operator. There, we can take full advantage of the I/O implementation of Polars in Rust so you should get better performance. You do introduce some overhead however with the download first.
Use pyarrow to do the heavy lifting for you and convert the resulting pyarrow dataset to a polars lazyframe with this function. The implementation is then something like the snippet below. The drawback of this approach is that this API is experimental...

import fsspec
import polars as pl 
import pyarrow.dataset as ds

def lazy_load_dataset(file_uri: str, format="parquet"): 
    
    fs = fsspec.filesystem("s3")
    dataset = ds.dataset(file_uri, filesystem=fs, format=format)
    
    return pl.scan_pyarrow_dataset(dataset)

Any suggestions on what would be the best approach of these three?

astrojuanlu · 2023-06-12T16:07:41Z

The third one looks the most promising to me! Many things in Polars are experimental and the project evolves fast, but I think it's worth trying.

noklam · 2023-06-12T16:33:31Z

Why using fsspec contradict with lazy loading?

MatthiasRoels · 2023-06-12T19:49:58Z

Why using fsspec contradict with lazy loading?

It does not contradict it per se. It’s just that when fsspec, you use Python for I/O which means that you loose all the polars specific I/O goodies (read Rust implementation). Hence, this will negatively impact performance.

What happens under the hood is that Python code (from fsspec) is responsible for reading data in memory at which moment in time Rust (polars) takes over for the actual processing. This means that you can’t benefit from things like predicate pushdown etc.

On the other hand, you can still benefit from the optimised query plan because of the lazyframe API. You could almost think about this scenario as if you would convert a polars dataframe to a lazyframe at some point during processing.

inigohidalgo · 2023-06-26T14:27:26Z

I've been playing around with a similar implementation for partitioned parquet files on Azure Blob using adlfs. Having pyarrow as the common interface for loading data into memory thanks to its interoperability with pandas (2.0), polars, duckdb... seems like it could be an interesting proposition.

astrojuanlu · 2023-06-26T16:10:12Z

IIUC, pyarrow.dataset https://arrow.apache.org/docs/python/dataset.html provides lazy/deferred reading capabilities in general, am I right?

Don't think Arrow would work for all datasets (think of weird stuff like Video, HDF5 and so on) but definitely worth giving it some thought for tabular datasets.

In any case, for a first iteration maybe we don't need to strive for consistency, and just focus on Polars lazy evaluation.

MatthiasRoels mentioned this issue Sep 27, 2023

feat(datasets): support Polars lazy evaluation #350

Merged

4 tasks

astrojuanlu closed this as completed in #350 Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Polars lazy evaluation #224

Support Polars lazy evaluation #224

noklam commented Jun 2, 2023 •

edited

Loading

astrojuanlu commented Jun 2, 2023

noklam commented Jun 2, 2023

MatthiasRoels commented Jun 12, 2023 •

edited

Loading

astrojuanlu commented Jun 12, 2023

noklam commented Jun 12, 2023

MatthiasRoels commented Jun 12, 2023 •

edited

Loading

inigohidalgo commented Jun 26, 2023

astrojuanlu commented Jun 26, 2023

Support Polars lazy evaluation #224

Support Polars lazy evaluation #224

Comments

noklam commented Jun 2, 2023 • edited Loading

Description

Context

Possible Implementation

Possible Alternatives

astrojuanlu commented Jun 2, 2023

noklam commented Jun 2, 2023

MatthiasRoels commented Jun 12, 2023 • edited Loading

astrojuanlu commented Jun 12, 2023

noklam commented Jun 12, 2023

MatthiasRoels commented Jun 12, 2023 • edited Loading

inigohidalgo commented Jun 26, 2023

astrojuanlu commented Jun 26, 2023

noklam commented Jun 2, 2023 •

edited

Loading

MatthiasRoels commented Jun 12, 2023 •

edited

Loading

MatthiasRoels commented Jun 12, 2023 •

edited

Loading