Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Polars lazy evaluation #224

Closed
noklam opened this issue Jun 2, 2023 · 8 comments · Fixed by #350
Closed

Support Polars lazy evaluation #224

noklam opened this issue Jun 2, 2023 · 8 comments · Fixed by #350

Comments

@noklam
Copy link
Contributor

noklam commented Jun 2, 2023

Description

Polars is efficient and support lazy evaluation mode which could be useful for memory hungry pipeline.

https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_csv.html
Maybe supporting a flag with lazy to use pl.scan_csv ?

Context

Why is this change important to you? How would you use it? How can it benefit other users?

Possible Implementation

(Optional) Suggest an idea for implementing the addition or change.

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

@astrojuanlu
Copy link
Member

Tangentially related to kedro-org/kedro#2374 ?

@noklam
Copy link
Contributor Author

noklam commented Jun 2, 2023

I think it's different. Polars already support this out of the box, so there isn't much implementation overhead. Something like this:

if lazy:
  pl.scan_csv(xxx)
else:
  pl.read_csv(xxx)

@MatthiasRoels
Copy link
Contributor

MatthiasRoels commented Jun 12, 2023

Yes, but I do see another issue in the current implementation of PolarsDatasets... Currently, polars cannot read from (remote) object stores such as S3 natively. There are several solutions for this:

  • Make use of fsspec, which is what is currently implemented in kedro-datasets. The drawback is that we have to load everything into memory first (using pure Python code, as fsspec is doing all the work) before we can create a pl.Lazyframe object. This is not ideal for larger than memory datasets, but also not very efficient (you rely on Python to do the heavy lifting).
  • First download the file for object storage (e.g. S3) to local disk and then do a read_/scan_ operator. There, we can take full advantage of the I/O implementation of Polars in Rust so you should get better performance. You do introduce some overhead however with the download first.
  • Use pyarrow to do the heavy lifting for you and convert the resulting pyarrow dataset to a polars lazyframe with this function. The implementation is then something like the snippet below. The drawback of this approach is that this API is experimental...
import fsspec
import polars as pl 
import pyarrow.dataset as ds

def lazy_load_dataset(file_uri: str, format="parquet"): 
    
    fs = fsspec.filesystem("s3")
    dataset = ds.dataset(file_uri, filesystem=fs, format=format)
    
    return pl.scan_pyarrow_dataset(dataset)

Any suggestions on what would be the best approach of these three?

@astrojuanlu
Copy link
Member

The third one looks the most promising to me! Many things in Polars are experimental and the project evolves fast, but I think it's worth trying.

@noklam
Copy link
Contributor Author

noklam commented Jun 12, 2023

Why using fsspec contradict with lazy loading?

@MatthiasRoels
Copy link
Contributor

MatthiasRoels commented Jun 12, 2023

Why using fsspec contradict with lazy loading?

It does not contradict it per se. It’s just that when fsspec, you use Python for I/O which means that you loose all the polars specific I/O goodies (read Rust implementation). Hence, this will negatively impact performance.

What happens under the hood is that Python code (from fsspec) is responsible for reading data in memory at which moment in time Rust (polars) takes over for the actual processing. This means that you can’t benefit from things like predicate pushdown etc.

On the other hand, you can still benefit from the optimised query plan because of the lazyframe API. You could almost think about this scenario as if you would convert a polars dataframe to a lazyframe at some point during processing.

@inigohidalgo
Copy link

I've been playing around with a similar implementation for partitioned parquet files on Azure Blob using adlfs. Having pyarrow as the common interface for loading data into memory thanks to its interoperability with pandas (2.0), polars, duckdb... seems like it could be an interesting proposition.

@astrojuanlu
Copy link
Member

IIUC, pyarrow.dataset https://arrow.apache.org/docs/python/dataset.html provides lazy/deferred reading capabilities in general, am I right?

Don't think Arrow would work for all datasets (think of weird stuff like Video, HDF5 and so on) but definitely worth giving it some thought for tabular datasets.

In any case, for a first iteration maybe we don't need to strive for consistency, and just focus on Polars lazy evaluation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants