-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Polars lazy evaluation #224
Comments
Tangentially related to kedro-org/kedro#2374 ? |
I think it's different. Polars already support this out of the box, so there isn't much implementation overhead. Something like this: if lazy:
pl.scan_csv(xxx)
else:
pl.read_csv(xxx) |
Yes, but I do see another issue in the current implementation of PolarsDatasets... Currently, polars cannot read from (remote) object stores such as S3 natively. There are several solutions for this:
import fsspec
import polars as pl
import pyarrow.dataset as ds
def lazy_load_dataset(file_uri: str, format="parquet"):
fs = fsspec.filesystem("s3")
dataset = ds.dataset(file_uri, filesystem=fs, format=format)
return pl.scan_pyarrow_dataset(dataset) Any suggestions on what would be the best approach of these three? |
The third one looks the most promising to me! Many things in Polars are experimental and the project evolves fast, but I think it's worth trying. |
Why using fsspec contradict with lazy loading? |
It does not contradict it per se. It’s just that when What happens under the hood is that Python code (from On the other hand, you can still benefit from the optimised query plan because of the lazyframe API. You could almost think about this scenario as if you would convert a polars dataframe to a lazyframe at some point during processing. |
I've been playing around with a similar implementation for partitioned parquet files on Azure Blob using adlfs. Having pyarrow as the common interface for loading data into memory thanks to its interoperability with pandas (2.0), polars, duckdb... seems like it could be an interesting proposition. |
IIUC, Don't think Arrow would work for all datasets (think of weird stuff like Video, HDF5 and so on) but definitely worth giving it some thought for tabular datasets. In any case, for a first iteration maybe we don't need to strive for consistency, and just focus on Polars lazy evaluation. |
Description
Polars is efficient and support lazy evaluation mode which could be useful for memory hungry pipeline.
https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_csv.html
Maybe supporting a flag with
lazy
to usepl.scan_csv
?Context
Why is this change important to you? How would you use it? How can it benefit other users?
Possible Implementation
(Optional) Suggest an idea for implementing the addition or change.
Possible Alternatives
(Optional) Describe any alternative solutions or features you've considered.
The text was updated successfully, but these errors were encountered: