Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polars lazyframe #2695

Merged
merged 12 commits into from
Aug 28, 2024
Merged

Polars lazyframe #2695

merged 12 commits into from
Aug 28, 2024

Conversation

cosmicBboy
Copy link
Contributor

Tracking issue

Fixes flyteorg/flyte#5678

Why are the changes needed?

The flytekit plugin for polars only supports polars.DataFrame. Many polars users rely on the LazyFrame API to make data handling more memory efficient.

What changes were proposed in this pull request?

This PR:

  • Adds and registers a LazyFrame encoder/decoder to the StructuredDataset type.

How was this patch tested?

Unit tests run on CI.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Signed-off-by: Niels Bantilan <[email protected]>
Signed-off-by: Niels Bantilan <[email protected]>
Copy link

codecov bot commented Aug 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 46.14%. Comparing base (a8f68d7) to head (56662a6).
Report is 24 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (a8f68d7) and HEAD (56662a6). Click for more details.

HEAD has 31 uploads less than BASE
Flag BASE (a8f68d7) HEAD (56662a6)
32 1
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #2695       +/-   ##
===========================================
- Coverage   78.91%   46.14%   -32.78%     
===========================================
  Files         316      187      -129     
  Lines       24965    19249     -5716     
  Branches     4012     2790     -1222     
===========================================
- Hits        19702     8882    -10820     
- Misses       4548     9932     +5384     
+ Partials      715      435      -280     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Niels Bantilan <[email protected]>
Signed-off-by: Niels Bantilan <[email protected]>
eapolinario
eapolinario previously approved these changes Aug 22, 2024
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the flyte sandbox, the following stalls for me:

from flytekit import task, workflow, ImageSpec
import polars as pl

polars_plugin = "git+https://github.com/flyteorg/flytekit@55db243b51bfe5137942b68aa212ecb8edc27bcd#subdirectory=plugins/flytekit-polars"

image = ImageSpec(
    name="polars",
    packages=[polars_plugin, "flytekit"],
    registry="localhost:30000",
    apt_packages=["git"],
)


@task(container_image=image)
def create_lazy_frame() -> pl.LazyFrame:
    return pl.LazyFrame({"col1": [1, 3, 2], "col2": list("abc")})


@task(container_image=image)
def read_pl(df: pl.LazyFrame) -> int:
    return df.select(pl.col("col1").sum()).collect().item()


@workflow
def main() -> int:
    df = create_lazy_frame()
    return read_pl(df=df)

# Register the Polars LazyFrame handlers
StructuredDatasetTransformerEngine.register(PolarsLazyFrameToParquetEncodingHandler())
StructuredDatasetTransformerEngine.register(ParquetToPolarsLazyFrameDecodingHandler())
StructuredDatasetTransformerEngine.register_renderer(pl.LazyFrame, PolarsDataFrameRenderer())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not actionable for this PR: I do not like how an possibility computability expensive renderer is attach to using a Polars DataFrames plugin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make the renderer for this explicit? (either using an Annotated type or in the task function body?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan I removed register_renderer for the LazyFrame... I think it's reasonable to think that any user who uses this doesn't want to perform potentially computationally intensive operation at the end of the task without doing it explicitly

@cosmicBboy
Copy link
Contributor Author

failing greatexpectations plugin should be fixed in a separate PR

@eapolinario
Copy link
Collaborator

@cosmicBboy , @thomasjpfan , can we merge this?

@cosmicBboy
Copy link
Contributor Author

one sec @eapolinario let's not merge yet, still testing on flyte sandbox and serverless

Signed-off-by: Niels Bantilan <[email protected]>
Signed-off-by: Niels Bantilan <[email protected]>
Signed-off-by: Niels Bantilan <[email protected]>
Signed-off-by: Niels Bantilan <[email protected]>
Signed-off-by: Niels Bantilan <[email protected]>
@eapolinario eapolinario merged commit 9a08c1a into master Aug 28, 2024
101 of 103 checks passed
@cosmicBboy cosmicBboy deleted the polars-lazyframe branch August 29, 2024 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core feature] Add LazyFrame support for polars plugin
4 participants