-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polars lazyframe #2695
Polars lazyframe #2695
Conversation
Signed-off-by: Niels Bantilan <[email protected]>
Signed-off-by: Niels Bantilan <[email protected]>
plugins/flytekit-polars/flytekitplugins/polars/sd_transformers.py
Outdated
Show resolved
Hide resolved
plugins/flytekit-polars/flytekitplugins/polars/sd_transformers.py
Outdated
Show resolved
Hide resolved
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2695 +/- ##
===========================================
- Coverage 78.91% 46.14% -32.78%
===========================================
Files 316 187 -129
Lines 24965 19249 -5716
Branches 4012 2790 -1222
===========================================
- Hits 19702 8882 -10820
- Misses 4548 9932 +5384
+ Partials 715 435 -280 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Niels Bantilan <[email protected]>
Signed-off-by: Niels Bantilan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the flyte sandbox, the following stalls for me:
from flytekit import task, workflow, ImageSpec
import polars as pl
polars_plugin = "git+https://github.com/flyteorg/flytekit@55db243b51bfe5137942b68aa212ecb8edc27bcd#subdirectory=plugins/flytekit-polars"
image = ImageSpec(
name="polars",
packages=[polars_plugin, "flytekit"],
registry="localhost:30000",
apt_packages=["git"],
)
@task(container_image=image)
def create_lazy_frame() -> pl.LazyFrame:
return pl.LazyFrame({"col1": [1, 3, 2], "col2": list("abc")})
@task(container_image=image)
def read_pl(df: pl.LazyFrame) -> int:
return df.select(pl.col("col1").sum()).collect().item()
@workflow
def main() -> int:
df = create_lazy_frame()
return read_pl(df=df)
# Register the Polars LazyFrame handlers | ||
StructuredDatasetTransformerEngine.register(PolarsLazyFrameToParquetEncodingHandler()) | ||
StructuredDatasetTransformerEngine.register(ParquetToPolarsLazyFrameDecodingHandler()) | ||
StructuredDatasetTransformerEngine.register_renderer(pl.LazyFrame, PolarsDataFrameRenderer()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not actionable for this PR: I do not like how an possibility computability expensive renderer is attach to using a Polars DataFrames plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make the renderer for this explicit? (either using an Annotated
type or in the task function body?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan I removed register_renderer
for the LazyFrame... I think it's reasonable to think that any user who uses this doesn't want to perform potentially computationally intensive operation at the end of the task without doing it explicitly
Signed-off-by: Niels Bantilan <[email protected]>
failing greatexpectations plugin should be fixed in a separate PR |
@cosmicBboy , @thomasjpfan , can we merge this? |
one sec @eapolinario let's not merge yet, still testing on flyte sandbox and serverless |
Signed-off-by: Niels Bantilan <[email protected]>
Signed-off-by: Niels Bantilan <[email protected]>
Signed-off-by: Niels Bantilan <[email protected]>
Signed-off-by: Niels Bantilan <[email protected]>
Signed-off-by: Niels Bantilan <[email protected]>
Signed-off-by: Niels Bantilan <[email protected]>
Tracking issue
Fixes flyteorg/flyte#5678
Why are the changes needed?
The flytekit plugin for polars only supports
polars.DataFrame
. Many polars users rely on theLazyFrame
API to make data handling more memory efficient.What changes were proposed in this pull request?
This PR:
LazyFrame
encoder/decoder to theStructuredDataset
type.How was this patch tested?
Unit tests run on CI.
Check all the applicable boxes