Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Extrapolate_flat parameter in interpolate_by #18355

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

agossard
Copy link
Contributor

Adds "extrapolate_flat" as a parameter to interpolate_by. If True, we will extrapolate values outside the range of the min/max of by column with the associated values of the min/max by positions in the expression column. This seeks to match the default behavior of numpy.interp.

This parameter defaults to False, in order to preserve existing behavior. In a vacuum, I would consider having this parameter default to True.

@agossard agossard changed the title Feature/interpolate by enhancements feat: Extrapolate_flat parameter in interpolate_by Aug 24, 2024
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars and removed title needs formatting labels Aug 24, 2024
Copy link

codecov bot commented Aug 25, 2024

Codecov Report

Attention: Patch coverage is 97.66082% with 4 lines in your changes missing coverage. Please review.

Project coverage is 79.83%. Comparing base (dd1fc86) to head (3a89261).

Files Patch % Lines
crates/polars-plan/src/dsl/function_expr/mod.rs 50.00% 2 Missing ⚠️
.../polars-python/src/lazyframe/visitor/expr_nodes.rs 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18355      +/-   ##
==========================================
- Coverage   79.84%   79.83%   -0.01%     
==========================================
  Files        1496     1496              
  Lines      200333   200456     +123     
  Branches     2841     2841              
==========================================
+ Hits       159952   160044      +92     
- Misses      39856    39887      +31     
  Partials      525      525              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MarcoGorelli
Copy link
Collaborator

thanks @agossard ! will take a look

Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice, thanks @agossard !

I've just left some minor comments

Is the name extrapolate_flat taken from anywhere?

assert_frame_equal(result, expected)
# assert_frame_equal(result, expected)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this commented out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably by mistake. Good catch. I'll do the other clean ups (though what do you mean by missing full stop?)

Thanks. And let me know what you want to do with the name of the parameter.

Comment on lines +6170 to +6172
extrapolate_flat
If True, extrapolate the highest and lowest values of the expression in
the regions below and above the highest/lowest by values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we include an example which uses extrapolate_flat? users tend to learn better from examples than from descriptions

Comment on lines +6170 to +6172
extrapolate_flat
If True, extrapolate the highest and lowest values of the expression in
the regions below and above the highest/lowest by values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing full stop

Comment on lines +6134 to +6136
extrapolate_flat
If True, extrapolate the highest and lowest values of the expression in
the regions below and above the highest/lowest by values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Comment on lines +1036 to +1037
map_as_slice!(dispatch::interpolate_by, extrapolate_flat)
//map_as_slice!(dispatch::interpolate_by, extrapolate_flat)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented out

@agossard
Copy link
Contributor Author

Really nice, thanks @agossard !

I've just left some minor comments

Is the name extrapolate_flat taken from anywhere?

I'm not at all wedded to it. And no, it didn't come from anywhere. I feel like "extrapolate" could be fine and maybe better. I just added the "flat" to try to be extra pedantic about how it works... i.e. it's not like it's going to extrapolate base on the relationship between the two series or anything... or the slope of the last two points or something... it's just going to be flat (in the y-axis)

@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Aug 28, 2024

ok thanks

going to think about this more and look more what other tools do (because a counterargument to adding this could be: just do .interpolate_by(...).forward_fill().backward_fill())

@agossard
Copy link
Contributor Author

ok thanks

going to think about this more and look more what other tools do (because a counterargument to adding this could be: just do .interpolate_by(...).forward_fill().backward_fill())

Doesn’t that rely on the by column being sorted?

The PR behavior (with True) is the default behavior of np.interp. https://numpy.org/doc/stable/reference/generated/numpy.interp.html

They allow you to specify the values to use above and below the range, but default is to use the max and min.

@agossard
Copy link
Contributor Author

Ok, did a quick survey of a few more libraries. Some of them (Matlab, scipy) provide all sorts of fancy fitting functionality (cubic splines, etc, etc). If you’re doing that, then it is possible to “extrapolate” with this actual functional form. These libraries call that operation “extrapolate” and refer to the thing I (and numpy is doing) with something more like “fill value.” Based on the fact that we are implanting a simple linear interpolation process, and not offering curve fitting, I continue to think that providing this option and matching the numpy functionality makes sense. However, I don’t necessarily think the parameter should involve the word “extrapolate” anymore.

@MarcoGorelli
Copy link
Collaborator

Doesn’t that rely on the by column being sorted?

ah you're right thanks, I'd forgotten that interpolate_by already doesn't assume sortedness

However, I don’t necessarily think the parameter should involve the word “extrapolate” anymore.

ok let's bikeshed on the name a bit longer then 😄

@agossard
Copy link
Contributor Author

agossard commented Sep 1, 2024

One thing to consider is how strongly do we feel about maintaining the existing behavior, even as an option. If we want to make a breaking change, we can simple match the interface of numpy interp exactly. In other words, you can pass “left” and “right” (not that I love THOSE parameter names) and it will use those as the fill values above and below the range. If they are null, it will use fp[0] and fp[-1]. So there is no way to end up with your existing behavior in that scenario.

I would lean this way personally. When I reach for interpolation, it’s generally because I have some null values and I want to end up with no null values. That can be accomplished with different levels of complexity and fanciness… but the goal is always “get rid of the missing values” so the existing polars behavior means I’m not done. If I don’t like linear interpolation with values above the range filled in with a scalar… well… then maybe I should reach for a more complicated science package outside of polars. “Simple linear interpolation in polars in the range, and then something custom above/below the range” doesn’t seem like all that common of a workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants