-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Extrapolate_flat parameter in interpolate_by #18355
base: main
Are you sure you want to change the base?
feat: Extrapolate_flat parameter in interpolate_by #18355
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #18355 +/- ##
==========================================
- Coverage 79.84% 79.83% -0.01%
==========================================
Files 1496 1496
Lines 200333 200456 +123
Branches 2841 2841
==========================================
+ Hits 159952 160044 +92
- Misses 39856 39887 +31
Partials 525 525 ☔ View full report in Codecov by Sentry. |
thanks @agossard ! will take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice, thanks @agossard !
I've just left some minor comments
Is the name extrapolate_flat
taken from anywhere?
assert_frame_equal(result, expected) | ||
# assert_frame_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably by mistake. Good catch. I'll do the other clean ups (though what do you mean by missing full stop?)
Thanks. And let me know what you want to do with the name of the parameter.
extrapolate_flat | ||
If True, extrapolate the highest and lowest values of the expression in | ||
the regions below and above the highest/lowest by values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we include an example which uses extrapolate_flat
? users tend to learn better from examples than from descriptions
extrapolate_flat | ||
If True, extrapolate the highest and lowest values of the expression in | ||
the regions below and above the highest/lowest by values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing full stop
extrapolate_flat | ||
If True, extrapolate the highest and lowest values of the expression in | ||
the regions below and above the highest/lowest by values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
map_as_slice!(dispatch::interpolate_by, extrapolate_flat) | ||
//map_as_slice!(dispatch::interpolate_by, extrapolate_flat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented out
I'm not at all wedded to it. And no, it didn't come from anywhere. I feel like "extrapolate" could be fine and maybe better. I just added the "flat" to try to be extra pedantic about how it works... i.e. it's not like it's going to extrapolate base on the relationship between the two series or anything... or the slope of the last two points or something... it's just going to be flat (in the y-axis) |
ok thanks going to think about this more and look more what other tools do (because a counterargument to adding this could be: just do |
Doesn’t that rely on the by column being sorted? The PR behavior (with True) is the default behavior of np.interp. https://numpy.org/doc/stable/reference/generated/numpy.interp.html They allow you to specify the values to use above and below the range, but default is to use the max and min. |
Ok, did a quick survey of a few more libraries. Some of them (Matlab, scipy) provide all sorts of fancy fitting functionality (cubic splines, etc, etc). If you’re doing that, then it is possible to “extrapolate” with this actual functional form. These libraries call that operation “extrapolate” and refer to the thing I (and numpy is doing) with something more like “fill value.” Based on the fact that we are implanting a simple linear interpolation process, and not offering curve fitting, I continue to think that providing this option and matching the numpy functionality makes sense. However, I don’t necessarily think the parameter should involve the word “extrapolate” anymore. |
ah you're right thanks, I'd forgotten that
ok let's bikeshed on the name a bit longer then 😄 |
One thing to consider is how strongly do we feel about maintaining the existing behavior, even as an option. If we want to make a breaking change, we can simple match the interface of numpy interp exactly. In other words, you can pass “left” and “right” (not that I love THOSE parameter names) and it will use those as the fill values above and below the range. If they are null, it will use fp[0] and fp[-1]. So there is no way to end up with your existing behavior in that scenario. I would lean this way personally. When I reach for interpolation, it’s generally because I have some null values and I want to end up with no null values. That can be accomplished with different levels of complexity and fanciness… but the goal is always “get rid of the missing values” so the existing polars behavior means I’m not done. If I don’t like linear interpolation with values above the range filled in with a scalar… well… then maybe I should reach for a more complicated science package outside of polars. “Simple linear interpolation in polars in the range, and then something custom above/below the range” doesn’t seem like all that common of a workflow. |
Adds "extrapolate_flat" as a parameter to interpolate_by. If True, we will extrapolate values outside the range of the min/max of by column with the associated values of the min/max by positions in the expression column. This seeks to match the default behavior of numpy.interp.
This parameter defaults to False, in order to preserve existing behavior. In a vacuum, I would consider having this parameter default to True.