-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dispatch aggregate #1116
base: main
Are you sure you want to change the base?
Dispatch aggregate #1116
Conversation
…dle default values
…r in polars groupby
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @TheooJ, very nice effort! Here is a first pass of comments.
On a higher level:
- Since you use the
CheckInputDataFrame
class inAggJoiner.fit_transform
, you should be able to remove the_check_dataframes
method entirely. And since_check_inputs
currently calls_check_dataframes
, you have to place them in reverse order:
self._main_check_input = CheckInputDataFrame()
X = self._main_check_input.fit_transform(X)
self._check_inputs(X)
Additionally, the check_inputs
method of the AggTarget
could be simplified because CheckInputDataFrame
does most of the checks.
-
You should add a
get_feature_names_out
method that returnsself.all_outputs_
for bothAggJoiner
andAggTarget
. -
I think we should allow
key
,aux_key
andmain_key
to be selectors.
Thanks a lot @Vincent-Maladiere ! To answer your points:
|
I agree that 3. use cases are rarer, but it looks weird to support selectors for the cols argument only. I might have keys that I can identify using regexes for instance, when the name of the column changes for whatever reason. Not a strong requirement, but it shouldn't be too costly to add. |
The goal of this PR is to dispatch
aggregate
, currently written in two files, by directly implementing it in_agg_joiner.py
.Following discussions with @Vincent-Maladiere and @jeromedockes,
AggJoiner
andAggTarget
now require theoperations
parameter by default, and will try to apply all operations on all columns — as opposed to now, where columns are separated in categorical and numeric and only some operations are computed on each category.I’m planning on doing follow ups to completely remove the _pandas.py, _polars.py, _namespace.py files, and on refactoring
AggTarget
with cross-fitting.