Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dispatch aggregate #1116

Open
wants to merge 59 commits into
base: main
Choose a base branch
from
Open

Dispatch aggregate #1116

wants to merge 59 commits into from

Conversation

TheooJ
Copy link
Contributor

@TheooJ TheooJ commented Oct 17, 2024

The goal of this PR is to dispatch aggregate, currently written in two files, by directly implementing it in _agg_joiner.py.

Following discussions with @Vincent-Maladiere and @jeromedockes, AggJoiner and AggTarget now require the operations parameter by default, and will try to apply all operations on all columns — as opposed to now, where columns are separated in categorical and numeric and only some operations are computed on each category.

I’m planning on doing follow ups to completely remove the _pandas.py, _polars.py, _namespace.py files, and on refactoring AggTarget with cross-fitting.

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @TheooJ, very nice effort! Here is a first pass of comments.

On a higher level:

  1. Since you use the CheckInputDataFrame class in AggJoiner.fit_transform, you should be able to remove the _check_dataframes method entirely. And since _check_inputs currently calls _check_dataframes, you have to place them in reverse order:
self._main_check_input = CheckInputDataFrame()
X = self._main_check_input.fit_transform(X)
self._check_inputs(X)

Additionally, the check_inputs method of the AggTarget could be simplified because CheckInputDataFrame does most of the checks.

  1. You should add a get_feature_names_out method that returns self.all_outputs_ for both AggJoiner and AggTarget.

  2. I think we should allow key, aux_key and main_key to be selectors.

skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_join_utils.py Show resolved Hide resolved
@TheooJ
Copy link
Contributor Author

TheooJ commented Oct 22, 2024

Thanks a lot @Vincent-Maladiere !

To answer your points:

  1. I can remove _check_dataframes and use CheckInputDataFrame, I just need to handle the case where aux_table="X" to keep the option of using the placeholder. I agree that AggTarget's check_inputs could be simplified, I plan on cleaning it in another PR (it's also missing some tests).
  2. Sure ! Is it something to do here or in another PR ? Should we open an issue so that we add this method for other transformers like Joiner ?
  3. I'm not sure we want this. On the one hand, it might be useful for long-term compatibility when selectors are public, so that people can either use list of cols or selectors directly. On the other, while I think it makes sense to say "compute the mean on all numeric columns", you are not going to join dataframes on all numeric or all string columns with key.

@Vincent-Maladiere
Copy link
Member

  1. and 2. can be done in this PR, it's not a huge change.

I agree that 3. use cases are rarer, but it looks weird to support selectors for the cols argument only. I might have keys that I can identify using regexes for instance, when the name of the column changes for whatever reason. Not a strong requirement, but it shouldn't be too costly to add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants