Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT allow metadata to be transformed in a Pipeline #28901

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

adrinjalali
Copy link
Member

Initial proposal: #28440 (comment)
xref: #28440 (comment)

This adds transform_input as a constructor argument to Pipeline, as:

    transform_input : list of str, default=None
        This enables transforming some input arguments to ``fit`` (other than ``X``)
        to be transformed by the steps of the pipeline up to the step which requires
        them. Requirement is defined via :ref:`metadata routing <metadata_routing>`.
        This can be used to pass a validation set through the pipeline for instance.

        See the example TBD for more details.

        You can only set this if metadata routing is enabled, which you
        can enable using ``sklearn.set_config(enable_metadata_routing=True)``.

It simply allows to transform metadata with fitted estimators up to the step which needs the metadata.

How does this look?

cc @lorentzenchr @ogrisel @amueller @betatim

Copy link

github-actions bot commented Apr 26, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 24fa675. Link to the linter CI: here

@adrinjalali
Copy link
Member Author

So for simple cases where metadata is only used in fit and transform expects no metadata, we're fine. But things get a bit trickier when we start having a transform method of a step accepting the same metadata as the fit of that step.

Specifically, in this test:

@pytest.mark.usefixtures("enable_slep006")
@pytest.mark.parametrize("method", ["fit", "fit_transform"])
def test_transform_input_pipeline(method):
    """Test that with transform_input, data is correctly transformed for each step."""

    def get_transformer(registry, sample_weight, metadata):
        """Get a transformer with requests set."""
        return (
            ConsumingTransformer(registry=registry)
            .set_fit_request(sample_weight=sample_weight, metadata=metadata)
            .set_transform_request(sample_weight=sample_weight, metadata=metadata)
        )

    def get_pipeline():
        """Get a pipeline and corresponding registries.

        The pipeline has 4 steps, with different request values set to test different
        cases. One is aliased.
        """
        registry_1, registry_2, registry_3, registry_4 = (
            _Registry(),
            _Registry(),
            _Registry(),
            _Registry(),
        )
        pipe = make_pipeline(
            get_transformer(registry_1, sample_weight=True, metadata=True),
            get_transformer(registry_2, sample_weight=False, metadata=False),
            get_transformer(registry_3, sample_weight=True, metadata=True),
            get_transformer(registry_4, sample_weight="other_weights", metadata=True),
            transform_input=["sample_weight"],
        )
        return pipe, registry_1, registry_2, registry_3, registry_4

    def check_metadata(registry, methods, **metadata):
        """Check that the right metadata was recorded for the given methods."""
        assert registry
        for estimator in registry:
            for method in methods:
                check_recorded_metadata(
                    estimator,
                    method=method,
                    **metadata,
                )

    X = np.array([[1, 2], [3, 4]])
    y = np.array([0, 1])
    sample_weight = np.array([[1, 2]])
    other_weights = np.array([[30, 40]])
    metadata = np.array([[100, 200]])

    pipe, registry_1, registry_2, registry_3, registry_4 = get_pipeline()
    pipe.fit(
        X,
        y,
        sample_weight=sample_weight,
        other_weights=other_weights,
        metadata=metadata,
    )

    check_metadata(
        registry_1, ["fit", "transform"], sample_weight=sample_weight, metadata=metadata
    )
    check_metadata(registry_2, ["fit", "transform"])
    check_metadata(
        registry_3,
        ["fit", "transform"],
        sample_weight=sample_weight + 2,
        metadata=metadata,
    )
    check_metadata(
        registry_4,
        method.split("_"),  # ["fit", "transform"] if "fit_transform", ["fit"] otherwise
        sample_weight=other_weights + 3,
        metadata=metadata,
    )

Step 3 receives transformed data in its transform method during fit of the pipeline cause all metadata are transformed if they're in transform_input, but a second time when step3.transform is called, the metadata is not transformed (cause I haven't implemented it in pipeline.transform yet).

The question is, what should be the expected behavior?

Do we want transform_input to only transform when calling fit of sub estimators? That's a bit tricky cause all TransformerMixin estimators implement a fit_transform which accepts all metadata together, which means the metadata (if the same name) is either transformed or not transformed. (Wish we didn't have fit_transform in the first place, it's giving us so much headache)

@adrinjalali
Copy link
Member Author

Actually, in TransformerMixin we have:

        if _routing_enabled():
            transform_params = self.get_metadata_routing().consumes(
                method="transform", params=fit_params.keys()
            )
            if transform_params:
                warnings.warn(
                    (
                        f"This object ({self.__class__.__name__}) has a `transform`"
                        " method which consumes metadata, but `fit_transform` does not"
                        " forward metadata to `transform`. Please implement a custom"
                        " `fit_transform` method to forward metadata to `transform` as"
                        " well. Alternatively, you can explicitly do"
                        " `set_transform_request`and set all values to `False` to"
                        " disable metadata routed to `transform`, if that's an option."
                    ),
                    UserWarning,
                )

and we never send anything to .transform. So in Pipeline we can also assume things are only transformed for fit, as long as scikit-learn is concerned.

However, for third party transformers where they can have their own fit_transform and route parameters, then things can become tricky, as the example in the previous comment shows.

@adrinjalali
Copy link
Member Author

adrinjalali commented May 14, 2024

Another question is, do we want to have this syntactic sugar?

pipe = make_pipeline(
    StandardScaler(),
    HistGradientBoostingClassifier(..., early_stopping=True)
).fit(X, y, X_val, y_val)

The above code would:

  • early_stopping=True would change the default request values so that the user doesn't have to type .set_fit_request(X_val=True, y_val=True)
  • early_stopping=True sets something in the instance of the estimator which tells pipeline that X_val is of the same nature as X, and therefore should be transformed

It wouldn't change what we have now implemented in Pipeline in this PR, but would make it easier for the user. Not sure if it's too magical for us though.

For that to happen, HGBC need to have:

class HistGradientBoostingClassifier(...):
    ...

    def get_metadata_routing(self):
        routing = super().get_metadata_routing()
        if self.early_stopping:
            routing.fit.add(X_val=True, y_val=True)

    def __sklearn_get_transforming_data__(self):
        return ["X_val"]

And Pipeline would look for info in __sklearn_get_transforming_data__ if it exists.

cc @glemaitre

It goes towards the direction of having more default routing info as @ogrisel really likes. (ref #26179 )

Note that this could come later separately as an enhancement to this PR.

@adrinjalali
Copy link
Member Author

There is an issue with testing metadata routing in more complex situations (which has come up in this PR) which requires some fixes (adding parent or caller to how metadata is recorded in testing estimators) which is included in this PR now.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A partial review.
@adrinjalali Just that you see that at least someone cares.

doc/whats_new/v1.6.rst Outdated Show resolved Hide resolved
sklearn/pipeline.py Outdated Show resolved Hide resolved
sklearn/pipeline.py Show resolved Hide resolved
them. Requirement is defined via :ref:`metadata routing <metadata_routing>`.
This can be used to pass a validation set through the pipeline for instance.

See the example TBD for more details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very keen to see that example, maybe the HGBT early stopping case?

Copy link
Member Author

@adrinjalali adrinjalali Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went and checked if I could use lighgbm, but there the validation set is passed as a list of tuples. No way to process that in a pipeline.

As for HGBT, it would look like this:

from sklearn.datasets import make_regression
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

X, y = make_regression(n_samples=200, n_features=500, n_informative=5, random_state=0)
X[:2,] = X[:2,] + 20

# Validation set chosen before looking at the data.
X_val, y_val = X[:50,], y[:50,]
X, y = X[50:,], y[50:,]

est_gs = GridSearchCV(
    Pipeline(
        (
            StandardScaler(),
            HistGradientBoostingRegressor(
                early_stopping=True,
            ).set_fit_request(X_val=True, y_val=True),
        ),
        # telling pipeline to transform these inputs up to the step which is
        # requesting them.
        transform_input=["X_val", "y_val"],
    ),
    param_grid={"histgradientboostingregressor__max_depth": list(range(5))},
    cv=5,
).fit(X, y, X_val, y_val)
# this passes X_val, y_val to Pipeline, and Pipeline knows how to deal with
# them.

@@ -378,9 +392,66 @@ def _check_method_params(self, method, props, **kwargs):
fit_params_steps[step]["fit_predict"][param] = pval
return fit_params_steps

def _get_step_params(self, *, step_idx, step_params, all_params):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know naming is not easy. If it "transformes" the metadata, why not naming it in that direction: _transform_metadata_for_step or similar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't always transform. This gets the metadata for the step, and transforms the ones which need to be transformed. In most cases there's nothing to transform here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_metadate_for_step then?

will be transformed.

`all_params` are the metadata passed by the user. Used to call `transform`
on the pipeline itself.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the Parameters section in the docstring might help to better understand this method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this now helps.

sklearn/pipeline.py Show resolved Hide resolved
sklearn/tests/test_pipeline.py Show resolved Hide resolved
sklearn/pipeline.py Show resolved Hide resolved
@@ -378,9 +392,66 @@ def _check_method_params(self, method, props, **kwargs):
fit_params_steps[step]["fit_predict"][param] = pval
return fit_params_steps

def _get_step_params(self, *, step_idx, step_params, all_params):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't always transform. This gets the metadata for the step, and transforms the ones which need to be transformed. In most cases there's nothing to transform here.

will be transformed.

`all_params` are the metadata passed by the user. Used to call `transform`
on the pipeline itself.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this now helps.

Comment on lines +1887 to +1899
check_metadata(
registry_3,
["fit", "transform"],
sample_weight=sample_weight + 2,
metadata=metadata,
)
check_metadata(
registry_4,
method.split("_"), # ["fit", "transform"] if "fit_transform", ["fit"] otherwise
sample_weight=other_weights + 3,
metadata=metadata,
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lorentzenchr here the exact values of the transformed metadata are tested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But it would help me to have one simple test where everything is explicit.
This test is great and should stay.

them. Requirement is defined via :ref:`metadata routing <metadata_routing>`.
This can be used to pass a validation set through the pipeline for instance.

See the example TBD for more details.
Copy link
Member Author

@adrinjalali adrinjalali Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went and checked if I could use lighgbm, but there the validation set is passed as a list of tuples. No way to process that in a pipeline.

As for HGBT, it would look like this:

from sklearn.datasets import make_regression
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

X, y = make_regression(n_samples=200, n_features=500, n_informative=5, random_state=0)
X[:2,] = X[:2,] + 20

# Validation set chosen before looking at the data.
X_val, y_val = X[:50,], y[:50,]
X, y = X[50:,], y[50:,]

est_gs = GridSearchCV(
    Pipeline(
        (
            StandardScaler(),
            HistGradientBoostingRegressor(
                early_stopping=True,
            ).set_fit_request(X_val=True, y_val=True),
        ),
        # telling pipeline to transform these inputs up to the step which is
        # requesting them.
        transform_input=["X_val", "y_val"],
    ),
    param_grid={"histgradientboostingregressor__max_depth": list(range(5))},
    cv=5,
).fit(X, y, X_val, y_val)
# this passes X_val, y_val to Pipeline, and Pipeline knows how to deal with
# them.

@adrinjalali
Copy link
Member Author

I checked lightgbm and xgboost, they both take a list of tuples as validation sets, which makes me think why, and if this proposal is enough!

@lorentzenchr
Copy link
Member

I checked lightgbm and xgboost, they both take a list of tuples as validation sets, which makes me think why, and if this proposal is enough!

Let's ask them directly: @jameslamb @StrikerRUS @shiyu1994 @trivialfis @hcho3 your opinion would be very appreciated. We are trying to transform metadata on the way to the step of a pipeline where it is needed, e.g. validation data for early stopping in GBTs, see #28901 (comment) (StandardScaler is just for demonstration purposes).

@trivialfis
Copy link

trivialfis commented Sep 3, 2024

Thank you for the ping. There's no particular reason for XGBoost; it was developed in the early days before the sklearn estimator guideline existed. If there's a new best practice, please share it, and we will be happy to make the change to comply with the new estimator guideline.

On the other hand, we sometimes use multiple evaluation datasets. For example, we might monitor the progress for both training and validation instead of only validation. Sometimes, tracking the training accuracy might not be necessary, and computation performance can be improved by evaluating only the validation dataset. There are other cases for using custom evaluation datasets as well. Leaving it as a user choice makes sense. Therefore, the option of using a variable sequence of datasets remains unchanged.

@adrinjalali
Copy link
Member Author

If the API to est.fit(X, y, X_val=(X1, X2), y_val=(y1, y2)), then we could have a special case for tuple in our mechanism to transform all in the X_val tuple, and the user code would look like:

X, y = make_regression(n_samples=200, n_features=500, n_informative=5, random_state=0)
X[:2,] = X[:2,] + 20

# Validation set chosen before looking at the data.
X_val, y_val = X[:50,], y[:50,]
X, y = X[50:,], y[50:,]

est_gs = GridSearchCV(
    Pipeline(
        (
            StandardScaler(),
            XGBRegressor(
                early_stopping_rounds=3,
            ).set_fit_request(X_val=True, y_val=True),
        ),
        # telling pipeline to transform these inputs up to the step which is
        # requesting them.
        transform_input=["X_val", "y_val"],
    ),
    param_grid={"histgradientboostingregressor__max_depth": list(range(5))},
    cv=5,
).fit(X, y, X_val=(X, X_val), y_val=(y, y_val))
# this passes X_val, y_val to Pipeline, and Pipeline knows how to deal with
# them.

@StrikerRUS
Copy link
Contributor

Thanks a lot for the invitation to the discussion!

If there's a new best practice, please share it, and we will be happy to make the change to comply with the new estimator guideline.

The same for LightGBM. The discussing here interface was developed very long time ago and I believe it was just a replication of the existing at that moment XGBoost's one to not overcomplicate users' experience.

On the other hand, we sometimes use multiple evaluation datasets.

Indeed true! This feature of multiple validation sets should be preserved, I think. Linking some related discussions:

then we could have a special case for tuple in our mechanism to transform all in the X_val tuple

Will it play nice with some other data fit() arguments like sample_weight/eval_sample_weight and init_score/eval_init_score?

Comment on lines +1917 to +1951
def test_transform_tuple_input():
"""Test that if metadata is a tuple of arrays, both arrays are transformed."""

class Estimator(ClassifierMixin, BaseEstimator):
def fit(self, X, y, X_val=None, y_val=None):
assert isinstance(X_val, tuple)
assert isinstance(y_val, tuple)
# Here we make sure that each X_val is transformed by the transformer
assert_array_equal(X_val[0], np.array([[2, 3]]))
assert_array_equal(y_val[0], np.array([0, 1]))
assert_array_equal(X_val[1], np.array([[11, 12]]))
assert_array_equal(y_val[1], np.array([1, 2]))
return self

class Transformer(TransformerMixin, BaseEstimator):
def fit(self, X, y):
return self

def transform(self, X):
return X + 1

X = np.array([[1, 2]])
y = np.array([0, 1])
X_val0 = np.array([[1, 2]])
y_val0 = np.array([0, 1])
X_val1 = np.array([[10, 11]])
y_val1 = np.array([1, 2])
pipe = Pipeline(
[
("transformer", Transformer()),
("estimator", Estimator().set_fit_request(X_val=True, y_val=True)),
],
transform_input=["X_val"],
)
pipe.fit(X, y, X_val=(X_val0, X_val1), y_val=(y_val0, y_val1))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lorentzenchr @StrikerRUS @trivialfis the tuple pattern is now present and tested here.

It would be lovely if you could test this with a few cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no one else gets to it sooner, I'd be happy to try testing this (for both lightgbm and xgboost).

I could probably get to that some time in the next week... but I won't be offended if you say "thanks but that's too long to wait, we're going to merge this soon".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb yeah go for it. Thanks.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
It needs:

  • example (why I did not yet approve)
  • more reviewers

@@ -378,9 +392,66 @@ def _check_method_params(self, method, props, **kwargs):
fit_params_steps[step]["fit_predict"][param] = pval
return fit_params_steps

def _get_step_params(self, *, step_idx, step_params, all_params):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_metadate_for_step then?

# tuple. This is needed to support the pattern present in
# `lightgbm` and `xgboost` where users can pass multiple
# validation sets.
if isinstance(param_value, tuple):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a veeery deep loop/if/else-code. Could it be made less deep.
On the other side, it seem quite readable.

them. Requirement is defined via :ref:`metadata routing <metadata_routing>`.
This can be used to pass a validation set through the pipeline for instance.

See the example TBD for more details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be addressed.

Comment on lines +1887 to +1899
check_metadata(
registry_3,
["fit", "transform"],
sample_weight=sample_weight + 2,
metadata=metadata,
)
check_metadata(
registry_4,
method.split("_"), # ["fit", "transform"] if "fit_transform", ["fit"] otherwise
sample_weight=other_weights + 3,
metadata=metadata,
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But it would help me to have one simple test where everything is explicit.
This test is great and should stay.

@lorentzenchr lorentzenchr added this to the 1.6 milestone Sep 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

6 participants