FEAT allow metadata to be transformed in a Pipeline #28901

adrinjalali · 2024-04-26T15:41:56Z

Initial proposal: #28440 (comment)
xref: #28440 (comment)

This adds transform_input as a constructor argument to Pipeline, as:

    transform_input : list of str, default=None
        This enables transforming some input arguments to ``fit`` (other than ``X``)
        to be transformed by the steps of the pipeline up to the step which requires
        them. Requirement is defined via :ref:`metadata routing <metadata_routing>`.
        This can be used to pass a validation set through the pipeline for instance.

        See the example TBD for more details.

        You can only set this if metadata routing is enabled, which you
        can enable using ``sklearn.set_config(enable_metadata_routing=True)``.

It simply allows to transform metadata with fitted estimators up to the step which needs the metadata.

How does this look?

cc @lorentzenchr @ogrisel @amueller @betatim

github-actions · 2024-04-26T15:43:12Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 24fa675. Link to the linter CI: here}

adrinjalali · 2024-05-08T09:31:40Z

So for simple cases where metadata is only used in fit and transform expects no metadata, we're fine. But things get a bit trickier when we start having a transform method of a step accepting the same metadata as the fit of that step.

Specifically, in this test:

@pytest.mark.usefixtures("enable_slep006")
@pytest.mark.parametrize("method", ["fit", "fit_transform"])
def test_transform_input_pipeline(method):
    """Test that with transform_input, data is correctly transformed for each step."""

    def get_transformer(registry, sample_weight, metadata):
        """Get a transformer with requests set."""
        return (
            ConsumingTransformer(registry=registry)
            .set_fit_request(sample_weight=sample_weight, metadata=metadata)
            .set_transform_request(sample_weight=sample_weight, metadata=metadata)
        )

    def get_pipeline():
        """Get a pipeline and corresponding registries.

        The pipeline has 4 steps, with different request values set to test different
        cases. One is aliased.
        """
        registry_1, registry_2, registry_3, registry_4 = (
            _Registry(),
            _Registry(),
            _Registry(),
            _Registry(),
        )
        pipe = make_pipeline(
            get_transformer(registry_1, sample_weight=True, metadata=True),
            get_transformer(registry_2, sample_weight=False, metadata=False),
            get_transformer(registry_3, sample_weight=True, metadata=True),
            get_transformer(registry_4, sample_weight="other_weights", metadata=True),
            transform_input=["sample_weight"],
        )
        return pipe, registry_1, registry_2, registry_3, registry_4

    def check_metadata(registry, methods, **metadata):
        """Check that the right metadata was recorded for the given methods."""
        assert registry
        for estimator in registry:
            for method in methods:
                check_recorded_metadata(
                    estimator,
                    method=method,
                    **metadata,
                )

    X = np.array([[1, 2], [3, 4]])
    y = np.array([0, 1])
    sample_weight = np.array([[1, 2]])
    other_weights = np.array([[30, 40]])
    metadata = np.array([[100, 200]])

    pipe, registry_1, registry_2, registry_3, registry_4 = get_pipeline()
    pipe.fit(
        X,
        y,
        sample_weight=sample_weight,
        other_weights=other_weights,
        metadata=metadata,
    )

    check_metadata(
        registry_1, ["fit", "transform"], sample_weight=sample_weight, metadata=metadata
    )
    check_metadata(registry_2, ["fit", "transform"])
    check_metadata(
        registry_3,
        ["fit", "transform"],
        sample_weight=sample_weight + 2,
        metadata=metadata,
    )
    check_metadata(
        registry_4,
        method.split("_"),  # ["fit", "transform"] if "fit_transform", ["fit"] otherwise
        sample_weight=other_weights + 3,
        metadata=metadata,
    )

Step 3 receives transformed data in its transform method during fit of the pipeline cause all metadata are transformed if they're in transform_input, but a second time when step3.transform is called, the metadata is not transformed (cause I haven't implemented it in pipeline.transform yet).

The question is, what should be the expected behavior?

Do we want transform_input to only transform when calling fit of sub estimators? That's a bit tricky cause all TransformerMixin estimators implement a fit_transform which accepts all metadata together, which means the metadata (if the same name) is either transformed or not transformed. (Wish we didn't have fit_transform in the first place, it's giving us so much headache)

adrinjalali · 2024-05-13T08:28:52Z

Actually, in TransformerMixin we have:

        if _routing_enabled():
            transform_params = self.get_metadata_routing().consumes(
                method="transform", params=fit_params.keys()
            )
            if transform_params:
                warnings.warn(
                    (
                        f"This object ({self.__class__.__name__}) has a `transform`"
                        " method which consumes metadata, but `fit_transform` does not"
                        " forward metadata to `transform`. Please implement a custom"
                        " `fit_transform` method to forward metadata to `transform` as"
                        " well. Alternatively, you can explicitly do"
                        " `set_transform_request`and set all values to `False` to"
                        " disable metadata routed to `transform`, if that's an option."
                    ),
                    UserWarning,
                )

and we never send anything to .transform. So in Pipeline we can also assume things are only transformed for fit, as long as scikit-learn is concerned.

However, for third party transformers where they can have their own fit_transform and route parameters, then things can become tricky, as the example in the previous comment shows.

adrinjalali · 2024-05-14T09:52:40Z

Another question is, do we want to have this syntactic sugar?

pipe = make_pipeline(
    StandardScaler(),
    HistGradientBoostingClassifier(..., early_stopping=True)
).fit(X, y, X_val, y_val)

The above code would:

early_stopping=True would change the default request values so that the user doesn't have to type .set_fit_request(X_val=True, y_val=True)
early_stopping=True sets something in the instance of the estimator which tells pipeline that X_val is of the same nature as X, and therefore should be transformed

It wouldn't change what we have now implemented in Pipeline in this PR, but would make it easier for the user. Not sure if it's too magical for us though.

For that to happen, HGBC need to have:

class HistGradientBoostingClassifier(...):
    ...

    def get_metadata_routing(self):
        routing = super().get_metadata_routing()
        if self.early_stopping:
            routing.fit.add(X_val=True, y_val=True)

    def __sklearn_get_transforming_data__(self):
        return ["X_val"]

And Pipeline would look for info in __sklearn_get_transforming_data__ if it exists.

cc @glemaitre

It goes towards the direction of having more default routing info as @ogrisel really likes. (ref #26179 )

Note that this could come later separately as an enhancement to this PR.

adrinjalali · 2024-05-26T17:43:24Z

There is an issue with testing metadata routing in more complex situations (which has come up in this PR) which requires some fixes (adding parent or caller to how metadata is recorded in testing estimators) which is included in this PR now.

sklearn/tests/metadata_routing_common.py

lorentzenchr

A partial review.
@adrinjalali Just that you see that at least someone cares.

doc/whats_new/v1.6.rst

sklearn/pipeline.py

lorentzenchr · 2024-08-27T16:19:17Z

sklearn/pipeline.py

+        them. Requirement is defined via :ref:`metadata routing <metadata_routing>`.
+        This can be used to pass a validation set through the pipeline for instance.
+
+        See the example TBD for more details.


I'm very keen to see that example, maybe the HGBT early stopping case?

I went and checked if I could use lighgbm, but there the validation set is passed as a list of tuples. No way to process that in a pipeline.

As for HGBT, it would look like this:

from sklearn.datasets import make_regression from sklearn.ensemble import HistGradientBoostingRegressor from sklearn.model_selection import GridSearchCV, ShuffleSplit from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler X, y = make_regression(n_samples=200, n_features=500, n_informative=5, random_state=0) X[:2,] = X[:2,] + 20 # Validation set chosen before looking at the data. X_val, y_val = X[:50,], y[:50,] X, y = X[50:,], y[50:,] est_gs = GridSearchCV( Pipeline( ( StandardScaler(), HistGradientBoostingRegressor( early_stopping=True, ).set_fit_request(X_val=True, y_val=True), ), # telling pipeline to transform these inputs up to the step which is # requesting them. transform_input=["X_val", "y_val"], ), param_grid={"histgradientboostingregressor__max_depth": list(range(5))}, cv=5, ).fit(X, y, X_val, y_val) # this passes X_val, y_val to Pipeline, and Pipeline knows how to deal with # them.

lorentzenchr · 2024-08-27T16:21:38Z

sklearn/pipeline.py

@@ -378,9 +392,66 @@ def _check_method_params(self, method, props, **kwargs):
                fit_params_steps[step]["fit_predict"][param] = pval
            return fit_params_steps

+    def _get_step_params(self, *, step_idx, step_params, all_params):


I know naming is not easy. If it "transformes" the metadata, why not naming it in that direction: _transform_metadata_for_step or similar.

This doesn't always transform. This gets the metadata for the step, and transforms the ones which need to be transformed. In most cases there's nothing to transform here.

_get_metadate_for_step then?

lorentzenchr · 2024-08-27T16:22:35Z

sklearn/pipeline.py

+        will be transformed.
+
+        `all_params` are the metadata passed by the user. Used to call `transform`
+        on the pipeline itself.


Adding the Parameters section in the docstring might help to better understand this method.

I think this now helps.

sklearn/pipeline.py

sklearn/tests/test_pipeline.py

sklearn/pipeline.py

adrinjalali · 2024-08-31T10:13:04Z

sklearn/pipeline.py

@@ -378,9 +392,66 @@ def _check_method_params(self, method, props, **kwargs):
                fit_params_steps[step]["fit_predict"][param] = pval
            return fit_params_steps

+    def _get_step_params(self, *, step_idx, step_params, all_params):


This doesn't always transform. This gets the metadata for the step, and transforms the ones which need to be transformed. In most cases there's nothing to transform here.

adrinjalali · 2024-09-02T09:10:54Z

sklearn/pipeline.py

+        will be transformed.
+
+        `all_params` are the metadata passed by the user. Used to call `transform`
+        on the pipeline itself.


I think this now helps.

adrinjalali · 2024-09-02T09:13:26Z

sklearn/tests/test_pipeline.py

+    check_metadata(
+        registry_3,
+        ["fit", "transform"],
+        sample_weight=sample_weight + 2,
+        metadata=metadata,
+    )
+    check_metadata(
+        registry_4,
+        method.split("_"),  # ["fit", "transform"] if "fit_transform", ["fit"] otherwise
+        sample_weight=other_weights + 3,
+        metadata=metadata,
+    )
+


@lorentzenchr here the exact values of the transformed metadata are tested.

Yes. But it would help me to have one simple test where everything is explicit.
This test is great and should stay.

adrinjalali · 2024-09-02T12:06:47Z

sklearn/pipeline.py

+        them. Requirement is defined via :ref:`metadata routing <metadata_routing>`.
+        This can be used to pass a validation set through the pipeline for instance.
+
+        See the example TBD for more details.


I went and checked if I could use lighgbm, but there the validation set is passed as a list of tuples. No way to process that in a pipeline.

As for HGBT, it would look like this:

from sklearn.datasets import make_regression from sklearn.ensemble import HistGradientBoostingRegressor from sklearn.model_selection import GridSearchCV, ShuffleSplit from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler X, y = make_regression(n_samples=200, n_features=500, n_informative=5, random_state=0) X[:2,] = X[:2,] + 20 # Validation set chosen before looking at the data. X_val, y_val = X[:50,], y[:50,] X, y = X[50:,], y[50:,] est_gs = GridSearchCV( Pipeline( ( StandardScaler(), HistGradientBoostingRegressor( early_stopping=True, ).set_fit_request(X_val=True, y_val=True), ), # telling pipeline to transform these inputs up to the step which is # requesting them. transform_input=["X_val", "y_val"], ), param_grid={"histgradientboostingregressor__max_depth": list(range(5))}, cv=5, ).fit(X, y, X_val, y_val) # this passes X_val, y_val to Pipeline, and Pipeline knows how to deal with # them.

adrinjalali · 2024-09-02T13:35:57Z

I checked lightgbm and xgboost, they both take a list of tuples as validation sets, which makes me think why, and if this proposal is enough!

lorentzenchr · 2024-09-02T18:49:36Z

I checked lightgbm and xgboost, they both take a list of tuples as validation sets, which makes me think why, and if this proposal is enough!

Let's ask them directly: @jameslamb @StrikerRUS @shiyu1994 @trivialfis @hcho3 your opinion would be very appreciated. We are trying to transform metadata on the way to the step of a pipeline where it is needed, e.g. validation data for early stopping in GBTs, see #28901 (comment) (StandardScaler is just for demonstration purposes).

trivialfis · 2024-09-03T08:25:46Z

Thank you for the ping. There's no particular reason for XGBoost; it was developed in the early days before the sklearn estimator guideline existed. If there's a new best practice, please share it, and we will be happy to make the change to comply with the new estimator guideline.

On the other hand, we sometimes use multiple evaluation datasets. For example, we might monitor the progress for both training and validation instead of only validation. Sometimes, tracking the training accuracy might not be necessary, and computation performance can be improved by evaluating only the validation dataset. There are other cases for using custom evaluation datasets as well. Leaving it as a user choice makes sense. Therefore, the option of using a variable sequence of datasets remains unchanged.

adrinjalali · 2024-09-03T08:46:30Z

If the API to est.fit(X, y, X_val=(X1, X2), y_val=(y1, y2)), then we could have a special case for tuple in our mechanism to transform all in the X_val tuple, and the user code would look like:

X, y = make_regression(n_samples=200, n_features=500, n_informative=5, random_state=0)
X[:2,] = X[:2,] + 20

# Validation set chosen before looking at the data.
X_val, y_val = X[:50,], y[:50,]
X, y = X[50:,], y[50:,]

est_gs = GridSearchCV(
    Pipeline(
        (
            StandardScaler(),
            XGBRegressor(
                early_stopping_rounds=3,
            ).set_fit_request(X_val=True, y_val=True),
        ),
        # telling pipeline to transform these inputs up to the step which is
        # requesting them.
        transform_input=["X_val", "y_val"],
    ),
    param_grid={"histgradientboostingregressor__max_depth": list(range(5))},
    cv=5,
).fit(X, y, X_val=(X, X_val), y_val=(y, y_val))
# this passes X_val, y_val to Pipeline, and Pipeline knows how to deal with
# them.

StrikerRUS · 2024-09-05T23:09:34Z

Thanks a lot for the invitation to the discussion!

If there's a new best practice, please share it, and we will be happy to make the change to comply with the new estimator guideline.

The same for LightGBM. The discussing here interface was developed very long time ago and I believe it was just a replication of the existing at that moment XGBoost's one to not overcomplicate users' experience.

On the other hand, we sometimes use multiple evaluation datasets.

Indeed true! This feature of multiple validation sets should be preserved, I think. Linking some related discussions:

[python-package] Custom multiclass loss function doesn't work microsoft/LightGBM#4981

I implemented multiclass logloss as a custom loss function, and trained while evaluating on 3 validation sets: the training data, the training data in shuffled order, and a heldout set.
[python-package] Allow using only the last dataset for early stopping microsoft/LightGBM#6360

then we could have a special case for tuple in our mechanism to transform all in the X_val tuple

Will it play nice with some other data fit() arguments like sample_weight/eval_sample_weight and init_score/eval_init_score?

adrinjalali · 2024-09-07T15:28:29Z

sklearn/tests/test_pipeline.py

+def test_transform_tuple_input():
+    """Test that if metadata is a tuple of arrays, both arrays are transformed."""
+
+    class Estimator(ClassifierMixin, BaseEstimator):
+        def fit(self, X, y, X_val=None, y_val=None):
+            assert isinstance(X_val, tuple)
+            assert isinstance(y_val, tuple)
+            # Here we make sure that each X_val is transformed by the transformer
+            assert_array_equal(X_val[0], np.array([[2, 3]]))
+            assert_array_equal(y_val[0], np.array([0, 1]))
+            assert_array_equal(X_val[1], np.array([[11, 12]]))
+            assert_array_equal(y_val[1], np.array([1, 2]))
+            return self
+
+    class Transformer(TransformerMixin, BaseEstimator):
+        def fit(self, X, y):
+            return self
+
+        def transform(self, X):
+            return X + 1
+
+    X = np.array([[1, 2]])
+    y = np.array([0, 1])
+    X_val0 = np.array([[1, 2]])
+    y_val0 = np.array([0, 1])
+    X_val1 = np.array([[10, 11]])
+    y_val1 = np.array([1, 2])
+    pipe = Pipeline(
+        [
+            ("transformer", Transformer()),
+            ("estimator", Estimator().set_fit_request(X_val=True, y_val=True)),
+        ],
+        transform_input=["X_val"],
+    )
+    pipe.fit(X, y, X_val=(X_val0, X_val1), y_val=(y_val0, y_val1))


@lorentzenchr @StrikerRUS @trivialfis the tuple pattern is now present and tested here.

It would be lovely if you could test this with a few cases.

If no one else gets to it sooner, I'd be happy to try testing this (for both lightgbm and xgboost).

I could probably get to that some time in the next week... but I won't be offended if you say "thanks but that's too long to wait, we're going to merge this soon".

@jameslamb yeah go for it. Thanks.

lorentzenchr

LGTM
It needs:

example (why I did not yet approve)
more reviewers

lorentzenchr · 2024-09-28T12:30:25Z

sklearn/pipeline.py

@@ -378,9 +392,66 @@ def _check_method_params(self, method, props, **kwargs):
                fit_params_steps[step]["fit_predict"][param] = pval
            return fit_params_steps

+    def _get_step_params(self, *, step_idx, step_params, all_params):


_get_metadate_for_step then?

lorentzenchr · 2024-09-28T12:34:27Z

sklearn/pipeline.py

+                        # tuple. This is needed to support the pattern present in
+                        # `lightgbm` and `xgboost` where users can pass multiple
+                        # validation sets.
+                        if isinstance(param_value, tuple):


This is a veeery deep loop/if/else-code. Could it be made less deep.
On the other side, it seem quite readable.

lorentzenchr · 2024-09-28T12:35:48Z

sklearn/pipeline.py

+        them. Requirement is defined via :ref:`metadata routing <metadata_routing>`.
+        This can be used to pass a validation set through the pipeline for instance.
+
+        See the example TBD for more details.


Needs to be addressed.

lorentzenchr · 2024-09-28T12:36:53Z

sklearn/tests/test_pipeline.py

+    check_metadata(
+        registry_3,
+        ["fit", "transform"],
+        sample_weight=sample_weight + 2,
+        metadata=metadata,
+    )
+    check_metadata(
+        registry_4,
+        method.split("_"),  # ["fit", "transform"] if "fit_transform", ["fit"] otherwise
+        sample_weight=other_weights + 3,
+        metadata=metadata,
+    )
+


Yes. But it would help me to have one simple test where everything is explicit.
This test is great and should stay.

adrinjalali added 4 commits April 15, 2024 11:25

FEAT allow metadata to be transformed in Pipeline

868d0ff

Merge remote-tracking branch 'upstream/main' into pipeline/transform

42dfe81

add tests

94c8bd9

add fit_transform

818da32

github-actions bot added the module:pipeline label Apr 26, 2024

adrinjalali added 5 commits April 29, 2024 13:05

fix pprint test

067946c

Merge remote-tracking branch 'upstream/main' into pipeline/transform

ed5edcd

add changelog

85c10a4

much more extensive tests

ad269ea

Merge remote-tracking branch 'upstream/main' into pipeline/transform

1622203

adrinjalali added 4 commits May 24, 2024 20:25

more fixes

5268514

Merge remote-tracking branch 'upstream/main' into pipeline/transform

1a4a428

WIP tests improvements

052b13d

TST fix pipeline tests

278dc70

adrinjalali mentioned this pull request Jun 6, 2024

FEA SLEP006: Metadata routing for SelfTrainingClassifier #28494

Merged

adam2392 reviewed Jun 6, 2024

View reviewed changes

sklearn/tests/metadata_routing_common.py Outdated Show resolved Hide resolved

adrinjalali mentioned this pull request Jun 10, 2024

TST improve metadata routing tests #29226

Merged

This was referenced Aug 1, 2024

Support early_stopping with custom validation_set #18748

Open

BaggingRegressor with **fit_params with CatBoostRegressor fit(..., eval_set= ()) #29591

Closed

lorentzenchr reviewed Aug 27, 2024

View reviewed changes

adrinjalali added 2 commits September 2, 2024 11:13

Christian's comments

75dbf5d

Merge remote-tracking branch 'upstream/main' into pipeline/transform

ffcbca5

adrinjalali commented Sep 2, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into pipeline/transform

07586f9

adrinjalali added 3 commits September 7, 2024 16:48

remove erronous arg passing

52e0642

support tupples to be transformed

cdaf20f

Merge remote-tracking branch 'upstream/main' into pipeline/transform

24fa675

adrinjalali commented Sep 7, 2024

View reviewed changes

lorentzenchr reviewed Sep 28, 2024

View reviewed changes

lorentzenchr added this to the 1.6 milestone Sep 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT allow metadata to be transformed in a Pipeline #28901

FEAT allow metadata to be transformed in a Pipeline #28901

adrinjalali commented Apr 26, 2024

github-actions bot commented Apr 26, 2024 •

edited

Loading

adrinjalali commented May 8, 2024

adrinjalali commented May 13, 2024

adrinjalali commented May 14, 2024 •

edited

Loading

adrinjalali commented May 26, 2024

lorentzenchr left a comment

lorentzenchr Aug 27, 2024

adrinjalali Sep 2, 2024 •

edited

Loading

lorentzenchr Aug 27, 2024

adrinjalali Aug 31, 2024

lorentzenchr Sep 28, 2024

lorentzenchr Aug 27, 2024

adrinjalali Sep 2, 2024

adrinjalali Aug 31, 2024

adrinjalali Sep 2, 2024

adrinjalali Sep 2, 2024

lorentzenchr Sep 28, 2024

adrinjalali Sep 2, 2024 •

edited

Loading

adrinjalali commented Sep 2, 2024

lorentzenchr commented Sep 2, 2024

trivialfis commented Sep 3, 2024 •

edited

Loading

adrinjalali commented Sep 3, 2024

StrikerRUS commented Sep 5, 2024

adrinjalali Sep 7, 2024

jameslamb Oct 9, 2024

adrinjalali Oct 9, 2024

lorentzenchr left a comment

lorentzenchr Sep 28, 2024

lorentzenchr Sep 28, 2024

lorentzenchr Sep 28, 2024

lorentzenchr Sep 28, 2024

FEAT allow metadata to be transformed in a Pipeline #28901

Are you sure you want to change the base?

FEAT allow metadata to be transformed in a Pipeline #28901

Conversation

adrinjalali commented Apr 26, 2024

github-actions bot commented Apr 26, 2024 • edited Loading

✔️ Linting Passed

adrinjalali commented May 8, 2024

adrinjalali commented May 13, 2024

adrinjalali commented May 14, 2024 • edited Loading

adrinjalali commented May 26, 2024

lorentzenchr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali Sep 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali Sep 2, 2024 • edited Loading

Choose a reason for hiding this comment

adrinjalali commented Sep 2, 2024

lorentzenchr commented Sep 2, 2024

trivialfis commented Sep 3, 2024 • edited Loading

adrinjalali commented Sep 3, 2024

StrikerRUS commented Sep 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lorentzenchr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 26, 2024 •

edited

Loading

adrinjalali commented May 14, 2024 •

edited

Loading

adrinjalali Sep 2, 2024 •

edited

Loading

adrinjalali Sep 2, 2024 •

edited

Loading

trivialfis commented Sep 3, 2024 •

edited

Loading