[AIR] Introduce better scoring API for `BatchPredictor` #26451

amogkam · 2022-07-11T20:03:11Z

As discussed offline, allow configurability for feature columns and keep columns in BatchPredictor for better scoring UX on test datasets.

See the updated docstring for usage example.

Why are these changes needed?

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…h-predictor-api

jiaodong · 2022-07-11T22:33:39Z

python/ray/train/batch_predictor.py

            ScoringWrapper,
            compute=compute,
            batch_format="pandas",
            batch_size=batch_size,
            **ray_remote_args,
        )

+        if original_col_ds:
+            prediction_results = prediction_results.zip(original_col_ds)


im curious what guarantees ray datasets provide here, how did we ensure results returned from dropped_dataset and original_col_ds always match 1-1 across multiple executions, rather than assigning wrong labels ?

Hmm not sure I understand you here.

They don't necessarily need to match?

jiaodong · 2022-07-11T22:35:48Z

python/ray/train/batch_predictor.py

+
+        original_col_ds = None
+        if keep_columns:
+            original_col_ds = data.map_batches(


i assume the cost of .map_batches is O(data_size), is there any API in dataset that takes both feature_columns and keep_columns so we have both values we need in one .map_batches pass ?

For this I believe we need an API to output 2 datasets as the result of a map on an initial dataset.

There's no API to do this with Datasets right now.

jiaodong · 2022-07-11T22:37:01Z

python/ray/train/tests/test_batch_predictor.py

+
+    assert batch_predictor.predict(
+        test_dataset, feature_columns=["a"]
+    ).to_pandas().to_numpy().squeeze().tolist() == [4.0, 8.0, 12.0]


if dummy predictor returns data * self.factor why would [1, 2, 3] maps to a factor of 4 here o.0 ??

There's also a preprocessor which multiplies by 2 again

python/ray/train/tests/test_batch_predictor.py

ericl · 2022-07-11T23:30:24Z

Let's make sure to do this without adding any new map batches or zip. Just do the column manipulation in the existing UDF.

…

On Mon, Jul 11, 2022, 4:25 PM Amog Kamsetty ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In python/ray/train/batch_predictor.py <#26451 (comment)>: > @@ -106,6 +124,19 @@ def predict( ): predictor_kwargs["use_gpu"] = True + if feature_columns: + dropped_dataset = data.map_batches( + lambda df: df[feature_columns], batch_size=batch_size + ) + else: + dropped_dataset = data + + original_col_ds = None + if keep_columns: + original_col_ds = data.map_batches( For this I believe we need an API to output 2 datasets as the result of a map on an initial dataset. There's no API to do this with Datasets right now. — Reply to this email directly, view it on GitHub <#26451 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSWTJMOPTCEI7QWHM2TVTSUPDANCNFSM53IVB5YA> . You are receiving this because you were assigned.Message ID: ***@***.***>

amogkam · 2022-07-12T00:26:44Z

Won’t combining into a single stage lower performance for the pipelined case?

…

On Jul 11, 2022, at 4:30 PM, Eric Liang ***@***.***> wrote: Let's make sure to do this without adding any new map batches or zip. Just do the column manipulation in the existing UDF. On Mon, Jul 11, 2022, 4:25 PM Amog Kamsetty ***@***.***> wrote: > ***@***.**** commented on this pull request. > ------------------------------ > > In python/ray/train/batch_predictor.py > <#26451 (comment)>: > > > @@ -106,6 +124,19 @@ def predict( > ): > predictor_kwargs["use_gpu"] = True > > + if feature_columns: > + dropped_dataset = data.map_batches( > + lambda df: df[feature_columns], batch_size=batch_size > + ) > + else: > + dropped_dataset = data > + > + original_col_ds = None > + if keep_columns: > + original_col_ds = data.map_batches( > > For this I believe we need an API to output 2 datasets as the result of a > map on an initial dataset. > > There's no API to do this with Datasets right now. > > — > Reply to this email directly, view it on GitHub > <#26451 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAADUSWTJMOPTCEI7QWHM2TVTSUPDANCNFSM53IVB5YA> > . > You are receiving this because you were assigned.Message ID: > ***@***.***> > — Reply to this email directly, view it on GitHub <#26451 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5RZLEFJSNUWZDYS7KHAALVTSVBXANCNFSM53IVB5YA>. You are receiving this because you authored the thread.

ericl · 2022-07-12T00:56:12Z

Not as much as the extra data movement of adding more distributed ops. On Mon, Jul 11, 2022, 5:26 PM Amog Kamsetty ***@***.***> wrote:

…

Won’t combining into a single stage lower performance for the pipelined case? > On Jul 11, 2022, at 4:30 PM, Eric Liang ***@***.***> wrote: > > > Let's make sure to do this without adding any new map batches or zip. Just > do the column manipulation in the existing UDF. > > On Mon, Jul 11, 2022, 4:25 PM Amog Kamsetty ***@***.***> > wrote: > > > ***@***.**** commented on this pull request. > > ------------------------------ > > > > In python/ray/train/batch_predictor.py > > <#26451 (comment)>: > > > > > @@ -106,6 +124,19 @@ def predict( > > ): > > predictor_kwargs["use_gpu"] = True > > > > + if feature_columns: > > + dropped_dataset = data.map_batches( > > + lambda df: df[feature_columns], batch_size=batch_size > > + ) > > + else: > > + dropped_dataset = data > > + > > + original_col_ds = None > > + if keep_columns: > > + original_col_ds = data.map_batches( > > > > For this I believe we need an API to output 2 datasets as the result of a > > map on an initial dataset. > > > > There's no API to do this with Datasets right now. > > > > — > > Reply to this email directly, view it on GitHub > > <#26451 (comment)>, or > > unsubscribe > > < https://github.com/notifications/unsubscribe-auth/AAADUSWTJMOPTCEI7QWHM2TVTSUPDANCNFSM53IVB5YA > > > . > > You are receiving this because you were assigned.Message ID: > > ***@***.***> > > > — > Reply to this email directly, view it on GitHub < #26451 (comment)>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AB5RZLEFJSNUWZDYS7KHAALVTSVBXANCNFSM53IVB5YA >. > You are receiving this because you authored the thread. > — Reply to this email directly, view it on GitHub <#26451 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSUGA75N3IMRCUF5SHDVTS3U7ANCNFSM53IVB5YA> . You are receiving this because you were assigned.Message ID: ***@***.***>

ericl · 2022-07-12T23:25:41Z

python/ray/train/batch_predictor.py

@@ -106,6 +124,19 @@ def predict(
        ):
            predictor_kwargs["use_gpu"] = True

+        if feature_columns:
+            dropped_dataset = data.map_batches(
+                lambda df: df[feature_columns], batch_size=batch_size


Concretely, move this line into the ScoringWrapper.

ericl · 2022-07-12T23:26:03Z

python/ray/train/batch_predictor.py

            ScoringWrapper,
            compute=compute,
            batch_format="pandas",
            batch_size=batch_size,
            **ray_remote_args,
        )

+        if original_col_ds:
+            prediction_results = prediction_results.zip(original_col_ds)


Also remove and move into ScoringWrapper.

Signed-off-by: Amog Kamsetty <[email protected]>

amogkam · 2022-07-13T22:07:24Z

Updated @ericl, ptal!

…h-predictor-api

Signed-off-by: Amog Kamsetty <[email protected]>

…h-predictor-api Signed-off-by: Amog Kamsetty <[email protected]>

…26451) Signed-off-by: Amog Kamsetty <[email protected]> As discussed offline, allow configurability for feature columns and keep columns in BatchPredictor for better scoring UX on test datasets. Signed-off-by: Xiaowei Jiang <[email protected]>

…26451) Signed-off-by: Amog Kamsetty <[email protected]> As discussed offline, allow configurability for feature columns and keep columns in BatchPredictor for better scoring UX on test datasets. Signed-off-by: Stefan van der Kleij <[email protected]>

update

13441ed

amogkam assigned ericl, richardliaw, jiaodong and krfricke Jul 11, 2022

Merge branch 'master' of https://github.com/ray-project/ray into batc…

5abec6b

…h-predictor-api

richardliaw approved these changes Jul 11, 2022

View reviewed changes

jiaodong reviewed Jul 11, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 12, 2022

ericl reviewed Jul 12, 2022

View reviewed changes

amogkam added 2 commits July 13, 2022 15:03

update

7728e05

signoff

2f921d9

Signed-off-by: Amog Kamsetty <[email protected]>

amogkam requested review from ericl and jiaodong July 13, 2022 22:07

amogkam removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 13, 2022

ericl approved these changes Jul 14, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 14, 2022

amogkam added 2 commits July 13, 2022 17:45

update docstring

46a8136

update examples

4f92e5a

amogkam requested review from maxpumperla, pcmoritz, edoakes and simon-mo as code owners July 14, 2022 00:56

amogkam added 2 commits July 13, 2022 22:47

Merge branch 'master' of https://github.com/ray-project/ray into batc…

730faa3

…h-predictor-api

fix

87f84a3

Signed-off-by: Amog Kamsetty <[email protected]>

amogkam added 2 commits July 13, 2022 22:54

update

5e0a1bc

Merge branch 'master' of https://github.com/ray-project/ray into batc…

b2534d4

…h-predictor-api Signed-off-by: Amog Kamsetty <[email protected]>

amogkam merged commit 6595bd6 into ray-project:master Jul 14, 2022

amogkam deleted the batch-predictor-api branch July 14, 2022 18:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR] Introduce better scoring API for `BatchPredictor` #26451

[AIR] Introduce better scoring API for `BatchPredictor` #26451

amogkam commented Jul 11, 2022

jiaodong Jul 11, 2022 •

edited

Loading

amogkam Jul 11, 2022

jiaodong Jul 11, 2022

amogkam Jul 11, 2022

jiaodong Jul 11, 2022

amogkam Jul 11, 2022

ericl commented Jul 11, 2022 via email

amogkam commented Jul 12, 2022 via email

ericl commented Jul 12, 2022 via email

ericl Jul 12, 2022

ericl Jul 12, 2022

amogkam commented Jul 13, 2022

[AIR] Introduce better scoring API for BatchPredictor #26451

[AIR] Introduce better scoring API for BatchPredictor #26451

Conversation

amogkam commented Jul 11, 2022

Why are these changes needed?

Related issue number

Checks

jiaodong Jul 11, 2022 • edited Loading

Choose a reason for hiding this comment

amogkam Jul 11, 2022

Choose a reason for hiding this comment

jiaodong Jul 11, 2022

Choose a reason for hiding this comment

amogkam Jul 11, 2022

Choose a reason for hiding this comment

jiaodong Jul 11, 2022

Choose a reason for hiding this comment

amogkam Jul 11, 2022

Choose a reason for hiding this comment

ericl commented Jul 11, 2022 via email

amogkam commented Jul 12, 2022 via email

ericl commented Jul 12, 2022 via email

ericl Jul 12, 2022

Choose a reason for hiding this comment

ericl Jul 12, 2022

Choose a reason for hiding this comment

amogkam commented Jul 13, 2022

[AIR] Introduce better scoring API for `BatchPredictor` #26451

[AIR] Introduce better scoring API for `BatchPredictor` #26451

jiaodong Jul 11, 2022 •

edited

Loading