-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIR] Introduce better scoring API for BatchPredictor
#26451
Conversation
python/ray/train/batch_predictor.py
Outdated
ScoringWrapper, | ||
compute=compute, | ||
batch_format="pandas", | ||
batch_size=batch_size, | ||
**ray_remote_args, | ||
) | ||
|
||
if original_col_ds: | ||
prediction_results = prediction_results.zip(original_col_ds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im curious what guarantees ray datasets provide here, how did we ensure results returned from dropped_dataset
and original_col_ds
always match 1-1 across multiple executions, rather than assigning wrong labels ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not sure I understand you here.
They don't necessarily need to match?
python/ray/train/batch_predictor.py
Outdated
|
||
original_col_ds = None | ||
if keep_columns: | ||
original_col_ds = data.map_batches( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i assume the cost of .map_batches
is O(data_size), is there any API in dataset that takes both feature_columns
and keep_columns
so we have both values we need in one .map_batches
pass ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this I believe we need an API to output 2 datasets as the result of a map on an initial dataset.
There's no API to do this with Datasets right now.
|
||
assert batch_predictor.predict( | ||
test_dataset, feature_columns=["a"] | ||
).to_pandas().to_numpy().squeeze().tolist() == [4.0, 8.0, 12.0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if dummy predictor returns data * self.factor
why would [1, 2, 3] maps to a factor of 4 here o.0 ??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also a preprocessor which multiplies by 2 again
Let's make sure to do this without adding any new map batches or zip. Just
do the column manipulation in the existing UDF.
…On Mon, Jul 11, 2022, 4:25 PM Amog Kamsetty ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In python/ray/train/batch_predictor.py
<#26451 (comment)>:
> @@ -106,6 +124,19 @@ def predict(
):
predictor_kwargs["use_gpu"] = True
+ if feature_columns:
+ dropped_dataset = data.map_batches(
+ lambda df: df[feature_columns], batch_size=batch_size
+ )
+ else:
+ dropped_dataset = data
+
+ original_col_ds = None
+ if keep_columns:
+ original_col_ds = data.map_batches(
For this I believe we need an API to output 2 datasets as the result of a
map on an initial dataset.
There's no API to do this with Datasets right now.
—
Reply to this email directly, view it on GitHub
<#26451 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSWTJMOPTCEI7QWHM2TVTSUPDANCNFSM53IVB5YA>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
Won’t combining into a single stage lower performance for the pipelined case?
… On Jul 11, 2022, at 4:30 PM, Eric Liang ***@***.***> wrote:
Let's make sure to do this without adding any new map batches or zip. Just
do the column manipulation in the existing UDF.
On Mon, Jul 11, 2022, 4:25 PM Amog Kamsetty ***@***.***>
wrote:
> ***@***.**** commented on this pull request.
> ------------------------------
>
> In python/ray/train/batch_predictor.py
> <#26451 (comment)>:
>
> > @@ -106,6 +124,19 @@ def predict(
> ):
> predictor_kwargs["use_gpu"] = True
>
> + if feature_columns:
> + dropped_dataset = data.map_batches(
> + lambda df: df[feature_columns], batch_size=batch_size
> + )
> + else:
> + dropped_dataset = data
> +
> + original_col_ds = None
> + if keep_columns:
> + original_col_ds = data.map_batches(
>
> For this I believe we need an API to output 2 datasets as the result of a
> map on an initial dataset.
>
> There's no API to do this with Datasets right now.
>
> —
> Reply to this email directly, view it on GitHub
> <#26451 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAADUSWTJMOPTCEI7QWHM2TVTSUPDANCNFSM53IVB5YA>
> .
> You are receiving this because you were assigned.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub <#26451 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5RZLEFJSNUWZDYS7KHAALVTSVBXANCNFSM53IVB5YA>.
You are receiving this because you authored the thread.
|
Not as much as the extra data movement of adding more distributed ops.
On Mon, Jul 11, 2022, 5:26 PM Amog Kamsetty ***@***.***>
wrote:
… Won’t combining into a single stage lower performance for the pipelined
case?
> On Jul 11, 2022, at 4:30 PM, Eric Liang ***@***.***> wrote:
>
>
> Let's make sure to do this without adding any new map batches or zip.
Just
> do the column manipulation in the existing UDF.
>
> On Mon, Jul 11, 2022, 4:25 PM Amog Kamsetty ***@***.***>
> wrote:
>
> > ***@***.**** commented on this pull request.
> > ------------------------------
> >
> > In python/ray/train/batch_predictor.py
> > <#26451 (comment)>:
> >
> > > @@ -106,6 +124,19 @@ def predict(
> > ):
> > predictor_kwargs["use_gpu"] = True
> >
> > + if feature_columns:
> > + dropped_dataset = data.map_batches(
> > + lambda df: df[feature_columns], batch_size=batch_size
> > + )
> > + else:
> > + dropped_dataset = data
> > +
> > + original_col_ds = None
> > + if keep_columns:
> > + original_col_ds = data.map_batches(
> >
> > For this I believe we need an API to output 2 datasets as the result
of a
> > map on an initial dataset.
> >
> > There's no API to do this with Datasets right now.
> >
> > —
> > Reply to this email directly, view it on GitHub
> > <#26451 (comment)>,
or
> > unsubscribe
> > <
https://github.com/notifications/unsubscribe-auth/AAADUSWTJMOPTCEI7QWHM2TVTSUPDANCNFSM53IVB5YA
>
> > .
> > You are receiving this because you were assigned.Message ID:
> > ***@***.***>
> >
> —
> Reply to this email directly, view it on GitHub <
#26451 (comment)>,
or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AB5RZLEFJSNUWZDYS7KHAALVTSVBXANCNFSM53IVB5YA
>.
> You are receiving this because you authored the thread.
>
—
Reply to this email directly, view it on GitHub
<#26451 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSUGA75N3IMRCUF5SHDVTS3U7ANCNFSM53IVB5YA>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
python/ray/train/batch_predictor.py
Outdated
@@ -106,6 +124,19 @@ def predict( | |||
): | |||
predictor_kwargs["use_gpu"] = True | |||
|
|||
if feature_columns: | |||
dropped_dataset = data.map_batches( | |||
lambda df: df[feature_columns], batch_size=batch_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concretely, move this line into the ScoringWrapper.
python/ray/train/batch_predictor.py
Outdated
ScoringWrapper, | ||
compute=compute, | ||
batch_format="pandas", | ||
batch_size=batch_size, | ||
**ray_remote_args, | ||
) | ||
|
||
if original_col_ds: | ||
prediction_results = prediction_results.zip(original_col_ds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also remove and move into ScoringWrapper.
Signed-off-by: Amog Kamsetty <[email protected]>
Updated @ericl, ptal! |
Signed-off-by: Amog Kamsetty <[email protected]>
…h-predictor-api Signed-off-by: Amog Kamsetty <[email protected]>
…26451) Signed-off-by: Amog Kamsetty <[email protected]> As discussed offline, allow configurability for feature columns and keep columns in BatchPredictor for better scoring UX on test datasets. Signed-off-by: Xiaowei Jiang <[email protected]>
…26451) Signed-off-by: Amog Kamsetty <[email protected]> As discussed offline, allow configurability for feature columns and keep columns in BatchPredictor for better scoring UX on test datasets. Signed-off-by: Stefan van der Kleij <[email protected]>
As discussed offline, allow configurability for feature columns and keep columns in
BatchPredictor
for better scoring UX on test datasets.See the updated docstring for usage example.
Why are these changes needed?
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.