Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR] Introduce better scoring API for BatchPredictor #26451

Merged
merged 10 commits into from
Jul 14, 2022

Conversation

amogkam
Copy link
Contributor

@amogkam amogkam commented Jul 11, 2022

As discussed offline, allow configurability for feature columns and keep columns in BatchPredictor for better scoring UX on test datasets.

See the updated docstring for usage example.

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

ScoringWrapper,
compute=compute,
batch_format="pandas",
batch_size=batch_size,
**ray_remote_args,
)

if original_col_ds:
prediction_results = prediction_results.zip(original_col_ds)
Copy link
Member

@jiaodong jiaodong Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im curious what guarantees ray datasets provide here, how did we ensure results returned from dropped_dataset and original_col_ds always match 1-1 across multiple executions, rather than assigning wrong labels ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure I understand you here.

They don't necessarily need to match?


original_col_ds = None
if keep_columns:
original_col_ds = data.map_batches(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume the cost of .map_batches is O(data_size), is there any API in dataset that takes both feature_columns and keep_columns so we have both values we need in one .map_batches pass ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this I believe we need an API to output 2 datasets as the result of a map on an initial dataset.

There's no API to do this with Datasets right now.


assert batch_predictor.predict(
test_dataset, feature_columns=["a"]
).to_pandas().to_numpy().squeeze().tolist() == [4.0, 8.0, 12.0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if dummy predictor returns data * self.factor why would [1, 2, 3] maps to a factor of 4 here o.0 ??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also a preprocessor which multiplies by 2 again

@ericl
Copy link
Contributor

ericl commented Jul 11, 2022 via email

@amogkam
Copy link
Contributor Author

amogkam commented Jul 12, 2022 via email

@ericl
Copy link
Contributor

ericl commented Jul 12, 2022 via email

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 12, 2022
@@ -106,6 +124,19 @@ def predict(
):
predictor_kwargs["use_gpu"] = True

if feature_columns:
dropped_dataset = data.map_batches(
lambda df: df[feature_columns], batch_size=batch_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concretely, move this line into the ScoringWrapper.

ScoringWrapper,
compute=compute,
batch_format="pandas",
batch_size=batch_size,
**ray_remote_args,
)

if original_col_ds:
prediction_results = prediction_results.zip(original_col_ds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also remove and move into ScoringWrapper.

Signed-off-by: Amog Kamsetty <[email protected]>
@amogkam amogkam requested review from ericl and jiaodong July 13, 2022 22:07
@amogkam amogkam removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 13, 2022
@amogkam
Copy link
Contributor Author

amogkam commented Jul 13, 2022

Updated @ericl, ptal!

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 14, 2022
@amogkam amogkam merged commit 6595bd6 into ray-project:master Jul 14, 2022
@amogkam amogkam deleted the batch-predictor-api branch July 14, 2022 18:26
xwjiang2010 pushed a commit to xwjiang2010/ray that referenced this pull request Jul 19, 2022
…26451)

Signed-off-by: Amog Kamsetty <[email protected]>

As discussed offline, allow configurability for feature columns and keep columns in BatchPredictor for better scoring UX on test datasets.

Signed-off-by: Xiaowei Jiang <[email protected]>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
…26451)

Signed-off-by: Amog Kamsetty <[email protected]>

As discussed offline, allow configurability for feature columns and keep columns in BatchPredictor for better scoring UX on test datasets.

Signed-off-by: Stefan van der Kleij <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants