Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR][Predictor] Enable numpy based predictor #28917

Merged
merged 82 commits into from
Nov 16, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
dd9922f
initial commit
jiaodong Sep 30, 2022
18423ac
predictor.py adding numpy and use enum type rather than string
jiaodong Sep 30, 2022
f66c438
fix tests related to keep columns
jiaodong Sep 30, 2022
2053c9b
add arrow format for comp
jiaodong Sep 30, 2022
4183d28
move model outputs to cpu and convert nump
jiaodong Sep 30, 2022
43f067b
add batch_format arg
jiaodong Sep 30, 2022
c2ea0ea
default format to pandas if batch_format is missing, and remove pyarr…
jiaodong Sep 30, 2022
1a109b3
single column selection + output handling
jiaodong Oct 1, 2022
ef0d919
fix torch image batch prediction example
jiaodong Oct 2, 2022
e98ade8
fix training and inference preprocessor consistency + incremental lea…
jiaodong Oct 2, 2022
f9d00e4
convert batch type to numpy to adapt DL with pandas preprocessor, thu…
jiaodong Oct 2, 2022
72432ce
fix torch image example preprocessor
jiaodong Oct 3, 2022
e501d76
Merge branch 'master' into dl_predictor_np
jiaodong Oct 3, 2022
1704036
fix notebook
jiaodong Oct 3, 2022
9234923
batch data type
jiaodong Oct 3, 2022
f56eacd
Merge branch 'master' into dl_predictor_np
jiaodong Oct 6, 2022
0bbc159
address comments, remove predict with batch_format and delegate to pr…
jiaodong Oct 6, 2022
194a21a
Merge branch 'master' into dl_predictor_np
jiaodong Oct 6, 2022
7d5bba6
Merge branch 'master' of https://github.com/ray-project/ray into dl_p…
jiaodong Oct 11, 2022
2ce30a1
wip
jiaodong Oct 12, 2022
f05049f
Merge branch 'master' into dl_predictor_np
jiaodong Oct 21, 2022
f992a0c
Merge branch 'dl_predictor_np' of github.com:jiaodong/ray into dl_pre…
jiaodong Oct 24, 2022
6606555
torch_predictor passed after reemoving batch_format from BatchPredictor
jiaodong Oct 24, 2022
16c9f13
fix tensorflow tests
jiaodong Oct 24, 2022
5748ca1
fix batch predictor
jiaodong Oct 25, 2022
bfcbcdf
add fallback pandas path to base predictor if _predict_numpy does not…
jiaodong Oct 25, 2022
1f5ba7a
fix some tests
jiaodong Oct 25, 2022
c70e15c
fix torch image example
jiaodong Oct 25, 2022
03dbbbc
Merge branch 'dl_predictor_np' of https://github.com/jiaodong/ray int…
jiaodong Oct 25, 2022
a3e047f
fix xgboost docstring
jiaodong Oct 25, 2022
1d5d9cd
try fix serve air_integrations by handling pandas with raw ndarray
jiaodong Oct 26, 2022
5f76d30
fix notebooks and add pandas format path to predictor.py
jiaodong Oct 26, 2022
1ec5438
Merge branch 'master' into dl_predictor_np
jiaodong Oct 26, 2022
e5ba13a
fix last notebook
jiaodong Oct 26, 2022
035e673
enhance air_integration tests
jiaodong Oct 26, 2022
7b1dfbd
refactor test_predictor tests to cover all pandas + numpy preprocesso…
jiaodong Oct 27, 2022
4aad0c5
update all batch_predictor level test combinations, remove ScoreWrapp…
jiaodong Oct 27, 2022
d3d192c
fix docs and docstring
jiaodong Oct 27, 2022
59006f6
fix test_batch_predictor test to work with python cli
jiaodong Oct 27, 2022
4a85371
add keep_col tests with preserved single column prediction for numpy …
jiaodong Oct 28, 2022
a97b3df
fix tests for keep column and single column output
jiaodong Oct 28, 2022
e4ea455
Merge branch 'master' into dl_predictor_np
jiaodong Oct 28, 2022
d0ed91d
fix notebook
jiaodong Oct 28, 2022
f0790a8
Merge branch 'master' into dl_predictor_np
jiaodong Nov 3, 2022
07c8fd8
Update python/ray/serve/air_integrations.py
jiaodong Nov 3, 2022
3140673
Update python/ray/serve/air_integrations.py
jiaodong Nov 3, 2022
4dbb7de
Update python/ray/train/tests/test_predictor.py
jiaodong Nov 3, 2022
f682d5b
Update python/ray/train/batch_predictor.py
jiaodong Nov 3, 2022
7c5fe89
Update python/ray/train/_internal/dl_predictor.py
jiaodong Nov 3, 2022
3019220
Update python/ray/train/_internal/dl_predictor.py
jiaodong Nov 3, 2022
6decdfd
Merge branch 'dl_predictor_np' of https://github.com/jiaodong/ray int…
jiaodong Nov 3, 2022
d8e729b
address comments and make predictor more self contained
jiaodong Nov 3, 2022
0efda24
fix tests
jiaodong Nov 4, 2022
74412c2
revert not needed notebook changes
jiaodong Nov 4, 2022
6f920af
change BlockFormat
jiaodong Nov 4, 2022
8cdbe82
update dataset_format
jiaodong Nov 4, 2022
10b0972
Merge branch 'master' into dl_predictor_np
jiaodong Nov 7, 2022
c90030e
simply preferred batch format at Predictor class level
jiaodong Nov 8, 2022
58fa01e
Merge branch 'master' into dl_predictor_np
jiaodong Nov 8, 2022
1f9ba1b
Merge branch 'master' into dl_predictor_np
jiaodong Nov 8, 2022
71231c8
fix test_air_integration
jiaodong Nov 8, 2022
aa9d3d0
fix lint
jiaodong Nov 8, 2022
291ffc7
address comments
jiaodong Nov 10, 2022
e4d3af4
Merge branch 'master' into dl_predictor_np
jiaodong Nov 10, 2022
a79c4cf
remove casting flags
jiaodong Nov 10, 2022
e66ef51
address comment regarding preprocessor format and predictor format de…
jiaodong Nov 11, 2022
4bee5b5
Merge branch 'dl_predictor_np' of https://github.com/jiaodong/ray int…
jiaodong Nov 11, 2022
b5cd7f7
Merge branch 'master' into dl_predictor_np
jiaodong Nov 11, 2022
f912662
tensor cast column nit
jiaodong Nov 11, 2022
26e9320
fix torch predictor doc test
jiaodong Nov 11, 2022
1e291bb
support chain of preprocessors
jiaodong Nov 11, 2022
c14ca57
fix torch predictor doctest
jiaodong Nov 11, 2022
5c8cef1
nit
jiaodong Nov 11, 2022
329cce2
fix tests
jiaodong Nov 11, 2022
1e7a6e6
fix output
jiaodong Nov 11, 2022
3a8e7ae
fix testouput docstring
jiaodong Nov 12, 2022
3df9446
predictor changes
jiaodong Nov 15, 2022
551c7ac
address comments
jiaodong Nov 15, 2022
dbc3bd8
make determine_transform_to_use private
jiaodong Nov 15, 2022
1ee1ede
Apply suggestions from code review
jiaodong Nov 15, 2022
461eb1f
lint
jiaodong Nov 15, 2022
80546a4
fix test
jiaodong Nov 15, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions doc/source/train/doc_code/xgboost_train_predict.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,11 @@
# __train_predict_end__

# __batch_predict_start__
import pandas as pd
from ray.train.batch_predictor import BatchPredictor

batch_predictor = BatchPredictor.from_checkpoint(result.checkpoint, XGBoostPredictor)
predict_dataset = ray.data.from_items(
[{"x": x} for x in np.expand_dims(np.arange(32), 1)]
)
predict_dataset = ray.data.from_pandas(pd.DataFrame({"x": np.arange(32)}))
predictions = batch_predictor.predict(
data=predict_dataset,
batch_size=8,
Expand Down
21 changes: 9 additions & 12 deletions python/ray/train/batch_predictor.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ def predict(
self._determine_preprocessor_batch_format(data)
)
# This is the [Y] in case of separated GPU stage prediction
separated_stage_prediction_batch_format: BatchFormat = (
predict_stage_batch_format: BatchFormat = (
self._predictor_cls._batch_format_to_use()
)
ctx = DatasetContext.get_current()
Expand Down Expand Up @@ -220,10 +220,8 @@ def _select_columns_from_input_batch(
f"Column name(s) {select_columns} should not be provided "
"for prediction input data type of ``numpy.ndarray``"
)
if isinstance(batch_data, dict):
elif isinstance(batch_data, dict):
return {k: v for k, v in batch_data.items() if k in select_columns}
elif isinstance(batch_data, np.ndarray):
return batch_data
elif isinstance(batch_data, pd.DataFrame):
# Select a subset of the pandas columns.
return batch_data[select_columns]
jiaodong marked this conversation as resolved.
Show resolved Hide resolved
Expand Down Expand Up @@ -287,13 +285,15 @@ def __call__(self, input_batch: DataBatchType) -> DataBatchType:
# missing _dataset_format()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer true with your addition, we should try unifying these paths in a follow-up PR.

batch_fn = preprocessor._transform_batch
data = data.map_batches(
batch_fn, batch_format=separated_stage_prediction_batch_format
batch_fn, batch_format=predict_stage_batch_format
)

prediction_results = data.map_batches(
ScoringWrapper,
compute=compute,
batch_format=preprocessor_batch_format,
batch_format=preprocessor_batch_format
if self.get_preprocessor()
jiaodong marked this conversation as resolved.
Show resolved Hide resolved
else predict_stage_batch_format,
batch_size=batch_size,
**ray_remote_args,
)
Expand Down Expand Up @@ -421,15 +421,12 @@ def _determine_preprocessor_batch_format(
dataset_block_format = ds.dataset_format()
if dataset_block_format == BlockFormat.SIMPLE:
# Naive case that we cast to pandas for compatibility.
# TODO: Revisit
return BatchFormat.PANDAS
jiaodong marked this conversation as resolved.
Show resolved Hide resolved

if not preprocessor:
jiaodong marked this conversation as resolved.
Show resolved Hide resolved
# No preprocessor, just use the dataset format.
return (
BatchFormat.NUMPY
if dataset_block_format == BlockFormat.ARROW
else BatchFormat.PANDAS
)
# No preprocessor, just use the predictor format.
return self._predictor_cls._batch_format_to_use()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function should never be called in the first place if preprocessor is None. Don't think we need this if clause

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amogkam Hmm seems like this is still called if preprocessor is None?

self._determine_preprocessor_batch_format(data)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but it doesn't need to be. But this is a minor point, so looks good to merge.

elif hasattr(preprocessor, "preprocessors"):
# For Chain preprocessor, we picked the first one as entry point.
# TODO (jiaodong): We should revisit if our Chain preprocessor is
Expand Down
4 changes: 2 additions & 2 deletions python/ray/train/tests/test_predictor.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ def test_predict_numpy_with_numpy_data():
pd.DataFrame({TENSOR_COLUMN_NAME: [2, 4, 6]}),
)

# Test predcit with both Numpy and Pandas preprocessor available
# Test predict with both Numpy and Pandas preprocessor available
checkpoint = Checkpoint.from_dict(
{"factor": 2.0, PREPROCESSOR_KEY: DummyWithNumpyPreprocessor()}
)
Expand Down Expand Up @@ -183,7 +183,7 @@ def test_predict_pandas_with_numpy_data():
pd.DataFrame({TENSOR_COLUMN_NAME: [2, 4, 6]}),
)

# Test predcit with both Numpy and Pandas preprocessor available
# Test predict with both Numpy and Pandas preprocessor available
checkpoint = Checkpoint.from_dict(
{"factor": 2.0, PREPROCESSOR_KEY: DummyWithNumpyPreprocessor()}
)
Expand Down