[AIR][Predictor] Enable numpy based predictor #28917

jiaodong · 2022-09-30T00:14:21Z

Why are these changes needed?

Add a numpy first path for DL predictors such as tensorflow and pytorch.

Notable changes:

We split preprocessor format and GPU separate stage format in BatchPredictor
- Between dataset & preprocessor we choose the minimal conversion one if possible
- Predictor base class now provides a preferred native batch format, and will be used as GPU separate stage format.
Changed enum to BatchFormat and use it across our codebase instead of raw string val
predictor.py now choose implementation to call based on input batch data type, same as preprocessor
both predictor.py and batch_predictor.py return same data type as input batch / bock format
Added _predict_numpy to TF & Torch predictor
Logic updates in BatchPredictor
Removed arrow batch format from existing tests
Notebook and test updates
Re-write test_predictor.py that removed mocks and test against all numpy + pandas combination of {data batch, preprocessor, predictor}

Batch prediction results

TL;DR -- Faster, Better memory footprint, no GRAM leak or OOM.

Per-batch inference time
- Numpy: 0.88 secs / Pandas: 1.44 secs
Final memory cost after 10GB image data
- Numpy: 0.04 GB / Pandas: 6.34 GB
Per-batch incremental GRAM footprint
- Numpy: 0 / Pandas: +0.6GB
Final prediction output GRAM footprint
- Numpy: 0 / Pandas: 3.03GB -> OOM

Setup:

10 GB synthetic image data as 16 partitions of parquet file
g4dn.16xlarge with 1 Telsa T4 GPU
Batch size = Block size ~= 1024 images

Image 1: Pandas narrow waist prediction, +0.6GB accumulated GPU memory usage each batch

Image 2: Pandas narrow waist prediction, extra 3.03GB GPU memory required to dump final output from batchnorm that lead to OOM

Image 3: Numpy narrow waist prediction, const memory usage that finishes.

Related issue number

Related #28346
Closes #28525, #28627, #29003

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…ow batch type from all predictors

…rning notebook

…s fix pytorch_tabular_starter

jiaodong · 2022-10-03T15:58:30Z

This PR is good for initial review, with pending fixes on one example notebook to be fixed.

=== edit ===
Notebook failure fixed with minor fix on output format

jiaodong · 2022-10-03T20:59:06Z

some release test are flaky due to marginal e2e latency assertions, but this PR doesn't touch them (predictor only)

jiaodong · 2022-10-04T20:58:34Z

Failed release test air-benchmark-pytorch-training-e2e-gpu-1x1-20gb-9 is caused by ragged tensor PR that just got reverted. Serve test is irrelevant.

python/ray/train/_internal/dl_predictor.py

python/ray/train/predictor.py

python/ray/train/batch_predictor.py

amogkam

Thanks @jiaodong, overall lgtm! Left some minor comments

python/ray/train/predictor.py

amogkam · 2022-11-14T19:48:04Z

python/ray/data/preprocessor.py

@@ -183,7 +184,7 @@ def _fit(self, dataset: Dataset) -> "Preprocessor":
        """Sub-classes should override this instead of fit()."""
        raise NotImplementedError()

-    def _determine_transform_to_use(self, data_format: str) -> str:
+    def determine_transform_to_use(self, data_format: BlockFormat) -> BatchFormat:


can we keep this as private still? it should not be a public facing api to users

amogkam · 2022-11-14T20:03:08Z

python/ray/serve/air_integrations.py

+        elif output_df.dtypes[col] == np.dtype(object) and all(
+            isinstance(v, np.ndarray) for v in output_df[col]
+        ):
+            output_df.loc[:, col] = [v.tolist() for v in output_df[col]]


if numpy arrays are not json serializable, is this also a problem if dict of ndarrays are returned?

this function is currently only called for pandas output.

Serve already knows how to handle it https://sourcegraph.com/github.com/ray-project/ray/-/blob/python/ray/serve/air_integrations.py?L129

This case is a very small edge case that we return a DataFrame that happens to have ndarray in it due to fallback casting. With your ongoing PR we should be able to remove this function and casting completely.

amogkam · 2022-11-14T20:04:12Z

python/ray/train/tests/test_batch_predictor.py

+    - Predictor implementation (pandas vs numpy)
+    """
+    # Got to inline this rather than using @pytest.mark.parametrize to void
+    # unknown object owner error when running test with python cli.


following up on this- we use parametrize in test_batch_mapper.

does that not work here?

python/ray/train/tests/test_batch_predictor.py

python/ray/train/tests/test_predictor.py

amogkam · 2022-11-15T03:35:48Z

python/ray/train/batch_predictor.py

-                else BatchFormat.PANDAS
-            )
+            # No preprocessor, just use the predictor format.
+            return self._predictor_cls._batch_format_to_use()


this function should never be called in the first place if preprocessor is None. Don't think we need this if clause

@amogkam Hmm seems like this is still called if preprocessor is None?

ray/python/ray/train/batch_predictor.py

Line 188 in dbc3bd8

self._determine_preprocessor_batch_format(data)

yeah, but it doesn't need to be. But this is a minor point, so looks good to merge.

amogkam

Thanks, lgtm! Can we address the remaining comments before merging?

#28917 (comment)
#28917 (comment)

clarkzinzow

LGTM, only nits and suggestions for follow-ups, so I think we can merge!

clarkzinzow · 2022-11-15T18:28:29Z

python/ray/data/dataset_pipeline.py

+        """
+        # We need schema to properly validate, so synchronously
+        # fetch it if necessary.
+        schema = self.schema(fetch_if_missing=True)


With the pipeline peeking implemented above, this triggering execution should be fine, i.e. we shouldn't hit the double-execution issue. 👍

clarkzinzow · 2022-11-15T18:29:13Z

python/ray/serve/air_integrations.py

    """
    from ray.data.extensions import TensorDtype

    for col in output_df.columns:
+        # TensorArray requires special handling to numpy array.


I'm assuming that we're leaving this relatively alone in this PR? Just double-checking, what was the decision?

clarkzinzow · 2022-11-15T18:31:42Z

python/ray/train/batch_predictor.py

@@ -222,13 +281,19 @@ def __call__(self, batch):
                # Set the in-predictor preprocessing to a no-op when using a separate
                # GPU stage. Otherwise, the preprocessing will be applied twice.
                override_prep = BatchMapper(lambda x: x)
+                # preprocessor.transform will break for DatasetPipeline due to
+                # missing _dataset_format()


This is no longer true with your addition, we should try unifying these paths in a follow-up PR.

python/ray/train/batch_predictor.py

clarkzinzow · 2022-11-15T18:36:52Z

python/ray/train/batch_predictor.py

-                else BatchFormat.PANDAS
-            )
+            # No preprocessor, just use the predictor format.
+            return self._predictor_cls._batch_format_to_use()


@amogkam Hmm seems like this is still called if preprocessor is None?

ray/python/ray/train/batch_predictor.py

Line 188 in dbc3bd8

self._determine_preprocessor_batch_format(data)

python/ray/train/predictor.py

clarkzinzow · 2022-11-15T18:43:14Z

python/ray/train/predictor.py

+            raise NotImplementedError(
+                "None of `_predict_pandas` or `_predict_numpy` are "
+                f"implemented for input data batch format `{batch_format}`."
+            )


Ah I see that the check is here, nice. This happens upstream of any Predictor._batch_format_to_use() calls, right?

this one is a bit more downstream tho upon seeing data on predict, so i've added your suggestion above to surface the issue earlier

clarkzinzow · 2022-11-15T18:46:44Z

python/ray/train/tests/test_batch_predictor.py

+    - Predictor implementation (pandas vs numpy)
+    """
+    # Got to inline this rather than using @pytest.mark.parametrize to void
+    # unknown object owner error when running test with python cli.


I fixed this issue for the test_batch_mapper tests by ensuring that the fixtures use the ray_start_regular_shared fixture for its Datasets execution, otherwise you could have the fixtures creating the Datasets on separate Ray clusters than the eventual tests run. If you have this test use the ray_start_regular_shared fixture, and turn these test cases into fixtures depending on the ray_start_regular_shared fixture, it should work:

ray/python/ray/data/tests/conftest.py

Line 292 in c749ad3

def ds_pandas_single_column_format(ray_start_regular_shared):

Can happen in a follow-up PR if you'd like.

richardliaw

doc changes

Co-authored-by: Clark Zinzow <[email protected]> Signed-off-by: Jiao <[email protected]>

jiaodong · 2022-11-16T00:19:37Z

rllib get started is flaky that also fails on master.

This reverts commit 326d84f.

Co-authored-by: Clark Zinzow <[email protected]> Co-authored-by: Amog Kamsetty <[email protected]> Signed-off-by: Weichen Xu <[email protected]>

jiaodong added 13 commits September 29, 2022 17:13

initial commit

dd9922f

predictor.py adding numpy and use enum type rather than string

18423ac

fix tests related to keep columns

f66c438

add arrow format for comp

2053c9b

move model outputs to cpu and convert nump

4183d28

add batch_format arg

43f067b

default format to pandas if batch_format is missing, and remove pyarr…

c2ea0ea

…ow batch type from all predictors

single column selection + output handling

1a109b3

fix torch image batch prediction example

ef0d919

fix training and inference preprocessor consistency + incremental lea…

e98ade8

…rning notebook

convert batch type to numpy to adapt DL with pandas preprocessor, thu…

f9d00e4

…s fix pytorch_tabular_starter

fix torch image example preprocessor

72432ce

Merge branch 'master' into dl_predictor_np

e501d76

jiaodong marked this pull request as ready for review October 3, 2022 15:58

jiaodong requested a review from a team as a code owner October 3, 2022 15:58

jiaodong requested review from amogkam, clarkzinzow and ericl October 3, 2022 15:58

jiaodong assigned ericl, clarkzinzow and amogkam Oct 3, 2022

jiaodong added 2 commits October 3, 2022 09:55

fix notebook

1704036

batch data type

9234923

jiaodong added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Oct 3, 2022

jiaodong mentioned this pull request Oct 4, 2022

[AIR] Potential GRAM memory leak in pandas batch conversion #29034

Closed

amogkam reviewed Oct 5, 2022

View reviewed changes

ericl removed the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Oct 5, 2022

jiaodong added 5 commits November 10, 2022 22:04

fix torch predictor doc test

26e9320

support chain of preprocessors

1e291bb

fix torch predictor doctest

c14ca57

nit

5c8cef1

fix tests

329cce2

jiaodong requested review from krfricke, xwjiang2010, matthewdeng, Yard1 and maxpumperla as code owners November 11, 2022 19:42

jiaodong added 2 commits November 11, 2022 12:48

fix output

1e7a6e6

fix testouput docstring

3a8e7ae

jiaodong unassigned ericl Nov 14, 2022

amogkam reviewed Nov 14, 2022

View reviewed changes

jiaodong added 2 commits November 14, 2022 16:31

predictor changes

3df9446

address comments

551c7ac

amogkam reviewed Nov 15, 2022

View reviewed changes

amogkam approved these changes Nov 15, 2022

View reviewed changes

make determine_transform_to_use private

dbc3bd8

clarkzinzow approved these changes Nov 15, 2022

View reviewed changes

richardliaw approved these changes Nov 15, 2022

View reviewed changes

jiaodong and others added 3 commits November 15, 2022 12:43

Apply suggestions from code review

1ee1ede

Co-authored-by: Clark Zinzow <[email protected]> Signed-off-by: Jiao <[email protected]>

lint

461eb1f

fix test

80546a4

richardliaw merged commit 326d84f into ray-project:master Nov 16, 2022

fishbone added a commit that referenced this pull request Nov 16, 2022

Revert "[AIR][Predictor] Enable numpy based predictor (#28917)"

f400b33

This reverts commit 326d84f.

fishbone mentioned this pull request Nov 16, 2022

Revert "[AIR][Predictor] Enable numpy based predictor" #30362

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR][Predictor] Enable numpy based predictor #28917

[AIR][Predictor] Enable numpy based predictor #28917

jiaodong commented Sep 30, 2022 •

edited

Loading

jiaodong commented Oct 3, 2022 •

edited

Loading

jiaodong commented Oct 3, 2022

jiaodong commented Oct 4, 2022

amogkam left a comment

amogkam Nov 14, 2022

amogkam Nov 14, 2022

jiaodong Nov 15, 2022

amogkam Nov 14, 2022

amogkam Nov 15, 2022

clarkzinzow Nov 15, 2022

amogkam Nov 15, 2022

amogkam left a comment •

edited

Loading

clarkzinzow left a comment

clarkzinzow Nov 15, 2022

clarkzinzow Nov 15, 2022

clarkzinzow Nov 15, 2022

clarkzinzow Nov 15, 2022

clarkzinzow Nov 15, 2022

jiaodong Nov 15, 2022

clarkzinzow Nov 15, 2022

richardliaw left a comment

jiaodong commented Nov 16, 2022

[AIR][Predictor] Enable numpy based predictor #28917

[AIR][Predictor] Enable numpy based predictor #28917

Conversation

jiaodong commented Sep 30, 2022 • edited Loading

Why are these changes needed?

Batch prediction results

Related issue number

Checks

jiaodong commented Oct 3, 2022 • edited Loading

jiaodong commented Oct 3, 2022

jiaodong commented Oct 4, 2022

amogkam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogkam left a comment • edited Loading

Choose a reason for hiding this comment

clarkzinzow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardliaw left a comment

Choose a reason for hiding this comment

jiaodong commented Nov 16, 2022

jiaodong commented Sep 30, 2022 •

edited

Loading

jiaodong commented Oct 3, 2022 •

edited

Loading

amogkam left a comment •

edited

Loading