[AIR][GPU Batch Prediction] Add basic support for GPU batch prediction #26251

jiaodong · 2022-07-01T04:25:24Z

Why are these changes needed?

This PR adds GPU support for pytorch and tensorflow predictor, as well as automatic setting use_gpu flag in BatchPredictor.

Notable changes:

Added use_gpu flag in the constructor of TorchPredictor and TensorflowPredictor (note it's slightly different from our latest design doc that puts this flag at predict() call)
Added use_gpu flag to SklearnPredictor so its interface is compatible with BatchPredictor
Code to move both model weights and input tensor to default visible GPU at index 0 if flag is set
parametrized existing predictor tests to use GPU for both CPU & GPU coverage
Changed BUILD CI tests with an added gpu tag (I'm not 100% sure if that's a right way tho)

Follow ups:

#26249 is created in case our host has multiple GPU devices. It's a bit out of scope for this PR, but for GPU batch inference ideally we should be able to evenly use all GPU devices on host where CPU & DRAM are busy with pre-fetching + data movement to GPU. We might approximately do the same by scheduling same # of Predictor instances on the host, but that's worth verifying once benchmarks are set.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

jiaodong · 2022-07-02T16:19:58Z

python/ray/train/torch/torch_predictor.py

            output = self.model(model_input)

        def untensorize(torch_tensor):
-            numpy_array = torch_tensor.cpu().detach().numpy()
+            numpy_array = torch_tensor.detach().cpu().numpy()


this is minor nit improvement so we don't construct autograd edge on cpu then detach it. End result is the same.

jiaodong · 2022-07-02T16:20:31Z

python/ray/train/torch/torch_predictor.py

+        self.use_gpu = use_gpu
+        if use_gpu:
+            # Ensure input tensor and model live on GPU for GPU inference
+            self.model.to(torch.device("cuda"))


model init and config is moved to constructor rather than upon every predict() call.

jiaodong · 2022-07-02T16:20:49Z

python/ray/train/tensorflow/tensorflow_predictor.py

+            with tf.device("GPU:0"):
+                self.model = self.model_definition()
+        else:
+            self.model = self.model_definition()


model init and config is moved to constructor rather than upon every predict() call.

python/ray/train/sklearn/sklearn_predictor.py

python/ray/train/tensorflow/tensorflow_predictor.py

python/ray/train/tests/test_batch_predictor.py

ericl

Main comment is around edge case coverage: if a GPU is available and use_gpu=False, warn the user.

python/ray/train/torch/torch_predictor.py

python/ray/train/tensorflow/tensorflow_predictor.py

python/ray/train/sklearn/sklearn_predictor.py

amogkam

Thanks @jiaodong! Overall lgtm, but let's make sure to follow up on automatically setting use_gpu with PredictorDeployment!

…gpu_predictor

amogkam · 2022-07-11T16:18:23Z

.buildkite/pipeline.gpu.large.yml

@@ -39,6 +39,8 @@
    - DATA_PROCESSING_TESTING=1 TRAIN_TESTING=1 TUNE_TESTING=1 ./ci/env/install-dependencies.sh
    - pip install -Ur ./python/requirements_ml_docker.txt
    - bazel test --config=ci $(./ci/run/bazel_export_options) --build_tests_only --test_tag_filters=gpu python/ray/air/...
+    - pip install -U horovod


Ah instead of manually installing here you can set the INSTALL_HOROVOD=1 env var for install-dependencies.sh

oh nice let me change it

amogkam · 2022-07-11T16:18:42Z

python/ray/air/BUILD

+        "//python/ray/air:__pkg__",
+        "//python/ray/air:__subpackages__",
+        "//python/ray/train:__pkg__",
+        "//python/ray/train:__subpackages__",


For my own understanding, what does this do?

i noticed that ray train tests are actually using modules from air folder, such as the test_horovod. So I setup both ML and AIR bazel targets to be visible to each other so tests and directly import and use if needed.

amogkam · 2022-07-11T16:24:43Z

@jiaodong actually looks like the lint failure is related. This PR removes test_huggingface_predictor from the BUILD file so it's not longer being tested.

ray-project#26251) This PR adds GPU support for pytorch and tensorflow predictor, as well as automatic setting `use_gpu` flag in `BatchPredictor`. Notable changes: - Added `use_gpu` flag in the constructor of `TorchPredictor` and `TensorflowPredictor` (note it's slightly different from our latest design doc that puts this flag at `predict()` call) - Added `use_gpu` flag to `SklearnPredictor` so its interface is compatible with `BatchPredictor` - Code to move both model weights and input tensor to default visible GPU at index 0 if flag is set - parametrized existing predictor tests to use GPU for both CPU & GPU coverage - Changed BUILD CI tests with an added `gpu` tag (I'm not 100% sure if that's a right way tho) Follow ups: ray-project#26249 is created in case our host has multiple GPU devices. It's a bit out of scope for this PR, but for GPU batch inference ideally we should be able to evenly use all GPU devices on host where CPU & DRAM are busy with pre-fetching + data movement to GPU. We might approximately do the same by scheduling same # of Predictor instances on the host, but that's worth verifying once benchmarks are set. Signed-off-by: Stefan van der Kleij <[email protected]>

jiaodong added 5 commits June 30, 2022 21:24

initial working commit

ce936d4

move flag to constructor

a5829dc

add GPU only flag

39bba9c

add tensorflow predictor changes with tests

de71e77

add gpu tag to test_tensorflow_predictor

7cb488e

jiaodong requested review from amogkam and ericl July 2, 2022 01:29

jiaodong assigned ericl and amogkam Jul 2, 2022

jiaodong added the air label Jul 2, 2022

jiaodong added this to the Ray AIR milestone Jul 2, 2022

fix sklearn predictor interface and batch predictor test case

4ff097d

jiaodong commented Jul 2, 2022

View reviewed changes

jiaodong marked this pull request as ready for review July 2, 2022 17:38

jiaodong added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 3, 2022