Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR][GPU Batch Prediction] Add basic support for GPU batch prediction #26251

Merged
merged 25 commits into from
Jul 11, 2022

Conversation

jiaodong
Copy link
Member

@jiaodong jiaodong commented Jul 1, 2022

Why are these changes needed?

This PR adds GPU support for pytorch and tensorflow predictor, as well as automatic setting use_gpu flag in BatchPredictor.

Notable changes:

  • Added use_gpu flag in the constructor of TorchPredictor and TensorflowPredictor (note it's slightly different from our latest design doc that puts this flag at predict() call)
  • Added use_gpu flag to SklearnPredictor so its interface is compatible with BatchPredictor
  • Code to move both model weights and input tensor to default visible GPU at index 0 if flag is set
  • parametrized existing predictor tests to use GPU for both CPU & GPU coverage
  • Changed BUILD CI tests with an added gpu tag (I'm not 100% sure if that's a right way tho)

Follow ups:

#26249 is created in case our host has multiple GPU devices. It's a bit out of scope for this PR, but for GPU batch inference ideally we should be able to evenly use all GPU devices on host where CPU & DRAM are busy with pre-fetching + data movement to GPU. We might approximately do the same by scheduling same # of Predictor instances on the host, but that's worth verifying once benchmarks are set.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@jiaodong jiaodong requested review from amogkam and ericl July 2, 2022 01:29
@jiaodong jiaodong added the air label Jul 2, 2022
@jiaodong jiaodong added this to the Ray AIR milestone Jul 2, 2022
output = self.model(model_input)

def untensorize(torch_tensor):
numpy_array = torch_tensor.cpu().detach().numpy()
numpy_array = torch_tensor.detach().cpu().numpy()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is minor nit improvement so we don't construct autograd edge on cpu then detach it. End result is the same.

self.use_gpu = use_gpu
if use_gpu:
# Ensure input tensor and model live on GPU for GPU inference
self.model.to(torch.device("cuda"))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model init and config is moved to constructor rather than upon every predict() call.

with tf.device("GPU:0"):
self.model = self.model_definition()
else:
self.model = self.model_definition()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model init and config is moved to constructor rather than upon every predict() call.

python/ray/train/sklearn/sklearn_predictor.py Outdated Show resolved Hide resolved
@jiaodong jiaodong marked this pull request as ready for review July 2, 2022 17:38
@jiaodong jiaodong added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 3, 2022
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main comment is around edge case coverage: if a GPU is available and use_gpu=False, warn the user.

@ericl ericl added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Jul 5, 2022
@jiaodong jiaodong added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Jul 5, 2022
@jiaodong jiaodong removed the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 7, 2022
Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jiaodong! Overall lgtm, but let's make sure to follow up on automatically setting use_gpu with PredictorDeployment!

@@ -39,6 +39,8 @@
- DATA_PROCESSING_TESTING=1 TRAIN_TESTING=1 TUNE_TESTING=1 ./ci/env/install-dependencies.sh
- pip install -Ur ./python/requirements_ml_docker.txt
- bazel test --config=ci $(./ci/run/bazel_export_options) --build_tests_only --test_tag_filters=gpu python/ray/air/...
- pip install -U horovod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah instead of manually installing here you can set the INSTALL_HOROVOD=1 env var for install-dependencies.sh

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nice let me change it

"//python/ray/air:__pkg__",
"//python/ray/air:__subpackages__",
"//python/ray/train:__pkg__",
"//python/ray/train:__subpackages__",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding, what does this do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i noticed that ray train tests are actually using modules from air folder, such as the test_horovod. So I setup both ML and AIR bazel targets to be visible to each other so tests and directly import and use if needed.

@amogkam
Copy link
Contributor

amogkam commented Jul 11, 2022

@jiaodong actually looks like the lint failure is related. This PR removes test_huggingface_predictor from the BUILD file so it's not longer being tested.

@jiaodong jiaodong added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 11, 2022
@amogkam amogkam merged commit d95dc2f into ray-project:master Jul 11, 2022
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
ray-project#26251)

This PR adds GPU support for pytorch and tensorflow predictor, as well as automatic setting `use_gpu` flag in `BatchPredictor`.

Notable changes:
- Added `use_gpu` flag in the constructor of `TorchPredictor` and `TensorflowPredictor` (note it's slightly different from our latest design doc that puts this flag at `predict()` call)
- Added `use_gpu` flag to `SklearnPredictor` so its interface is compatible with `BatchPredictor`
- Code to move both model weights and input tensor to default visible GPU at index 0 if flag is set
- parametrized existing predictor tests to use GPU for both CPU & GPU coverage
- Changed BUILD CI tests with an added `gpu` tag (I'm not 100% sure if that's a right way tho)

Follow ups:

ray-project#26249 is created in case our host has multiple GPU devices. It's a bit out of scope for this PR, but for GPU batch inference ideally we should be able to evenly use all GPU devices on host where CPU & DRAM are busy with pre-fetching + data movement to GPU. We might approximately do the same by scheduling same # of Predictor instances on the host, but that's worth verifying once benchmarks are set.
Signed-off-by: Stefan van der Kleij <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants