[air] Add horovod trainer. #23437

xwjiang2010 · 2022-03-23T22:40:21Z

Why are these changes needed?

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

amogkam

LGTM! I think we should add an example as well as show how to use with with the predictors as well.

python/ray/ml/tests/test_horovod_trainer.py

amogkam · 2022-03-24T00:25:23Z

python/ray/ml/tests/test_horovod_trainer.py

+        scaling_config=scaling_config,
+    )
+    trainer.fit()
+


Could we also e2e test with TorchPredictor as well, like what we have in test_torch_trainer?

I find it very heard to use the predictor interface for this image classification problem.
For the sake of verifying the training process, I just use the native pytorch DataLoader and Tensor stuff (and not the predictor).

I may need to just put a linear training if we want to cover the predictor part in an e2e fashion for horovod.
Although in terms of test coverage, I think the corresponding test_pytorch/tensorflow_trainer got it covered already.

python/ray/ml/train/integrations/horovod/horovod_trainer.py

I find it very heard to use the predictor interface for this image classification problem. For the sake of verifying the model, I just use the native pytorch DataLoader and Tensor stuff.

amogkam

Left a few comments, but overall looks good to me!

amogkam · 2022-03-28T17:42:38Z

python/ray/ml/examples/horovod/horovod_pytorch_example.py

+    num_epochs = config.get("num_epochs", 10)
+    log_interval = config.get("log_interval", 10)
+    use_cuda = config.get("use_cuda", False)
+    save_model_as_dict = config.get("save_model_as_dict", False)


Seems like this is always going to be False?

hmmm, I have a test, where save_model_as_dict is True so that we can test that path as well. So it should be taking effect.

python/ray/ml/train/integrations/horovod/horovod_trainer.py

amogkam · 2022-03-28T17:48:53Z

python/ray/ml/train/integrations/horovod/horovod_trainer.py

+
+    .. code-block:: python
+
+        class Net(nn.Module):


can we use a simpler example for the docstring 🙂

I think we can just copy the one for TorchTrainer, except add in the hvd.init(), hvd.DistributedOptimizer, etc. lines.

Yes, good point.
I updated with a simple linear example instead.

python/ray/train/examples/horovod/horovod_example.py

python/ray/ml/examples/horovod/horovod_pytorch_example.py

amogkam

LGTM! Just a few minor comments.

python/ray/ml/BUILD

python/ray/ml/train/integrations/horovod/horovod_trainer.py

matthewdeng

LGTM - I think we can do some extra work down the line to clean up the example, but doesn't affect the implementation of the Trainer.

[air] Add horovod trainer.

0af1d2c

xwjiang2010 assigned amogkam Mar 23, 2022

add build rules.

95efc4f

amogkam reviewed Mar 24, 2022

View reviewed changes

xwjiang2010 added 2 commits March 25, 2022 10:17

Merge branch 'master' of https://github.com/ray-project/ray into hvd

4cd4962

address comments.

11b4a9f

I find it very heard to use the predictor interface for this image classification problem. For the sake of verifying the model, I just use the native pytorch DataLoader and Tensor stuff.

xwjiang2010 added this to the Ray AIR milestone Mar 27, 2022

xwjiang2010 added 2 commits March 27, 2022 08:23

install hvd.

bc9fcb1

Add state_dict test.

de48ead

amogkam reviewed Mar 28, 2022

View reviewed changes

amogkam assigned matthewdeng Mar 28, 2022

Update horovod trainer example.

0de7fa3

xwjiang2010 commented Mar 28, 2022

View reviewed changes

python/ray/ml/examples/horovod/horovod_pytorch_example.py Show resolved Hide resolved

Merge branch 'ray-project:master' into hvd

eacd0ca

amogkam reviewed Mar 29, 2022

View reviewed changes

matthewdeng reviewed Mar 29, 2022

View reviewed changes

python/ray/ml/train/integrations/horovod/horovod_trainer.py Outdated Show resolved Hide resolved

matthewdeng approved these changes Mar 29, 2022

View reviewed changes

xwjiang2010 and others added 3 commits March 29, 2022 11:53

address comments.

faccc9e

Merge branch 'hvd' of https://github.com/xwjiang2010/ray into hvd

5185810

update example.

773cce6

amogkam merged commit 6443f3d into ray-project:master Mar 30, 2022

xwjiang2010 deleted the hvd branch July 26, 2023 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[air] Add horovod trainer. #23437

[air] Add horovod trainer. #23437

xwjiang2010 commented Mar 23, 2022 •

edited

Loading

amogkam left a comment

amogkam Mar 24, 2022 •

edited

Loading

xwjiang2010 Mar 27, 2022

amogkam left a comment

amogkam Mar 28, 2022

xwjiang2010 Mar 28, 2022

amogkam Mar 28, 2022

amogkam Mar 28, 2022

xwjiang2010 Mar 28, 2022

amogkam left a comment

matthewdeng left a comment

[air] Add horovod trainer. #23437

[air] Add horovod trainer. #23437

Conversation

xwjiang2010 commented Mar 23, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

amogkam left a comment

Choose a reason for hiding this comment

amogkam Mar 24, 2022 • edited Loading

Choose a reason for hiding this comment

xwjiang2010 Mar 27, 2022

Choose a reason for hiding this comment

amogkam left a comment

Choose a reason for hiding this comment

amogkam Mar 28, 2022

Choose a reason for hiding this comment

xwjiang2010 Mar 28, 2022

Choose a reason for hiding this comment

amogkam Mar 28, 2022

Choose a reason for hiding this comment

amogkam Mar 28, 2022

Choose a reason for hiding this comment

xwjiang2010 Mar 28, 2022

Choose a reason for hiding this comment

amogkam left a comment

Choose a reason for hiding this comment

matthewdeng left a comment

Choose a reason for hiding this comment

xwjiang2010 commented Mar 23, 2022 •

edited

Loading

amogkam Mar 24, 2022 •

edited

Loading