[Train] Torch data transfer automatic conversion #20333

amogkam · 2021-11-13T23:25:12Z

When saving a model in a checkpoint or reporting a model, the user has to manually extract out the module from DDP and move it to cpu so that it can be properly deserialized on the driver.

This PR adds functionality to automatically do the above so the user does not have to add this logic to their training script.

TODO: Possibly use this logic not just for checkpoints, but for return values as well.

Why are these changes needed?

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…h-accelerate

…ig-refactor

…ccelerate

Co-authored-by: matthewdeng <[email protected]>

…accelerate

…ccelerate

…h-accelerate

…cpu-gpu-convert

…h-accelerate

…accelerate

…cpu-gpu-convert

python/ray/train/session.py

python/ray/train/trainer.py

python/ray/train/backend.py

python/ray/train/tests/test_gpu.py

python/ray/train/tests/test_trainer.py

python/ray/train/torch.py

matthewdeng

LGTM!

matthewdeng · 2021-11-14T21:35:41Z

python/ray/train/backend.py

@@ -66,13 +122,14 @@ class BackendExecutor:
    def __init__(
            self,
            backend_config: BackendConfig,
+            backend: Backend,


nit: Undo this change since it doesn't make sense to have conflicting BackendConfig and Backend.

Ok I made Backend into a singleton, so it's ok to instantiate any of the backends any number of times, and we don't need to pass around a single instance.

python/ray/train/trainer.py

python/ray/train/tests/test_session.py

…h-cpu-gpu-convert

amogkam and others added 30 commits November 10, 2021 15:22

update

250c2a2

formatting

c230a19

fix failures

ed9bb72

fix session tests

48f6777

address comments

2e569d8

add to api docs

94f8cbb

package refactor

02e6a3d

wip

deac2c5

Merge branch 'master' of https://github.com/ray-project/ray into torc…

00375a4

…h-accelerate

wip

1b90b9a

wip

22ff603

Merge branch 'master' of https://github.com/ray-project/ray into conf…

7b2ce3f

…ig-refactor

finish

9002ab4

Merge branch 'config-refactor' of github.com:amogkam/ray into torch-a…

28a4337

…ccelerate

finish

9c6f9d5

fix

06c89e8

comment

055538c

fix

3d058e3

install horovod for docs

a4333bd

address comment

1a88647

Update python/ray/train/session.py

93f4422

Co-authored-by: matthewdeng <[email protected]>

Update python/ray/train/torch.py

e023d4d

Co-authored-by: matthewdeng <[email protected]>

address comments

62c5566

Merge branch 'torch-accelerate' of github.com:amogkam/ray into torch-…

b94551b

…accelerate

try fix docs

cc429de

fix doc build failure

5b97931

Merge branch 'config-refactor' of github.com:amogkam/ray into torch-a…

2fea386

…ccelerate

wip

e6eb596

Merge branch 'master' of https://github.com/ray-project/ray into torc…

e1312ef

…h-accelerate

fix

9168a83

amogkam and others added 6 commits November 13, 2021 11:14

fix

74eb011

fix

e6ff7ea

try fix doc highlighting

06b1934

fix docs

0e05c0f

finish

630a528

Merge branch 'torch-accelerate' of github.com:amogkam/ray into torch-…

03f9ec4

…cpu-gpu-convert

amogkam assigned matthewdeng Nov 13, 2021

amogkam added 4 commits November 13, 2021 15:31

Merge branch 'master' of https://github.com/ray-project/ray into torc…

4355905

…h-accelerate

Merge branch 'torch-accelerate' of github.com:amogkam/ray into torch-…

7515d96

…accelerate

Merge branch 'torch-accelerate' of github.com:amogkam/ray into torch-…

86ef6fc

…cpu-gpu-convert

formatting

8777432

matthewdeng reviewed Nov 14, 2021

View reviewed changes

address comments and fix tests

8669d75

amogkam changed the title ~~[Train] Torch save_checkpoint automatic conversion~~ [Train] Torch data transfer automatic conversion Nov 14, 2021

amogkam requested a review from matthewdeng November 14, 2021 20:55

matthewdeng approved these changes Nov 14, 2021

View reviewed changes

amogkam added 2 commits November 14, 2021 14:16

address comments and fix test

4138771

Merge branch 'master' of https://github.com/ray-project/ray into torc…

69899c2

…h-cpu-gpu-convert

amogkam merged commit ef79674 into ray-project:master Nov 15, 2021

amogkam deleted the torch-cpu-gpu-convert branch November 15, 2021 17:14

matthewdeng mentioned this pull request Nov 15, 2021

[SGD] v2 Backend specific checkpoint conversion #17840

Closed

6 tasks

bveeramani mentioned this pull request Mar 30, 2022

[Train] Print actionable error message when CUDA object returned from training function #23159

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Torch data transfer automatic conversion #20333

[Train] Torch data transfer automatic conversion #20333

amogkam commented Nov 13, 2021 •

edited

Loading

matthewdeng left a comment

matthewdeng Nov 14, 2021

amogkam Nov 14, 2021

[Train] Torch data transfer automatic conversion #20333

[Train] Torch data transfer automatic conversion #20333

Conversation

amogkam commented Nov 13, 2021 • edited Loading

Why are these changes needed?

Related issue number

Checks

matthewdeng left a comment

Choose a reason for hiding this comment

matthewdeng Nov 14, 2021

Choose a reason for hiding this comment

amogkam Nov 14, 2021

Choose a reason for hiding this comment

amogkam commented Nov 13, 2021 •

edited

Loading