Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Torch data transfer automatic conversion #20333

Merged
merged 43 commits into from
Nov 15, 2021

Conversation

amogkam
Copy link
Contributor

@amogkam amogkam commented Nov 13, 2021

When saving a model in a checkpoint or reporting a model, the user has to manually extract out the module from DDP and move it to cpu so that it can be properly deserialized on the driver.

This PR adds functionality to automatically do the above so the user does not have to add this logic to their training script.

TODO: Possibly use this logic not just for checkpoints, but for return values as well.

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

python/ray/train/session.py Outdated Show resolved Hide resolved
python/ray/train/trainer.py Outdated Show resolved Hide resolved
python/ray/train/backend.py Outdated Show resolved Hide resolved
python/ray/train/tests/test_gpu.py Outdated Show resolved Hide resolved
python/ray/train/tests/test_trainer.py Outdated Show resolved Hide resolved
python/ray/train/torch.py Outdated Show resolved Hide resolved
@amogkam amogkam changed the title [Train] Torch save_checkpoint automatic conversion [Train] Torch data transfer automatic conversion Nov 14, 2021
Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -66,13 +122,14 @@ class BackendExecutor:
def __init__(
self,
backend_config: BackendConfig,
backend: Backend,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Undo this change since it doesn't make sense to have conflicting BackendConfig and Backend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I made Backend into a singleton, so it's ok to instantiate any of the backends any number of times, and we don't need to pass around a single instance.

python/ray/train/trainer.py Outdated Show resolved Hide resolved
python/ray/train/tests/test_session.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants