[Train] Strip "module." from state dict #30705

Yard1 · 2022-11-28T21:38:53Z

Signed-off-by: Antoni Baum [email protected]

Why are these changes needed?

This PR adds logic to automatically strip the "module." prefix from a user-saved state dict in TorchCheckpoint, which is present if a user obtains the state dict from a DistributedDataParallel module directly. We already obtain the underlying module if a user saves the model object, so this merely makes the logic consistent.

This PR also edits our examples to remove instances where this operation was conducted in the example itself. This led to issues if train.torch.prepare_model was used with num_workers=1 (eg. on Google Colab), as the module was not wrapped around, thus leading to the .module attribute being missing.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Antoni Baum <[email protected]>

…prefix

amogkam · 2022-12-07T22:33:36Z

python/ray/train/tests/test_torch_trainer.py

@@ -111,6 +119,8 @@ def train_func():
    assert predictions.count() == 3


+# We can't really test for prepare_model here as we can't detect what the user
+# has saved without loading (and thus triggering the exception anyway)


for my understanding, can you elaborate on why prepare_model causes this test to fail?

prepare_model will wrap the model in DDP. If the user doesn't manually unwrap it before saving, DDP will throw an exception after being loaded.

sorry I guess I mean more- why is it not going through the _encode_dict path?

If a checkpoint is created from directory, we aren't really able to detect what's actually in the files without deserializing them in the first place (which would not only add overhead but also cause the error anyway), and we can't apply _encode_dict on already serialized data

then for this dir checkpoint, why does it get deserialized in the first place?

Well, we don't have a native way of supporting torch models from files (as mentioned by the TODO in this test). Therefore, the test implements its own predictor. Using dir checkpoints with torch is not what we want users to do right now, but the purpose of this test is to make sure that it works regardless.

We can add prepare_model here but we'd have to unwrap the model before saving anyway, meaning we wouldn't really test anything extra here.

python/ray/air/_internal/torch_utils.py

Signed-off-by: Antoni Baum <[email protected]>

This PR adds logic to automatically strip the "module." prefix from a user-saved state dict in TorchCheckpoint, which is present if a user obtains the state dict from a DistributedDataParallel module directly. We already obtain the underlying module if a user saves the model object, so this merely makes the logic consistent. This PR also edits our examples to remove instances where this operation was conducted in the example itself. This led to issues if train.torch.prepare_model was used with num_workers=1 (eg. on Google Colab), as the module was not wrapped around, thus leading to the .module attribute being missing. Signed-off-by: Antoni Baum <[email protected]> Signed-off-by: Weichen Xu <[email protected]>

This PR adds logic to automatically strip the "module." prefix from a user-saved state dict in TorchCheckpoint, which is present if a user obtains the state dict from a DistributedDataParallel module directly. We already obtain the underlying module if a user saves the model object, so this merely makes the logic consistent. This PR also edits our examples to remove instances where this operation was conducted in the example itself. This led to issues if train.torch.prepare_model was used with num_workers=1 (eg. on Google Colab), as the module was not wrapped around, thus leading to the .module attribute being missing. Signed-off-by: Antoni Baum <[email protected]>

The regression is introduced by #30705. Also added some documentation into TorchTrainer so users know there is quite some magic happening :) Tested manually in workspace. Follow-up PR to add more strict assertions to the test. Signed-off-by: xwjiang2010 <[email protected]>

This PR adds logic to automatically strip the "module." prefix from a user-saved state dict in TorchCheckpoint, which is present if a user obtains the state dict from a DistributedDataParallel module directly. We already obtain the underlying module if a user saves the model object, so this merely makes the logic consistent. This PR also edits our examples to remove instances where this operation was conducted in the example itself. This led to issues if train.torch.prepare_model was used with num_workers=1 (eg. on Google Colab), as the module was not wrapped around, thus leading to the .module attribute being missing. Signed-off-by: Antoni Baum <[email protected]> Signed-off-by: tmynn <[email protected]>

[Train] Strip "module." from state dict

80bec0b

Signed-off-by: Antoni Baum <[email protected]>

Yard1 assigned amogkam Nov 28, 2022

Yard1 requested review from richardliaw, krfricke, xwjiang2010, amogkam, matthewdeng, maxpumperla and a team as code owners November 28, 2022 21:38

Yard1 added 3 commits November 28, 2022 23:18

Fixes

6fdb633

Signed-off-by: Antoni Baum <[email protected]>

Implement feedback

d2bc55d

Signed-off-by: Antoni Baum <[email protected]>

Merge branch 'ray-project:master' into train_checkpoint_strip_module_…

fe91373

…prefix

amogkam reviewed Dec 7, 2022

View reviewed changes

Yard1 added 2 commits December 7, 2022 22:47

Merge branch 'master' into train_checkpoint_strip_module_prefix

84e4d4d

Trim

b931c2c

Signed-off-by: Antoni Baum <[email protected]>

Yard1 requested a review from amogkam December 7, 2022 22:49

Tweak

9149677

Signed-off-by: Antoni Baum <[email protected]>

amogkam approved these changes Dec 10, 2022

View reviewed changes

richardliaw approved these changes Dec 12, 2022

View reviewed changes

amogkam merged commit 03acade into ray-project:master Dec 12, 2022

Yard1 deleted the train_checkpoint_strip_module_prefix branch December 12, 2022 21:52

xwjiang2010 mentioned this pull request Jan 23, 2023

[release] fix pytorch pbt failure test. #31791

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Strip "module." from state dict #30705

[Train] Strip "module." from state dict #30705

Yard1 commented Nov 28, 2022

amogkam Dec 7, 2022

Yard1 Dec 7, 2022

amogkam Dec 7, 2022

Yard1 Dec 7, 2022 •

edited

Loading

amogkam Dec 7, 2022

Yard1 Dec 7, 2022

Yard1 Dec 7, 2022

[Train] Strip "module." from state dict #30705

[Train] Strip "module." from state dict #30705

Conversation

Yard1 commented Nov 28, 2022

Why are these changes needed?

Related issue number

Checks

amogkam Dec 7, 2022

Choose a reason for hiding this comment

Yard1 Dec 7, 2022

Choose a reason for hiding this comment

amogkam Dec 7, 2022

Choose a reason for hiding this comment

Yard1 Dec 7, 2022 • edited Loading

Choose a reason for hiding this comment

amogkam Dec 7, 2022

Choose a reason for hiding this comment

Yard1 Dec 7, 2022

Choose a reason for hiding this comment

Yard1 Dec 7, 2022

Choose a reason for hiding this comment

Yard1 Dec 7, 2022 •

edited

Loading