[AIR] Add `TorchCheckpoint.from_state_dict` #27970

bveeramani · 2022-08-17T23:17:02Z

Signed-off-by: Balaji Veeramani [email protected]

Why are these changes needed?

PyTorch recommends saving state dictionaries instead of modules, but we don't support any way to do this.

Related issue number

Closes #28158

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/train/torch/torch_checkpoint.py

amogkam · 2022-08-18T22:12:25Z

python/ray/train/torch/torch_checkpoint.py

+        *,
+        preprocessor: Optional["Preprocessor"] = None,
+    ) -> "TorchCheckpoint":
+        """Create a :class:`~ray.air.checkpoint.Checkpoint` that stores a model state


First line of docstring should be 1 line please!

I'm not sure if we should do that.

If the docstring summary is on one line, then the line is longer than 88 characters. This normally means we should shorten our summary, but the only reason the summary is long is because of the :class:ray.air.checkpoint.Checkpoint Sphinx directive.

So, we either break that line limit convention or the one-line docstring summary convention. Given that the goal of the one-line summary convention is to have short summaries, I think it's fine to wrap summaries when we use Sphinx directives.

What do you think?

Also, I don't think there's any existing precedent in our code base. PyTorch wraps their summaries. For example, https://pytorch.org/docs/stable/_modules/torch/nn/modules/linear.html#Bilinear

The formatting guidelines we use follow PEP 257:

Multi-line docstrings consist of a summary line just like a one-line docstring, followed by a blank line, followed by a more elaborate description. The summary line may be used by automatic indexing tools; it is important that it fits on one line and is separated from the rest of the docstring by a blank line.

https://peps.python.org/pep-0257/#multi-line-docstrings

I agree though, for long sphinx directives, there doesn't seem to be an ideal solution.

cc @richardliaw @maxpumperla for thoughts

amogkam · 2022-08-18T22:12:50Z

python/ray/train/torch/torch_checkpoint.py

-        """
-        checkpoint = cls.from_dict({PREPROCESSOR_KEY: preprocessor, MODEL_KEY: model})
-        return checkpoint
+        """  # noqa: E501


is this noqa necessary?

I think so. AFAIK there's no way to wrap

`Saving and Loading Models <https://pytorch.org/tutorials/beginner/saving_loading_models.html#what-is-a-state-dict>`_.

in a way that the line is less than 88 characters.

amogkam · 2022-08-18T22:13:02Z

python/ray/train/tests/test_torch_checkpoint.py

+    checkpoint = TorchCheckpoint.from_state_dict(expected_state_dict)
+    actual_state_dict = checkpoint.get_model(torch.nn.Linear(1, 1)).state_dict()
+    assert actual_state_dict == expected_state_dict
+


should we add a test for from_model as well?

@justinvyu is on it. See #27056

PyTorch recommends saving state dictionaries instead of modules, but we don't support any way to do this. Signed-off-by: Balaji Veeramani [email protected]

Add TorchCheckpoint.from_state_dict

a4472e2

bveeramani assigned matthewdeng and amogkam Aug 17, 2022

bveeramani mentioned this pull request Aug 17, 2022

[AIR] Replace TorchCheckpoint.from_model with TorchCheckpoint.from_state_dict #27971

Closed

amogkam approved these changes Aug 18, 2022

View reviewed changes

python/ray/train/torch/torch_checkpoint.py Outdated Show resolved Hide resolved

python/ray/train/torch/torch_checkpoint.py Outdated Show resolved Hide resolved

python/ray/train/torch/torch_checkpoint.py Outdated Show resolved Hide resolved

bveeramani added 2 commits August 18, 2022 12:33

Address review comments

45edb90

Fix stuff

818462b

bveeramani mentioned this pull request Aug 18, 2022

[AIR] Raise ValueError if TorchCheckpoint can't serialize model #27998

Closed

9 tasks

amogkam reviewed Aug 18, 2022

View reviewed changes

bveeramani mentioned this pull request Aug 18, 2022

[AIR] [Docs] Update "Training a Torch classifier" #28002

Merged

8 tasks

bveeramani added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 18, 2022

bveeramani self-assigned this Aug 30, 2022

amogkam merged commit dad98dc into ray-project:master Aug 30, 2022

bveeramani deleted the bveeramani/torch-checkpoint-state-dict branch August 30, 2022 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR] Add `TorchCheckpoint.from_state_dict` #27970

[AIR] Add `TorchCheckpoint.from_state_dict` #27970

bveeramani commented Aug 17, 2022 •

edited

Loading

amogkam Aug 18, 2022

bveeramani Aug 18, 2022

bveeramani Aug 18, 2022

amogkam Aug 29, 2022

amogkam Aug 29, 2022

amogkam Aug 18, 2022

bveeramani Aug 18, 2022

amogkam Aug 18, 2022

bveeramani Aug 18, 2022

[AIR] Add TorchCheckpoint.from_state_dict #27970

[AIR] Add TorchCheckpoint.from_state_dict #27970

Conversation

bveeramani commented Aug 17, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[AIR] Add `TorchCheckpoint.from_state_dict` #27970

[AIR] Add `TorchCheckpoint.from_state_dict` #27970

bveeramani commented Aug 17, 2022 •

edited

Loading