[AIR] Raise `ValueError` if `TorchCheckpoint` can't serialize model #27998

bveeramani · 2022-08-18T21:02:25Z

Depends on:

Why are these changes needed?

If a model is defined in the top-level directory, then Torch can't serialize the model. This PR adds a clearer error message.

Before:
If you're using more than one worker:

RuntimeError: Some workers returned results while others didn't. Make sure that session.report() (legacy API:train.report() and train.save_checkpoint()) are called the same number of times on all workers.

If you're using one worker:

_pickle.PicklingError: Can't pickle <class 'main.Identity'>: attribute lookup Identity on main failed

After:

ValueError: TorchCheckpoint can't serialize model of type Identity because Identity is defined in the top-level environment. To work around this error, call TorchCheckpoint.from_state_dict instead of TorchCheckpoint.from_model. Alternatively, move the definition of Identity to a different module.

Related issue number

Closes #27922

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…/pickle-bug

bveeramani · 2022-08-18T21:13:06Z

python/ray/train/tests/test_torch_checkpoint.py

+def test_from_model_value_error():
+    class StubModel(torch.nn.Module):
+        __module__ = "__main__"
+
+        def forward(x):
+            return x
+
+    model = StubModel()
+    with pytest.raises(ValueError):
+        TorchCheckpoint.from_model(model)


This test makes assumptions about the implementation of from_model, but I wasn't sure how else to test it.

python/ray/train/torch/torch_checkpoint.py

richardliaw · 2022-08-19T08:30:59Z

This particular catch-warn mechanism seems very niche.

Can we figure out how to get the _pickle.PicklingError: Can't pickle <class 'main.Identity'>: attribute lookup Identity on main failed error to raise even on multiple ~~machines~~ workers?

bveeramani · 2022-08-19T18:24:25Z

This particular catch-warn mechanism seems very niche.

Can we figure out how to get the _pickle.PicklingError: Can't pickle <class 'main.Identity'>: attribute lookup Identity on main failed error to raise even on multiple machines?

What error did you get when you ran it on multiple machines?

richardliaw · 2022-08-27T09:06:29Z

Sorry typo -- meant workers.

bveeramani · 2022-08-30T01:16:20Z

This particular catch-warn mechanism seems very niche.

@richardliaw can you elaborate on this? The try-catch mechanism works with any number of workers, and it works with both Jupyter notebooks and Python programs.

Also, the error _pickle.PicklingError: Can't pickle <class 'main.Identity'>: attribute lookup Identity on main failed isn't actionable. It isn't clear what the user needs to do.

matthewdeng · 2022-08-30T17:35:36Z

@bveeramani I think Richard's point is that we should start with generalizable problem of surfacing the correct error and not

RuntimeError: Some workers returned results while others didn't. Make sure that session.report() (legacy API:train.report() and train.save_checkpoint()) are called the same number of times on all workers.

For this particular error, it is niche but I'm wondering if this should instead be handled as part of a try/catch.

bveeramani · 2022-08-30T18:31:04Z

@bveeramani I think Richard's point is that we should start with generalizable problem of surfacing the correct error and not
RuntimeError: Some workers returned results while others didn't. Make sure that session.report() (legacy API:train.report() and train.save_checkpoint()) are called the same number of times on all workers.

I agree we should fix the general problem. That being said, I still think we should raise a ValueError. The "correct error" is confusing and unactionable.

For this particular error, it is niche but I'm wondering if this should instead be handled as part of a try/catch.

How could we handle this as part of a try/catch?

matthewdeng · 2022-08-30T18:49:21Z

If we can fix the general problem, then we have access to the original error. We then catch this particular error and raise the ValueError.

stale · 2022-09-30T18:17:18Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

bveeramani · 2022-10-03T21:10:19Z

Closing for now. Can re-open when #27922 is fixed

bveeramani added 7 commits August 17, 2022 16:15

Add TorchCheckpoint.from_state_dict

a4472e2

Address review comments

45edb90

Merge branch 'bveeramani/torch-checkpoint-state-dict' into bveeramani…

565a811

…/pickle-bug

Add ValueError

4323571

Fix stuff

818462b

Merge branch 'bveeramani/torch-checkpoint-state-dict' into bveeramani…

f0422ea

…/pickle-bug

Update test_torch_checkpoint.py

8600fe4

bveeramani commented Aug 18, 2022

View reviewed changes

bveeramani assigned amogkam Aug 18, 2022

amogkam reviewed Aug 18, 2022

View reviewed changes

python/ray/train/torch/torch_checkpoint.py Show resolved Hide resolved

bveeramani mentioned this pull request Aug 18, 2022

[AIR] [Docs] Update "Training a Torch classifier" #28002

Merged

8 tasks

Update torch_checkpoint.py

504bc0f

bveeramani self-assigned this Aug 30, 2022

amogkam assigned amogkam and unassigned amogkam Aug 30, 2022

bveeramani added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 30, 2022

bveeramani removed their assignment Sep 23, 2022

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Sep 30, 2022

bveeramani closed this Oct 3, 2022

stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Oct 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR] Raise `ValueError` if `TorchCheckpoint` can't serialize model #27998

[AIR] Raise `ValueError` if `TorchCheckpoint` can't serialize model #27998

bveeramani commented Aug 18, 2022 •

edited

Loading

bveeramani Aug 18, 2022

richardliaw commented Aug 19, 2022 •

edited

Loading

bveeramani commented Aug 19, 2022 •

edited

Loading

richardliaw commented Aug 27, 2022

bveeramani commented Aug 30, 2022

matthewdeng commented Aug 30, 2022

bveeramani commented Aug 30, 2022

matthewdeng commented Aug 30, 2022

stale bot commented Sep 30, 2022

bveeramani commented Oct 3, 2022

[AIR] Raise ValueError if TorchCheckpoint can't serialize model #27998

[AIR] Raise ValueError if TorchCheckpoint can't serialize model #27998

Conversation

bveeramani commented Aug 18, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

bveeramani Aug 18, 2022

Choose a reason for hiding this comment

richardliaw commented Aug 19, 2022 • edited Loading

bveeramani commented Aug 19, 2022 • edited Loading

richardliaw commented Aug 27, 2022

bveeramani commented Aug 30, 2022

matthewdeng commented Aug 30, 2022

bveeramani commented Aug 30, 2022

matthewdeng commented Aug 30, 2022

stale bot commented Sep 30, 2022

bveeramani commented Oct 3, 2022

[AIR] Raise `ValueError` if `TorchCheckpoint` can't serialize model #27998

[AIR] Raise `ValueError` if `TorchCheckpoint` can't serialize model #27998

bveeramani commented Aug 18, 2022 •

edited

Loading

richardliaw commented Aug 19, 2022 •

edited

Loading

bveeramani commented Aug 19, 2022 •

edited

Loading