[AIR] Skip checkpoint cast if checkpoint is same type #28935

bveeramani · 2022-09-30T20:52:35Z

Signed-off-by: Balaji Veeramani [email protected]

Why are these changes needed?

Checkpoints attributes are reset when passed to TensorflowPredictor.from_checkpoint.

TensorflowPredictor.from_checkpoint calls TensorflowCheckpoint.from_checkpoint, which creates a new checkpoint with default object attributes -- even if the passed-in checkpoint is already a TensorflowCheckpoint.

These changes are needed to merge #28474.

Related issue number

See #28474 and see #26777

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

jiaodong · 2022-09-30T21:34:16Z

python/ray/air/checkpoint.py

@@ -467,6 +467,9 @@ def from_checkpoint(cls, other: "Checkpoint") -> "Checkpoint":
            >>> model = checkpoint.get_model()  # doctest: +SKIP
            Linear(in_features=1, out_features=1, bias=True)
        """
+        if type(other) is cls:


this behavior makes sense to me but I need another pair of eyes with full checkpoint context. @krfricke and @xwjiang2010 ?

Main question here is if we should be less strict by using isinstance(other, cls), which will also return True for inherited classes.

matthewdeng · 2022-09-30T22:59:44Z

Can you explain why it's created with default attributes? Is that not a bug?

bveeramani · 2022-09-30T23:15:12Z

Can you explain why it's created with default attributes? Is that not a bug?

It's sort-of a bug. More so that the behavior of Checkpoint.from_checkpoint is poorly defined.

TensorflowPredictor.from_checkpoint casts the input checkpoint to a TensorflowCheckpoint by calling TensorflowCheckpoint.from_checkpoint. This is necessary to avoid the errors described in #28134.

checkpoint = TensorflowCheckpoint.from_checkpoint(checkpoint)
model_weights = checkpoint.get_model_weights()
preprocessor = checkpoint.get_preprocessor()

Attributes are restored at the from_dict / from_directory layer of abstraction. Because TensorflowCheckpoint.from_checkpoint creates a new checkpoint with the constructor as opposed to from_dict / from_directory, the attributes aren't restored.

T = TypeVar(bound=Checkpoint)

def from_checkpoint(cls: T, other: Checkpoint) -> T:
    return cls(
        local_path=other._local_path,
        data_dict=other._data_dict,
        uri=other._uri,
        obj_ref=other._obj_ref,
    )

An alternative implementation would be to call from_* methods in from_checkpoint.

def from_checkpoint(cls: T, other: Checkpoint) -> T:
    if other._data_dict:
        return cls.from_dict(other._data_dict)
    ...

The problem with this approach is that we'd also restore the checkpoint type.

>>> checkpoint: TorchCheckpoint = ...
>>> new_checkpoint = Checkpoint.from_checkpoint(checkpoint)
>>> type(new_checkpoint). # Arguably should be `Checkpoint`
TorchCheckpoint

In any case, from_checkpoint is a short-term hack. We should handle it soon, either by removing it or by making it an implementation detail

matthewdeng · 2022-10-02T02:58:33Z

Ah, thanks for the explanation.

In any case, from_checkpoint is a short-term hack. We should handle it soon, either by removing it or by making it an implementation detail

Can this be documented either as a Github issue and/or in the code as a TODO? It would be great to have this context captured so we (a) know how to design a better long-term fix, and (b) can reasonably evaluate short-term fixes (e.g. this PR).

krfricke

LGTM, quick question before merge

krfricke · 2022-10-03T17:08:45Z

python/ray/air/checkpoint.py

@@ -467,6 +467,9 @@ def from_checkpoint(cls, other: "Checkpoint") -> "Checkpoint":
            >>> model = checkpoint.get_model()  # doctest: +SKIP
            Linear(in_features=1, out_features=1, bias=True)
        """
+        if type(other) is cls:


Main question here is if we should be less strict by using isinstance(other, cls), which will also return True for inherited classes.

amogkam

Actually thinking about it some more @bveeramani, do we even need to use from_checkpoint in predictors?

The predictors can used the passed in checkpoints directly without needing to call from_checkpoint.

bveeramani · 2022-10-03T17:33:26Z

Main question here is if we should be less strict by using isinstance(other, cls), which will also return True for inherited classes.

@krfricke decision was somewhat arbitrary, but the motivation is to make cls.from_checkpoint always return a cls.

Currently, if you do

>>> torch_checkpoint = TorchCheckpoint.from_state_dict(...)
>>> checkpoint = Checkpoint.from_checkpoint(torch_checkpoint)

you'd get back a Checkpoint

>>> type(checkpoint)
<class 'Checkpoint'>

but if we did isinstance(other, cls), you'd call Checkpoint.from_checkpoint but get a TorchCheckpoint back.

>>> type(checkpoint>
<class 'TorchCheckpoint'>

amogkam

Actually thinking about it some more @bveeramani, do we even need to use from_checkpoint in predictors?

Seems like this is still necessary until #28910 is fixed.

This PR looks good as a short term fix

bveeramani · 2022-10-03T17:38:09Z

Actually thinking about it some more @bveeramani, do we even need to use from_checkpoint in predictors?

The predictors can used the passed in checkpoints directly without needing to call from_checkpoint.

If we want to support TorchPredictor.from_checkpoint(checkpoint: Checkpoint) (as opposed to TorchPredictor.from_checkpoint(checkpoint: TorchCheckpoint)), then we need to call Checkpoint.from_checkpoint in predictors.

But yeah, in any case, we should try to unblock #28474

) Checkpoints attributes are reset when passed to `TensorflowPredictor.from_checkpoint`. `TensorflowPredictor.from_checkpoint` calls `TensorflowCheckpoint.from_checkpoint`, which creates a new checkpoint with default object attributes -- even if the passed-in checkpoint is already a `TensorflowCheckpoint`. These changes are needed to merge ray-project#28474. Signed-off-by: Weichen Xu <[email protected]>

bveeramani added 2 commits September 30, 2022 13:45

Initial commit

4dce53a

Update test_checkpoints.py

02f3e03

bveeramani changed the title ~~[AIR] Skip checkpoint cast if checkpoint is correct type~~ [AIR] Skip checkpoint cast if checkpoint is same type Sep 30, 2022

bveeramani assigned jiaodong Sep 30, 2022

jiaodong reviewed Sep 30, 2022

View reviewed changes

xwjiang2010 mentioned this pull request Oct 3, 2022

[air/tf] Support TensorflowCheckpoint's saved model/h5 format #28474

Merged

7 tasks

krfricke approved these changes Oct 3, 2022

View reviewed changes

amogkam approved these changes Oct 3, 2022

View reviewed changes

amogkam requested changes Oct 3, 2022

View reviewed changes

amogkam approved these changes Oct 3, 2022

View reviewed changes

krfricke merged commit e705f03 into ray-project:master Oct 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR] Skip checkpoint cast if checkpoint is same type #28935

[AIR] Skip checkpoint cast if checkpoint is same type #28935

bveeramani commented Sep 30, 2022 •

edited

Loading

jiaodong Sep 30, 2022

krfricke Oct 3, 2022 •

edited

Loading

matthewdeng commented Sep 30, 2022

bveeramani commented Sep 30, 2022 •

edited

Loading

matthewdeng commented Oct 2, 2022

krfricke left a comment

krfricke Oct 3, 2022 •

edited

Loading

amogkam left a comment •

edited

Loading

bveeramani commented Oct 3, 2022

amogkam left a comment

bveeramani commented Oct 3, 2022

[AIR] Skip checkpoint cast if checkpoint is same type #28935

[AIR] Skip checkpoint cast if checkpoint is same type #28935

Conversation

bveeramani commented Sep 30, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

jiaodong Sep 30, 2022

Choose a reason for hiding this comment

krfricke Oct 3, 2022 • edited Loading

Choose a reason for hiding this comment

matthewdeng commented Sep 30, 2022

bveeramani commented Sep 30, 2022 • edited Loading

matthewdeng commented Oct 2, 2022

krfricke left a comment

Choose a reason for hiding this comment

krfricke Oct 3, 2022 • edited Loading

Choose a reason for hiding this comment

amogkam left a comment • edited Loading

Choose a reason for hiding this comment

bveeramani commented Oct 3, 2022

amogkam left a comment

Choose a reason for hiding this comment

bveeramani commented Oct 3, 2022

bveeramani commented Sep 30, 2022 •

edited

Loading

krfricke Oct 3, 2022 •

edited

Loading

bveeramani commented Sep 30, 2022 •

edited

Loading

krfricke Oct 3, 2022 •

edited

Loading

amogkam left a comment •

edited

Loading