Fix resuming from checkpoint when using RayFSDPStrategy #43594

angerrp · 2024-03-01T11:30:42Z

Why are these changes needed?

Restoring from a checkpoint when using FSDP is currently flawed as the state_dict keys for each layer get modified and torch can not associate the weights to the layer name when loading. The current implementation always assumes that the layer keys in the state dict are prefixed with _forward_module. and then slices the key based on the length of the prefix.

The underlying reason why we remove the _forward_module. is unclear to me but we should check if it is prefixed before removing. This is implemented in this PR and fixes the loading of checkpoints when using RayFSDPStrategy

The following is an example of the wrong state_dict keys for a checkpoint:

# Keys in checkpoint["state_dict"]
der.embed_tokens.weight
der.embed_positions.weight
der.final_layer_norm.weight
...

Correct keys:

model.model.decoder.embed_tokens.weight
model.model.decoder.embed_positions.weight
model.model.decoder.final_layer_norm.weight
...

Signed-off-by: Paul Angerer <[email protected]>

woshiyyya · 2024-03-04T20:13:39Z

Hi @dabauxi , The reason why we did this is because Lightning previously have a _LightningModuleWrapperBase, which added the extra "_forward_module." prefix to the state_dict keys. Users need to manually trim it to correctly load the checkpoint. Lightning-AI/pytorch-lightning#16526

In the recent versions, lightning trimmed the prefix internally so no need to do it ourselves.

Thanks for the fix!

woshiyyya

LGTM

galyna-anyscale · 2024-03-04T22:14:26Z

@matthewdeng Please review this PR so it can be merged.

matthewdeng

LGTM, seems like this branching logic is needed to account for different versions of Lightning which do/don't have the prefix.

angerrp added 2 commits March 1, 2024 12:32

Fix resuming from checkpoint due to slicing in all cases

2a37469

Signed-off-by: Paul Angerer <[email protected]>

Formatting

4bea70d

Signed-off-by: Paul Angerer <[email protected]>

angerrp force-pushed the fix-checkpoint-state-dict branch from 4d4bffd to 4bea70d Compare March 1, 2024 11:32

anyscalesam requested review from woshiyyya and matthewdeng March 1, 2024 18:30

woshiyyya approved these changes Mar 4, 2024

View reviewed changes

matthewdeng approved these changes Mar 4, 2024

View reviewed changes

matthewdeng merged commit b5aee36 into ray-project:master Mar 4, 2024
9 checks passed

matthewdeng mentioned this pull request Mar 6, 2024

[Train] state_dict Key Truncation in RayFSDPStrategy leads to Unexpected key(s) Error #43744

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix resuming from checkpoint when using RayFSDPStrategy #43594

Fix resuming from checkpoint when using RayFSDPStrategy #43594

angerrp commented Mar 1, 2024 •

edited

Loading

woshiyyya commented Mar 4, 2024 •

edited

Loading

woshiyyya left a comment

galyna-anyscale commented Mar 4, 2024

matthewdeng left a comment

Fix resuming from checkpoint when using RayFSDPStrategy #43594

Fix resuming from checkpoint when using RayFSDPStrategy #43594

Conversation

angerrp commented Mar 1, 2024 • edited Loading

Why are these changes needed?

woshiyyya commented Mar 4, 2024 • edited Loading

woshiyyya left a comment

Choose a reason for hiding this comment

galyna-anyscale commented Mar 4, 2024

matthewdeng left a comment

Choose a reason for hiding this comment

angerrp commented Mar 1, 2024 •

edited

Loading

woshiyyya commented Mar 4, 2024 •

edited

Loading