[Train] LightningTrainer enable checkpoint full dict with FSDP strategy #34967

woshiyyya · 2023-05-02T18:47:19Z

Why are these changes needed?

In PyTorch Lightning 2.0, when using the FSDP strategy, the checkpoint does not include the model state dictionary. This PR aims to add a model unsharding logic, which gathers all model shard from workers to rank0 CPU memory.

This is a pre-request PR for FSDP LLM finetune example: #34990

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: woshiyyya <[email protected]>

…sdp_checkpoint_fulldict

gjoliver

A quick question.

python/ray/train/lightning/_lightning_utils.py

Signed-off-by: woshiyyya <[email protected]>

…gy (ray-project#34967) Signed-off-by: woshiyyya <[email protected]>

pytorch_lightning 2.0 support added in 2.5 ray-project/ray#34967 ray_lightning deprecated ray-project/ray#36400

woshiyyya added 4 commits May 2, 2023 11:40

wip

dda06e9

Signed-off-by: woshiyyya <[email protected]>

fix lint

ec3130f

Signed-off-by: woshiyyya <[email protected]>

add error message

5beeaad

Signed-off-by: woshiyyya <[email protected]>

Merge remote-tracking branch 'upstream/master' into train/lightning_f…

449d8a5

…sdp_checkpoint_fulldict

woshiyyya marked this pull request as ready for review May 5, 2023 18:35

woshiyyya requested a review from gjoliver May 5, 2023 18:35

woshiyyya assigned gjoliver May 5, 2023

gjoliver reviewed May 5, 2023

View reviewed changes

python/ray/train/lightning/_lightning_utils.py Show resolved Hide resolved

woshiyyya added 2 commits May 5, 2023 13:08

fix ci

02422ab

Signed-off-by: woshiyyya <[email protected]>

fix ci

c2649cd

Signed-off-by: woshiyyya <[email protected]>

gjoliver approved these changes May 5, 2023

View reviewed changes

gjoliver merged commit 3be4491 into ray-project:master May 5, 2023

architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023

[Train] LightningTrainer enable checkpoint full dict with FSDP strate…

01d9a79

…gy (ray-project#34967) Signed-off-by: woshiyyya <[email protected]>

ddelange added a commit to ddelange/autogluon that referenced this pull request Jul 3, 2023

Bump ray version

a4fabd7

pytorch_lightning 2.0 support added in 2.5 ray-project/ray#34967 ray_lightning deprecated ray-project/ray#36400

woshiyyya mentioned this pull request Apr 8, 2024

[Train] Disable gathering the full state dict in RayFSDPStrategy for lightning>2.1 #44569

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] LightningTrainer enable checkpoint full dict with FSDP strategy #34967

[Train] LightningTrainer enable checkpoint full dict with FSDP strategy #34967

woshiyyya commented May 2, 2023 •

edited

Loading

gjoliver left a comment

[Train] LightningTrainer enable checkpoint full dict with FSDP strategy #34967

[Train] LightningTrainer enable checkpoint full dict with FSDP strategy #34967

Conversation

woshiyyya commented May 2, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

gjoliver left a comment

Choose a reason for hiding this comment

woshiyyya commented May 2, 2023 •

edited

Loading