Improve universal checkpoint #5289

tohtana · 2024-03-17T08:50:23Z

This PR includes the following improvement regarding universal checkpoint.

Restoring step

A universal checkpoint saves the training step count taken from the engine. In
#5263, we fixed to always set this count to restore training step count to optimizer's states per-param (optimizer_state['state][param]['step']`) and a param_group. However, this approach does not restore the optimizer's state and param groups precisely due to different behaviors of optimizers.

Torch's Adam doesn't make step in a param groups and only uses optimizer_state['state'][param]['step']. Apex's fused adam only uses step in a param groups. DeepSpeed's fused adam creates step in a param groups and never updates. It only uses optimizer_state['state'][param]['step'].
Consequently, this leads to discrepancies between the restored and original states of the optimizer and param groups.

This PR modifies the restoration process to ensure that the step number in the optimizer's state and param groups matches those in the original setup, effectively aligning the restored and original optimizer states and param groups.

Unit tests of DP size scaling

This PR also adds unit tests to verify universal checkpointing. They run training with DP, save a checkpoint, and converts in to a universal checkpoint. Then they load the checkpoint with a different DP size and validate that parameters and the all-gathered (ZeRO 1/2) optimizer states match.

Fix bug of loading with load_optimizer_states=False

The loader doesn't load parameters from a universal checkpoint when load_optimizer_states=False. c8c0498 fixes this issue.

deepspeed/checkpoint/ds_to_universal.py

deepspeed/checkpoint/universal_checkpoint.py

This PR includes the following improvement regarding universal checkpoint. - Restoring step A universal checkpoint saves the training step count taken from the engine. In microsoft#5263, we fixed to always set this count to restore training step count to optimizer's states per-param (`optimizer_state['state`][param]['step']`) and a param_group. However, this approach does not restore the optimizer's state and param groups precisely due to different behaviors of optimizers. Torch's Adam doesn't make `step` in a param groups and only uses `optimizer_state['state'][param]['step']`. Apex's fused adam only uses `step` in a param groups. DeepSpeed's fused adam creates `step` in a param groups and never updates. It only uses `optimizer_state['state'][param]['step']`. Consequently, this leads to discrepancies between the restored and original states of the optimizer and param groups. This PR modifies the restoration process to ensure that the step number in the optimizer's state and param groups matches those in the original setup, effectively aligning the restored and original optimizer states and param groups. - Unit tests of DP size scaling This PR also adds unit tests to verify universal checkpointing. They run training with DP, save a checkpoint, and converts in to a universal checkpoint. Then they load the checkpoint with a different DP size and validate that parameters and the all-gathered (ZeRO 1/2) optimizer states match. - Fix bug of loading with `load_optimizer_states=False` The loader doesn't load parameters from a universal checkpoint when `load_optimizer_states=False`. microsoft@c8c0498 fixes this issue.

tohtana added 11 commits March 14, 2024 16:05

run conversion script

7684ae6

add no-parallel path

1ada96d

improve restoring step from universal checkpoint

fe61652

sort keys when comparing state dicts

2b15116

add unit test for universal checkpoint

575665b

fix loading with load_optimizer_states=False

c8c0498

add dp scaling test

8d2dbaa

remove pad for comparison

3547b4e

refactor test conditions

2b6f694

Merge branch 'master' into tohtana/unit_test_univ_cp

099133c

fix for torch adam

ccfba1a

tohtana marked this pull request as ready for review March 18, 2024 04:14

tohtana requested review from tjruwase, mrwyattii and loadams as code owners March 18, 2024 04:14

tjruwase reviewed Mar 18, 2024

View reviewed changes

deepspeed/checkpoint/ds_to_universal.py Outdated Show resolved Hide resolved

simplify argument

24484ad

tjruwase approved these changes Mar 18, 2024

View reviewed changes

tohtana added 3 commits March 19, 2024 22:00

fix for optimizer that doesn't have step in optimizer states

13effa1

add api to load global state to BF16 optimizer for compatibility

101e90c

restore all fields in param group

76aca75

tjruwase reviewed Mar 20, 2024

View reviewed changes

deepspeed/checkpoint/universal_checkpoint.py Outdated Show resolved Hide resolved

Merge branch 'master' into tohtana/unit_test_univ_cp

c4b2aaa

tohtana added this pull request to the merge queue Mar 27, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 27, 2024

tohtana added this pull request to the merge queue Mar 27, 2024

loadams removed this pull request from the merge queue due to a manual request Mar 27, 2024

tohtana added 3 commits March 28, 2024 01:19

Merge branch 'master' into tohtana/unit_test_univ_cp

140d704

move loading function to ZeROOptimizer

9909bd3

refactor to avoid circular import

5114233

tohtana added 2 commits March 28, 2024 01:58

fix format

301e79e

fix method calls

802397d

tohtana enabled auto-merge March 28, 2024 08:08

tohtana added this pull request to the merge queue Mar 28, 2024

Merged via the queue into microsoft:master with commit c56a4b9 Mar 28, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve universal checkpoint #5289

Improve universal checkpoint #5289

tohtana commented Mar 17, 2024 •

edited

Loading

Improve universal checkpoint #5289

Improve universal checkpoint #5289

Conversation

tohtana commented Mar 17, 2024 • edited Loading

tohtana commented Mar 17, 2024 •

edited

Loading