Only half of parameters are saved when applied PP #474

dmammfl · 2024-07-22T23:48:35Z

I'm currently training Llama-3-8B model in 2 GPUs with Pipeline parallel only.
However, when i save a checkpoint on each rank, half of that checkpoint is saved. (Layer 1 is saved, Layer 2 is not saved, Layer 3 is saved, Layer 4 is not saved ... Layer 15 is saved.)

I think dcp.save only works well with dtensor, not tensor. I need your insight on this. Thanks a lot!

wanchaol · 2024-07-23T19:34:43Z

maybe we should add state_dict hooks to PP to emit DTensor on PP's submesh so that DCP works with PP alone? @fegin @H-Huang @wconstab

wconstab · 2024-08-05T19:42:18Z

hmm. we shouldn't really need DTensor to solve the problem of layer0 being saved and layer1 not being saved. The fqn should be preserved and not conflict, so we should be able to save both. From the pattern I assume this is using virtual pipeline stages and layer 0,2,4,... are on gpu0 and only gpu0 is correctly saving things?

In the 3D case with PP, we expect that gpu0 would save DTensor including any TP/DP replication/sharding. However, we do not rely on DTensor for dealing with layer0 vs layer1.

fegin · 2024-08-05T19:46:11Z

dcp.save() works with both DTensor and Tensor. Rank0 will determine what to save on each rank. If tensors are not duplicated (FQNs are different), all the tensors will be saved .

dmammfl · 2024-08-06T08:05:54Z

I tested this case, and figured out several points:

When only PP is applied in degree 2 and assume that the model is 15 GB, dcp.save should save 2 dcp checkpoints whose sizes are 7.5 GB, but each checkpoint's size is about 3.7 GB
When applied only PP, model.state_dict( ) per each rank has its sharded model params exactly(rank 0 has 0~15 layer param, rank 1 has 16~31 layer param)
in the case of rank 1, although rank 1 has 16~31 layer params, its key names are "model.layer.0.self_attn....", "model.layer.1.self_attn...." ..., "model.layer.15.self_attn....", exactly same as rank 0's layer key name (except embed_token, lm_head, etc)
When I changed layer keynames as "PP0_model.layer.0.self_attn...." (did same thing for rank 1: "PP1_model.layer.0.self_attn...."), All of the state_dicts are saved properly, whose sizes are around 7.5 GB each.

I think there is a key confilct in _save_state_dict( ) method, so _save_state_dict( ) in dcp.save( ) does weird operation.

fegin · 2024-08-06T16:30:16Z

looks suspicious and just as you mentioned, there are key conflicts. We have tested the non-virtual pipeline and there are non key conflict. Any insight about this, @wconstab, @H-Huang ?

wconstab · 2024-08-06T16:49:39Z

could you share the exact repro command so we can debug?

dmammfl · 2024-08-08T02:20:10Z

I run "run_llama_train.sh", setting "pipeline_parallel_degree" as 2 and other parallel degrees are 1.

tianyu-l added the bug Something isn't working label Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only half of parameters are saved when applied PP #474

Only half of parameters are saved when applied PP #474

dmammfl commented Jul 22, 2024 •

edited by wconstab

Loading

wanchaol commented Jul 23, 2024

wconstab commented Aug 5, 2024

fegin commented Aug 5, 2024

dmammfl commented Aug 6, 2024 •

edited

Loading

fegin commented Aug 6, 2024

wconstab commented Aug 6, 2024

dmammfl commented Aug 8, 2024 •

edited

Loading

Only half of parameters are saved when applied PP #474

Only half of parameters are saved when applied PP #474

Comments

dmammfl commented Jul 22, 2024 • edited by wconstab Loading

wanchaol commented Jul 23, 2024

wconstab commented Aug 5, 2024

fegin commented Aug 5, 2024

dmammfl commented Aug 6, 2024 • edited Loading

fegin commented Aug 6, 2024

wconstab commented Aug 6, 2024

dmammfl commented Aug 8, 2024 • edited Loading

dmammfl commented Jul 22, 2024 •

edited by wconstab

Loading

dmammfl commented Aug 6, 2024 •

edited

Loading

dmammfl commented Aug 8, 2024 •

edited

Loading