Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only half of parameters are saved when applied PP #474

Open
dmammfl opened this issue Jul 22, 2024 · 7 comments
Open

Only half of parameters are saved when applied PP #474

dmammfl opened this issue Jul 22, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@dmammfl
Copy link

dmammfl commented Jul 22, 2024

I'm currently training Llama-3-8B model in 2 GPUs with Pipeline parallel only.
However, when i save a checkpoint on each rank, half of that checkpoint is saved. (Layer 1 is saved, Layer 2 is not saved, Layer 3 is saved, Layer 4 is not saved ... Layer 15 is saved.)

I think dcp.save only works well with dtensor, not tensor. I need your insight on this. Thanks a lot!

@wanchaol
Copy link
Contributor

maybe we should add state_dict hooks to PP to emit DTensor on PP's submesh so that DCP works with PP alone? @fegin @H-Huang @wconstab

@tianyu-l tianyu-l added the bug Something isn't working label Jul 26, 2024
@wconstab
Copy link
Contributor

wconstab commented Aug 5, 2024

hmm. we shouldn't really need DTensor to solve the problem of layer0 being saved and layer1 not being saved. The fqn should be preserved and not conflict, so we should be able to save both. From the pattern I assume this is using virtual pipeline stages and layer 0,2,4,... are on gpu0 and only gpu0 is correctly saving things?

In the 3D case with PP, we expect that gpu0 would save DTensor including any TP/DP replication/sharding. However, we do not rely on DTensor for dealing with layer0 vs layer1.

@fegin
Copy link
Contributor

fegin commented Aug 5, 2024

dcp.save() works with both DTensor and Tensor. Rank0 will determine what to save on each rank. If tensors are not duplicated (FQNs are different), all the tensors will be saved .

@dmammfl
Copy link
Author

dmammfl commented Aug 6, 2024

I tested this case, and figured out several points:

  1. When only PP is applied in degree 2 and assume that the model is 15 GB, dcp.save should save 2 dcp checkpoints whose sizes are 7.5 GB, but each checkpoint's size is about 3.7 GB

  2. When applied only PP, model.state_dict( ) per each rank has its sharded model params exactly(rank 0 has 0~15 layer param, rank 1 has 16~31 layer param)

  3. in the case of rank 1, although rank 1 has 16~31 layer params, its key names are "model.layer.0.self_attn....", "model.layer.1.self_attn...." ..., "model.layer.15.self_attn....", exactly same as rank 0's layer key name (except embed_token, lm_head, etc)

  4. When I changed layer keynames as "PP0_model.layer.0.self_attn...." (did same thing for rank 1: "PP1_model.layer.0.self_attn...."), All of the state_dicts are saved properly, whose sizes are around 7.5 GB each.

I think there is a key confilct in _save_state_dict( ) method, so _save_state_dict( ) in dcp.save( ) does weird operation.

@fegin
Copy link
Contributor

fegin commented Aug 6, 2024

  1. looks suspicious and just as you mentioned, there are key conflicts. We have tested the non-virtual pipeline and there are non key conflict. Any insight about this, @wconstab, @H-Huang ?

@wconstab
Copy link
Contributor

wconstab commented Aug 6, 2024

could you share the exact repro command so we can debug?

@dmammfl
Copy link
Author

dmammfl commented Aug 8, 2024

I run "run_llama_train.sh", setting "pipeline_parallel_degree" as 2 and other parallel degrees are 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants