Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use initial_checkpoint_dir for continue-pretraining but can't load model correctly #1729

Closed
wodelt opened this issue Sep 18, 2024 · 6 comments · Fixed by #1735
Closed

use initial_checkpoint_dir for continue-pretraining but can't load model correctly #1729

wodelt opened this issue Sep 18, 2024 · 6 comments · Fixed by #1735
Labels
documentation Improvements or additions to documentation enhancement New feature or request question Further information is requested

Comments

@wodelt
Copy link

wodelt commented Sep 18, 2024

I want to continue-pretraining my custom model in another dataset, so i only change initial_checkpoint_dir in training.yaml with the latest-run checkpoint dir path, but seems like the model can't be loaded correctly:

[rank0]: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
[rank0]: RuntimeError: Error(s) in loading state_dict for FullyShardedDataParallel:
[rank0]: Missing key(s) in state_dict: "lm_head.weight", "transformer.wte.weight", "transformer.h.0.norm_1.weight", "transformer.h.0.attn.attn.weight", "transformer.h.0.attn.proj.weight", "transformer.h.0.norm_2.weight", "transformer.h.0.mlp.fc_1.weight", "transformer.h.0.mlp.fc_2.weight", "transformer.h.0.mlp.proj.weight", "transformer.h.1.norm_1.weight", "transformer.h.1.attn.attn.weight", "transformer.h.1.attn.proj.weight", "transformer.h.1.norm_2.weight", "transformer.h.1.mlp.fc_1.weight", "transformer.h.1.mlp.fc_2.weight", "transformer.h.1.mlp.proj.weight", "transformer.h.2.norm_1.weight", "transformer.h.2.attn.attn.weight", "transformer.h.2.attn.proj.weight", "transformer.h.2.norm_2.weight", "transformer.h.2.mlp.fc_1.weight", "transformer.h.2.mlp.fc_2.weight", "transformer.h.2.mlp.proj.weight", "transformer.h.3.norm_1.weight", "transformer.h.3.attn.attn.weight", "transformer.h.3.attn.proj.weight", "transformer.h.3.norm_2.weight", "transformer.h.3.mlp.fc_1.weight", "transformer.h.3.mlp.fc_2.weight", "transformer.h.3.mlp.proj.weight", "transformer.h.4.norm_1.weight", "transformer.h.4.attn.attn.weight", "transformer.h.4.attn.proj.weight", "transformer.h.4.norm_2.weight", "transformer.h.4.mlp.fc_1.weight", "transformer.h.4.mlp.fc_2.weight", "transformer.h.4.mlp.proj.weight", "transformer.h.5.norm_1.weight", "transformer.h.5.attn.attn.weight", "transformer.h.5.attn.proj.weight", "transformer.h.5.norm_2.weight", "transformer.h.5.mlp.fc_1.weight", "transformer.h.5.mlp.fc_2.weight", "transformer.h.5.mlp.proj.weight", "transformer.ln_f.weight".
[rank0]: Unexpected key(s) in state_dict: "model", "optimizer", "train_dataloader", "iter_num", "step_count".

I don't understand the error cause i didn't change the model_config.

@wodelt wodelt added the question Further information is requested label Sep 18, 2024
@fdalvi
Copy link

fdalvi commented Sep 19, 2024

I've faced similar issues, usually I convert my models to HF format for some other parts of my pipeline, and converting back from HF to LitGPT resolves this error.

Alternatively, https://github.com/Lightning-AI/litgpt/blob/main/litgpt/scripts/convert_pretrained_checkpoint.py seems to be also meant for this purpose. Perhaps you can try that while the maintainers reply with a more concrete solution!

@rasbt
Copy link
Collaborator

rasbt commented Sep 19, 2024

Download some data

mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt

Download tokenizer

litgpt download EleutherAI/pythia-160m \
  --tokenizer_only True

Pretrain model

litgpt pretrain EleutherAI/pythia-160m \
  --tokenizer_dir EleutherAI/pythia-160m \
  --data TextFiles \
  --data.train_data_path "custom_texts/" \
  --train.max_tokens 1_000_000 \
  --out_dir out/custom-model

Continue pretraining the model

litgpt pretrain pythia-160m \
   --initial_checkpoint_dir out/custom-model/final \
   --tokenizer_dir EleutherAI/pythia-160m \
   --out_dir new_checkpoint \
   --data TextFiles \
   --data.train_data_path "custom_texts/"

results in

RuntimeError: Error(s) in loading state_dict for GPT:
        Missing key(s) in state_dict: "lm_head.weight", "transformer.wte.weight", bias", ..."transformer.h.5.mlp.proj.bias", "transformer.ln_f.weight", "transformer.ln_f.bias".
        Unexpected key(s) in state_dict: "model", "optimizer", "train_dataloader", "iter_num", "step_count".

The specific issue is that the pretrained saves the things like the iter_num etc. So, if you are continuing pretraining from an existing pretrained checkpoint (which is a bit different from a pretrained downloaded checkpoint from the hub), you need to provide the --resume option:

litgpt pretrain pythia-160m \
   --resume "auto" \
   --tokenizer_dir EleutherAI/pythia-160m \
   --out_dir out/custom-model-2 \
   --data TextFiles \
   --data.train_data_path "custom_texts/"

There may be other ways to do it with a conversion like mentioned above.

@rasbt rasbt added documentation Improvements or additions to documentation enhancement New feature or request labels Sep 19, 2024
@fdalvi
Copy link

fdalvi commented Sep 21, 2024

Thanks @rasbt, if someone is continuing with a different dataset etc (like OP), would --resume still work? Wouldn't it try to load the train_dataloader state, number of steps etc from the previous dataset/run and cause some issues?

@wodelt
Copy link
Author

wodelt commented Sep 22, 2024

Thanks @rasbt, if someone is continuing with a different dataset etc (like OP), would --resume still work? Wouldn't it try to load the train_dataloader state, number of steps etc from the previous dataset/run and cause some issues?

I've tried it and it works with the --resume "auto" . The previous traindataloader step count is inherited and continues to be counted. But here is something need to know:

1. If you want to continue pretraining in different dataset, you need to set --resume "auto" and make sure your out_dir doesn't change.

2. if you want to change out_dir, and in this case, --resume "auto" can't load your previous checkpoint cause new out_dir doesn't have any checkpoint, and if you set --resume '/llama_tinystory2_en/step-00050000' manually, it will cause issues:

[rank1]: ValueError: The path '/llama_tinystory2_en/step-00050000' does not point to a valid checkpoint. Make sure the path points to either a directory with FSDP checkpoint shards, or a single file with a full checkpoint.

you need to set the lit_model.pth path(/llama_tinystory2_en/step-00050000/litmodel.pth) so that You can achieve the same effect as in point 1.

@fdalvi
Copy link

fdalvi commented Sep 22, 2024

Thanks for trying @wodelt. I am still a bit concerned from my read of the code:

litgpt/litgpt/pretrain.py

Lines 216 to 233 in ef886a7

train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
train_dataloader, val_dataloader = fabric.setup_dataloaders(train_dataloader, val_dataloader)
if initial_checkpoint_dir:
fabric.load_raw(initial_checkpoint_dir / "lit_model.pth", model)
state = {
"model": model,
"optimizer": optimizer,
"train_dataloader": train_dataloader,
"iter_num": 0,
"step_count": 0,
}
resume = find_resume_path(resume, out_dir)
if resume:
fabric.print(f"Resuming training from {resume}")
fabric.load(resume, state)

In Line 216, a train_dataloader is initialized from the new paths in the config. However, Line 233 now loads something from the checkpoint into state -> train_dataloader. As you have seem the iteration number is definitely loaded from the older data loader. What I am unsure about is whether the old paths from the old dataset are also loaded into the "new" train_dataloader, effectively nulling out Line 216.

I'm hoping @rasbt has better insight into this saving/loading.

@rasbt
Copy link
Collaborator

rasbt commented Sep 23, 2024

Oh sorry, yes you are right, if you want to train it on a different dataset, then you would not use the --resume function. Instead, you'd need to convert the checkpoint using the litgpt convert_pretrained_checkpoint utility function. Let me provide a cleaned up workflow below (I will also add this to the docs shortly):

Download some data

mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt

Download tokenizer

litgpt download EleutherAI/pythia-160m \
  --tokenizer_only True

Pretrain model

litgpt pretrain EleutherAI/pythia-160m \
  --tokenizer_dir EleutherAI/pythia-160m \
  --data TextFiles \
  --data.train_data_path "custom_texts/" \
  --train.max_tokens 1_000_000 \
  --out_dir out/custom-model

Continue pretraining the model in case the training was interrupted (using --resume):

litgpt pretrain pythia-160m \
   --resume "auto" \
   --tokenizer_dir EleutherAI/pythia-160m \
   --out_dir out/custom-model-2 \
   --data TextFiles \
   --data.train_data_path "custom_texts/"

Continue pretraining the model on a different dataset (requires model conversion step):

litgpt convert_pretrained_checkpoint out/custom-model/final/ out/custom-model-converted
scp -r  custom_texts/ custom_new_texts/
litgpt pretrain pythia-160m \
   --initial_checkpoint_dir out/custom-model-converted \
   --tokenizer_dir EleutherAI/pythia-160m \
   --out_dir new_checkpoint \
   --data TextFiles \
   --data.train_data_path "custom_new_texts/"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants