-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use initial_checkpoint_dir for continue-pretraining but can't load model correctly #1729
Comments
I've faced similar issues, usually I convert my models to HF format for some other parts of my pipeline, and converting back from HF to LitGPT resolves this error. Alternatively, |
Download some data mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt Download tokenizer litgpt download EleutherAI/pythia-160m \
--tokenizer_only True Pretrain model litgpt pretrain EleutherAI/pythia-160m \
--tokenizer_dir EleutherAI/pythia-160m \
--data TextFiles \
--data.train_data_path "custom_texts/" \
--train.max_tokens 1_000_000 \
--out_dir out/custom-model Continue pretraining the model litgpt pretrain pythia-160m \
--initial_checkpoint_dir out/custom-model/final \
--tokenizer_dir EleutherAI/pythia-160m \
--out_dir new_checkpoint \
--data TextFiles \
--data.train_data_path "custom_texts/" results in
The specific issue is that the pretrained saves the things like the litgpt pretrain pythia-160m \
--resume "auto" \
--tokenizer_dir EleutherAI/pythia-160m \
--out_dir out/custom-model-2 \
--data TextFiles \
--data.train_data_path "custom_texts/" There may be other ways to do it with a conversion like mentioned above. |
Thanks @rasbt, if someone is continuing with a different dataset etc (like OP), would |
I've tried it and it works with the 1. If you want to continue pretraining in different dataset, you need to set 2. if you want to change [rank1]: ValueError: The path '/llama_tinystory2_en/step-00050000' does not point to a valid checkpoint. Make sure the path points to either a directory with FSDP checkpoint shards, or a single file with a full checkpoint. you need to set the lit_model.pth path( |
Thanks for trying @wodelt. I am still a bit concerned from my read of the code: Lines 216 to 233 in ef886a7
In Line 216, a I'm hoping @rasbt has better insight into this saving/loading. |
Oh sorry, yes you are right, if you want to train it on a different dataset, then you would not use the Download some data mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt Download tokenizer litgpt download EleutherAI/pythia-160m \
--tokenizer_only True Pretrain model litgpt pretrain EleutherAI/pythia-160m \
--tokenizer_dir EleutherAI/pythia-160m \
--data TextFiles \
--data.train_data_path "custom_texts/" \
--train.max_tokens 1_000_000 \
--out_dir out/custom-model Continue pretraining the model in case the training was interrupted (using litgpt pretrain pythia-160m \
--resume "auto" \
--tokenizer_dir EleutherAI/pythia-160m \
--out_dir out/custom-model-2 \
--data TextFiles \
--data.train_data_path "custom_texts/" Continue pretraining the model on a different dataset (requires model conversion step): litgpt convert_pretrained_checkpoint out/custom-model/final/ out/custom-model-converted scp -r custom_texts/ custom_new_texts/ litgpt pretrain pythia-160m \
--initial_checkpoint_dir out/custom-model-converted \
--tokenizer_dir EleutherAI/pythia-160m \
--out_dir new_checkpoint \
--data TextFiles \
--data.train_data_path "custom_new_texts/" |
I want to continue-pretraining my custom model in another dataset, so i only change initial_checkpoint_dir in training.yaml with the latest-run checkpoint dir path, but seems like the model can't be loaded correctly:
[rank0]: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
[rank0]: RuntimeError: Error(s) in loading state_dict for FullyShardedDataParallel:
[rank0]: Missing key(s) in state_dict: "lm_head.weight", "transformer.wte.weight", "transformer.h.0.norm_1.weight", "transformer.h.0.attn.attn.weight", "transformer.h.0.attn.proj.weight", "transformer.h.0.norm_2.weight", "transformer.h.0.mlp.fc_1.weight", "transformer.h.0.mlp.fc_2.weight", "transformer.h.0.mlp.proj.weight", "transformer.h.1.norm_1.weight", "transformer.h.1.attn.attn.weight", "transformer.h.1.attn.proj.weight", "transformer.h.1.norm_2.weight", "transformer.h.1.mlp.fc_1.weight", "transformer.h.1.mlp.fc_2.weight", "transformer.h.1.mlp.proj.weight", "transformer.h.2.norm_1.weight", "transformer.h.2.attn.attn.weight", "transformer.h.2.attn.proj.weight", "transformer.h.2.norm_2.weight", "transformer.h.2.mlp.fc_1.weight", "transformer.h.2.mlp.fc_2.weight", "transformer.h.2.mlp.proj.weight", "transformer.h.3.norm_1.weight", "transformer.h.3.attn.attn.weight", "transformer.h.3.attn.proj.weight", "transformer.h.3.norm_2.weight", "transformer.h.3.mlp.fc_1.weight", "transformer.h.3.mlp.fc_2.weight", "transformer.h.3.mlp.proj.weight", "transformer.h.4.norm_1.weight", "transformer.h.4.attn.attn.weight", "transformer.h.4.attn.proj.weight", "transformer.h.4.norm_2.weight", "transformer.h.4.mlp.fc_1.weight", "transformer.h.4.mlp.fc_2.weight", "transformer.h.4.mlp.proj.weight", "transformer.h.5.norm_1.weight", "transformer.h.5.attn.attn.weight", "transformer.h.5.attn.proj.weight", "transformer.h.5.norm_2.weight", "transformer.h.5.mlp.fc_1.weight", "transformer.h.5.mlp.fc_2.weight", "transformer.h.5.mlp.proj.weight", "transformer.ln_f.weight".
[rank0]: Unexpected key(s) in state_dict: "model", "optimizer", "train_dataloader", "iter_num", "step_count".
I don't understand the error cause i didn't change the model_config.
The text was updated successfully, but these errors were encountered: