use initial_checkpoint_dir for continue-pretraining but can't load model correctly #1729

wodelt · 2024-09-18T03:21:22Z

I want to continue-pretraining my custom model in another dataset, so i only change initial_checkpoint_dir in training.yaml with the latest-run checkpoint dir path, but seems like the model can't be loaded correctly:

[rank0]: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
[rank0]: RuntimeError: Error(s) in loading state_dict for FullyShardedDataParallel:
[rank0]: Missing key(s) in state_dict: "lm_head.weight", "transformer.wte.weight", "transformer.h.0.norm_1.weight", "transformer.h.0.attn.attn.weight", "transformer.h.0.attn.proj.weight", "transformer.h.0.norm_2.weight", "transformer.h.0.mlp.fc_1.weight", "transformer.h.0.mlp.fc_2.weight", "transformer.h.0.mlp.proj.weight", "transformer.h.1.norm_1.weight", "transformer.h.1.attn.attn.weight", "transformer.h.1.attn.proj.weight", "transformer.h.1.norm_2.weight", "transformer.h.1.mlp.fc_1.weight", "transformer.h.1.mlp.fc_2.weight", "transformer.h.1.mlp.proj.weight", "transformer.h.2.norm_1.weight", "transformer.h.2.attn.attn.weight", "transformer.h.2.attn.proj.weight", "transformer.h.2.norm_2.weight", "transformer.h.2.mlp.fc_1.weight", "transformer.h.2.mlp.fc_2.weight", "transformer.h.2.mlp.proj.weight", "transformer.h.3.norm_1.weight", "transformer.h.3.attn.attn.weight", "transformer.h.3.attn.proj.weight", "transformer.h.3.norm_2.weight", "transformer.h.3.mlp.fc_1.weight", "transformer.h.3.mlp.fc_2.weight", "transformer.h.3.mlp.proj.weight", "transformer.h.4.norm_1.weight", "transformer.h.4.attn.attn.weight", "transformer.h.4.attn.proj.weight", "transformer.h.4.norm_2.weight", "transformer.h.4.mlp.fc_1.weight", "transformer.h.4.mlp.fc_2.weight", "transformer.h.4.mlp.proj.weight", "transformer.h.5.norm_1.weight", "transformer.h.5.attn.attn.weight", "transformer.h.5.attn.proj.weight", "transformer.h.5.norm_2.weight", "transformer.h.5.mlp.fc_1.weight", "transformer.h.5.mlp.fc_2.weight", "transformer.h.5.mlp.proj.weight", "transformer.ln_f.weight".
[rank0]: Unexpected key(s) in state_dict: "model", "optimizer", "train_dataloader", "iter_num", "step_count".

I don't understand the error cause i didn't change the model_config.

fdalvi · 2024-09-19T08:12:07Z

I've faced similar issues, usually I convert my models to HF format for some other parts of my pipeline, and converting back from HF to LitGPT resolves this error.

Alternatively, https://github.com/Lightning-AI/litgpt/blob/main/litgpt/scripts/convert_pretrained_checkpoint.py seems to be also meant for this purpose. Perhaps you can try that while the maintainers reply with a more concrete solution!

rasbt · 2024-09-19T13:10:46Z

Download some data

mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt

Download tokenizer

litgpt download EleutherAI/pythia-160m \
  --tokenizer_only True

Pretrain model

litgpt pretrain EleutherAI/pythia-160m \
  --tokenizer_dir EleutherAI/pythia-160m \
  --data TextFiles \
  --data.train_data_path "custom_texts/" \
  --train.max_tokens 1_000_000 \
  --out_dir out/custom-model

Continue pretraining the model

litgpt pretrain pythia-160m \
   --initial_checkpoint_dir out/custom-model/final \
   --tokenizer_dir EleutherAI/pythia-160m \
   --out_dir new_checkpoint \
   --data TextFiles \
   --data.train_data_path "custom_texts/"

results in

RuntimeError: Error(s) in loading state_dict for GPT:
        Missing key(s) in state_dict: "lm_head.weight", "transformer.wte.weight", bias", ..."transformer.h.5.mlp.proj.bias", "transformer.ln_f.weight", "transformer.ln_f.bias".
        Unexpected key(s) in state_dict: "model", "optimizer", "train_dataloader", "iter_num", "step_count".

The specific issue is that the pretrained saves the things like the iter_num etc. So, if you are continuing pretraining from an existing pretrained checkpoint (which is a bit different from a pretrained downloaded checkpoint from the hub), you need to provide the --resume option:

litgpt pretrain pythia-160m \
   --resume "auto" \
   --tokenizer_dir EleutherAI/pythia-160m \
   --out_dir out/custom-model-2 \
   --data TextFiles \
   --data.train_data_path "custom_texts/"

There may be other ways to do it with a conversion like mentioned above.

fdalvi · 2024-09-21T06:42:37Z

Thanks @rasbt, if someone is continuing with a different dataset etc (like OP), would --resume still work? Wouldn't it try to load the train_dataloader state, number of steps etc from the previous dataset/run and cause some issues?

wodelt · 2024-09-22T07:29:47Z

Thanks @rasbt, if someone is continuing with a different dataset etc (like OP), would --resume still work? Wouldn't it try to load the train_dataloader state, number of steps etc from the previous dataset/run and cause some issues?

I've tried it and it works with the --resume "auto" . The previous traindataloader step count is inherited and continues to be counted. But here is something need to know:

1. If you want to continue pretraining in different dataset, you need to set --resume "auto" and make sure your out_dir doesn't change.

2. if you want to change out_dir, and in this case, --resume "auto" can't load your previous checkpoint cause new out_dir doesn't have any checkpoint, and if you set --resume '/llama_tinystory2_en/step-00050000' manually, it will cause issues:

[rank1]: ValueError: The path '/llama_tinystory2_en/step-00050000' does not point to a valid checkpoint. Make sure the path points to either a directory with FSDP checkpoint shards, or a single file with a full checkpoint.

you need to set the lit_model.pth path(/llama_tinystory2_en/step-00050000/litmodel.pth) so that You can achieve the same effect as in point 1.

fdalvi · 2024-09-22T07:43:14Z

Thanks for trying @wodelt. I am still a bit concerned from my read of the code:

litgpt/litgpt/pretrain.py

Lines 216 to 233 in ef886a7

    
           train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length) 
        
           train_dataloader, val_dataloader = fabric.setup_dataloaders(train_dataloader, val_dataloader) 
        
           if initial_checkpoint_dir: 
        
               fabric.load_raw(initial_checkpoint_dir / "lit_model.pth", model) 
        
           state = { 
        
               "model": model, 
        
               "optimizer": optimizer, 
        
               "train_dataloader": train_dataloader, 
        
               "iter_num": 0, 
        
               "step_count": 0, 
        
           } 
        
           resume = find_resume_path(resume, out_dir) 
        
           if resume: 
        
               fabric.print(f"Resuming training from {resume}") 
        
               fabric.load(resume, state)

In Line 216, a train_dataloader is initialized from the new paths in the config. However, Line 233 now loads something from the checkpoint into state -> train_dataloader. As you have seem the iteration number is definitely loaded from the older data loader. What I am unsure about is whether the old paths from the old dataset are also loaded into the "new" train_dataloader, effectively nulling out Line 216.

I'm hoping @rasbt has better insight into this saving/loading.

rasbt · 2024-09-23T17:19:51Z

Oh sorry, yes you are right, if you want to train it on a different dataset, then you would not use the --resume function. Instead, you'd need to convert the checkpoint using the litgpt convert_pretrained_checkpoint utility function. Let me provide a cleaned up workflow below (I will also add this to the docs shortly):

Download some data

mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt

Download tokenizer

litgpt download EleutherAI/pythia-160m \
  --tokenizer_only True

Pretrain model

litgpt pretrain EleutherAI/pythia-160m \
  --tokenizer_dir EleutherAI/pythia-160m \
  --data TextFiles \
  --data.train_data_path "custom_texts/" \
  --train.max_tokens 1_000_000 \
  --out_dir out/custom-model

Continue pretraining the model in case the training was interrupted (using --resume):

litgpt pretrain pythia-160m \
   --resume "auto" \
   --tokenizer_dir EleutherAI/pythia-160m \
   --out_dir out/custom-model-2 \
   --data TextFiles \
   --data.train_data_path "custom_texts/"

Continue pretraining the model on a different dataset (requires model conversion step):

litgpt convert_pretrained_checkpoint out/custom-model/final/ out/custom-model-converted

scp -r  custom_texts/ custom_new_texts/

litgpt pretrain pythia-160m \
   --initial_checkpoint_dir out/custom-model-converted \
   --tokenizer_dir EleutherAI/pythia-160m \
   --out_dir new_checkpoint \
   --data TextFiles \
   --data.train_data_path "custom_new_texts/"

wodelt added the question Further information is requested label Sep 18, 2024

rasbt added documentation Improvements or additions to documentation enhancement New feature or request labels Sep 19, 2024

rasbt mentioned this issue Sep 23, 2024

Add pretrain conversion #1735

Merged

rasbt closed this as completed in #1735 Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use initial_checkpoint_dir for continue-pretraining but can't load model correctly #1729

use initial_checkpoint_dir for continue-pretraining but can't load model correctly #1729

wodelt commented Sep 18, 2024

fdalvi commented Sep 19, 2024

rasbt commented Sep 19, 2024

fdalvi commented Sep 21, 2024

wodelt commented Sep 22, 2024

fdalvi commented Sep 22, 2024

rasbt commented Sep 23, 2024

use initial_checkpoint_dir for continue-pretraining but can't load model correctly #1729

use initial_checkpoint_dir for continue-pretraining but can't load model correctly #1729

Comments

wodelt commented Sep 18, 2024

fdalvi commented Sep 19, 2024

rasbt commented Sep 19, 2024

fdalvi commented Sep 21, 2024

wodelt commented Sep 22, 2024

fdalvi commented Sep 22, 2024

rasbt commented Sep 23, 2024