error occur in the resize_embedding #32196

Gaiejj · 2024-07-24T16:00:16Z

System Info

transformers version: 4.43.1
Platform: Linux-5.15.0-1040-nvidia-x86_64-with-glibc2.35
Python version: 3.11.9
Huggingface_hub version: 0.24.1
Safetensors version: 0.4.3
Accelerate version: 0.33.0
Accelerate config: not found
PyTorch version (GPU?): 2.3.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA H800

Who can help?

@ArthurZucker When using deepspeed ZeRO3 to train the llama2-7b-hf model, I encountered an error during the resize_embedding process that I couldn't resolve. The llama2-7b-hf tokenizer lacks a pad_token, so I specified a default value for it, which requires resizing the embedding. However, this command executes correctly in transformers version 4.41.2 but fails in version 4.43.0.

I identified the following two anomalies:

Abnormal tensor shape

        params = [embeddings.weight]
        # embeddings.weight.size(0) is 32001 here
        context = (
            deepspeed.zero.GatheredParameters(params, modifier_rank=0)
            if is_deepspeed_zero3_enabled()
            else contextlib.nullcontext()
        )
        with context:
            for param in params:
                if param is None:
                    continue
                assert param.size(0) == new_num_embeddings, f'{param.size(0)}, {new_num_embeddings}'
                # bug here, param size is 32000 while new_num_embeddings is 32001, in 4.43.0 transformers
                param_data = param.data
                param_mean = param_data[:-num_new_embeddings].mean(dim=0, keepdim=True)
                param_data[-num_new_embeddings:] = param_mean

Abnormal ds_id

        params = [embeddings.weight]
        print(hasattr(embeddings.weight, 'ds_id'))
        # True for transformers 4.43.0, False for transformers 4.41.2

I've spent a lot of time pinpointing this issue, but I genuinely don't know how to resolve it. I sincerely hope you can provide assistance. This would be incredibly helpful, and I express my heartfelt gratitude to you.

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

The python file:

import torch
import deepspeed
import json

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer
)

from transformers.integrations.deepspeed import HfDeepSpeedConfig


DEFAULT_BOS_TOKEN: str = '<s>'
DEFAULT_EOS_TOKEN: str = '</s>'
DEFAULT_PAD_TOKEN: str = '<pad>'
DEFAULT_UNK_TOKEN: str = '<unk>'

model_name_or_path = 'PATHTO/Llama-2-7b-hf'
ds_cfgs_path = 'PATH'

deepspeed.init_distributed()

with open(ds_cfgs_path) as f:
    ds_cfgs = json.load(f)
    ds_cfgs['bf16']['enabled'] = True

dstchf = HfDeepSpeedConfig(ds_cfgs)

tokenizer = AutoTokenizer.from_pretrained(
    model_name_or_path,
    model_max_length=2048,
    padding_side='right',
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
        model_name_or_path,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
)

# Reference: https://github.com/tatsu-lab/stanford_alpaca/blob/main/train.py
def resize_tokenizer_embedding(tokenizer, model) -> None:
    """Resize tokenizer and embedding.

    Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
    """
    def init_new_embeddings(
        embeddings,
        new_num_embeddings: int,
        num_new_embeddings: int,
    ) -> None:
        if embeddings is None:
            return

        params = [embeddings.weight]
        print(hasattr(embeddings.weight, 'ds_id'))
        # True for transformers 4.43.1, False for transformers 4.41.2
        exit()
        context = (
            deepspeed.zero.GatheredParameters(params, modifier_rank=0)
            if is_deepspeed_zero3_enabled()
            else contextlib.nullcontext()
        )
        with context:
            for param in params:
                if param is None:
                    continue
                assert param.size(0) == new_num_embeddings, f'{param.size(0)}, {new_num_embeddings}'
                # bug here, param size is 32000 while new_num_embeddings is 32001
                param_data = param.data
                param_mean = param_data[:-num_new_embeddings].mean(dim=0, keepdim=True)
                param_data[-num_new_embeddings:] = param_mean

    special_tokens_dict = {}
    if tokenizer.pad_token is None:
        special_tokens_dict['pad_token'] = DEFAULT_PAD_TOKEN
    if tokenizer.eos_token is None:
        special_tokens_dict['eos_token'] = DEFAULT_EOS_TOKEN
    if tokenizer.bos_token is None:
        special_tokens_dict['bos_token'] = DEFAULT_BOS_TOKEN
    if tokenizer.unk_token is None:
        special_tokens_dict['unk_token'] = DEFAULT_UNK_TOKEN

    num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
    new_num_embeddings = len(tokenizer)

    model.config.bos_token_id = tokenizer.bos_token_id
    model.config.eos_token_id = tokenizer.eos_token_id
    model.config.pad_token_id = tokenizer.pad_token_id

    if num_new_tokens > 0:
        hf_device_map = getattr(model, 'hf_device_map', {})
        devices = {
            torch.device(device)
            for device in hf_device_map.values()
            if device not in {'cpu', 'disk'}
        }
        is_model_parallel = len(devices) > 1

        if not is_model_parallel:
            model.resize_token_embeddings(new_num_embeddings)

            init_new_embeddings(
                model.get_input_embeddings(),
                new_num_embeddings=new_num_embeddings,
                num_new_embeddings=num_new_tokens,
            )
            init_new_embeddings(
                model.get_output_embeddings(),
                new_num_embeddings=new_num_embeddings,
                num_new_embeddings=num_new_tokens,
            )
            
resize_tokenizer_embedding(tokenizer=tokenizer, model=model)

The deepspeed start bash

deepspeed \
 --master_port 12345 \
 --module debug.py \

The ds cfgs:

{
  "train_batch_size": 128,
  "train_micro_batch_size_per_gpu": 16,
  "gradient_accumulation_steps": null,
  "steps_per_print": 10,
  "zero_optimization": {
      "stage": 3,
      "offload_param": {
          "device": "none"
      },
      "offload_optimizer": {
          "device": "none"
      },
      "param_persistence_threshold": 1e4,
      "max_live_parameters": 1e8,
      "prefetch_bucket_size": 3e7,
      "memory_efficient_linear": false,
      "gather_16bit_weights_on_model_save": true
  },
  "gradient_clipping": 1.0,
  "prescale_gradients": false,
  "wall_clock_breakdown": false,
  "hybrid_engine": {
      "enabled": false,
      "max_out_tokens": 512,
      "inference_tp_size": 1,
      "release_inference_cache": false,
      "pin_parameters": true,
      "tp_gather_partition_size": 8
  },
  "fp16": {
    "enabled": false,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": false
  }
}

Expected behavior

Correctly resizing. Thanks!

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-07-24T16:23:48Z

Hey! I think #32192 should have fixed it!

seokhyunan · 2024-07-25T01:39:09Z

It seems the issue is still not fixed. You can check the progress in #32192.

Gaiejj · 2024-07-25T03:57:50Z

Thank you very much for your prompt response and continuous follow-up. I will closely monitor the latest updates. Thanks again for your hard work! ❤️

seokhyunan · 2024-07-26T06:31:01Z

This issue is resolved by #32214! Thanks to @zucchini-nlp.

ArthurZucker · 2024-07-26T07:32:18Z

On my way to do a patch then! Thanks all for reporting this quickly, and thanks @zucchini-nlp for your quick fixes!

Gaiejj · 2024-07-26T15:26:55Z

Congratulations❤️ ! We have successfully executed full-parameter PPO fine-tuning on Llama 3.1. Thanks again to @ArthurZucker @iamseokhyun and @zucchini-nlp for their super quick effort and follow-up!!!

github-actions · 2024-08-24T08:04:08Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker · 2024-08-27T12:43:44Z

Closing as completed!

Gaiejj added the bug label Jul 24, 2024

Gaiejj changed the title ~~erroe occur in the resize_embedding~~ error occur in the resize_embedding Jul 24, 2024

seokhyunan mentioned this issue Jul 25, 2024

Fix resize embedding with Deepspeed #32192

Merged

Gaiejj mentioned this issue Jul 25, 2024

[BUG] Error in llama3.1 resizing embedding with ZeRO 3 PKU-Alignment/align-anything#26

Closed

3 tasks

ArthurZucker closed this as completed Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error occur in the resize_embedding #32196

error occur in the resize_embedding #32196

Gaiejj commented Jul 24, 2024 •

edited

Loading

ArthurZucker commented Jul 24, 2024

seokhyunan commented Jul 25, 2024

Gaiejj commented Jul 25, 2024

seokhyunan commented Jul 26, 2024

ArthurZucker commented Jul 26, 2024

Gaiejj commented Jul 26, 2024 •

edited

Loading

github-actions bot commented Aug 24, 2024

ArthurZucker commented Aug 27, 2024

error occur in the resize_embedding #32196

error occur in the resize_embedding #32196

Comments

Gaiejj commented Jul 24, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Jul 24, 2024

seokhyunan commented Jul 25, 2024

Gaiejj commented Jul 25, 2024

seokhyunan commented Jul 26, 2024

ArthurZucker commented Jul 26, 2024

Gaiejj commented Jul 26, 2024 • edited Loading

github-actions bot commented Aug 24, 2024

ArthurZucker commented Aug 27, 2024

Gaiejj commented Jul 24, 2024 •

edited

Loading

Gaiejj commented Jul 26, 2024 •

edited

Loading