Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helps needed in testing out TinyLlama #1154

Closed
jzhang38 opened this issue Sep 28, 2023 · 2 comments · Fixed by #1162
Closed

Helps needed in testing out TinyLlama #1154

jzhang38 opened this issue Sep 28, 2023 · 2 comments · Fixed by #1162
Assignees

Comments

@jzhang38
Copy link

The TinyLlama project aims to pretrain a 1.1B Llama on 3T tokens. So that model should be an ideal draft model for speculative inference.

https://github.com/jzhang38/TinyLlama
https://huggingface.co/PY007/TinyLlama-1.1B-intermediate-step-240k-503b

I encounter this error when I tried to use TinyLlama-1.1B-intermediate-step-240k-503b as the draft model

/root/miniconda3/lib/python3.10/site-packages/torch/__init__.py:635: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:450.)
  _C._set_default_tensor_type(t)
Creating directory /root/.cache/flexflow/weights/model/chaoscodes/tinyllama-1.1b-intermediate-step-240k-503b/half-precision (if it doesn't exist)...
Loading 'model/chaoscodes/TinyLlama-1.1B-intermediate-step-240k-503b' model weights from the cache...
Loading weight file tok_embeddings_weight
Loading weight file layers_0_attention_norm_weight
Loading weight file layers_0_attention_wq_weight
Loading weight file layers_0_attention_wk_weight
load attention data error 1048576, 8388608, 1, /root/.cache/flexflow/weights/model/chaoscodes/tinyllama-1.1b-intermediate-step-240k-503b/half-precision/layers_0_attention_wk_weight
python: /tmp/pip-install-ijvow1hh/flexflow_0192abbf2b1a40128377649dca2ea9f0/inference/file_loader.cc:252: void load_attention_weights_v2(DT*, int, int, size_t, size_t, std::string, std::string, size_t, int) [with DT = __half; size_t = long unsigned int; std::string = std::__cxx11::basic_string<char>]: Assertion `false && "data size mismatch"' failed.
Aborted (core dumped)

code I use:

import flexflow.serve as ff

ff.init(
        num_gpus=4,
        memory_per_gpu=23000,
        zero_copy_memory_per_node=30000,
        tensor_parallelism_degree=4,
        pipeline_parallelism_degree=1
    )

# Specify the LLM
llm = ff.LLM("model/Llama-2-7b-hf")

# Specify a list of SSMs (just one in this case)
ssms=[]
ssm = ff.SSM("model/TinyLlama-1.1B-intermediate-step-240k-503b")
ssms.append(ssm)


# Create the sampling configs
generation_config = ff.GenerationConfig(
    do_sample=False, temperature=0.9, topp=0.8, topk=1
)

# Compile the SSMs for inference and load the weights into memory
for ssm in ssms:
    ssm.compile(generation_config)

# Compile the LLM for inference and load the weights into memory
llm.compile(generation_config, ssms=ssms)

result = llm.generate("Here are some travel tips for Tokyo:\n")

I believe this is probably not an issue with the TinyLlama weight. More than likely there is a bug in FlexFlow dealing with GQA weight/ rope.

The reason why I claim that is the TinyLlama weight works fine with HuggingFace and llama.cpp.

Previously I spotted a bug in the llama.cpp: ggerganov/llama.cpp#3364. Basically, a bug exists when converting the HF weight to llama.cpp weight (from GPT-NeoX style rope to GPT-J style). Nobody spotted this because the previous GQA model like llama-2-70B has num_heads = kv_heads ** 2, while for TinyLlama its num_heads = 32 and kv_heads = 4.
This bug exists in repos such as llama.cpp (fixed now), llama2.c and llama2.mojo.

I am wondering if similar things may happen here.

Right now I am thinking if this line in FlexFlow is correct:

n_q_heads = n_kv_heads = self.hf_config.num_attention_heads

(Above is a hypothesis from me that may not be correct. My point is it would be nice if someone can make FlexFlow work with TinyLlama :) ).

@goliaro
Copy link
Collaborator

goliaro commented Sep 30, 2023

@jzhang38 Let me give it a try!

@goliaro
Copy link
Collaborator

goliaro commented Oct 1, 2023

@jzhang38 thanks for bringing this up. PR #1162 should fix the issue :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants