Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference throwing: TypeError: forward() got an unexpected keyword argument 'position_ids #6

Closed
Qubitium opened this issue Apr 9, 2023 · 7 comments

Comments

@Qubitium
Copy link

Qubitium commented Apr 9, 2023

Env:
Ubuntu 22.04
pytorch 2.1 nightly cuda 11.8
transformer[head]
peft[head]

Reproduction steps:

  • pip install git+(transformer) head
  • check out GPTQ-for-LLaMa cuda branch
  • Generate 4bit quantized --act-order --sequential
  • Convert 4bit model from last step using GPTQ-triton/convert script to triton model.
  • Copy GPTQ-Triton/quant.py and custom_autotune.py to GPTQ-for-LLaMa cuda branch
  • Start inference code using gradio with minor changes.

Result:

  • The tokenizer loads
  • The quantized model loads
  • model.generate() is throwing the following error on input

Is the quantlized code not compatible with transformer[head]? Or am I doing something wrong?

    gen_output = model.generate(
  File "/root/miniconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.9/site-packages/transformers/generation/utils.py", line 1485, in generate
    return self.sample(
  File "/root/miniconda3/lib/python3.9/site-packages/transformers/generation/utils.py", line 2524, in sample
    outputs = self(
  File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: forward() got an unexpected keyword argument 'position_ids

Generation code

    input_ids = inputs["input_ids"].to(DEV)
    
    generation_config = GenerationConfig(
        do_sample=True,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        repetition_penalty=repetition_penalty,
        length_penalty=length_penalty,
    )

    with torch.no_grad():
        gen_output = model.generate(
            input_ids=input_ids,
            generation_config=generation_config,
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=max_new_tokens,
        )
@fpgaminer
Copy link
Owner

Good catch, thank you. I'll fix things up for the latest transformers head. In the meantime, you could try transformers @ commit a92e0ad2e20ef4ce28410b5e05c5d63a5a304e65

@Qubitium
Copy link
Author

Qubitium commented Apr 10, 2023

@fpgaminer Your latest triton updates is really fast. Can't believe it. GPTQ-for-LlaMa ported your new codes over and it finally made the triton branch not only useful but the fastest, on all my real-world tests.

Btw, not sure if it is transformer related or triton related but beam searching doesn't appear to work? I expected a slow down with num_beams going up but I get the same performance /s back which doesn't make much sense. Does triton need to implement beam or that should be handled by higher level transformer api? I am trying to isolate why beams are not functioning. Thanks.

@fpgaminer
Copy link
Owner

fpgaminer commented Apr 10, 2023

Does triton need to implement beam or that should be handled by higher level transformer api

That should be handled in the transformers library.

@fpgaminer
Copy link
Owner

FYI, there's a 10% performance regression in the latest transformers library (commit 7dcd870 and onward). I've opened an issue over there for it. I'm going to hold off on updating my code for now, and simply recommend sticking to pre 7dcd870 commits of transformers.

@fpgaminer
Copy link
Owner

Update:

As of the latest GPTQ-triton commit (3daf413), transformers HEAD is supported again. I'm working upstream to fix the performance regression in transformers.

@Qubitium
Copy link
Author

Currently quantizing 30b 4bit using repo's new quantize script and will do some testing later. Will post finding here.

@Qubitium
Copy link
Author

@fpgaminer Confirmed transformer[head] compat issue fixed with quantized 30b 4bit using your repo's quantize script. However, found a beaming issue at #11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants