Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPTQ Env vars: catch correct type of error #596

Merged
merged 5 commits into from
Jul 12, 2023
Merged

Conversation

ssmi153
Copy link
Contributor

@ssmi153 ssmi153 commented Jul 12, 2023

What does this PR do?

When passing in environment variables like gptq_bits, we still get errors thrown from TGI because the try/catch block is catching the wrong type of error. This PR aims to fix that.

@Narsil - let me know if this is how you want this formatted. My Python is a little shaky, so I hope this syntax is correct.

@Narsil
Copy link
Collaborator

Narsil commented Jul 12, 2023

Can you provide an example of model where this error is triggered instead ?I 'm very surprised that this error could be raised.

@ssmi153
Copy link
Contributor Author

ssmi153 commented Jul 12, 2023

This is the line the exception is being thrown from: https://github.com/huggingface/text-generation-inference/blob/f2f0289fb99c7caab0c3749fdf211e4d5ab2938b/server/text_generation_server/utils/weights.py#L49C22-L49C22

Following through the code this is what I've pieced together:
We call bits = self.get_tensor("gptq_bits")
get_tensor() calls self.get_filename(tensor_name)
get_filename() calls self.routing.get() for this tensor, which will return None as the tensor doesn't exist
That leads get_filename() to raise RuntimeError(f"weight {tensor_name} does not exist") => we end up with a RuntimeError

Happy to be proven wrong here!

... I'll find a model in in a sec that triggers this and put the link in here ...

@Narsil
Copy link
Collaborator

Narsil commented Jul 12, 2023

100% !

Catching and reraising at its best...

@Narsil
Copy link
Collaborator

Narsil commented Jul 12, 2023

Merging, tests are red because you don't have access to our secrets.

@Narsil Narsil merged commit 3628559 into huggingface:main Jul 12, 2023
2 of 5 checks passed
@Narsil
Copy link
Collaborator

Narsil commented Jul 12, 2023

@olivier FYI

@ssmi153
Copy link
Contributor Author

ssmi153 commented Jul 12, 2023

Quick update, I've got a model in a private repo that I quantized with GPTQ-for-Llama that triggers this reliably. That's here: https://huggingface.co/ssmi153/student-feedback-llama-30b-guanaco-2023-07-03-final-gptq and I can DM over a HuggingFace token to give you access to it (probably shouldn't put that out in public though!).

I've been trying to find other public GPTQ model files that do it too, but for all of TheBloke's conversions I run into another issue:

2023-07-12T10:57:26.137426358-07:00 {"timestamp":"2023-07-12T17:57:26.137143Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1130, in __call__\n return self.main(*args, **kwargs)\n File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 778, in main\n return _main(\n File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 216, in _main\n rv = self.invoke(ctx)\n File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1657, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1404, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 760, in invoke\n return __callback(*args, **kwargs)\n File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 78, in serve\n server.serve(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 175, in serve\n asyncio.run(\n File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete\n self.run_forever()\n File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever\n self._run_once()\n File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once\n handle._run()\n File \"/opt/conda/lib/python3.9/asyncio/events.py\", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 142, in serve_inner\n model = get_model(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 215, in get_model\n return FlashLlama(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py\", line 65, in __init__\n model = FlashLlamaForCausalLM(config, weights)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py\", line 371, in __init__\n self.model = FlashLlamaModel(config, weights)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py\", line 310, in __init__\n [\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py\", line 311, in <listcomp>\n FlashLlamaLayer(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py\", line 246, in __init__\n self.self_attn = FlashLlamaAttention(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py\", line 121, in __init__\n self.query_key_value = TensorParallelColumnLinear.load_multi(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 251, in load_multi\n weight = weights.get_multi_weights_col(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 125, in get_multi_weights_col\n w = [self.get_tensor(f\"{p}.g_idx\") for p in prefixes]\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 125, in <listcomp>\n w = [self.get_tensor(f\"{p}.g_idx\") for p in prefixes]\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 62, in get_tensor\n filename, tensor_name = self.get_filename(tensor_name)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 49, in get_filename\n raise RuntimeError(f\"weight {tensor_name} does not exist\")\nRuntimeError: weight model.layers.0.self_attn.q_proj.g_idx does not exist\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}

This is from this model: https://huggingface.co/TheBloke/orca_mini_v2_7B-GPTQ

It's also a conversion using GPTQ-for-Llama so should be broadly compatible. Here's the quantization config:

{

  | "bits": 4,
  | "group_size": 128,
  | "damp_percent": 0.01,
  | "desc_act": false,
  | "sym": true,
  | "true_sequential": true
  | }

Is this incompatibility due to the desc_act = false, or something else?

@Narsil
Copy link
Collaborator

Narsil commented Jul 13, 2023

Annnnd this is why I don't particularly enjoy maintaining external models...

I'm not sure I have the bandwidth to really investigate an escape hatch.

@ssmi153
Copy link
Contributor Author

ssmi153 commented Jul 14, 2023

@Narsil , I worked out how to get at least some of the quantised versions of TheBloke’s conversions working which is good news. Take a look at #601 where I’ve added more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants