Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error loading Llama-2-70b gptq weights from local directory #728

Closed
2 of 4 tasks
hmcp22 opened this issue Jul 28, 2023 · 10 comments
Closed
2 of 4 tasks

Error loading Llama-2-70b gptq weights from local directory #728

hmcp22 opened this issue Jul 28, 2023 · 10 comments

Comments

@hmcp22
Copy link

hmcp22 commented Jul 28, 2023

System Info

Docker deployment version 0.9.4
Hardware: AWS g5.12xlarge

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Running using docker-compose with the following compose file:

version: "3.5"
services:
  text-generation-inference:
    image: ghcr.io/huggingface/text-generation-inference:0.9.4
    container_name: text-generation-inference
    entrypoint: text-generation-launcher
    restart: always
    stdin_open: true 
    tty: true 
    env_file:
      - tgi.env
    shm_size: '1gb'
    ports:
      - 8080:80
    volumes:
      - type: bind
        source: /home/ubuntu/efs/llm_downloads
        target: /llm_downloads
    deploy:
      resources:
        reservations:
          devices: 
            - driver: nvidia
              count: all
              # device_ids: ['0', '3']
              capabilities: [gpu]
networks:
  default:
    driver: bridge

and the following env variables in the tgi.env file:

MODEL_ID=/llm_downloads/TheBloke/Llama-2-70B-chat-GPTQ-gptq-4bit-128g-actorder_True
QUANTIZE=gptq
GPTQ_BITS=4
GPTQ_GROUPSIZE=128
SHARDED=true
NUM_SHARD=4
MAX_CONCURRENT_REQUESTS=128
MAX_BEST_OF=5
MAX_STOP_SEQUENCES=4 
MAX_INPUT_LENGTH=4000 
MAX_TOTAL_TOKENS=8192
WAITING_SERVED_RATIO=1.2 
MAX_BATCH_TOTAL_TOKENS=16000 
MAX_WAITING_TOKENS=20

MAX_BATCH_PREFILL_TOKENS=4096

HUGGINGFACE_HUB_CACHE=/llm_downloads/tgi_hf_cache

Which gives the following error:

2023-07-28T13:20:04.621775Z  INFO text_generation_launcher: Args { model_id: "/llm_downloads/TheBloke/Llama-2-70B-chat-GPTQ-gptq-4bit-128g-actorder_True", revision: None, validation_workers: 2, sharded: Some(true), num_shard: Some(4), quantize: Some(Gptq), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 5, max_stop_sequences: 4, max_input_length: 4000, max_total_tokens: 8192, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: Some(16000), max_waiting_tokens: 20, hostname: "fdd9e32f6611", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/llm_downloads/tgi_hf_cache"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-07-28T13:20:04.621814Z  INFO text_generation_launcher: Sharding model on 4 processes
2023-07-28T13:20:04.621894Z  INFO download: text_generation_launcher: Starting download process.
2023-07-28T13:20:06.114282Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-07-28T13:20:06.423909Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-07-28T13:20:06.424273Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-07-28T13:20:06.424931Z  INFO shard-manager: text_generation_launcher: Starting shard rank=3
2023-07-28T13:20:06.424404Z  INFO shard-manager: text_generation_launcher: Starting shard rank=2
2023-07-28T13:20:06.424931Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
2023-07-28T13:20:11.906010Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 216, in _get_gptq_params
    bits = self.gptq_bits
AttributeError: 'Weights' object has no attribute 'gptq_bits'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
    return FlashLlama(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 67, in __init__
    model = FlashLlamaForCausalLM(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in __init__
    self.model = FlashLlamaModel(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in __init__
    [
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in <listcomp>
    FlashLlamaLayer(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 331, in __init__
    self.self_attn = FlashLlamaAttention(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 204, in __init__
    self.query_key_value = _load_gqa(config, prefix, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 154, in _load_gqa
    weight = weights.get_multi_weights_col(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 133, in get_multi_weights_col
    bits, groupsize = self._get_gptq_params()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 219, in _get_gptq_params
    raise e
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 212, in _get_gptq_params
    bits = self.get_tensor("gptq_bits").item()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight gptq_bits does not exist

2023-07-28T13:20:11.979402Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 216, in _get_gptq_params
    bits = self.gptq_bits
AttributeError: 'Weights' object has no attribute 'gptq_bits'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
    return FlashLlama(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 67, in __init__
    model = FlashLlamaForCausalLM(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in __init__
    self.model = FlashLlamaModel(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in __init__
    [
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in <listcomp>
    FlashLlamaLayer(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 331, in __init__
    self.self_attn = FlashLlamaAttention(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 204, in __init__
    self.query_key_value = _load_gqa(config, prefix, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 154, in _load_gqa
    weight = weights.get_multi_weights_col(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 133, in get_multi_weights_col
    bits, groupsize = self._get_gptq_params()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 219, in _get_gptq_params
    raise e
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 212, in _get_gptq_params
    bits = self.get_tensor("gptq_bits").item()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight gptq_bits does not exist

2023-07-28T13:20:11.980989Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 216, in _get_gptq_params
    bits = self.gptq_bits
AttributeError: 'Weights' object has no attribute 'gptq_bits'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
    return FlashLlama(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 67, in __init__
    model = FlashLlamaForCausalLM(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in __init__
    self.model = FlashLlamaModel(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in __init__
    [
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in <listcomp>
    FlashLlamaLayer(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 331, in __init__
    self.self_attn = FlashLlamaAttention(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 204, in __init__
    self.query_key_value = _load_gqa(config, prefix, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 154, in _load_gqa
    weight = weights.get_multi_weights_col(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 133, in get_multi_weights_col
    bits, groupsize = self._get_gptq_params()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 219, in _get_gptq_params
    raise e
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 212, in _get_gptq_params
    bits = self.get_tensor("gptq_bits").item()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight gptq_bits does not exist

2023-07-28T13:20:11.984688Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 216, in _get_gptq_params
    bits = self.gptq_bits
AttributeError: 'Weights' object has no attribute 'gptq_bits'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
    return FlashLlama(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 67, in __init__
    model = FlashLlamaForCausalLM(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in __init__
    self.model = FlashLlamaModel(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in __init__
    [
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in <listcomp>
    FlashLlamaLayer(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 331, in __init__
    self.self_attn = FlashLlamaAttention(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 204, in __init__
    self.query_key_value = _load_gqa(config, prefix, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 154, in _load_gqa
    weight = weights.get_multi_weights_col(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 133, in get_multi_weights_col
    bits, groupsize = self._get_gptq_params()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 219, in _get_gptq_params
    raise e
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 212, in _get_gptq_params
    bits = self.get_tensor("gptq_bits").item()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight gptq_bits does not exist

2023-07-28T13:20:12.431682Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

[W socket.cpp:601] [c10d] The client socket has failed to connect to [localhost]:29500 (errno: 99 - Cannot assign requested address).
You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 216, in _get_gptq_params
    bits = self.gptq_bits

AttributeError: 'Weights' object has no attribute 'gptq_bits'


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
    return FlashLlama(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 67, in __init__
    model = FlashLlamaForCausalLM(config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in __init__
    self.model = FlashLlamaModel(config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in __init__
    [

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in <listcomp>
    FlashLlamaLayer(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 331, in __init__
    self.self_attn = FlashLlamaAttention(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 204, in __init__
    self.query_key_value = _load_gqa(config, prefix, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 154, in _load_gqa
    weight = weights.get_multi_weights_col(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 133, in get_multi_weights_col
    bits, groupsize = self._get_gptq_params()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 219, in _get_gptq_params
    raise e

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 212, in _get_gptq_params
    bits = self.get_tensor("gptq_bits").item()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")

RuntimeError: weight gptq_bits does not exist
 rank=3
2023-07-28T13:20:12.530175Z ERROR text_generation_launcher: Shard 3 failed to start
2023-07-28T13:20:12.530205Z  INFO text_generation_launcher: Shutting down shards
2023-07-28T13:20:12.555460Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=1
2023-07-28T13:20:12.555658Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=2
2023-07-28T13:20:12.614893Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=0
Error: ShardCannotStart

Expected behavior

Expect the model to load correctly.
I did a little digging into where the error was happening and I can see it's when it tries to load the gptq config settings in the _get_gptq_params method in
the server/text_generation_server/utils/weights.py file.
I'm not entirely sure why it doesn't seem to pick up these settings from the local dir as the quantize_config.json file does exist there.
I modified the _get_gptq_params method to revert to getting this from env variables if it errors (see below) as was the case before this last release. I rebuilt the image and this seems to successfully load the model

    def _get_gptq_params(self) -> Tuple[int, int]:
        try:
            bits = self.get_tensor("gptq_bits").item()
            groupsize = self.get_tensor("gptq_groupsize").item()
        except (SafetensorError, RuntimeError) as e:
            try:
                bits = self.gptq_bits
                groupsize = self.gptq_groupsize
            except Exception:
                try:
                    import os
                    bits = int(os.getenv("GPTQ_BITS"))
                    groupsize = int(os.getenv("GPTQ_GROUPSIZE"))
                except Exception:
                    raise e

        return bits, groupsize
@hmcp22 hmcp22 changed the title Error loading Llama-2-70b gptq weigths from local directory Error loading Llama-2-70b gptq weights from local directory Jul 28, 2023
@yadamonk
Copy link

yadamonk commented Jul 28, 2023

@hmcp22 Does the model work as expected with your fix? I was told that the 70b GPTQ model can't be sharded on 4 GPUs and somewhere else I read that it produces gibberish on 4 A10Gs.

@hmcp22
Copy link
Author

hmcp22 commented Jul 28, 2023

Yep, working as expected and getting coherent outputs

@152334H
Copy link

152334H commented Jul 30, 2023

you can just replace _get_gptq_params() with

def _get_gptq_params(self) -> Tuple[int, int]:
    return self.gptq_bits, self.gptq_groupsize

because these attribs are set in advance by _set_gptq_params, and fix _set_gptq_params to read from the local directory with,

    def _set_gptq_params(self, model_id):
        p = Path(model_id)/'quantize_config.json'
        try:
            if p.exists(): data = json.loads(p.read_text())
            else:
                filename = hf_hub_download(model_id, filename="quantize_config.json")
                with open(filename, "r") as f:
                    data = json.load(f)
            self.gptq_bits = data["bits"]
            self.gptq_groupsize = data["group_size"]
        except Exception as e:
            raise

beacuse hf_hub_download doesn't work with local model IDs

@Narsil
Copy link
Collaborator

Narsil commented Jul 31, 2023

This should have fixed it:
#738

Can you confirm? (--pull latest if you're using docker).

@hmcp22
Copy link
Author

hmcp22 commented Aug 1, 2023

This should have fixed it: #738

Can you confirm? (--pull latest if you're using docker).

Can confirm this is working with the latest docker image now

@zhaohb
Copy link

zhaohb commented Aug 2, 2023

@Narsil @hmcp22 I test latest image, but get same error.

@Narsil
Copy link
Collaborator

Narsil commented Aug 3, 2023

Does your local mode have quantization_config.json in the directory ?

Currently TGI expects:

  • Either gptq_bits within the model weights
  • Or quantization_config.json file.

@zhaohb
Copy link

zhaohb commented Aug 3, 2023

@Narsil yes, this is my issue #766

@Narsil
Copy link
Collaborator

Narsil commented Aug 3, 2023

Ok I will close this, and we can move the discussion over #766

@Narsil Narsil closed this as completed Aug 3, 2023
@yunll
Copy link

yunll commented Oct 8, 2023

Does your local mode have quantization_config.json in the directory ?

Currently TGI expects:

  • Either gptq_bits within the model weights
  • Or quantization_config.json file.

how to get quantization_config.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants