[FEAT]: EETQ quantizer support (#30262)

* [FEAT]: EETQ quantizer support * Update quantization.md * Update docs/source/en/main_classes/quantization.md Co-authored-by: Marc Sun <[email protected]> * Update docs/source/en/quantization.md Co-authored-by: Marc Sun <[email protected]> * Update docs/source/en/quantization.md Co-authored-by: Marc Sun <[email protected]> * Update src/transformers/integrations/__init__.py Co-authored-by: Marc Sun <[email protected]> * Update src/transformers/integrations/__init__.py Co-authored-by: Marc Sun <[email protected]> * Update src/transformers/integrations/eetq.py Co-authored-by: Marc Sun <[email protected]> * Update src/transformers/integrations/eetq.py Co-authored-by: Marc Sun <[email protected]> * Update src/transformers/integrations/eetq.py Co-authored-by: Marc Sun <[email protected]> * Update tests/quantization/eetq_integration/test_eetq.py Co-authored-by: Marc Sun <[email protected]> * Update src/transformers/quantizers/auto.py Co-authored-by: Marc Sun <[email protected]> * Update src/transformers/quantizers/auto.py Co-authored-by: Marc Sun <[email protected]> * Update src/transformers/quantizers/auto.py Co-authored-by: Marc Sun <[email protected]> * Update src/transformers/quantizers/quantizer_eetq.py Co-authored-by: Marc Sun <[email protected]> * Update tests/quantization/eetq_integration/test_eetq.py Co-authored-by: Marc Sun <[email protected]> * Update src/transformers/quantizers/quantizer_eetq.py Co-authored-by: Marc Sun <[email protected]> * Update tests/quantization/eetq_integration/test_eetq.py Co-authored-by: Marc Sun <[email protected]> * Update tests/quantization/eetq_integration/test_eetq.py Co-authored-by: Marc Sun <[email protected]> * [FEAT]: EETQ quantizer support * [FEAT]: EETQ quantizer support * remove whitespaces * update quantization.md * style * Update docs/source/en/quantization.md Co-authored-by: Younes Belkada <[email protected]> * add copyright * Update quantization.md * Update docs/source/en/quantization.md Co-authored-by: amyeroberts <[email protected]> * Update docs/source/en/quantization.md Co-authored-by: amyeroberts <[email protected]> * Address the comments by amyeroberts * style --------- Co-authored-by: Marc Sun <[email protected]> Co-authored-by: Marc Sun <[email protected]> Co-authored-by: Younes Belkada <[email protected]> Co-authored-by: amyeroberts <[email protected]>
huggingface · May 14, 2024 · 8ac632d · 8ac632d
1 parent 951bc7b
commit 8ac632d
Show file tree

Hide file tree

Showing 14 changed files with 570 additions and 2 deletions.
diff --git a/docker/transformers-quantization-latest-gpu/Dockerfile b/docker/transformers-quantization-latest-gpu/Dockerfile
@@ -52,6 +52,9 @@ RUN python3 -m pip install --no-cache-dir https://github.com/casper-hansen/AutoA
 # Add quanto for quantization testing
 RUN python3 -m pip install --no-cache-dir quanto
 
+# Add eetq for quantization testing
+RUN python3 -m pip install git+https://github.com/NetEase-FuXi/EETQ.git
+
 # When installing in editable mode, `transformers` is not recognized as a package.
 # this line must be added in order for python to be aware of transformers.
 RUN cd transformers && python3 setup.py develop
diff --git a/docs/source/en/main_classes/quantization.md b/docs/source/en/main_classes/quantization.md
@@ -38,6 +38,9 @@ Learn how to quantize models in the [Quantization](../quantization) guide.
 
 [[autodoc]] AwqConfig
 
+## EetqConfig
+[[autodoc]] EetqConfig
+
 ## GPTQConfig
 
 [[autodoc]] GPTQConfig

diff --git a/docs/source/en/quantization.md b/docs/source/en/quantization.md
@@ -642,6 +642,37 @@ double_quant_config = BitsAndBytesConfig(
 model_double_quant = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b", quantization_config=double_quant_config)
 ```
 
+## EETQ
+The [EETQ](https://github.com/NetEase-FuXi/EETQ) library supports int8 per-channel weight-only quantization for NVIDIA GPUS. The high-performance GEMM and GEMV kernels are from FasterTransformer and TensorRT-LLM. It requires no calibration dataset and does not need to pre-quantize your model. Moreover, the accuracy degradation is negligible owing to the per-channel quantization. 
+
+Make sure you have eetq installed from the [relase page](https://github.com/NetEase-FuXi/EETQ/releases)
+```
+pip install --no-cache-dir https://github.com/NetEase-FuXi/EETQ/releases/download/v1.0.0/EETQ-1.0.0+cu121+torch2.1.2-cp310-cp310-linux_x86_64.whl
+```
+or via the source code https://github.com/NetEase-FuXi/EETQ. EETQ requires CUDA capability <= 8.9 and >= 7.0
+```
+git clone https://github.com/NetEase-FuXi/EETQ.git
+cd EETQ/
+git submodule update --init --recursive
+pip install .
+```
+
+An unquantized model can be quantized via "from_pretrained".
+```py
+from transformers import AutoModelForCausalLM, EetqConfig
+path = "/path/to/model"
+quantization_config = EetqConfig("int8")
+model = AutoModelForCausalLM.from_pretrained(path, device_map="auto", quantization_config=quantization_config)
+```
+
+A quantized model can be saved via "saved_pretrained" and be reused again via the "from_pretrained".
+
+```py
+quant_path = "/path/to/save/quantized/model"
+model.save_pretrained(quant_path)
+model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")
+```
+
 ## Optimum
 
 The [Optimum](https://huggingface.co/docs/optimum/index) library supports quantization for Intel, Furiosa, ONNX Runtime, GPTQ, and lower-level PyTorch quantization functions. Consider using Optimum for quantization if you're using specific and optimized hardware like Intel CPUs, Furiosa NPUs or a model accelerator like ONNX Runtime.

diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py
@@ -1126,7 +1126,14 @@
         "is_vision_available",
         "logging",
     ],
-    "utils.quantization_config": ["AqlmConfig", "AwqConfig", "BitsAndBytesConfig", "GPTQConfig", "QuantoConfig"],
+    "utils.quantization_config": [
+        "AqlmConfig",
+        "AwqConfig",
+        "BitsAndBytesConfig",
+        "EetqConfig",
+        "GPTQConfig",
+        "QuantoConfig",
+    ],
 }
 
 # sentencepiece-backed objects
@@ -6071,7 +6078,14 @@
     )
 
     # bitsandbytes config
-    from .utils.quantization_config import AqlmConfig, AwqConfig, BitsAndBytesConfig, GPTQConfig, QuantoConfig
+    from .utils.quantization_config import (
+        AqlmConfig,
+        AwqConfig,
+        BitsAndBytesConfig,
+        EetqConfig,
+        GPTQConfig,
+        QuantoConfig,
+    )
 
     try:
         if not is_sentencepiece_available():

diff --git a/src/transformers/integrations/__init__.py b/src/transformers/integrations/__init__.py
@@ -42,6 +42,7 @@
         "set_hf_deepspeed_config",
         "unset_hf_deepspeed_config",
     ],
+    "eetq": ["replace_with_eetq_linear"],
     "integration_utils": [
         "INTEGRATION_TO_CALLBACK",
         "AzureMLCallback",
@@ -111,6 +112,7 @@
         set_hf_deepspeed_config,
         unset_hf_deepspeed_config,
     )
+    from .eetq import replace_with_eetq_linear
     from .integration_utils import (
         INTEGRATION_TO_CALLBACK,
         AzureMLCallback,

diff --git a/src/transformers/integrations/eetq.py b/src/transformers/integrations/eetq.py
@@ -0,0 +1,121 @@
+# coding=utf-8
+# Copyright 2024 NetEase, Inc. and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from ..utils import is_accelerate_available, is_eetq_available, logging
+
+
+if is_eetq_available():
+    import eetq
+    import torch.nn as nn
+
+if is_accelerate_available():
+    from accelerate import init_empty_weights
+
+logger = logging.get_logger(__name__)
+
+
+def _replace_with_eetq_linear(
+    model,
+    modules_to_not_convert=None,
+    current_key_name=None,
+    quantization_config=None,
+    has_been_replaced=False,
+    pre_quantized=False,
+):
+    """
+    Private method that wraps the recursion for module replacement.
+
+    Returns the converted model and a boolean that indicates if the conversion has been successfull or not.
+    """
+    if current_key_name is None:
+        current_key_name = []
+
+    for name, module in model.named_children():
+        current_key_name.append(name)
+
+        if (isinstance(module, nn.Linear)) and name not in modules_to_not_convert:
+            # Check if the current key is not in the `modules_to_not_convert`
+            current_key_name_str = ".".join(current_key_name)
+            if not any(
+                (key + "." in current_key_name_str) or (key == current_key_name_str) for key in modules_to_not_convert
+            ):
+                with init_empty_weights():
+                    in_features = module.in_features
+                    out_features = module.out_features
+                    model._modules[name] = eetq.EetqLinear(
+                        in_features, out_features, module.bias is not None, module.weight.device
+                    )
+                    if pre_quantized:
+                        model._modules[name].register_scale(module.weight.device)
+                    has_been_replaced = True
+
+                    # Force requires grad to False to avoid unexpected errors
+                    model._modules[name].requires_grad_(False)
+        if len(list(module.children())) > 0:
+            _, has_been_replaced = _replace_with_eetq_linear(
+                module,
+                modules_to_not_convert,
+                current_key_name,
+                quantization_config,
+                has_been_replaced=has_been_replaced,
+                pre_quantized=pre_quantized,
+            )
+        # Remove the last key for recursion
+        current_key_name.pop(-1)
+    return model, has_been_replaced
+
+
+def replace_with_eetq_linear(
+    model, modules_to_not_convert=None, current_key_name=None, quantization_config=None, pre_quantized=False
+):
+    """
+    A helper function to replace all `torch.nn.Linear` modules by `eetq.EetqLinear` modules from the `eetq`
+    library. This will enable running your models using high performance int8 weight-only gemm kerner from
+    FasterTransformer and TensorRT-LLM. Make sure `eetq` compiled with the correct CUDA
+    version of your hardware is installed before running this function. EETQ shall be installed via the source
+    'https://github.com/NetEase-FuXi/EETQ'
+
+    The function will be run recursively and replace all `torch.nn.Linear` modules except for the `lm_head` that should
+    be kept as a `torch.nn.Linear` module. The replacement is done under `init_empty_weights` context manager so no
+    CPU/GPU memory is required to run this function. Each weight will be quantized along the channel.
+
+    Parameters:
+        model (`torch.nn.Module`):
+            Input model or `torch.nn.Module` as the function is run recursively.
+        modules_to_not_convert (`List[`str`]`, *optional*, defaults to `["lm_head"]`):
+            Names of the modules to not convert in `EetqLinear`. In practice we keep the `lm_head` in full precision
+            for numerical stability reasons.
+        current_key_name (`List[`str`]`, *optional*):
+            An array to track the current key of the recursion. This is used to check whether the current key (part of
+            it) is not in the list of modules to not convert (for instances modules that are offloaded to `cpu` or
+            `disk`).
+    """
+
+    modules_to_not_convert = ["lm_head"] if modules_to_not_convert is None else modules_to_not_convert
+
+    if quantization_config.modules_to_not_convert is not None:
+        modules_to_not_convert.extend(quantization_config.modules_to_not_convert)
+    modules_to_not_convert = list(set(modules_to_not_convert))
+    model, has_been_replaced = _replace_with_eetq_linear(
+        model, modules_to_not_convert, current_key_name, quantization_config, pre_quantized=pre_quantized
+    )
+
+    if not has_been_replaced:
+        logger.warning(
+            "You are loading your model using eetq but no linear modules were found in your model."
+            " Please double check your model architecture, or submit an issue on github if you think this is"
+            " a bug."
+        )
+
+    return model
diff --git a/src/transformers/quantizers/auto.py b/src/transformers/quantizers/auto.py
@@ -19,6 +19,7 @@
     AqlmConfig,
     AwqConfig,
     BitsAndBytesConfig,
+    EetqConfig,
     GPTQConfig,
     QuantizationConfigMixin,
     QuantizationMethod,
@@ -28,6 +29,7 @@
 from .quantizer_awq import AwqQuantizer
 from .quantizer_bnb_4bit import Bnb4BitHfQuantizer
 from .quantizer_bnb_8bit import Bnb8BitHfQuantizer
+from .quantizer_eetq import EetqHfQuantizer
 from .quantizer_gptq import GptqHfQuantizer
 from .quantizer_quanto import QuantoHfQuantizer
 
@@ -39,12 +41,14 @@
     "gptq": GptqHfQuantizer,
     "aqlm": AqlmHfQuantizer,
     "quanto": QuantoHfQuantizer,
+    "eetq": EetqHfQuantizer,
 }
 
 AUTO_QUANTIZATION_CONFIG_MAPPING = {
     "awq": AwqConfig,
     "bitsandbytes_4bit": BitsAndBytesConfig,
     "bitsandbytes_8bit": BitsAndBytesConfig,
+    "eetq": EetqConfig,
     "gptq": GPTQConfig,
     "aqlm": AqlmConfig,
     "quanto": QuantoConfig,