Add activation quantization support to per-channel quantized linear layers #105

lsy323 · 2024-05-25T02:09:56Z

Activation quantization is only supported with per-channel quantized model.

Enable activation quantization with per-channel quant by using the flag --quantize_activation=True

The activation will be quantized to int8 and then do a int8 x int8 matmul operation. We need to call lax.dot_general because with torch matmul ops we cannot control the output dtype (int8 by default, and the output is easy to overflow). We use int32 as accumulation dtype to avoid overflow.

The correctness is verified in unit tests and llama/gemma model. Now get same performance on 7B int8 per-channel BS=32. In depth investigation is needed to understand the performance impact.

wang2yn84 · 2024-05-26T22:10:26Z

jetstream_pt/layers.py

      else:
-        out = torch.mul(F.linear(inputs, self.weight), self.weight_scaler)
+        result = torchjax.call_jax(


Is it a bit confusing that when quantize_activation not enabled the inputs and self.weight are torch tensor and when it's enabled it's Jax arrays. At least we need more detailed comments here.

Here we have to call jax because we need to do dot(int8, int8)->int32. This semantic cannot be represented in torch now. In torch, the inferred output dtype of 2 int8 operands will be int8, causing the dot result to overflow.. The dot_general in JAX support specifying output dtype, hence we use it here.

Let me add a comment to make it clear

wang2yn84 · 2024-05-26T22:12:58Z

jetstream_pt/layers.py

-      use_dot_general=False,
-      block_size=128,
-      n_bit=8,
+      quant_config=QuantizationConfig(),
  ):
    super().__init__()
    self.in_features = in_features
    self.out_features = out_features

    # Use dot general instead of einsum
    # Use dot general is slow now.


Known torch xla2 issue? Is there a bug tracker for this?

This should be an XLA issue I think, using dot_general and einsum should have the same semantics

wang2yn84 · 2024-05-26T22:22:30Z

jetstream_pt/layers.py

-            self.zero_point,
-        )
+        ), "Blockwise quantized linear doesn't support zero_point in dot_general or einsum flattened implementation."
+      blockwise_matmul_kernel = (


nit, maybe the following is a little simpler:
blockwise_matmul_kernel = (
blockwise_jax_kernel_dot_general
if self.use_dot_general
else blockwise_jax_kernel_einsum_flatten
if self.flatten
else blockwise_jax_kernel
)

Thanks, this will be cleaner, let me update

wang2yn84 · 2024-05-26T22:25:55Z

jetstream_pt/quantize.py

+  return out
+
+
+def blockwise_jax_kernel_dot_general(inputs, weight, weight_scaler, zero_point):


Since torch xla2 has fixed the torch einsum lowering, do we still need this?

No, we don't need to call jax for it. Thanks for the heads up. Since I'm moving the existing kernel implementation to this new file, I will switch to torch in the following PR.

FanhaiLu1 · 2024-05-29T23:40:00Z

jetstream_pt/layers.py

    )
    w_dq = dequantize_tensor(w_q, scale, zp)
    self._load_quantized_weights(w_q, scale, zp)

  def forward(self, inputs):
    if not self.run_fake_quantize:
-      if self.is_symmetric:
-        return torch.mul(F.linear(inputs, self.weight), self.weight_scaler)
+      if self.quantize_activation:


Can we move this code to else?

No, we cannot move this code to else. This is an extra step for activation quant

FanhaiLu1 · 2024-05-30T00:37:32Z

jetstream_pt/quantize.py

+
+
+def blockwise_jax_kernel_einsum_flatten(
+    inputs, weight, weight_scaler, zero_point


Do you handle zero_point is not None case?

Yes, there is an assertion on the caller side https://github.com/google/jetstream-pytorch/blob/main/jetstream_pt/layers.py#L297-L299

jetstream_pt/quantize.py

add debug print to debug remove print, add bias to asym quant tests lint

lsy323 requested review from qihqi, wang2yn84 and FanhaiLu1 May 25, 2024 02:09

lsy323 marked this pull request as draft May 25, 2024 02:47

lsy323 force-pushed the lsiyuan/act-quant branch from 1f7884d to 0c42219 Compare May 25, 2024 03:23

lsy323 marked this pull request as ready for review May 25, 2024 05:04

wang2yn84 reviewed May 28, 2024

View reviewed changes

FanhaiLu1 reviewed May 30, 2024

View reviewed changes

qihqi reviewed Jun 7, 2024

View reviewed changes

jetstream_pt/quantize.py Show resolved Hide resolved

lsy323 added 5 commits June 11, 2024 21:24

add activation quant support

47e16b0

pyink

b78f5d8

fix dtype

ac8f7a9

uncomment prompts

338daf9

try fix test

8d613e7

add debug print to debug remove print, add bias to asym quant tests lint

lsy323 force-pushed the lsiyuan/act-quant branch from 0c42219 to 8d613e7 Compare June 11, 2024 21:25

add comment

0249be4

lsy323 requested review from qihqi, wang2yn84 and FanhaiLu1 June 11, 2024 21:55

FanhaiLu1 approved these changes Jun 11, 2024

View reviewed changes

qihqi approved these changes Jun 12, 2024

View reviewed changes

lsy323 merged commit 8a125b6 into AI-Hypercomputer:main Jun 12, 2024
4 checks passed

lsy323 deleted the lsiyuan/act-quant branch June 12, 2024 22:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add activation quantization support to per-channel quantized linear layers #105

Add activation quantization support to per-channel quantized linear layers #105

lsy323 commented May 25, 2024 •

edited

Loading

wang2yn84 May 26, 2024

lsy323 Jun 11, 2024 •

edited

Loading

wang2yn84 May 26, 2024

lsy323 Jun 11, 2024

wang2yn84 May 26, 2024

lsy323 Jun 11, 2024

wang2yn84 May 26, 2024

lsy323 Jun 11, 2024

FanhaiLu1 May 29, 2024

lsy323 Jun 11, 2024

FanhaiLu1 May 30, 2024

lsy323 Jun 11, 2024

		return out


		def blockwise_jax_kernel_dot_general(inputs, weight, weight_scaler, zero_point):



		def blockwise_jax_kernel_einsum_flatten(
		inputs, weight, weight_scaler, zero_point

Add activation quantization support to per-channel quantized linear layers #105

Add activation quantization support to per-channel quantized linear layers #105

Conversation

lsy323 commented May 25, 2024 • edited Loading

Choose a reason for hiding this comment

lsy323 Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lsy323 commented May 25, 2024 •

edited

Loading

lsy323 Jun 11, 2024 •

edited

Loading