Refactor int4 and int8 weight only quantization to use `quantize` #301

jerryzh168 · 2024-06-01T05:23:27Z

Summary:
This is similar to #294 but applied for int4 weight only quantization

Test Plan:

unit perf test:
python test/quantization/test_quant_api.py -k test_quantized_tensor_subclass_int4_wo_quant_perf
elapsed time: 0.2166275215148926, ref elapsed time: 0.2191881561279297
elapsed time: 0.2376406478881836, ref elapsed time: 0.22721023559570314
elapsed time: 0.21919679641723633, ref elapsed time: 0.2154969596862793

integration perf test:

reference: elapsed_time: 2.5900126953125 milliseconds
after refactor: elapsed_time: 2.56680078125 milliseconds
diff: no diff

TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py

Before:
After:
generated code diff:

Reviewers:

Subscribers:

Tasks:

Tags:

Refactor int8 weight only quant to use quantize #299 logs

Summary:
Similar to #294 we replaced the implementation
of int8 weight only quant to used the newly added quantize function, as a part of
the unification effort for affine quantization

Test Plan:

unit perf test:
python test/quantization/test_quant_api.py -k test_quantized_tensor_subclass_int8_wo_quant_perf
elapsed time: 0.23909856796264647, ref elapsed time: 0.25150911331176756
elapsed time: 0.24894208908081056, ref elapsed time: 0.2570047950744629
elapsed time: 0.21607391357421876, ref elapsed time: 0.22809568405151368

integration test:
TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py

Reference: elapsed_time: 1.355208740234375 milliseconds
After refactor: elapsed_time: 1.32778857421875 milliseconds

code diff (gist): gist.github.com/jerryzh168/921a722cf20d476c8fc5888482e722dc
code diff (meta-only paste): internalfb.com/phabricator/paste/view/P1387333845

…antize` Summary: Previously we added `quantize` as a general API (pytorch#256) for Affine Quantized tensor subclass, and also tensor subclass based dtype conversion in general. The plan is to use this to replace existing quant APIs including int4 weight only, int8 weight only, int8 dynamic quant and 8da4w (for executorch). This PR we started replacing the implementation of int8 dynamic quant API with `quantize` API with affine quantized tensor subclass. We'll make sure the performance does not regress for vit model. Test Plan: TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py reference: elapsed_time: 1.4821058654785155 milliseconds after refactor: elapsed_time: 1.4804757690429688 milliseconds generated code diff: https://gist.github.com/jerryzh168/90c71107a5aaaa5d8dd2170c573e076d Reviewers: Subscribers: Tasks: Tags:

Summary: Similar to pytorch#294 we replaced the implementation of int8 weight only quant to used the newly added `quantize` function, as a part of the unification effort for affine quantization Test Plan: 1. unit perf test: python test/quantization/test_quant_api.py -k test_quantized_tensor_subclass_int8_wo_quant_perf elapsed time: 0.23909856796264647, ref elapsed time: 0.25150911331176756 elapsed time: 0.24894208908081056, ref elapsed time: 0.2570047950744629 elapsed time: 0.21607391357421876, ref elapsed time: 0.22809568405151368 2. integration test: TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py Reference: elapsed_time: 1.355208740234375 milliseconds After refactor: elapsed_time: 1.32778857421875 milliseconds code diff (gist): https://gist.github.com/jerryzh168/921a722cf20d476c8fc5888482e722dc code diff (meta-only paste): https://www.internalfb.com/phabricator/paste/view/P1387333845 Reviewers: Subscribers: Tasks: Tags:

…antize` Summary: Previously we added `quantize` as a general API (pytorch#256) for Affine Quantized tensor subclass, and also tensor subclass based dtype conversion in general. The plan is to use this to replace existing quant APIs including int4 weight only, int8 weight only, int8 dynamic quant and 8da4w (for executorch). This PR we started replacing the implementation of int8 dynamic quant API with `quantize` API with affine quantized tensor subclass. We'll make sure the performance does not regress for vit model. Test Plan: TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py reference: elapsed_time: 1.4821058654785155 milliseconds after refactor: elapsed_time: 1.4804757690429688 milliseconds generated code diff: https://gist.github.com/jerryzh168/90c71107a5aaaa5d8dd2170c573e076d Reviewers: Subscribers: Tasks: Tags:

pytorch-bot · 2024-06-01T05:23:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/301

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 74ecb09 with merge base 729fa4d ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run Regression Tests / test (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://download.pytorc... / linux-job (gh) (trunk failure)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerryzh168 · 2024-06-01T05:27:33Z

this is rebased on int8-wo PR (#299) so will need to update this PR after the int8-wo PR is landed

msaroufim · 2024-06-01T05:50:52Z

test/integration/test_integration.py

@@ -930,6 +930,7 @@ def _test_lin_weight_subclass_impl(
        )

    @parameterized.expand(COMMON_DEVICE_DTYPE)
+    @unittest.skipIf(TORCH_VERSION_AFTER_2_4, "skip because there is some bug in inductor codegen")


is there an issue or a short description of both bugs we can add, otherwise will be hard to remember when to remove the skipIf

it's just a inductor c++ compilation bug I think, I'm planning to open a PR after this, I have opened one for the other skip here: #300

test/quantization/test_quant_api.py

msaroufim · 2024-06-01T05:54:37Z

torchao/dtypes/aqt.py

        return layout_cls
    return decorator

-def get_aqt_layout_cls(extended_layout: str) -> Callable:
+def get_aqt_layout_cls_ctr(extended_layout: str) -> Callable:


what does ctr stand for?

this means constructor, since we are returning class.from_plain now

This needs a comment I don't believe ctr is a common abbreviation for constructor

msaroufim · 2024-06-01T05:55:12Z

torchao/dtypes/aqt.py

+        # int_data = int_data.view(shape)
+        # changed = self.from_plain(int_data, scale, zero)
+        # return changed
+        # TODO: changing shape is no-op for int4 packed weight right now


could you share some more detail on this I'm quite curious

yeah, I'm confirming with @HDCharles right now, I think this is pretty weird, see comments in L575 of aqt.py for more details

msaroufim · 2024-06-01T05:55:29Z

torchao/dtypes/aqt.py

+
+    @classmethod
+    def from_plain(cls, int_data, scale, zero_point):
+        # TODO: expose the arg


why not just do it now

this one needs a bit more discussions with pt core team

msaroufim · 2024-06-01T05:56:07Z

torchao/dtypes/aqt.py

+        if extended_layout == "tensor_core_tiled":
+            from torchao.quantization.utils import find_multiple
+            orig_out_features, orig_in_features = input_float.shape
+            in_features = find_multiple(orig_in_features, 1024)


where do the constants for 1024 and 8 come from?

this is specific to tinygemm kernels I think, copied from old code:

ao/torchao/quantization/subclass.py

Lines 585 to 586 in 8a4e693

in_features = find_multiple(orig_in_features, 1024)

out_features = find_multiple(orig_out_features, 8)

torchao/dtypes/aqt.py

msaroufim · 2024-06-01T05:58:03Z

tutorials/quantize_vit/run_vit_b_quant.py

-torchao.apply_dynamic_quant(model)
-from torch._inductor import config as inductorconfig
-inductorconfig.force_fuse_int_mm_with_mul = True
+# int8 act, int8 weight dynamic quantization


should we delete code here instead of commenting it?

sure, this is just for people to easily try out different APIs, but we can just ask people to copy paste from README as well

msaroufim · 2024-06-01T06:05:17Z

torchao/dtypes/aqt.py

-            # groupwise int4 quantization
-            groupsize = weight_qtensor.block_size[-1]
+            if not _from_flinear:
+                weight_qtensor = weight_qtensor.t()


n00b q: why does this require a transpose?

this is to align the dimensions, for block_size so that we can get groupsize from block_size argument, see L662, and also related to L575. right now the _quantized_linear does not have a well-defined accepted weight shape, we need to fix that

jerryzh168 · 2024-06-01T21:08:01Z

torchao/dtypes/aqt.py

@@ -507,7 +571,13 @@ def __torch_dispatch__(cls, func, types, args, kwargs):
            f"AffineQuantizedTensor dispatch: attempting to run {func}, this is not supported"
        )

-def _quantized_linear_op(input_tensor, weight_qtensor, bias):
+def _quantized_linear_op(input_tensor, weight_qtensor, bias, _from_flinear=True):
+    # TODO: the old tensor subclass can use the single implementation for both F.linear dispatch


@msaroufim see this comment for more details

msaroufim · 2024-06-03T18:08:16Z

torchao/dtypes/aqt.py

        return layout_cls
    return decorator

-def get_aqt_layout_cls(extended_layout: str) -> Callable:
+def get_aqt_layout_cls_ctr(extended_layout: str) -> Callable:


This needs a comment I don't believe ctr is a common abbreviation for constructor

cpuhrsch · 2024-06-03T18:32:16Z

torchao/quantization/quant_api.py

-        filter_fn,
-    )
+    if TORCH_VERSION_AFTER_2_4:
+        quantize(model, get_apply_int4wo_quant(**kwargs), filter_fn)


blind kwargs make it impossible to document the behavior. i understand that change_linear_weights_to_int4_woqtensors has this as well. Seems like something that could be worth fixing.

cpuhrsch · 2024-06-03T18:32:44Z

torchao/utils.py

@@ -55,3 +58,10 @@ def benchmark_torch_function_in_microseconds(f, *args, **kwargs):
    )
    measurement = t0.blocked_autorange()
    return measurement.mean * 1e6
+
+
+def find_multiple(n: int, *args: Tuple[int]) -> int:


we now use this in torchao/dtypes and torchao/quantization and have to do import tricks to avoid circular dep

Summary: This is similar to pytorch#294 but applied for int4 weight only quantization Test Plan: unit perf test: python test/quantization/test_quant_api.py -k test_quantized_tensor_subclass_int4_wo_quant_perf elapsed time: 0.2166275215148926, ref elapsed time: 0.2191881561279297 elapsed time: 0.2376406478881836, ref elapsed time: 0.22721023559570314 elapsed time: 0.21919679641723633, ref elapsed time: 0.2154969596862793 integration perf test: reference: elapsed_time: 2.5900126953125 milliseconds after refactor: elapsed_time: 2.56680078125 milliseconds diff: no diff TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py Before: After: generated code diff: Reviewers: Subscribers: Tasks: Tags:

cpuhrsch · 2024-06-04T18:56:03Z

Please don't merge PRs when CI is red and we can't get signal for incremental changes. Fix main CI first, then merge.

jerryzh168 · 2024-06-04T20:21:05Z

Please don't merge PRs when CI is red and we can't get signal for incremental changes. Fix main CI first, then merge.

makes sense, sorry about this, will do next time

…torch#301) * Replace implementation for int8 dynamic quantization with call to `quantize` Summary: Previously we added `quantize` as a general API (pytorch#256) for Affine Quantized tensor subclass, and also tensor subclass based dtype conversion in general. The plan is to use this to replace existing quant APIs including int4 weight only, int8 weight only, int8 dynamic quant and 8da4w (for executorch). This PR we started replacing the implementation of int8 dynamic quant API with `quantize` API with affine quantized tensor subclass. We'll make sure the performance does not regress for vit model. Test Plan: TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py reference: elapsed_time: 1.4821058654785155 milliseconds after refactor: elapsed_time: 1.4804757690429688 milliseconds generated code diff: https://gist.github.com/jerryzh168/90c71107a5aaaa5d8dd2170c573e076d Reviewers: Subscribers: Tasks: Tags: * Refactor int8 weight only quant to use `quantize` Summary: Similar to pytorch#294 we replaced the implementation of int8 weight only quant to used the newly added `quantize` function, as a part of the unification effort for affine quantization Test Plan: 1. unit perf test: python test/quantization/test_quant_api.py -k test_quantized_tensor_subclass_int8_wo_quant_perf elapsed time: 0.23909856796264647, ref elapsed time: 0.25150911331176756 elapsed time: 0.24894208908081056, ref elapsed time: 0.2570047950744629 elapsed time: 0.21607391357421876, ref elapsed time: 0.22809568405151368 2. integration test: TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py Reference: elapsed_time: 1.355208740234375 milliseconds After refactor: elapsed_time: 1.32778857421875 milliseconds code diff (gist): https://gist.github.com/jerryzh168/921a722cf20d476c8fc5888482e722dc code diff (meta-only paste): https://www.internalfb.com/phabricator/paste/view/P1387333845 Reviewers: Subscribers: Tasks: Tags: * Replace implementation for int8 dynamic quantization with call to `quantize` Summary: Previously we added `quantize` as a general API (pytorch#256) for Affine Quantized tensor subclass, and also tensor subclass based dtype conversion in general. The plan is to use this to replace existing quant APIs including int4 weight only, int8 weight only, int8 dynamic quant and 8da4w (for executorch). This PR we started replacing the implementation of int8 dynamic quant API with `quantize` API with affine quantized tensor subclass. We'll make sure the performance does not regress for vit model. Test Plan: TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py reference: elapsed_time: 1.4821058654785155 milliseconds after refactor: elapsed_time: 1.4804757690429688 milliseconds generated code diff: https://gist.github.com/jerryzh168/90c71107a5aaaa5d8dd2170c573e076d Reviewers: Subscribers: Tasks: Tags: * Refactor int4 weight only quantization with call to `quantize` Summary: This is similar to pytorch#294 but applied for int4 weight only quantization Test Plan: unit perf test: python test/quantization/test_quant_api.py -k test_quantized_tensor_subclass_int4_wo_quant_perf elapsed time: 0.2166275215148926, ref elapsed time: 0.2191881561279297 elapsed time: 0.2376406478881836, ref elapsed time: 0.22721023559570314 elapsed time: 0.21919679641723633, ref elapsed time: 0.2154969596862793 integration perf test: reference: elapsed_time: 2.5900126953125 milliseconds after refactor: elapsed_time: 2.56680078125 milliseconds diff: no diff TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py Before: After: generated code diff: Reviewers: Subscribers: Tasks: Tags: --------- Co-authored-by: Mark Saroufim <[email protected]>

jerryzh168 added 3 commits May 31, 2024 11:17

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 1, 2024

jerryzh168 force-pushed the int4-wo branch from a19ac04 to f87cc12 Compare June 1, 2024 05:26

jerryzh168 changed the title ~~Replace implementation for int8 dynamic quantization with call to `qu…~~ Refactor int4 weight only quantization with call to quantize Jun 1, 2024

jerryzh168 requested review from cpuhrsch, msaroufim and HDCharles June 1, 2024 05:27

jerryzh168 changed the title ~~Refactor int4 weight only quantization with call to quantize~~ Refactor int4 weight only quantization to use quantize Jun 1, 2024

msaroufim requested changes Jun 1, 2024

View reviewed changes

jerryzh168 commented Jun 1, 2024

View reviewed changes

jerryzh168 force-pushed the int4-wo branch from f87cc12 to 1e03a8d Compare June 1, 2024 21:15

jerryzh168 requested a review from msaroufim June 1, 2024 21:16

msaroufim approved these changes Jun 3, 2024

View reviewed changes

cpuhrsch reviewed Jun 3, 2024

View reviewed changes

jerryzh168 force-pushed the int4-wo branch from 1e03a8d to b250618 Compare June 4, 2024 12:17

jerryzh168 force-pushed the int4-wo branch from b250618 to 0069e53 Compare June 4, 2024 12:23

jerryzh168 and others added 3 commits June 4, 2024 08:23

Merge branch 'main' into int4-wo

cafdbda

Merge branch 'main' into int4-wo

92fee02

Merge branch 'main' into int4-wo

74ecb09

jerryzh168 changed the title ~~Refactor int4 weight only quantization to use quantize~~ Refactor int4 and int8 weight only quantization to use quantize Jun 4, 2024

jerryzh168 mentioned this pull request Jun 4, 2024

recent torchinductor changes seems to break torchao CI pytorch/pytorch#127924

Closed

jerryzh168 merged commit 338d87c into pytorch:main Jun 4, 2024
12 of 13 checks passed

jerryzh168 deleted the int4-wo branch June 4, 2024 17:35

jerryzh168 mentioned this pull request Jun 4, 2024

Refactor int8 weight only quant to use quantize #299

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor int4 and int8 weight only quantization to use `quantize` #301

Refactor int4 and int8 weight only quantization to use `quantize` #301

jerryzh168 commented Jun 1, 2024 •

edited

Loading

pytorch-bot bot commented Jun 1, 2024 •

edited

Loading

jerryzh168 commented Jun 1, 2024

msaroufim Jun 1, 2024

jerryzh168 Jun 1, 2024 •

edited

Loading

msaroufim Jun 1, 2024

jerryzh168 Jun 1, 2024

msaroufim Jun 3, 2024

msaroufim Jun 1, 2024

jerryzh168 Jun 1, 2024

msaroufim Jun 1, 2024

jerryzh168 Jun 1, 2024

msaroufim Jun 1, 2024

jerryzh168 Jun 1, 2024

msaroufim Jun 1, 2024

jerryzh168 Jun 1, 2024

msaroufim Jun 1, 2024

jerryzh168 Jun 1, 2024 •

edited

Loading

jerryzh168 Jun 1, 2024

msaroufim Jun 3, 2024

cpuhrsch Jun 3, 2024

jerryzh168 Jun 3, 2024

cpuhrsch Jun 3, 2024

jerryzh168 Jun 3, 2024

cpuhrsch commented Jun 4, 2024

jerryzh168 commented Jun 4, 2024

	in_features = find_multiple(orig_in_features, 1024)
	out_features = find_multiple(orig_out_features, 8)

Refactor int4 and int8 weight only quantization to use quantize #301

Refactor int4 and int8 weight only quantization to use quantize #301

Conversation

jerryzh168 commented Jun 1, 2024 • edited Loading

pytorch-bot bot commented Jun 1, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/301

✅ You can merge normally! (1 Unrelated Failure)

jerryzh168 commented Jun 1, 2024

Choose a reason for hiding this comment

jerryzh168 Jun 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryzh168 Jun 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpuhrsch commented Jun 4, 2024

jerryzh168 commented Jun 4, 2024

Refactor int4 and int8 weight only quantization to use `quantize` #301

Refactor int4 and int8 weight only quantization to use `quantize` #301

jerryzh168 commented Jun 1, 2024 •

edited

Loading

pytorch-bot bot commented Jun 1, 2024 •

edited

Loading

jerryzh168 Jun 1, 2024 •

edited

Loading

jerryzh168 Jun 1, 2024 •

edited

Loading