Fix TinyGemmQBitsTensor move #246

dacorvo · 2024-07-18T09:40:28Z

What does this PR do?

This fixes a few issues on CUDA with arch <= sm80: a TinyGemmQBitsTensor could be moved to the CUDA device without calling QBitsTensor::create.

A small change is also added to remove a dual dispatch for AWQ gemm. This slightly improves the decode latency for LLM (but does not impact much the end-to-end latency that is still dominated by the prefill latency).

This will allow to detach TinyGemmQBitsTensor that have different inner tensors (scale and shift are combined).

Since QBitsTensor ops are now all compatible with TinyGemmQBitsTensor we can remove the specific dispatch.

The QBitsTensor.create factory method checks for CUDA version, but the unit tests that bypass that method must check themselves.

dacorvo added 5 commits July 18, 2024 08:31

refactor(qbits): make detach dispatch agnostic to inner tensors

a44eb70

This will allow to detach TinyGemmQBitsTensor that have different inner tensors (scale and shift are combined).

refactor(qbits): remove tinygemm op dispatch

5138f32

Since QBitsTensor ops are now all compatible with TinyGemmQBitsTensor we can remove the specific dispatch.

test(qbits): avoid move in linear dispatch

68a0fa1

test(qbits): test TinyGemm tensors only if CUDA arch >= sm80

3ced09c

The QBitsTensor.create factory method checks for CUDA version, but the unit tests that bypass that method must check themselves.

refactor(library): avoid dual dispatch for awq kernel

585af9c

dacorvo requested a review from fxmarty July 18, 2024 09:42

dacorvo merged commit 9241b96 into main Jul 18, 2024
12 checks passed

dacorvo deleted the fix_tinygemm_move branch July 18, 2024 11:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TinyGemmQBitsTensor move #246

Fix TinyGemmQBitsTensor move #246

dacorvo commented Jul 18, 2024 •

edited

Loading

Fix TinyGemmQBitsTensor move #246

Fix TinyGemmQBitsTensor move #246

Conversation

dacorvo commented Jul 18, 2024 • edited Loading

What does this PR do?

dacorvo commented Jul 18, 2024 •

edited

Loading