F16 x S8/S4 GEMM #911

There is a reordering of the columns so that we can load contiguous 128 byte cache lines. The GEMM has a K tile size of 64 and the B matrix is int8 and column major. The K-tile means we would normally be reading 64 bytes from a cache line during each load. As a result, I put [64 elements from col1, 64 elements from col2] in 1 cache line to improve the memory access pattern since we will now use all the data in the 128 cache line when it is fetched.

The "interleaving", by this I mean permutation of the rows, was done so that the int8 data ends up in the pattern we need from HMMA after we do LDSM. If we did not interleave the weights this way, we would have to have some communication among the threads to ensure each thread had the "correct" data from the weight matrix before dequantizing and issuing HMMA.

Lastly, there is some permutation within the registers - for int8 is not not needed but it is done so int8 and int4 have the same layout. It is needed for int4 to reduce shift instructions during the conversions. There are also some other transforms done to accelerate the int to float type conversion as @hwu36 pointed out.

The layout is "new" and is different from existing layouts in CUTLASS / cuBLAS since it needed to accomplish different goals.

Hope this clears some things up!

MARD1NO Apr 20, 2023

Thanks! It really helps me a lot to understand this layout.

hwu36 · 2023-04-13T13:35:10Z

hwu36
Apr 13, 2023
Maintainer Author

I think reordering is used to improve the performance of type conversion https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/cutlass_extensions/include/cutlass_extensions/interleaved_numeric_conversion.h#L54

@byshiue

0 replies

masahi · 2023-06-02T19:24:43Z

masahi
Jun 2, 2023

Is this kernel supposed to work on CUDA 12.1? I'm getting strange results where all outputs are zero, even though I confirmed that the kernel is running and inputs are sane.

On the container nvcr.io/nvidia/pytorch:22.09-py3 with CUDA 11.8 that FT supports, I verified that it works correctly.

12 replies

rhenry-nv Jun 2, 2023

you can check compile commands with verbose make to be sure

masahi Jun 2, 2023

Apparently __CUDA_ARCH__ is not defined. Probably something is off in my cmake.

masahi Jun 2, 2023

Updated my cmake masahi/cutlass_fpA_intB_gemm@bb60c79 closely following the one in FT and now I'm getting correct results from FT. Thank you for your help!

masahi Jun 2, 2023

I've also got vicuna 7b from https://github.com/mlc-ai/mlc-llm running with the FT kernel and I'm getting good-looking tokens generated. Currently I'm not fusing bias or activation but it will be done next week.

I'll continue cleaning up and adding more features to https://github.com/masahi/cutlass_fpA_intB_gemm (support residual fusion, more quantization scheme etc). More models like RedPajama will be supported as well.

rhenry-nv Jun 2, 2023

Sounds great! Feel free to reach out if you have any more questions.

masahi · 2023-06-09T00:55:55Z

masahi
Jun 9, 2023

Just found that the FT kernel always uses fp32 accumulation https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/cutlass_extensions/include/cutlass_extensions/gemm/kernel/default_fpA_intB_traits.h#L110

Is there a danger in enabling fp16 accum, at least optionally? mlc-llm uses fp16 accum exclusively for vicuna / llama, for example. But for Redpajama it selectively uses fp32 accum.

5 replies

rhenry-nv Jun 9, 2023

No danger in having fp16 accumulators [unless your application overflows]. You can add a trait for FP16 accumulators if it helps your application.

masahi Jun 9, 2023

I tried fp16 accum but the perf didn't change at all. Does that sound weird? I can take a look deeper next week.

The workload is a decoder network (vicuna 7B). So mostly GEMV. Maybe things are so memory bound that the accum dtype in compute doesn't matter?

hwu36 Jun 9, 2023
Maintainer Author

accumulation type does not impact mma throughput.

masahi Jun 9, 2023

oh? Then I've been confused about something very basic...

When people talk about FP16 tensorcore peak flops with fp32 accumulation like below, do they also imply that the output type is fp32? And ElementOutput in cutlass does impact mma throughput?

hwu36 Jun 10, 2023
Maintainer Author

let me rephrase. accumulation type has nothing to do with throughput on a100. on other cards such as geforce cards, they may do have 2x difference.

sorry, i use a100 every day. when you say ampere, it means a100 to me. :)

masahi · 2023-06-14T23:44:04Z

masahi
Jun 14, 2023

I've extracted the FT kernel into https://github.com/tlc-pack/cutlass_fpA_intB_gemm to make it easier to integrate into third-party projects. And I've already made the first improvement over the original implementation: Add support for residual block fusion tlc-pack/cutlass_fpA_intB_gemm#1.

Things are not documented at all and there is no test either, but nonetheless I hope it would be useful for others as well.

1 reply

hwu36 Jun 15, 2023
Maintainer Author

feel free to upstream it to cutlass as an example just like what you did for wgrad and transposed conv.

masahi · 2023-06-15T19:46:35Z

masahi
Jun 15, 2023

It's being integrated into TVM by apache/tvm#15111

0 replies

manishucsd · 2023-08-08T23:09:40Z

manishucsd
Aug 8, 2023

Hi, All, I am working on making changes to upstream mixed-input support into upstream NVIDIA/CUTLASS. Please review some drawings below on how I am planning to choreograph the mainloop with mixed input datatype. It is slightly different from approach discussed here.

Notably, we would like to maintain canonical layout in the global memory for TN mixed input (F16 * S8) GEMM and use __shfl_sync() after conversion to obtain the data in desired layout for mma.sync.16816.f32.f16.f16.f32. I plan to use the conversion tricks discussed here :).

Consider a warp-level test that I am currently fleshing out using the figure in mind. Let me know if you see an issue in the above approach?

TEST(SM80_warp_gemm_tensor_op_mixed_input_crosswise_f16_i8, 64x64x64_64x64x64_16x8x16) {
  using Shape = cutlass::gemm::GemmShape<64, 64, 64>;
  using InstructionShape = cutlass::gemm::GemmShape<16, 8, 16>;
  using ElementA = cutlass::half_t;
  using ElementB = int8_t;
  using ElementC = float;
  using LayoutA = cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<
      cutlass::sizeof_bits<ElementA>::value, 64>;
  using LayoutB = cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<
      cutlass::sizeof_bits<ElementB>::value, 64>;

  using MmaTensorOp = typename cutlass::gemm::warp::DefaultMmaTensorOp<
      Shape, InstructionShape, ElementA, LayoutA, ElementB, LayoutB, ElementC,
      cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAddMixedInput>::Type;

  test::gemm::warp::TransformTestbed<MmaTensorOp,
                            cutlass::gemm::GemmShape<64, 64, 64> >()
      .run(cutlass::Distribution::Identity, cutlass::Distribution::Sequential);
}

cc: @IonThruster , @hwu36 , @thakkarV , @kerrmudgeon

11 replies

rhenry-nv Aug 10, 2023

Let us see how much of the performance we will lose because of loading partial cachelines. Do you have an estimate?

I saw ~10% on some problem sizes where the int8 B matrix was much larger than the activations. This is what motivated the reordering of B to allow the use of 128B cache lines.

Data is already in canonical format, we will need epilogue of the previous operation to write it out in the interleaved layout, if we want to use that for our models.

Does that mean you will want to use the I2F trick for unsigned int8s?

Another performance question, can I run the current F16 * S8 stand alone and measure performance on different GEMM shapes? If yes, can you provide the steps to do it?

@masahi 's fork might be the easiest way to use those GEMMs as a stand-alone at the moment.

manishucsd Aug 12, 2023

I see the following performance with Threadblock::kK = 128 and 64 for s32 <= s8 * s8 + s32 on A100.

./tools/profiler/cutlass_profiler --m=3456 --n=4096 --k=2048 --output=s8_gemm.csv
=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_i16832gemm_s8_128x128_64x5_tn_align16

            Math: 450431 GFLOP/s

=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_i16832gemm_s8_128x128_128x4_tn_align16

            Math: 395882 GFLOP/s

I saw ~10% on some problem sizes where the int8 B matrix was much larger than the activations.

Do you have the problem shapes where you saw that perf difference with 128 vs. 64 elements in tile-K?

@masahi do you have a the changes/pointers ready on profiling the NVIDIA/FasterTransformer version of F32 <= F16 * S8 + F32?

FYI, I am working on an implementation for mixed input types on canonical layouts with shfls to get the mma layout. I will keep you posted, if I could make shfl implementation to work functionally, and then we can look into its performance.

masahi Aug 12, 2023

do you have a the changes/pointers ready on profiling the NVIDIA/FasterTransformer version of F32 <= F16 * S8 + F32?

I don't have such code other than the unit tests I added to TVM, which are unnecessarily complex for your purpose.

rhenry-nv Aug 17, 2023

Do you have the problem shapes where you saw that perf difference with 128 vs. 64 elements in tile-K?

I think it was M <= 40, N=4096, K=1024, Groups=32. That benchmark was on grouped GEMM, and for very small M which was the initial target for the core MMA. I retrofitted it to regular GEMM later.

dongjiyingdjy Aug 30, 2023

Hi，these weight only quantized GEMM helps a lot in LLMs. However, the problem size of GEMM is larger. The most commonly used shape is M <=64, N=4096, K=16384.
Have you tried K tile size more than 64, for example, 128 or 256? Could a bigger size of K tile improve memory bandwidth for large K?

tridao · 2023-08-17T20:28:41Z

tridao
Aug 17, 2023

My understanding is that the I2F conversion is done after loading from smem to rmem, right before calling mma.sync. Is that right? And the loading from smem to rmem is still using ldmatrix?

Did you guys ever try doing I2F conversion right after loading from gmem, before writing to smem? That might make the code simpler, but idk how much it affects performance. Even though that would increase the amount of smem read/write, maybe this kind of gemm is bottlenecked by either gmem bandwidth or fp16 mma anyway, and not smem read/write?

2 replies

manishucsd Aug 17, 2023

My understanding is that the I2F conversion is done after loading from smem to rmem, right before calling mma.sync. Is that right? And the loading from smem to rmem is still using ldmatrix?

yes and yes

Did you guys ever try doing I2F conversion right after loading from gmem, before writing to smem?

Gmem to Smem is done using cp.async which doesn't bring registers in between. cp.async reads from Gmem and writes data directly into Smem.

rhenry-nv Aug 17, 2023

+1 to above. There is support for what @tridao described in the Mmapipeline version of those kernels. You can control apply a transform either after LDG or after ldmatrix. However, we don't use that main loop on Ampere.

If I remember correctly, Volta does the transform after LDG while Turing does it after ldmatrix.

alexsamardzic · 2023-09-09T12:48:20Z

alexsamardzic
Sep 9, 2023

Any specific plans to actually have any of this upstream in CUTLASS soon? I was not aware of anything above, and as mentioned here, in the meantime I was working on utilizing it in PyTorch, starting from CUTLASS extensions from the FasterTransformer project (mine is f16xs8 only at the moment, but at least CUTLASS extensions from FasterTransformer are updated for CUTLASS 3.x in my version). Of course, it would be much better to actually have this functionality in CUTLASS.

1 reply

manishucsd Sep 9, 2023

Your timing couldn't be any better for this request. I am on it and very close to starting a PR to NVIDIA/CUTLASS.

masahi · 2023-10-18T09:28:21Z

masahi
Oct 18, 2023

Hi @rhenry-nv, I'm wondering if it is possible to run the FT int4/8 GEMM kernel on multiple GPUs.

The way https://github.com/mlc-ai/mlc-llm does multi-gpu for non-FT paths is to shard the quantized weight along row or column dimension, do GEMM on each device, and do NCCL AllReduce to gather the results. This scheme doesn't seem to work for the FT kernel due to the need for weight preprocessing, involving elements permute and transpose. Any thought?

4 replies

MARD1NO Oct 18, 2023

I think you can preprocess weight + split weight first, then you can run on multiple gpus.

masahi Oct 18, 2023

The result is incorrect if I do that.

rhenry-nv Oct 18, 2023

Yes it it possible. You must split first then preprocess. The contiguous dimension of the matrix must be a multiple of 64 after splitting.

masahi Oct 18, 2023

hmm ok. I thought the transposition of the weight would pose a problem for such independent preprocessing of sharded weights, but I'll try.

MARD1NO · 2023-10-18T13:39:12Z

MARD1NO
Oct 18, 2023

oh sry，I made wrong. the correct order is split weight first, then do preprocess on each device's weight. 

…

---Original--- From: ***@***.***> Date: Wed, Oct 18, 2023 21:36 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [NVIDIA/cutlass] F16 x S8/S4 GEMM (Discussion #911) The result is incorrect if I do that. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

masahi Oct 18, 2023

I thought about that, but preprocessing involves transpose of the entire matrix. Split and transpose would end up with a different matrix.

fishelegs · 2023-11-06T08:21:24Z

fishelegs
Nov 6, 2023

Will cutlass support group quantization for S4, in F16 x S4?
Since in LLM 4bit PTQ will lose some ppl, 4 bit group quantization will improve ppl a lot.

2 replies

manishucsd Nov 6, 2023

Hi @fishelegs ,

Thanks for bringing this to our attention. We are looking into and going to look look more closely into it in the coming weeks. Stay tuned.

Can you please expand your acronyms here?
PTQ: Post Training Quantization
PPL?

fishelegs Nov 7, 2023

1）PPL stands for perplexity
Perplexity (PPL) is one of the most common metrics for evaluating language models.
(ref:https://huggingface.co/docs/transformers/perplexity)

2）I find that TensorRT-LLM seems to have the 4bit group-wise weight-only quantization.
(https://github.com/NVIDIA/TensorRT-LLM/blob/d8ebeee2f6fcb219e6efc541ccc914765799fa3a/examples/llama/README.md?plain=1#L104)
In Section "Groupwise quantization (AWQ/GPTQ)"
I gonna have a try.

MARD1NO · 2024-02-28T03:03:26Z

MARD1NO
Feb 28, 2024

@rhenry-nv Hi rawn henry，I found in cutlass official's mixed input gemm using ColMajor weight layout, and use FragementShuffler to get the mma layout's weight. Is there exist performance gap between ColMajor and InterleavedLayout? I think InterleavedLayout may be get better performance, but it is hard to expand to other device and dtype. Using Manish's mixed input gemm, it is easier to expand like fp8 matmul int4.

1 reply

manishucsd Feb 28, 2024

FP: FP8 should not need a new implementation FragmentShuffler, but for FP8 which is on Hopper you can use ld.shared instead of ldmatrix with similar performance avoiding the need for FragmentShuffler. These should be in CUTLASS 3.4 and @rhenry-nv can confirm.

S4: S4 will need a new implementation of FragmentShuffler and @alexsamardzic was working on it.

I agree using canonical ColMajor layout for quantized weights makes it easier to exploit a heterogenous compute offering as Interleaved layout are device-specific.

sleepwalker2017 · 2024-05-25T02:51:50Z

sleepwalker2017
May 25, 2024

FastTransformer has kernels written in CUTLASS to support fp16 x int8/int4 GEMM.

source code:

https://github.com/NVIDIA/FasterTransformer/tree/main/src/fastertransformer/cutlass_extensions/include/cutlass_extensions

instantiation:

https://github.com/NVIDIA/FasterTransformer/tree/main/src/fastertransformer/kernels/cutlass_kernels/fpA_intB_gemm

paper:

https://arxiv.org/abs/2211.10017

GTC'23 talk:

https://register.nvidia.com/flow/nvidia/gtcspring2023/attendeeportal/page/sessioncatalog/session/1666226207768001N4Fe

Hi what is the name of the GTC talk? this link is too old and is lost.

Could you provide the name of this talk? Thank you so much !

1 reply

sleepwalker2017 May 25, 2024

Is it this talk? https://www.nvidia.com/en-us/on-demand/session/gtcspring23-s51226/

F16 x S8/S4 GEMM #911

hwu36 Apr 10, 2023 Maintainer

Replies: 14 comments · 43 replies

hwu36 Apr 13, 2023 Maintainer Author

hwu36 Jun 9, 2023 Maintainer Author

hwu36 Jun 10, 2023 Maintainer Author

hwu36 Jun 15, 2023 Maintainer Author

hwu36
Apr 10, 2023
Maintainer

Replies: 14 comments 43 replies

hwu36
Apr 13, 2023
Maintainer Author

hwu36 Jun 9, 2023
Maintainer Author

hwu36 Jun 10, 2023
Maintainer Author

hwu36 Jun 15, 2023
Maintainer Author