Optimized fused MoE Kernel #2913

pcmoritz · 2024-02-19T00:50:11Z

This PR is based on @WoosukKwon 's excellent work in porting the TensorRT MoE kernels in https://github.com/vllm-project/vllm/tree/cutlass-moe

It is based on the observation that the TensorRT MoE kernels are working very well in the small batch size regime, whereas the fused MoE kernel is working much better in the large batch size regime. I have been trying to optimize the triton kernels in the small batch size regime too, but unfortunately triton doesn't seem to have great support for matrix multiplications that involve skinny matrices (e.g. tl.dot only supports dimensions >= 16). Therefore, we use the TensorRT kernel in the small batch size regime and the fused MoE kernels in the large batch size regime. It would be much preferable to have one unified kernel for all regimes, so if anybody knows how to make that happen, I'd love to know.

This PR also incorporates some of @cadedaniel 's work on autotuning the fused MoE kernel.

The benchmarks are as follows (all on H100 with TP2, using 1000 input and 50 output tokens):

This PR with below tuning configs:

qps = 1 => 16.9 ms ITL (0.85s end-to-end completion time per request)
qps = 2 => 19.0 ms ITL (0.95s end-to-end completion time per request)
qps = 4 => 32.7 ms ITL (1.63s end-to-end completion time per request)
qps = 6 => 43.4 ms ITL (2.16s end-to-end completion time per request)

current main branch (untuned fused MoE kernel):

qps = 1 => 23.3 ms ITL (1.17s end-to-end completion time per request)
qps = 2 => 25.4ms ITL (1.27s end-to-end completion time per request)
qps = 4 => 43.0ms ITL (2.15s end-to-end completion time per request)
qps = 6 => 60.8ms ITL (3.04s end-to-end complition time per request)

only using the TensorRT Moe kernels:

qps = 1 => 18.1 ms ITL (0.90s end-to-end completion time per request)
qps = 2 => 23.8 ms ITL (1.19s end-to-end completion time per request)
qps = 4 => 48.1 ms ITL (2.36s end-to-end completion time per request)
qps = 6 => 90.8 ms ITL (4.54s end-to-end completion time per request)

You can run the autotuned kernel by setting

export VLLM_MIXTRAL_FUSE_MOE_CONFIG=/path/to/fused_moe_h100_tp2_config.json

where fused_moe_h100_tp2_config.json contains the following file:

{
    "64": {"BLOCK_SIZE_M": 64, "BLOCK_SIZE_N": 64, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 64, "num_warps": 4, "num_stages": 4},
    "128": {"BLOCK_SIZE_M": 64, "BLOCK_SIZE_N": 128, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 32, "num_warps": 4, "num_stages": 4},
    "256": {"BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 128, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 32, "num_warps": 8, "num_stages": 4},
    "512": {"BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 128, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 64, "num_warps": 8, "num_stages": 4},
    "1024": {"BLOCK_SIZE_M": 256, "BLOCK_SIZE_N": 128, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 64, "num_warps": 8, "num_stages": 4},
    "2048": {"BLOCK_SIZE_M": 256, "BLOCK_SIZE_N": 128, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 64, "num_warps": 8, "num_stages": 4},
    "4096": {"BLOCK_SIZE_M": 256, "BLOCK_SIZE_N": 128, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 16, "num_warps": 8, "num_stages": 4}
}

WoosukKwon · 2024-02-20T23:06:15Z

Hi @pcmoritz Thanks for the amazing PR! Is this PR ready for review? Or, do you have any blocker to the PR?

pcmoritz · 2024-02-20T23:15:42Z

I think we should merge your kernel https://github.com/vllm-project/vllm/tree/cutlass-moe as a separate PR and then we can merge this one. If you open the PR about the TensorRT kernels, I'm happy to review it! The thing I'm currently unsure about is whether we should have two different kernels in the two different regimes, that seems very unfortunate to me.

I'll be looking a little more if we can get more out of the triton kernel in the low batch size regime and will keep you updated. Let's come to a conclusion before the end of this week and execute on it :)

Also I'm curious about your thoughts on this (stitching together two kernels).

pcmoritz · 2024-02-22T06:52:12Z

Closed in favor of #2979

WoosukKwon and others added 30 commits January 31, 2024 10:12

Add CUTLASS as a submodule

ad66935

Port CUTLASS extensions

396e537

Port MoE kernels

0cd9436

Move moe_kernels

cb4524c

Port MoE GEMM

c191207

Port CUTLASS kernels

cfa4554

Remove MoE gemm

90ccdfa

Merge branch 'main' into cutlass-moe

3e90c1a

Remove unused CUTLASS kernels

77a5c8d

Minor

f1583de

Add topk_softmax kernels

de7a749

Remove unnecessary headers

e5c62e8

Add MoE namespace

e127d9b

Minor

c3096a0

Add permute_kernels

9a561cc

Remove unused

ba07256

Move

def2ccd

Move

72256cc

Remove

e86fd06

Add MoE MLP

612f961

Add cudaUtils

0bf8fb9

Fix headers

c09179d

Enable BF16

2ab65df

Err msg

c74fc79

Add unpermute_and_reduce

6320de4

Add renormalize

9b57e39

Add FusedMoE

55fae45

Merge branch 'main' into cutlass-moe

d355702

Minor fix

fb9c524

Merge branch 'main' into optimized-fused-moe

6b20148

pcmoritz added 6 commits February 18, 2024 14:07

Use autotuned config

bd52cbb

fix

855d98a

add config

437edcf

fix

e13bc81

update

4b94875

update

5bd4256

zhuohan123 added kernel moe labels Feb 19, 2024

cadedaniel mentioned this pull request Feb 20, 2024

[WIP] Speculative decoding using a draft model #2188

Closed

pcmoritz mentioned this pull request Feb 22, 2024

Optimized fused MoE Kernel, take 2 #2979

Merged

pcmoritz closed this Feb 22, 2024

furlat mentioned this pull request Mar 1, 2024

make some tests and choose an openai api compatible local llm server Neural-Dragon-AI/Cynde#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized fused MoE Kernel #2913

Optimized fused MoE Kernel #2913

pcmoritz commented Feb 19, 2024 •

edited

Loading

WoosukKwon commented Feb 20, 2024

pcmoritz commented Feb 20, 2024 •

edited

Loading

pcmoritz commented Feb 22, 2024

Optimized fused MoE Kernel #2913

Optimized fused MoE Kernel #2913

Conversation

pcmoritz commented Feb 19, 2024 • edited Loading

WoosukKwon commented Feb 20, 2024

pcmoritz commented Feb 20, 2024 • edited Loading

pcmoritz commented Feb 22, 2024

pcmoritz commented Feb 19, 2024 •

edited

Loading

pcmoritz commented Feb 20, 2024 •

edited

Loading