Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : group all experts in a single ggml_mul_mat_id #6505

Merged
merged 22 commits into from
Apr 18, 2024
Merged

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Apr 5, 2024

Should improve performance of MoE models with CUDA significantly. Also improved the rearrangement of the rows in the CUDA backend with custom kernels instead of memcpys, that's about 50% of the speedup here.

GPU Model Test t/s master t/s sl/moe-rework-2 Speedup
RTX 3090 Ti mixtral Q3_K_S pp512 387.58 1226.11 3.16
RTX 3090 Ti mixtral Q3_K_S tg128 43.07 50.40 1.17

This comment was marked as off-topic.

ggml-cuda.cu Outdated Show resolved Hide resolved
ggml-cuda.cu Outdated Show resolved Hide resolved
@askmyteapot
Copy link

Benchmarked this is on a Ryzen 5800x (64GB ddr4@3733MT CL16) and Tesla P40 24GB. 28/33 layers offloaded. Model is Bagel Mistery Tour 8x7b (mixtral)

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: yes
| model                          |       size |     params | backend    | ngl | test       | moe-rework-2 t/s | master       t/s | speedup |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: | ---------------: | ------: |
| llama 7B IQ4_XS - 4.25 bpw     |  23.42 GiB |    46.70 B | CUDA       |  28 | pp 4096    |    106.83 ± 0.07 |     75.64 ± 0.08 |  1.412x |
| llama 7B IQ4_XS - 4.25 bpw     |  23.42 GiB |    46.70 B | CUDA       |  28 | tg 128     |     13.34 ± 0.01 |     12.70 ± 0.00 |  1.050x |

@Dampfinchen
Copy link

Dampfinchen commented Apr 6, 2024

Alright, wow. PP went down from 11,85 ms/t to 4,95 ms/t (with partial offloading, 5 layers Mixtral and 2060). Simply incredible, but I'm not surprised anymore as Slaren always delivers. Llama.cpp's MOE implementation is now extremly robust.

@phymbert phymbert mentioned this pull request Apr 6, 2024
13 tasks
@askmyteapot
Copy link

I also noticed that this pull uses significantly less CUDA Buffer (50% less) compared to master which allowed an extra layer at low ctx.

PR:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: yes
llm_load_tensors: ggml ctx size =    0.44 MiB
llm_load_tensors: offloading 30 repeating layers to GPU
llm_load_tensors: offloaded 30/33 layers to GPU
llm_load_tensors:        CPU buffer size =  2711.96 MiB
llm_load_tensors:      CUDA0 buffer size = 22465.31 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    16.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   240.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   395.13 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 1574
llama_new_context_with_model: graph splits = 28

Master (build = 2620 (d4f220a))

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: yes
llm_load_tensors: ggml ctx size =    0.44 MiB
llm_load_tensors: offloading 29 repeating layers to GPU
llm_load_tensors: offloaded 29/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3425.96 MiB
llm_load_tensors:      CUDA0 buffer size = 21716.47 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    24.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   232.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   787.13 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 1638
llama_new_context_with_model: graph splits = 41

@askmyteapot
Copy link

Just tested with 8k context... not as much as a savings.
PR

llm_load_tensors: offloaded 28/33 layers to GPU
llm_load_tensors:        CPU buffer size =  4139.96 MiB
llm_load_tensors:      CUDA0 buffer size = 20967.62 MiB
......
llama_new_context_with_model: n_ctx      = 8288
llama_new_context_with_model: n_batch    = 1024
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   129.50 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   906.50 MiB
llama_new_context_with_model: KV self size  = 1036.00 MiB, K (f16):  518.00 MiB, V (f16):  518.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.24 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   595.69 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.19 MiB

Master

llm_load_tensors: offloaded 28/33 layers to GPU
llm_load_tensors:        CPU buffer size =  4139.96 MiB
llm_load_tensors:      CUDA0 buffer size = 20967.62 MiB
.......
llama_new_context_with_model: n_ctx      = 8288
llama_new_context_with_model: n_batch    = 1024
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   129.50 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   906.50 MiB
llama_new_context_with_model: KV self size  = 1036.00 MiB, K (f16):  518.00 MiB, V (f16):  518.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.24 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   783.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.19 MiB

@askmyteapot
Copy link

askmyteapot commented Apr 7, 2024

ftype = IQ4_XS - 4.25 bpw
params = 46.70 B
size = 23.57 GiB (4.33 BPW)
23/33 layers to GPU
image

ggml-cuda.cu Outdated Show resolved Hide resolved
ggml-cuda.cu Outdated Show resolved Hide resolved
ggml-cuda.cu Show resolved Hide resolved
ggml-cuda.cu Show resolved Hide resolved
@slaren
Copy link
Collaborator Author

slaren commented Apr 7, 2024

@JohannesGaessler I had already tested most of these, but on my 3090 I didn't see a meaningful improvement. Anyway I have pushed my changes that I think already cover all of that.

In the long term the goal is to use a grouped GEMM with cutlass without requiring a synchronization. I think that this will also allow removing the row rearrangement entirely, which has a significant cost.

@JohannesGaessler
Copy link
Collaborator

Some quick performance comparisons from me:

GPU Model Test t/s master t/s sl/moe-rework-2 Speedup
RTX 3090 Mixtral 8x7b Q3_K_S pp512 399.25 1173.12 2.94
RTX 3090 Mixtral 8x7b Q3_K_S tg128 53.47 55.80 1.04
P40 Mixtral 8x7b Q3_K_S pp512 187.27 250.23 1.34
P40 Mixtral 8x7b Q3_K_S tg128 23.50 23.91 1.02

I had already tested most of these, but on my 3090 I didn't see a meaningful improvement. Anyway I have pushed my changes that I think already cover all of that.

I am measuring a performance difference from the changes:

GPU Model Test t/s ea2b795 t/s sl/moe-rework-2 Speedup
RTX 3090 Mixtral 8x7b Q3_K_S pp512 1112.34 1173.12 1.05
RTX 3090 Mixtral 8x7b Q3_K_S tg128 55.45 55.80 1.01
P40 Mixtral 8x7b Q3_K_S pp512 246.34 250.23 1.02
P40 Mixtral 8x7b Q3_K_S tg128 23.82 23.91 1.00

In the long term the goal is to use a grouped GEMM with cutlass without requiring a synchronization. I think that this will also allow removing the row rearrangement entirely, which has a significant cost.

That would definitely help. When you look into this I think it would also make sense to check whether it is possible to set the input/output type to FP32 to avoid the conversion of some tensors. (With the input I suspect that it's probably not possible though.)

@askmyteapot
Copy link

Definite speedup on P40 with larger iq4_xs quant with partial offloading.

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: yes
| model                          |       size |     params | backend    | ngl | test       | updatedPR    t/s | moe-rework-2 t/s | speedup |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: | ---------------: | ------: |
| llama 7B IQ4_XS - 4.25 bpw     |  23.42 GiB |    46.70 B | CUDA       |  28 | pp 4096    |    115.45 ± 0.08 |    106.83 ± 0.07 |  1.080x |
| llama 7B IQ4_XS - 4.25 bpw     |  23.42 GiB |    46.70 B | CUDA       |  28 | tg 128     |     14.52 ± 0.01 |     13.34 ± 0.01 |  1.088x |

@slaren
Copy link
Collaborator Author

slaren commented Apr 8, 2024

@ggerganov I really do not want to have to modify 21 functions in exactly the same way again, I would rather spend some time refactoring. Did you find any reason that would prevent making the Metal kernel_mul_mv_id kernels a template?

@ggerganov
Copy link
Owner

Did you find any reason that would prevent making the Metal kernel_mul_mv_id kernels a template?

No reason at all - I simply wasn't able to fit this into templates. Thanks for doing it - it was very ugly before

phymbert added a commit that referenced this pull request Apr 12, 2024
…nstead of silu.

Do not pass too much time on this function as it will be replaced in #6505
@LostRuins
Copy link
Collaborator

Hey there, just wondering if there's any reason this isn't ready to be merged yet. Have heard a couple of reports that it's really beneficial for mixtral PP speed for some people who have tried it.

@slaren
Copy link
Collaborator Author

slaren commented Apr 16, 2024

It's still missing a Metal implementation. It should be good for CPU and CUDA already.

@slaren slaren marked this pull request as ready for review April 17, 2024 17:13
@slaren slaren requested a review from ggerganov April 17, 2024 17:13
@ggerganov
Copy link
Owner

Let's rebase on master and will continue review tomorrow

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! M2 Ultra results (-ub 256 is optimal):

./scripts/compare-commits.sh master sl/moe-rework-2 -m models/mixtral-8x7b-32k-fast/ggml-model-f16.gguf -ub 256 -p 1,2,4,8,16,32,64,128,256,512
CPU Model Test t/s master t/s sl/moe-rework-2 Speedup
M2 Ultra llama 8x7B F16 pp1 22.28 23.15 1.04
M2 Ultra llama 8x7B F16 pp2 21.18 22.11 1.04
M2 Ultra llama 8x7B F16 pp4 26.67 27.40 1.03
M2 Ultra llama 8x7B F16 pp8 30.73 44.21 1.44
M2 Ultra llama 8x7B F16 pp16 50.73 79.88 1.57
M2 Ultra llama 8x7B F16 pp32 90.07 154.87 1.72
M2 Ultra llama 8x7B F16 pp64 155.10 263.48 1.70
M2 Ultra llama 8x7B F16 pp128 256.59 357.97 1.40
M2 Ultra llama 8x7B F16 pp256 319.72 370.37 1.16
M2 Ultra llama 8x7B F16 pp512 319.97 370.87 1.16
M2 Ultra llama 8x7B F16 tg128 22.38 23.12 1.03

llama.cpp Outdated Show resolved Hide resolved
ggml-cuda.cu Show resolved Hide resolved
@slaren
Copy link
Collaborator Author

slaren commented Apr 18, 2024

@NeoZhangJianyu @airMeng This change will break mul_mat_id in SYCL again. Sorry for the inconvenience, the change to the interface was necessary to improve performance.

@slaren
Copy link
Collaborator Author

slaren commented Apr 18, 2024

@ggerganov Do you know why the ggml-ci cuda-v100 failed? The log ends during a quantize. Was it a timeout? There are more mul_mat_id tests in test-backend-ops that could increase the runtime.

@ggerganov
Copy link
Owner

Yes, it exceeded 30 min. On master we were at ~27 min

We can either increase to 40 min or maybe not run ctest in Debug?

@NeoZhangJianyu
Copy link
Collaborator

@NeoZhangJianyu @airMeng This change will break mul_mat_id in SYCL again. Sorry for the inconvenience, the change to the interface was necessary to improve performance.

Got it! I will study and fix later.
I have thought to have a rest after fix mul_mat_id() UT in last weekend. :)

Thank for reminding!

@airMeng
Copy link
Collaborator

airMeng commented Apr 18, 2024

@NeoZhangJianyu @airMeng This change will break mul_mat_id in SYCL again. Sorry for the inconvenience, the change to the interface was necessary to improve performance.

@slaren can we have a workaround like macros in llama.cpp or a fallback to CPU to maintain SYCL capabilities? Then SYCL will not block your merging, and we can have more time on SYCL kernels (I am just assigned a JIRA about MOE, maybe I can re-use the efforts)

@slaren
Copy link
Collaborator Author

slaren commented Apr 18, 2024

We could disable offloading of MoE models when using SYCL by setting n_gpu_layers to 0 in llm_load_tensors. That should at least avoid crashes with SYCL, but the result would be the same than running llama.cpp with -ngl 0.

@slaren
Copy link
Collaborator Author

slaren commented Apr 18, 2024

Yes, it exceeded 30 min. On master we were at ~27 min

We can either increase to 40 min or maybe not run ctest in Debug?

I think that the problem is that there are too many types. We can run the full tests only for a few types, and a basic test only for the rest.

@ggerganov
Copy link
Owner

I think that the problem is that there are too many types. We can run the full tests only for a few types, and a basic test only for the rest.

Yes, for now should I bump the timeout to 40 min and figure a test reduction later on master?

@slaren slaren merged commit 0d56246 into master Apr 18, 2024
50 of 60 checks passed
@slaren slaren deleted the sl/moe-rework-2 branch April 18, 2024 13:18
@slaren
Copy link
Collaborator Author

slaren commented Apr 18, 2024

Yes, for now should I bump the timeout to 40 min and figure a test reduction later on master?

I think this is good enough for now. There are full tests with a few types to verify the logic, and then a simple test with the other types to check if they work at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants