CUDA: Enable K-shift operation for -ctk q8_0 (limited) #9571

Nekotekina · 2024-09-20T19:11:00Z

This is a reworked #5653
Some CUDA code was adapted from: 3d92acf
Original PR had explosive GPU memory requirement. ~~I'm not sure if it's a bug, or intended logic of the allocator.~~
~~I worked around it by reusing the same tensor. It seems to work well for me.~~

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

slaren · 2024-09-21T01:20:08Z

I cannot reproduce the issue that you reported with memory usage, in my tests the allocator correctly reuses the memory of the previous tensors automatically. The ggml_backend_sched_set_tensor_backend is necessary, but the other changes should be reverted to keep the code simple.

It would also be very desirable to implement this in the CPU backend, and add a test in test-backend-ops for the new copy op.

For cases where the input and outputs of the copy are contiguous (as it is here), then this could also be implemented using the existing dequantize functions in the CUDA backend, which would allow it to work with any format and likely with better performance.

## SPLIT #0: CUDA0 # 1 inputs: [K_shift (   0K)]
node #  1 (       CPY):              K_f32-0 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-0 (  32K) [CUDA0         ]
node #  2 (      ROPE):       K_f32-0 (view) (  32K) [CUDA0         ]:              K_f32-0 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node #  3 (       CPY):          K_shifted-0 (   8K) [CUDA0         ]:       K_f32-0 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node #  5 (       CPY):              K_f32-1 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-1 (  32K) [CUDA0         ]
node #  6 (      ROPE):       K_f32-1 (view) (  32K) [CUDA0         ]:              K_f32-1 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node #  7 (       CPY):          K_shifted-1 (   8K) [CUDA0         ]:       K_f32-1 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node #  9 (       CPY):              K_f32-2 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-2 (  32K) [CUDA0         ]
node # 10 (      ROPE):       K_f32-2 (view) (  32K) [CUDA0         ]:              K_f32-2 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 11 (       CPY):          K_shifted-2 (   8K) [CUDA0         ]:       K_f32-2 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 13 (       CPY):              K_f32-3 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-3 (  32K) [CUDA0         ]
node # 14 (      ROPE):       K_f32-3 (view) (  32K) [CUDA0         ]:              K_f32-3 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 15 (       CPY):          K_shifted-3 (   8K) [CUDA0         ]:       K_f32-3 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 17 (       CPY):              K_f32-4 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-4 (  32K) [CUDA0         ]
node # 18 (      ROPE):       K_f32-4 (view) (  32K) [CUDA0         ]:              K_f32-4 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 19 (       CPY):          K_shifted-4 (   8K) [CUDA0         ]:       K_f32-4 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 21 (       CPY):              K_f32-5 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-5 (  32K) [CUDA0         ]
node # 22 (      ROPE):       K_f32-5 (view) (  32K) [CUDA0         ]:              K_f32-5 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 23 (       CPY):          K_shifted-5 (   8K) [CUDA0         ]:       K_f32-5 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 25 (       CPY):              K_f32-6 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-6 (  32K) [CUDA0         ]
node # 26 (      ROPE):       K_f32-6 (view) (  32K) [CUDA0         ]:              K_f32-6 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 27 (       CPY):          K_shifted-6 (   8K) [CUDA0         ]:       K_f32-6 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 29 (       CPY):              K_f32-7 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-7 (  32K) [CUDA0         ]
node # 30 (      ROPE):       K_f32-7 (view) (  32K) [CUDA0         ]:              K_f32-7 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 31 (       CPY):          K_shifted-7 (   8K) [CUDA0         ]:       K_f32-7 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 33 (       CPY):              K_f32-8 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-8 (  32K) [CUDA0         ]
node # 34 (      ROPE):       K_f32-8 (view) (  32K) [CUDA0         ]:              K_f32-8 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 35 (       CPY):          K_shifted-8 (   8K) [CUDA0         ]:       K_f32-8 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 37 (       CPY):              K_f32-9 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-9 (  32K) [CUDA0         ]
node # 38 (      ROPE):       K_f32-9 (view) (  32K) [CUDA0         ]:              K_f32-9 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 39 (       CPY):          K_shifted-9 (   8K) [CUDA0         ]:       K_f32-9 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 41 (       CPY):             K_f32-10 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-10 (  32K) [CUDA0         ]
node # 42 (      ROPE):      K_f32-10 (view) (  32K) [CUDA0         ]:             K_f32-10 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 43 (       CPY):         K_shifted-10 (   8K) [CUDA0         ]:      K_f32-10 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 45 (       CPY):             K_f32-11 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-11 (  32K) [CUDA0         ]
node # 46 (      ROPE):      K_f32-11 (view) (  32K) [CUDA0         ]:             K_f32-11 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 47 (       CPY):         K_shifted-11 (   8K) [CUDA0         ]:      K_f32-11 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 49 (       CPY):             K_f32-12 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-12 (  32K) [CUDA0         ]
node # 50 (      ROPE):      K_f32-12 (view) (  32K) [CUDA0         ]:             K_f32-12 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 51 (       CPY):         K_shifted-12 (   8K) [CUDA0         ]:      K_f32-12 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 53 (       CPY):             K_f32-13 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-13 (  32K) [CUDA0         ]
node # 54 (      ROPE):      K_f32-13 (view) (  32K) [CUDA0         ]:             K_f32-13 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 55 (       CPY):         K_shifted-13 (   8K) [CUDA0         ]:      K_f32-13 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 57 (       CPY):             K_f32-14 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-14 (  32K) [CUDA0         ]
node # 58 (      ROPE):      K_f32-14 (view) (  32K) [CUDA0         ]:             K_f32-14 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 59 (       CPY):         K_shifted-14 (   8K) [CUDA0         ]:      K_f32-14 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 61 (       CPY):             K_f32-15 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-15 (  32K) [CUDA0         ]
node # 62 (      ROPE):      K_f32-15 (view) (  32K) [CUDA0         ]:             K_f32-15 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 63 (       CPY):         K_shifted-15 (   8K) [CUDA0         ]:      K_f32-15 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 65 (       CPY):             K_f32-16 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-16 (  32K) [CUDA0         ]
node # 66 (      ROPE):      K_f32-16 (view) (  32K) [CUDA0         ]:             K_f32-16 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 67 (       CPY):         K_shifted-16 (   8K) [CUDA0         ]:      K_f32-16 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]

## SPLIT #1: CUDA1 # 1 inputs: [K_shift (   0K)]
node # 69 (       CPY):             K_f32-17 (  32K) [CUDA1         ]:               K-view (   8K) [CUDA1         ]             K_f32-17 (  32K) [CUDA1         ]
node # 70 (      ROPE):      K_f32-17 (view) (  32K) [CUDA1         ]:             K_f32-17 (  32K) [CUDA1         ]      CUDA1#K_shift#0 (   0K) [ NULL         ]
node # 71 (       CPY):         K_shifted-17 (   8K) [CUDA1         ]:      K_f32-17 (view) (  32K) [CUDA1         ]               K-view (   8K) [CUDA1         ]
node # 73 (       CPY):             K_f32-18 (  32K) [CUDA1         ]:               K-view (   8K) [CUDA1         ]             K_f32-18 (  32K) [CUDA1         ]
node # 74 (      ROPE):      K_f32-18 (view) (  32K) [CUDA1         ]:             K_f32-18 (  32K) [CUDA1         ]      CUDA1#K_shift#0 (   0K) [ NULL         ]
node # 75 (       CPY):         K_shifted-18 (   8K) [CUDA1         ]:      K_f32-18 (view) (  32K) [CUDA1         ]               K-view (   8K) [CUDA1         ]
node # 77 (       CPY):             K_f32-19 (  32K) [CUDA1         ]:               K-view (   8K) [CUDA1         ]             K_f32-19 (  32K) [CUDA1         ]
node # 78 (      ROPE):      K_f32-19 (view) (  32K) [CUDA1         ]:             K_f32-19 (  32K) [CUDA1         ]      CUDA1#K_shift#0 (   0K) [ NULL         ]
node # 79 (       CPY):         K_shifted-19 (   8K) [CUDA1         ]:      K_f32-19 (view) (  32K) [CUDA1         ]               K-view (   8K) [CUDA1         ]
node # 81 (       CPY):             K_f32-20 (  32K) [CUDA1         ]:               K-view (   8K) [CUDA1         ]             K_f32-20 (  32K) [CUDA1         ]
node # 82 (      ROPE):      K_f32-20 (view) (  32K) [CUDA1         ]:             K_f32-20 (  32K) [CUDA1         ]      CUDA1#K_shift#0 (   0K) [ NULL         ]
node # 83 (       CPY):         K_shifted-20 (   8K) [CUDA1         ]:      K_f32-20 (view) (  32K) [CUDA1         ]               K-view (   8K) [CUDA1         ]
node # 85 (       CPY):             K_f32-21 (  32K) [CUDA1         ]:               K-view (   8K) [CUDA1         ]             K_f32-21 (  32K) [CUDA1         ]
node # 86 (      ROPE):      K_f32-21 (view) (  32K) [CUDA1         ]:             K_f32-21 (  32K) [CUDA1         ]      CUDA1#K_shift#0 (   0K) [ NULL         ]
node # 87 (       CPY):         K_shifted-21 (   8K) [CUDA1         ]:      K_f32-21 (view) (  32K) [CUDA1         ]               K-view (   8K) [CUDA1         ]
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
max_size = 0.00 MB: tensors: K_shift [0-80] (0.00 MB)
max_size = 0.00 MB: tensors: CUDA0#K_shift#0 [0-80] (0.00 MB)
max_size = 0.03 MB: tensors: CUDA0#K_shift#0 [0-80] (0.00 MB) K_f32-0 [80-8080] (0.03 MB)
max_size = 0.00 MB: tensors: CUDA1#K_shift#0 [0-80] (0.00 MB)
max_size = 0.03 MB: tensors: CUDA1#K_shift#0 [0-80] (0.00 MB) K_f32-17 [80-8080] (0.03 MB)

llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.

Nekotekina · 2024-09-21T08:06:42Z

Right, I simplified it. Thanks. But where are dequantize functions you are talking about? You mean ggml_get_to_fp32_cuda from convert.cu?

Nekotekina · 2024-09-21T19:07:10Z

I also forgot to ask, is there a reason to use f32 over f16? The performance of q8_0 K-Shift doesn't seem great in this PR, I wonder if using f16 can improve it...

slaren · 2024-09-22T02:26:47Z

Yes, I mean the functions from convert.cu, it should be straightforward to use these for GGML_OP_CPY when both src0 and src1 are contiguous. Using F16 should also work, and at least it would reduce the buffer size, which is always good, but I wouldn't expect a big performance difference. I think the quantization kernels could be optimized to use more threads (one thread per value instead of per block), which should improve performance.

github-actions bot added the Nvidia GPU Issues specific to Nvidia GPUs label Sep 20, 2024

Nekotekina force-pushed the kshift branch from 2cbee0b to 6dfc5b1 Compare September 20, 2024 21:33

Nekotekina force-pushed the kshift branch from 6dfc5b1 to ea88404 Compare September 21, 2024 07:48

cuda: add q8_0->f32 cpy operation

eec216c

llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.

Nekotekina force-pushed the kshift branch from ea88404 to eec216c Compare September 21, 2024 07:54

slaren approved these changes Sep 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: Enable K-shift operation for -ctk q8_0 (limited) #9571

CUDA: Enable K-shift operation for -ctk q8_0 (limited) #9571

Nekotekina commented Sep 20, 2024 •

edited

Loading

slaren commented Sep 21, 2024 •

edited

Loading

Nekotekina commented Sep 21, 2024

Nekotekina commented Sep 21, 2024

slaren commented Sep 22, 2024

CUDA: Enable K-shift operation for -ctk q8_0 (limited) #9571

Are you sure you want to change the base?

CUDA: Enable K-shift operation for -ctk q8_0 (limited) #9571

Conversation

Nekotekina commented Sep 20, 2024 • edited Loading

slaren commented Sep 21, 2024 • edited Loading

Nekotekina commented Sep 21, 2024

Nekotekina commented Sep 21, 2024

slaren commented Sep 22, 2024

Nekotekina commented Sep 20, 2024 •

edited

Loading

slaren commented Sep 21, 2024 •

edited

Loading