Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: Enable K-shift operation for -ctk q8_0 (limited) #9571

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Nekotekina
Copy link

@Nekotekina Nekotekina commented Sep 20, 2024

This is a reworked #5653
Some CUDA code was adapted from: 3d92acf
Original PR had explosive GPU memory requirement. I'm not sure if it's a bug, or intended logic of the allocator.
I worked around it by reusing the same tensor. It seems to work well for me.

@github-actions github-actions bot added the Nvidia GPU Issues specific to Nvidia GPUs label Sep 20, 2024
@slaren
Copy link
Collaborator

slaren commented Sep 21, 2024

I cannot reproduce the issue that you reported with memory usage, in my tests the allocator correctly reuses the memory of the previous tensors automatically. The ggml_backend_sched_set_tensor_backend is necessary, but the other changes should be reverted to keep the code simple.

It would also be very desirable to implement this in the CPU backend, and add a test in test-backend-ops for the new copy op.

For cases where the input and outputs of the copy are contiguous (as it is here), then this could also be implemented using the existing dequantize functions in the CUDA backend, which would allow it to work with any format and likely with better performance.

## SPLIT #0: CUDA0 # 1 inputs: [K_shift (   0K)]
node #  1 (       CPY):              K_f32-0 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-0 (  32K) [CUDA0         ]
node #  2 (      ROPE):       K_f32-0 (view) (  32K) [CUDA0         ]:              K_f32-0 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node #  3 (       CPY):          K_shifted-0 (   8K) [CUDA0         ]:       K_f32-0 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node #  5 (       CPY):              K_f32-1 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-1 (  32K) [CUDA0         ]
node #  6 (      ROPE):       K_f32-1 (view) (  32K) [CUDA0         ]:              K_f32-1 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node #  7 (       CPY):          K_shifted-1 (   8K) [CUDA0         ]:       K_f32-1 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node #  9 (       CPY):              K_f32-2 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-2 (  32K) [CUDA0         ]
node # 10 (      ROPE):       K_f32-2 (view) (  32K) [CUDA0         ]:              K_f32-2 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 11 (       CPY):          K_shifted-2 (   8K) [CUDA0         ]:       K_f32-2 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 13 (       CPY):              K_f32-3 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-3 (  32K) [CUDA0         ]
node # 14 (      ROPE):       K_f32-3 (view) (  32K) [CUDA0         ]:              K_f32-3 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 15 (       CPY):          K_shifted-3 (   8K) [CUDA0         ]:       K_f32-3 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 17 (       CPY):              K_f32-4 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-4 (  32K) [CUDA0         ]
node # 18 (      ROPE):       K_f32-4 (view) (  32K) [CUDA0         ]:              K_f32-4 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 19 (       CPY):          K_shifted-4 (   8K) [CUDA0         ]:       K_f32-4 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 21 (       CPY):              K_f32-5 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-5 (  32K) [CUDA0         ]
node # 22 (      ROPE):       K_f32-5 (view) (  32K) [CUDA0         ]:              K_f32-5 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 23 (       CPY):          K_shifted-5 (   8K) [CUDA0         ]:       K_f32-5 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 25 (       CPY):              K_f32-6 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-6 (  32K) [CUDA0         ]
node # 26 (      ROPE):       K_f32-6 (view) (  32K) [CUDA0         ]:              K_f32-6 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 27 (       CPY):          K_shifted-6 (   8K) [CUDA0         ]:       K_f32-6 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 29 (       CPY):              K_f32-7 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-7 (  32K) [CUDA0         ]
node # 30 (      ROPE):       K_f32-7 (view) (  32K) [CUDA0         ]:              K_f32-7 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 31 (       CPY):          K_shifted-7 (   8K) [CUDA0         ]:       K_f32-7 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 33 (       CPY):              K_f32-8 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-8 (  32K) [CUDA0         ]
node # 34 (      ROPE):       K_f32-8 (view) (  32K) [CUDA0         ]:              K_f32-8 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 35 (       CPY):          K_shifted-8 (   8K) [CUDA0         ]:       K_f32-8 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 37 (       CPY):              K_f32-9 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-9 (  32K) [CUDA0         ]
node # 38 (      ROPE):       K_f32-9 (view) (  32K) [CUDA0         ]:              K_f32-9 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 39 (       CPY):          K_shifted-9 (   8K) [CUDA0         ]:       K_f32-9 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 41 (       CPY):             K_f32-10 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-10 (  32K) [CUDA0         ]
node # 42 (      ROPE):      K_f32-10 (view) (  32K) [CUDA0         ]:             K_f32-10 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 43 (       CPY):         K_shifted-10 (   8K) [CUDA0         ]:      K_f32-10 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 45 (       CPY):             K_f32-11 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-11 (  32K) [CUDA0         ]
node # 46 (      ROPE):      K_f32-11 (view) (  32K) [CUDA0         ]:             K_f32-11 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 47 (       CPY):         K_shifted-11 (   8K) [CUDA0         ]:      K_f32-11 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 49 (       CPY):             K_f32-12 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-12 (  32K) [CUDA0         ]
node # 50 (      ROPE):      K_f32-12 (view) (  32K) [CUDA0         ]:             K_f32-12 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 51 (       CPY):         K_shifted-12 (   8K) [CUDA0         ]:      K_f32-12 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 53 (       CPY):             K_f32-13 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-13 (  32K) [CUDA0         ]
node # 54 (      ROPE):      K_f32-13 (view) (  32K) [CUDA0         ]:             K_f32-13 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 55 (       CPY):         K_shifted-13 (   8K) [CUDA0         ]:      K_f32-13 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 57 (       CPY):             K_f32-14 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-14 (  32K) [CUDA0         ]
node # 58 (      ROPE):      K_f32-14 (view) (  32K) [CUDA0         ]:             K_f32-14 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 59 (       CPY):         K_shifted-14 (   8K) [CUDA0         ]:      K_f32-14 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 61 (       CPY):             K_f32-15 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-15 (  32K) [CUDA0         ]
node # 62 (      ROPE):      K_f32-15 (view) (  32K) [CUDA0         ]:             K_f32-15 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 63 (       CPY):         K_shifted-15 (   8K) [CUDA0         ]:      K_f32-15 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 65 (       CPY):             K_f32-16 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-16 (  32K) [CUDA0         ]
node # 66 (      ROPE):      K_f32-16 (view) (  32K) [CUDA0         ]:             K_f32-16 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 67 (       CPY):         K_shifted-16 (   8K) [CUDA0         ]:      K_f32-16 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]

## SPLIT #1: CUDA1 # 1 inputs: [K_shift (   0K)]
node # 69 (       CPY):             K_f32-17 (  32K) [CUDA1         ]:               K-view (   8K) [CUDA1         ]             K_f32-17 (  32K) [CUDA1         ]
node # 70 (      ROPE):      K_f32-17 (view) (  32K) [CUDA1         ]:             K_f32-17 (  32K) [CUDA1         ]      CUDA1#K_shift#0 (   0K) [ NULL         ]
node # 71 (       CPY):         K_shifted-17 (   8K) [CUDA1         ]:      K_f32-17 (view) (  32K) [CUDA1         ]               K-view (   8K) [CUDA1         ]
node # 73 (       CPY):             K_f32-18 (  32K) [CUDA1         ]:               K-view (   8K) [CUDA1         ]             K_f32-18 (  32K) [CUDA1         ]
node # 74 (      ROPE):      K_f32-18 (view) (  32K) [CUDA1         ]:             K_f32-18 (  32K) [CUDA1         ]      CUDA1#K_shift#0 (   0K) [ NULL         ]
node # 75 (       CPY):         K_shifted-18 (   8K) [CUDA1         ]:      K_f32-18 (view) (  32K) [CUDA1         ]               K-view (   8K) [CUDA1         ]
node # 77 (       CPY):             K_f32-19 (  32K) [CUDA1         ]:               K-view (   8K) [CUDA1         ]             K_f32-19 (  32K) [CUDA1         ]
node # 78 (      ROPE):      K_f32-19 (view) (  32K) [CUDA1         ]:             K_f32-19 (  32K) [CUDA1         ]      CUDA1#K_shift#0 (   0K) [ NULL         ]
node # 79 (       CPY):         K_shifted-19 (   8K) [CUDA1         ]:      K_f32-19 (view) (  32K) [CUDA1         ]               K-view (   8K) [CUDA1         ]
node # 81 (       CPY):             K_f32-20 (  32K) [CUDA1         ]:               K-view (   8K) [CUDA1         ]             K_f32-20 (  32K) [CUDA1         ]
node # 82 (      ROPE):      K_f32-20 (view) (  32K) [CUDA1         ]:             K_f32-20 (  32K) [CUDA1         ]      CUDA1#K_shift#0 (   0K) [ NULL         ]
node # 83 (       CPY):         K_shifted-20 (   8K) [CUDA1         ]:      K_f32-20 (view) (  32K) [CUDA1         ]               K-view (   8K) [CUDA1         ]
node # 85 (       CPY):             K_f32-21 (  32K) [CUDA1         ]:               K-view (   8K) [CUDA1         ]             K_f32-21 (  32K) [CUDA1         ]
node # 86 (      ROPE):      K_f32-21 (view) (  32K) [CUDA1         ]:             K_f32-21 (  32K) [CUDA1         ]      CUDA1#K_shift#0 (   0K) [ NULL         ]
node # 87 (       CPY):         K_shifted-21 (   8K) [CUDA1         ]:      K_f32-21 (view) (  32K) [CUDA1         ]               K-view (   8K) [CUDA1         ]
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
max_size = 0.00 MB: tensors: K_shift [0-80] (0.00 MB)
max_size = 0.00 MB: tensors: CUDA0#K_shift#0 [0-80] (0.00 MB)
max_size = 0.03 MB: tensors: CUDA0#K_shift#0 [0-80] (0.00 MB) K_f32-0 [80-8080] (0.03 MB)
max_size = 0.00 MB: tensors: CUDA1#K_shift#0 [0-80] (0.00 MB)
max_size = 0.03 MB: tensors: CUDA1#K_shift#0 [0-80] (0.00 MB) K_f32-17 [80-8080] (0.03 MB)

llama: enable K-shift for quantized KV cache
It will fail on unsupported backends or quant types.
@Nekotekina
Copy link
Author

Right, I simplified it. Thanks. But where are dequantize functions you are talking about? You mean ggml_get_to_fp32_cuda from convert.cu?

@Nekotekina
Copy link
Author

I also forgot to ask, is there a reason to use f32 over f16? The performance of q8_0 K-Shift doesn't seem great in this PR, I wonder if using f16 can improve it...

@slaren
Copy link
Collaborator

slaren commented Sep 22, 2024

Yes, I mean the functions from convert.cu, it should be straightforward to use these for GGML_OP_CPY when both src0 and src1 are contiguous. Using F16 should also work, and at least it would reduce the buffer size, which is always good, but I wouldn't expect a big performance difference. I think the quantization kernels could be optimized to use more threads (one thread per value instead of per block), which should improve performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants