Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RWKV v6: RWKV_WKV op CUDA implementation #9454

Merged
merged 3 commits into from
Sep 22, 2024
Merged

Conversation

MollySophia
Copy link
Contributor

@MollySophia MollySophia commented Sep 12, 2024

Added the RWKV_WKV CUDA impl and a test_case in test-backend-ops.cpp.
Also added unary op exp for cuda so that the rwkv v6 graph can be less splited when running on a gpu.

The kernel is modified from https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/cuda/rwkv6.cu and added support for batched inference.
Gonna add some speed and other test results later tomorrow.

Prompt:

RWKV (pronounced as RWaKuV) is an RNN with GPT-level LLM performance, which can also be directly trained like a GPT transformer (parallelizable).\nRWKV is an Open Source, non profit group, under the linux foundation. Supported by our sponsors.\nSo it's combining the best of RNN and transformer - great performance, fast inference, fast training, saves VRAM, "infinite" ctxlen, and free sentence embedding. Moreover it's 100% attention-free.\n

Here's the speed comparasion between the original and the PR version.
The test is done on my weird 12900HK ES + RTX4090 PC, which is relatively CPU-bound. The tests are all using FP16, offloading all layers to GPU. Prompt length = 107, generation length = 1000.

Parameter count Prefill (before) Prefill (after) Decode (before) Decode (after)
World-1.6B 678.72 tps 1375.48 tps 57.92 tps 142.13 tps
World-3B 481.70 tps 1262.74 tps 39.46 tps 86.00 tps
World-7B 340.25 tps 1053.80 tps 26.38 tps 47.39 tps

Here's the perplexity comparasion between the original and the PR version. Tested on wikitext-2 using FP16, offloading all layers to GPU.

Parameter count Perplexity (before) Perplexity (after)
World-1.6B 10.8599 +/- 0.07657 10.8604 +/- 0.07657
World-3B 9.3254 +/- 0.06322 9.3256 +/- 0.06322
World-7B 7.9571 +/- 0.05213 7.9570 +/- 0.05213

test-backend-ops perf tests:

Backend 1/2 (CPU)
  Backend name: CPU
  RWKV_WKV(type=f32,head_count=32,head_size=64,n_seq_tokens=1,n_seqs=1):             7826 runs -    40.90 us/run -     1072 kB/run -   24.99 GB/s
  RWKV_WKV(type=f32,head_count=32,head_size=64,n_seq_tokens=32,n_seqs=1):            3629 runs -   511.39 us/run -     2312 kB/run -    4.31 GB/s
  RWKV_WKV(type=f32,head_count=32,head_size=64,n_seq_tokens=32,n_seqs=4):             910 runs -  2195.17 us/run -     9224 kB/run -    4.01 GB/s
  RWKV_WKV(type=f32,head_count=32,head_size=64,n_seq_tokens=128,n_seqs=4):            342 runs -  8184.47 us/run -    24584 kB/run -    2.86 GB/s
  Backend CPU: OK

Backend 2/2 (CUDA0)
  Backend name: CUDA0
  RWKV_WKV(type=f32,head_count=32,head_size=64,n_seq_tokens=1,n_seqs=1):             8192 runs -     6.86 us/run -     1072 kB/run -  149.05 GB/s
  RWKV_WKV(type=f32,head_count=32,head_size=64,n_seq_tokens=32,n_seqs=1):            8192 runs -    22.27 us/run -     2312 kB/run -   99.00 GB/s
  RWKV_WKV(type=f32,head_count=32,head_size=64,n_seq_tokens=32,n_seqs=4):            3638 runs -    19.77 us/run -     9224 kB/run -  444.87 GB/s
  RWKV_WKV(type=f32,head_count=32,head_size=64,n_seq_tokens=128,n_seqs=4):           1365 runs -    71.58 us/run -    24584 kB/run -  327.55 GB/s
  Backend CUDA0: OK

TODO:

  • Add speed comparision
  • Add perplexity comparision

@MollySophia MollySophia marked this pull request as draft September 12, 2024 16:02
@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs labels Sep 12, 2024
Signed-off-by: Molly Sophia <[email protected]>
@uniartisan
Copy link
Contributor

uniartisan commented Sep 13, 2024

Hi, Molly.
I have some small suggestions. Can you rename the wkv kernel to wkv6? As far as I know, wkv7 will be released soon.
Maybe we can move forward and implement fp16 calculations(Maybe next pr)

@MollySophia MollySophia marked this pull request as draft September 13, 2024 02:19
Signed-off-by: Molly Sophia <[email protected]>
@MollySophia
Copy link
Contributor Author

Hi, Molly. I have some small suggestions. Can you rename the wkv kernel to wkv6? As far as I know, wkv7 will be released soon. Maybe we can move forward and implement fp16 calculations(Maybe next pr)

Yes. However I guess this is not that urgent. That can also be done after RWKV v7 is released, in the initial rwkv v7 support PR in the future.

@MollySophia
Copy link
Contributor Author

Hi! @ggerganov
In case you forgot about this PR :D

@slaren slaren merged commit 2a63caa into ggerganov:master Sep 22, 2024
53 checks passed
MollySophia added a commit to MollySophia/llama.cpp that referenced this pull request Sep 22, 2024
slaren pushed a commit that referenced this pull request Sep 22, 2024
Signed-off-by: Molly Sophia <[email protected]>
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
* ggml: CUDA unary op EXP

Signed-off-by: Molly Sophia <[email protected]>

* ggml: rwkv_wkv op CUDA impl

Signed-off-by: Molly Sophia <[email protected]>

---------

Signed-off-by: Molly Sophia <[email protected]>
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants