-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash on AMD graphics card on Windows #202
Comments
@tempstudio could you check if the issue remains with the latest release (v2.2.0)? |
I see the same issue with 2.2.1 ` INFO [ init] build info | tid="27560" timestamp=1725497899 build=3623 commit="436787f1" (Filename: ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLMUnitySetup.cs Line: 143) llm_load_vocab: token to piece cache size = 0.1637 MB |
Thank you for testing! |
Could you try the new build by changing the LlamaLib version here from With this build it should skip the HIP build and use the Vulkan instead 🤞 |
Apologies: It didn't crash this time, after I deleted things from StreamingAssets and reinstalled the package. but I'm pretty sure it's using the CPU, with very slow speed, high CPU usage. Server command: -m "C:/Users/.../AppData/Roaming/LLMUnity/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" -c 4096 -b 512 --log-disable -np 1 -ngl -1 (Filename: ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLMUnitySetup.cs Line: 137) warning: not compiled with GPU offload support, --gpu-layers option will be ignored ... llm_load_tensors: CPU buffer size = 4685.30 MiB Giving the 1.1.0-dev a try now |
The behavior is the same with 1.1.0-dev. |
You are using num GPU layers -1 which will not use the GPU. Could you try e.g. with 10? |
I thought -1 would mean all / max? ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no |
Tried it with flash attention OFF and it's the same: |
Thanks a lot! |
Couple problems I encounted with 1.1.0-dev2: |
Thanks a lot. |
I'm going through some issues and I have an idea. |
Could you try the |
The good news is that it doesn't crash anymore. INFO [ print_timings] prompt eval time = 192189.92 ms / 399 tokens ( 481.68 ms per token, 2.08 tokens per second) | tid="5292" timestamp=1725926634 id_slot=0 id_task=1 t_prompt_processing=192189.92 n_prompt_tokens_processed=399 t_token=481.67899749373436 n_tokens_second=2.0760714193543555 I have updated to the latest drivers and also just restarted my system. |
Yes! That works! |
Performance is equally bad with 10/30 layers. 10 layers: |
Is there any possibility of the performance issue being fixed in llamalib? |
I really doubt it is a problem of LlamaLib because I use and extend code directly from llama.cpp and llamafile. This is an overview of the different libraries:
The source of the speed issue is most probably on the tinyBLAS implementation of llamafile. |
There are reasons why I don't use llamafile anymore, although I love the project:
For these reasons I can't bring it back to the project. You could try the following to understand more about the issue using the latest llamafile. Check the timings for both cases:
Then we could find out which implementation is the culprit. |
I will give those a try. Can you build llamalib into a command line standalone so that I can test that too, just in case there's something wonky going on with gpu resource sharing between the ai and unity? |
Here is the performance with tinyBLAS. I don't believe the CUDA run is needed as I'm using an AMD system and it doesn't support CUDA. I will be very happy if I can get this type of performance inside Unity. .\llamafile-0.8.13.exe -m .\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -ngl 99 -p "to be or" --nocompile --tinyblas -c 2048 llama_print_timings: load time = 2364.86 ms More logs that might be helpful: import_cuda_impl: initializing gpu module... Another piece of info: during the execution I see that the GPU usage is at 1% instead of 99% that I see when using llamalib in task manager. This might be inaccurate. |
FYI I got llama.cpp's vulkan build to work (need to set GGML_VK_VISIBLE_DEVICES=0) and timing is like this: llama_perf_sampler_print: sampling time = 63.40 ms / 780 runs ( 0.08 ms per token, 12303.03 tokens per second) So it's (potentially) faster to run vulkan than HIP w./ tinyBLAS. |
Thanks for all the testing! |
Could you also try the following to see if the build works at the same speed as tinyBLAS?
|
Could we maybe have a call to resolve this? It would be really helpful! |
(1) Vulkan doesn't work because of this problem, it detects the same graphics card twice and then fails to load:
prints 0 - until the editor and the unity hub is restarted. (2) The performance for hip server is as bad as it is in editor:
(3) The vulkan server works with the right env variable. The performance of vulkan server matches llama.cpp |
@tempstudio I visited the issue again. Could you try the latest LLMUnity version and change this line: |
Describe the bug
Crash with abort when trying to use AMD graphics card in editor
Model is mistral-7b-instruct-v0.2.Q4_K_M.gguf
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no
llm_load_tensors: ggml ctx size = 0.30 MiB
d3d12: upload buffer was full! Waited for COPY queue for 1.118 ms.
d3d12: upload buffer was full! Waited for COPY queue for 0.902 ms.
d3d12: upload buffer was full! Waited for COPY queue for 0.897 ms.
d3d12: upload buffer was full! Waited for COPY queue for 0.896 ms.
d3d12: upload buffer was full! Waited for COPY queue for 0.901 ms.
[Licensing::Client] Successfully resolved entitlement details
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: ROCm0 buffer size = 4095.05 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB
..............................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: ROCm0 KV buffer size = 512.00 MiB
llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB
llama_new_context_with_model: ROCm_Host output buffer size = 0.24 MiB
llama_new_context_with_model: ROCm0 compute buffer size = 296.00 MiB
llama_new_context_with_model: ROCm_Host compute buffer size = 16.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
[1722650470] warming up the model with an empty run
ggml_cuda_compute_forward: RMS_NORM failed
CUDA error: invalid device function
current device: 0, in function ggml_cuda_compute_forward at D:/a/LlamaLib/LlamaLib/llama.cpp/ggml-cuda.cu:13061
err
Asset Pipeline Refresh (id=5fe1348313ec9e4439edb8aa2e9d608c): Total: 0.010 seconds - Initiated by RefreshV2(NoUpdateAssetOptions)
Asset Pipeline Refresh (id=a398558039bd1ba4a8f2fc04f6154810): Total: 0.007 seconds - Initiated by RefreshV2(NoUpdateAssetOptions)
Steps to reproduce
No response
LLMUnity version
2.0.3
Operating System
Windows
The text was updated successfully, but these errors were encountered: