-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : become thread-safe #3960
Comments
Any updates? |
FYI, I noticed with recent llama.cpp that while there are speed improvements, the thread safety has gotten worse. Now when I run in 2 threads a TTS model and a GGUF model, everything crashes in latest llama.cpp when did not used to. I get:
Worked perfectly with heavy use before. |
I am a bit confused. I thought that @slaren solved this problem with 'llama : add pipeline parallelism support (#6017)'? Or do you mean here something else? @slaren said that when he was ready with 6017 he will fix the backend to release all CUDA memory, what is a big problem to many of us. I am not impatient just would like to understand what is happening. |
What I meant is that I will work on this after the pipeline parallelism is merged, which is what I am doing. It will still take a while to complete, as fixing this will require infrastructure changes in other parts of the code. Sorry for the confusion. |
I understand. Thank you that you care about this issue and that you will work on it! I have tried to solve it but I could not. |
#6170 should fix this issue in the CUDA backend. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Is this actually fixed? |
Some additional problems #6909 maybe because of this fix? |
@martindevans It should be fixed, please report any issues with thread safety. For example, using multiple llama contexts simultaneously each with a different CUDA GPU on different threads should now be possible. CPU and Metal also should be thread-safe, other backends probably not. |
That's great to hear! I'll experiment with removing some of the locks we added into LLamaSharp and will report any bugs. Thanks. |
What about same GPU? Why isn't that thread safe too? |
It should also be thread-safe, but I don't expect that to be a very useful use case. |
ref #499 (reply in thread)
We should be able to run inference on multiple graphs, backends and devices in parallel.
Currently, there are CUDA singletons that break this requirement and possibly there could be other problems.
The text was updated successfully, but these errors were encountered: