-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use pause asm insn in busyloop to run the CPU (13600K) 10 °C cooler #1314
Conversation
Tested with a 13B model.
How does the |
Things can be different if CPU would need throttling because 85 °C was reached, but in my case the difference was 60 vs 70 °C. with pause:
without pause:
|
Interesting, can you suggest an alternative for |
I do not have arm64 to test with, but linux |
memory can be marked as clobbered, so compiler does not optimize away anything.
|
Tried a few variations of these, but not sure how to measure the effect: -- a/ggml.c
+++ b/ggml.c
@@ -11123,7 +11123,21 @@ typedef int ggml_lock_t;
#define ggml_lock_init(x) UNUSED(x)
#define ggml_lock_destroy(x) UNUSED(x)
+#ifdef __ARM_NEON
+static inline void ggml_lock_lock(void * x);
+inline void ggml_lock_lock(void * x) {
+ UNUSED(x);
+ for (int i = 0; i < 10; ++i) {
+ __asm __volatile("nop");
+ }
+ __asm __volatile(
+ " sev\n"
+ " wfe\n"
+ );
+}
+#else
#define ggml_lock_lock(x) UNUSED(x)
+#endif
#define ggml_lock_unlock(x) UNUSED(x)
#define GGML_LOCK_INITIALIZER 0 Anyway, for x86, if other people can confirm that the performance is not degraded, we can merge the proposed pause. |
It looks like the ARM version may be |
Maybe yield is more suitable for arm64. But I used sensors to check out the temperature, and |
Ryzen 4600H in a Huawei Matebook 14 - No temp differences, both hit 85C instantly (6c,12t) 12 Threads
6 Threads
3 Threads
1 Thread
|
Testing on a i9 9900k (12 threads, cuBLAS enabled): PR:
Master:
For me, this has no downsides, it uses 20% less power and is as fast or faster. |
You sure you wouldn't get the same results just reducing your number of threads to your logical cores? |
with -t 6: 70 ms per token with -t 14 (autoselected): 92 ms per token My system has only two memory channels. |
This seems like something the operating system / standard library should handle for us, though. C11 has introduced |
There have been a few attempts to do that, but this seems to be the first that successfully reduces power usage without affecting performance. |
Since nobody has reported degraded performance with this PR, I think this should be merged. But I am concerned that if #1305 is merged at the same time, and one of the two PRs causes performance issues for some people, it may be harder to find which one is the source of the problem. So it may be better to delay merging this until a few days after #1305. |
I tried this myself and didn't notice any performance loss. I did a couple runs and if anything, it seemed to be slightly faster but not by any significant amount. (Within the margin of error/variance that I was seeing.) I'm thinking at least some of the pushback we're seeing about 1305 is from people who download models already quantized rather than quantizing them themselves, so the breaking change will mean they can't upgrade until they can find newer models. If this does save energy, it might be better to push it first so people who choose to stay on an older version until they can find newer models have a more eco friendly version. |
#1305 will not be merged before |
They can always do |
Tested with a 13B model.
You might also want to think about other architectures.