use pause asm insn in busyloop to run the CPU (13600K) 10 °C cooler #1314

Safari77 · 2023-05-04T10:58:59Z

Tested with a 13B model.
You might also want to think about other architectures.

Tested with a 13B model.

ggerganov · 2023-05-04T15:28:01Z

How does the eval time (ms / run) change with this?

Safari77 · 2023-05-04T15:52:05Z

How does the eval time (ms / run) change with this?

Things can be different if CPU would need throttling because 85 °C was reached, but in my case the difference was 60 vs 70 °C.

with pause:

llama_print_timings:        load time =   491.01 ms
llama_print_timings:      sample time =   112.94 ms /   381 runs   (    0.30 ms per run)
llama_print_timings: prompt eval time =  2892.00 ms /    27 tokens (  107.11 ms per token)
llama_print_timings:        eval time = 99059.15 ms /   381 runs   (  260.00 ms per run)
llama_print_timings:       total time = 110540.60 ms

Performance counter stats for './main -n 2048 -c 2048 -t 12 --ignore-eos --interactive-first --color --instruct -s 1 -m
 ../gpt4-x-alpaca-13b-native-ggml-q4_0.bin':

      1 220 185,57 msec task-clock:u                     #   11,025 CPUs utilized
                 0      context-switches:u               #    0,000 /sec
                 0      cpu-migrations:u                 #    0,000 /sec
           345 203      page-faults:u                    #  282,910 /sec
 3 145 904 218 579      cpu_core/cycles/u                #    2,578 G/sec                    (98,56%)
 2 769 978 718 128      cpu_atom/cycles/u                #    2,270 G/sec                    (85,03%)
 1 937 590 809 996      cpu_core/instructions/u          #    1,588 G/sec                    (98,56%)
 2 150 075 610 108      cpu_atom/instructions/u          #    1,762 G/sec                    (85,03%)
   114 483 109 780      cpu_core/branches/u              #   93,824 M/sec                    (98,56%)
   113 048 578 832      cpu_atom/branches/u              #   92,649 M/sec                    (85,03%)
       472 475 920      cpu_core/branch-misses/u         #  387,216 K/sec                    (98,56%)
       538 736 029      cpu_atom/branch-misses/u         #  441,520 K/sec                    (85,03%)
18 682 482 544 249      cpu_core/slots:u/                #   15,311 G/sec                    (98,56%)
 2 044 175 523 462      cpu_core/topdown-retiring/u      #     10,9% Retiring                (98,56%)
    35 809 640 118      cpu_core/topdown-bad-spec/u      #      0,2% Bad Speculation         (98,56%)
    80 932 620 092      cpu_core/topdown-fe-bound/u      #      0,4% Frontend Bound          (98,56%)
16 535 620 003 457      cpu_core/topdown-be-bound/u      #     88,4% Backend Bound           (98,56%)
   224 977 900 355      cpu_core/topdown-heavy-ops/u     #      1,2% Heavy Operations       #      9,7% Light Operations        (98,56%)
    20 098 273 461      cpu_core/topdown-br-mispredict/u #      0,1% Branch Mispredict      #      0,1% Machine Clears          (98,56%)
    15 221 484 837      cpu_core/topdown-fetch-lat/u     #      0,1% Fetch Latency          #      0,4% Fetch Bandwidth         (98,56%)
 3 903 133 516 034      cpu_core/topdown-mem-bound/u     #     20,9% Memory Bound           #     67,6% Core Bound              (98,56%)

without pause:

llama_print_timings:        load time =   494.46 ms
llama_print_timings:      sample time =   113.41 ms /   383 runs   (    0.30 ms per run)
llama_print_timings: prompt eval time =  2892.17 ms /    27 tokens (  107.12 ms per token)
llama_print_timings:        eval time = 100403.52 ms /   383 runs   (  262.15 ms per run)
llama_print_timings:       total time = 105522.82 ms

 Performance counter stats for './main -n 2048 -c 2048 -t 12 --ignore-eos --interactive-first --color --instruct -s 1 -m
 ../gpt4-x-alpaca-13b-native-ggml-q4_0.bin':

      1 236 239,64 msec task-clock:u                     #   11,699 CPUs utilized          
                 0      context-switches:u               #    0,000 /sec                   
                 0      cpu-migrations:u                 #    0,000 /sec                   
           345 993      page-faults:u                    #  279,875 /sec                   
 3 188 758 738 410      cpu_core/cycles/u                #    2,579 G/sec                    (98,53%)
 2 722 421 590 622      cpu_atom/cycles/u                #    2,202 G/sec                    (87,66%)
12 756 258 981 589      cpu_core/instructions/u          #   10,319 G/sec                    (98,53%)
 3 434 808 721 694      cpu_atom/instructions/u          #    2,778 G/sec                    (87,66%)
 2 992 087 196 730      cpu_core/branches/u              #    2,420 G/sec                    (98,53%)
   446 389 930 549      cpu_atom/branches/u              #  361,087 M/sec                    (87,66%)
       474 845 600      cpu_core/branch-misses/u         #  384,105 K/sec                    (98,53%)
       525 726 624      cpu_atom/branch-misses/u         #  425,263 K/sec                    (87,66%)
19 040 516 217 918      cpu_core/slots:u/                #   15,402 G/sec                    (98,53%)
9 876 182 751 945      cpu_core/topdown-retiring/u      #     51,8% Retiring                (98,53%)
    75 459 706 605      cpu_core/topdown-bad-spec/u      #      0,4% Bad Speculation         (98,53%)
   107 405 880 272      cpu_core/topdown-fe-bound/u      #      0,6% Frontend Bound          (98,53%)
 9 011 728 924 394      cpu_core/topdown-be-bound/u      #     47,3% Backend Bound           (98,53%)
    79 615 892 727      cpu_core/topdown-heavy-ops/u     #      0,4% Heavy Operations       #     51,4% Light Operations        (98,53%)
    33 803 030 664      cpu_core/topdown-br-mispredict/u #      0,2% Branch Mispredict      #      0,2% Machine Clears          (98,53%)
    15 939 455 442      cpu_core/topdown-fetch-lat/u     #      0,1% Fetch Latency          #      0,5% Fetch Bandwidth         (98,53%)
 3 402 714 615 342      cpu_core/topdown-mem-bound/u     #     17,8% Memory Bound           #     29,4% Core Bound              (98,53%)

ggerganov · 2023-05-04T17:00:31Z

Interesting, can you suggest an alternative for arm64 that I can try?

Safari77 · 2023-05-04T17:08:55Z

Interesting, can you suggest an alternative for arm64 that I can try?

I do not have arm64 to test with, but linux arch/arm64/include/asm/cmpxchg.h uses sevl and wfe, try what combination works for you...

Safari77 · 2023-05-04T17:18:32Z

Interesting, can you suggest an alternative for arm64 that I can try?

I do not have arm64 to test with, but linux arch/arm64/include/asm/cmpxchg.h uses sevl and wfe, try what combination works for you...

memory can be marked as clobbered, so compiler does not optimize away anything.

__asm__ __volatile__("pause\n" : : : "memory");

ggerganov · 2023-05-04T17:28:11Z

Tried a few variations of these, but not sure how to measure the effect:

-- a/ggml.c
+++ b/ggml.c
@@ -11123,7 +11123,21 @@ typedef int ggml_lock_t;
 
 #define ggml_lock_init(x)    UNUSED(x)
 #define ggml_lock_destroy(x) UNUSED(x)
+#ifdef __ARM_NEON
+static inline void ggml_lock_lock(void * x);
+inline void ggml_lock_lock(void * x) {
+    UNUSED(x);
+    for (int i = 0; i < 10; ++i) {
+        __asm __volatile("nop");
+    }
+    __asm __volatile(
+            "  sev\n"
+            "  wfe\n"
+            );
+}
+#else
 #define ggml_lock_lock(x)    UNUSED(x)
+#endif
 #define ggml_lock_unlock(x)  UNUSED(x)
 
 #define GGML_LOCK_INITIALIZER 0

Anyway, for x86, if other people can confirm that the performance is not degraded, we can merge the proposed pause.

j-f1 · 2023-05-04T17:35:39Z

It looks like the ARM version may be yield.

Safari77 · 2023-05-04T18:04:18Z

Tried a few variations of these, but not sure how to measure the effect:

Maybe yield is more suitable for arm64. But I used sensors to check out the temperature, and perf stat -- ./main ... for performance counter stats.

ggml.c

Alumniminium · 2023-05-05T08:56:05Z

Ryzen 4600H in a Huawei Matebook 14 - No temp differences, both hit 85C instantly (6c,12t)
ran this 3 times each to make sure

12 Threads

pause:
llama_print_timings:        load time =  2793.53 ms
llama_print_timings:      sample time =   180.13 ms /   128 runs   (    1.41 ms per run)
llama_print_timings: prompt eval time =  2437.19 ms /    15 tokens (  162.48 ms per token)
llama_print_timings:        eval time = 26209.22 ms /   127 runs   (  206.37 ms per run)
llama_print_timings:       total time = 29236.74 ms

master:
llama_print_timings:        load time =  2799.41 ms
llama_print_timings:      sample time =   180.16 ms /   128 runs   (    1.41 ms per run)
llama_print_timings: prompt eval time =  2445.65 ms /    15 tokens (  163.04 ms per token)
llama_print_timings:        eval time = 26362.74 ms /   127 runs   (  207.58 ms per run)
llama_print_timings:       total time = 29396.40 ms

6 Threads

pause:
llama_print_timings:        load time =  2596.68 ms
llama_print_timings:      sample time =   156.47 ms /   128 runs   (    1.22 ms per run)
llama_print_timings: prompt eval time =  2237.88 ms /    15 tokens (  149.19 ms per token)
llama_print_timings:        eval time = 27610.40 ms /   127 runs   (  217.40 ms per run)
llama_print_timings:       total time = 30410.26 ms

master:
llama_print_timings:        load time =  2708.89 ms
llama_print_timings:      sample time =   164.13 ms /   128 runs   (    1.28 ms per run)
llama_print_timings: prompt eval time =  2353.16 ms /    15 tokens (  156.88 ms per token)
llama_print_timings:        eval time = 30370.40 ms /   127 runs   (  239.14 ms per run)
llama_print_timings:       total time = 33293.02 ms

3 Threads

pause:
llama_print_timings:        load time =  4126.54 ms
llama_print_timings:      sample time =   151.20 ms /   128 runs   (    1.18 ms per run)
llama_print_timings: prompt eval time =  3787.19 ms /    15 tokens (  252.48 ms per token)
llama_print_timings:        eval time = 35192.36 ms /   127 runs   (  277.11 ms per run)
llama_print_timings:       total time = 39516.05 ms

master:
llama_print_timings:        load time =  4108.10 ms
llama_print_timings:      sample time =   157.00 ms /   128 runs   (    1.23 ms per run)
llama_print_timings: prompt eval time =  3767.13 ms /    15 tokens (  251.14 ms per token)
llama_print_timings:        eval time = 35767.54 ms /   127 runs   (  281.63 ms per run)
llama_print_timings:       total time = 40080.61 ms

1 Thread

pause:
llama_print_timings:        load time = 11223.93 ms
llama_print_timings:      sample time =    63.93 ms /    64 runs   (    1.00 ms per run)
llama_print_timings: prompt eval time = 10867.14 ms /    15 tokens (  724.48 ms per token)
llama_print_timings:        eval time = 40488.44 ms /    63 runs   (  642.67 ms per run)
llama_print_timings:       total time = 51795.73 ms

master:
llama_print_timings:        load time = 10348.28 ms
llama_print_timings:      sample time =    60.87 ms /    64 runs   (    0.95 ms per run)
llama_print_timings: prompt eval time =  9994.98 ms /    15 tokens (  666.33 ms per token)
llama_print_timings:        eval time = 39635.09 ms /    63 runs   (  629.13 ms per run)
llama_print_timings:       total time = 50062.91 ms

slaren · 2023-05-05T20:03:46Z

Testing on a i9 9900k (12 threads, cuBLAS enabled):

PR:

llama_print_timings: prompt eval time =  3289.73 ms /   631 tokens (    5.21 ms per token)
llama_print_timings:        eval time = 16655.04 ms /   127 runs   (  131.14 ms per run)
2.10 seconds per pass - ETA 22 minutes
CPU power usage: ~80W

Master:

llama_print_timings: prompt eval time =  3206.96 ms /   631 tokens (    5.08 ms per token)
llama_print_timings:        eval time = 16859.71 ms /   127 runs   (  132.75 ms per run)
2.17 seconds per pass - ETA 23 minutes
CPU power usage: ~100W

For me, this has no downsides, it uses 20% less power and is as fast or faster.

Midaychi · 2023-05-05T22:28:52Z

You sure you wouldn't get the same results just reducing your number of threads to your logical cores?
The hyperthreading isn't doing much if your logical core is already pinned to 100%

Safari77 · 2023-05-06T03:22:41Z

with -t 6: 70 ms per token
36,6% Memory Bound # 32,2% Core Bound
12% of CPU time used in ggml_graph_compute_thread

with -t 14 (autoselected): 92 ms per token
19,9% Memory Bound # 67,7% Core Bound
30% of CPU time used in ggml_graph_compute_thread

My system has only two memory channels.

sw · 2023-05-06T18:45:58Z

This seems like something the operating system / standard library should handle for us, though. C11 has introduced <threads.h>, but support by compilers seems to be lack-luster. Could we achieve the same thing by using cnd_wait?

slaren · 2023-05-06T18:59:05Z

There have been a few attempts to do that, but this seems to be the first that successfully reduces power usage without affecting performance.

slaren · 2023-05-09T11:57:13Z

Since nobody has reported degraded performance with this PR, I think this should be merged. But I am concerned that if #1305 is merged at the same time, and one of the two PRs causes performance issues for some people, it may be harder to find which one is the source of the problem. So it may be better to delay merging this until a few days after #1305.

DannyDaemonic · 2023-05-09T12:10:29Z

I tried this myself and didn't notice any performance loss. I did a couple runs and if anything, it seemed to be slightly faster but not by any significant amount. (Within the margin of error/variance that I was seeing.)

I'm thinking at least some of the pushback we're seeing about 1305 is from people who download models already quantized rather than quantizing them themselves, so the breaking change will mean they can't upgrade until they can find newer models. If this does save energy, it might be better to push it first so people who choose to stay on an older version until they can find newer models have a more eco friendly version.

ggerganov · 2023-05-09T12:18:58Z

#1305 will not be merged before ~~Friday~~ Thursday, so maybe get this in for a couple of days before that

Safari77 · 2023-05-09T12:51:00Z

Since nobody has reported degraded performance with this PR, I think this should be merged. But I am concerned that if #1305 is merged at the same time, and one of the two PRs causes performance issues for some people, it may be harder to find which one is the source of the problem. So it may be better to delay merging this until a few days after #1305.

They can always do git bisect to find out what is the problematic commit.

use pause asm insn in busyloop to run the CPU (13600K) 10 °C cooler

458aeb1

Tested with a 13B model.

slaren reviewed May 5, 2023

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

use _mm_pause() in busyloop

23f1485

slaren approved these changes May 6, 2023

View reviewed changes

use _mm_pause() in busyloop on x86_64 to reduce power consumption

5423eb6

slaren merged commit 9f8dbc4 into ggerganov:master May 9, 2023

Safari77 deleted the busyloop branch May 9, 2023 12:51

mqy mentioned this pull request Jul 7, 2023

Large sched_yield() and threading overhead (+25-40% perf boost) #2134

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use pause asm insn in busyloop to run the CPU (13600K) 10 °C cooler #1314

use pause asm insn in busyloop to run the CPU (13600K) 10 °C cooler #1314

Safari77 commented May 4, 2023

ggerganov commented May 4, 2023

Safari77 commented May 4, 2023

ggerganov commented May 4, 2023

Safari77 commented May 4, 2023

Safari77 commented May 4, 2023

ggerganov commented May 4, 2023

j-f1 commented May 4, 2023

Safari77 commented May 4, 2023

Alumniminium commented May 5, 2023 •

edited

Loading

slaren commented May 5, 2023 •

edited

Loading

Midaychi commented May 5, 2023

Safari77 commented May 6, 2023

sw commented May 6, 2023 •

edited

Loading

slaren commented May 6, 2023

slaren commented May 9, 2023

DannyDaemonic commented May 9, 2023

ggerganov commented May 9, 2023 •

edited

Loading

Safari77 commented May 9, 2023

use pause asm insn in busyloop to run the CPU (13600K) 10 °C cooler #1314

use pause asm insn in busyloop to run the CPU (13600K) 10 °C cooler #1314

Conversation

Safari77 commented May 4, 2023

ggerganov commented May 4, 2023

Safari77 commented May 4, 2023

ggerganov commented May 4, 2023

Safari77 commented May 4, 2023

Safari77 commented May 4, 2023

ggerganov commented May 4, 2023

j-f1 commented May 4, 2023

Safari77 commented May 4, 2023

Alumniminium commented May 5, 2023 • edited Loading

slaren commented May 5, 2023 • edited Loading

Midaychi commented May 5, 2023

Safari77 commented May 6, 2023

sw commented May 6, 2023 • edited Loading

slaren commented May 6, 2023

slaren commented May 9, 2023

DannyDaemonic commented May 9, 2023

ggerganov commented May 9, 2023 • edited Loading

Safari77 commented May 9, 2023

Alumniminium commented May 5, 2023 •

edited

Loading

slaren commented May 5, 2023 •

edited

Loading

sw commented May 6, 2023 •

edited

Loading

ggerganov commented May 9, 2023 •

edited

Loading