Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use pause asm insn in busyloop to run the CPU (13600K) 10 °C cooler #1314

Merged
merged 3 commits into from
May 9, 2023

Conversation

Safari77
Copy link
Contributor

@Safari77 Safari77 commented May 4, 2023

Tested with a 13B model.
You might also want to think about other architectures.

pause1
pause2

@ggerganov
Copy link
Owner

How does the eval time (ms / run) change with this?

@Safari77
Copy link
Contributor Author

Safari77 commented May 4, 2023

How does the eval time (ms / run) change with this?

Things can be different if CPU would need throttling because 85 °C was reached, but in my case the difference was 60 vs 70 °C.

with pause:

llama_print_timings:        load time =   491.01 ms
llama_print_timings:      sample time =   112.94 ms /   381 runs   (    0.30 ms per run)
llama_print_timings: prompt eval time =  2892.00 ms /    27 tokens (  107.11 ms per token)
llama_print_timings:        eval time = 99059.15 ms /   381 runs   (  260.00 ms per run)
llama_print_timings:       total time = 110540.60 ms

Performance counter stats for './main -n 2048 -c 2048 -t 12 --ignore-eos --interactive-first --color --instruct -s 1 -m
 ../gpt4-x-alpaca-13b-native-ggml-q4_0.bin':

      1 220 185,57 msec task-clock:u                     #   11,025 CPUs utilized
                 0      context-switches:u               #    0,000 /sec
                 0      cpu-migrations:u                 #    0,000 /sec
           345 203      page-faults:u                    #  282,910 /sec
 3 145 904 218 579      cpu_core/cycles/u                #    2,578 G/sec                    (98,56%)
 2 769 978 718 128      cpu_atom/cycles/u                #    2,270 G/sec                    (85,03%)
 1 937 590 809 996      cpu_core/instructions/u          #    1,588 G/sec                    (98,56%)
 2 150 075 610 108      cpu_atom/instructions/u          #    1,762 G/sec                    (85,03%)
   114 483 109 780      cpu_core/branches/u              #   93,824 M/sec                    (98,56%)
   113 048 578 832      cpu_atom/branches/u              #   92,649 M/sec                    (85,03%)
       472 475 920      cpu_core/branch-misses/u         #  387,216 K/sec                    (98,56%)
       538 736 029      cpu_atom/branch-misses/u         #  441,520 K/sec                    (85,03%)
18 682 482 544 249      cpu_core/slots:u/                #   15,311 G/sec                    (98,56%)
 2 044 175 523 462      cpu_core/topdown-retiring/u      #     10,9% Retiring                (98,56%)
    35 809 640 118      cpu_core/topdown-bad-spec/u      #      0,2% Bad Speculation         (98,56%)
    80 932 620 092      cpu_core/topdown-fe-bound/u      #      0,4% Frontend Bound          (98,56%)
16 535 620 003 457      cpu_core/topdown-be-bound/u      #     88,4% Backend Bound           (98,56%)
   224 977 900 355      cpu_core/topdown-heavy-ops/u     #      1,2% Heavy Operations       #      9,7% Light Operations        (98,56%)
    20 098 273 461      cpu_core/topdown-br-mispredict/u #      0,1% Branch Mispredict      #      0,1% Machine Clears          (98,56%)
    15 221 484 837      cpu_core/topdown-fetch-lat/u     #      0,1% Fetch Latency          #      0,4% Fetch Bandwidth         (98,56%)
 3 903 133 516 034      cpu_core/topdown-mem-bound/u     #     20,9% Memory Bound           #     67,6% Core Bound              (98,56%)

without pause:

llama_print_timings:        load time =   494.46 ms
llama_print_timings:      sample time =   113.41 ms /   383 runs   (    0.30 ms per run)
llama_print_timings: prompt eval time =  2892.17 ms /    27 tokens (  107.12 ms per token)
llama_print_timings:        eval time = 100403.52 ms /   383 runs   (  262.15 ms per run)
llama_print_timings:       total time = 105522.82 ms

 Performance counter stats for './main -n 2048 -c 2048 -t 12 --ignore-eos --interactive-first --color --instruct -s 1 -m
 ../gpt4-x-alpaca-13b-native-ggml-q4_0.bin':

      1 236 239,64 msec task-clock:u                     #   11,699 CPUs utilized          
                 0      context-switches:u               #    0,000 /sec                   
                 0      cpu-migrations:u                 #    0,000 /sec                   
           345 993      page-faults:u                    #  279,875 /sec                   
 3 188 758 738 410      cpu_core/cycles/u                #    2,579 G/sec                    (98,53%)
 2 722 421 590 622      cpu_atom/cycles/u                #    2,202 G/sec                    (87,66%)
12 756 258 981 589      cpu_core/instructions/u          #   10,319 G/sec                    (98,53%)
 3 434 808 721 694      cpu_atom/instructions/u          #    2,778 G/sec                    (87,66%)
 2 992 087 196 730      cpu_core/branches/u              #    2,420 G/sec                    (98,53%)
   446 389 930 549      cpu_atom/branches/u              #  361,087 M/sec                    (87,66%)
       474 845 600      cpu_core/branch-misses/u         #  384,105 K/sec                    (98,53%)
       525 726 624      cpu_atom/branch-misses/u         #  425,263 K/sec                    (87,66%)
19 040 516 217 918      cpu_core/slots:u/                #   15,402 G/sec                    (98,53%)
9 876 182 751 945      cpu_core/topdown-retiring/u      #     51,8% Retiring                (98,53%)
    75 459 706 605      cpu_core/topdown-bad-spec/u      #      0,4% Bad Speculation         (98,53%)
   107 405 880 272      cpu_core/topdown-fe-bound/u      #      0,6% Frontend Bound          (98,53%)
 9 011 728 924 394      cpu_core/topdown-be-bound/u      #     47,3% Backend Bound           (98,53%)
    79 615 892 727      cpu_core/topdown-heavy-ops/u     #      0,4% Heavy Operations       #     51,4% Light Operations        (98,53%)
    33 803 030 664      cpu_core/topdown-br-mispredict/u #      0,2% Branch Mispredict      #      0,2% Machine Clears          (98,53%)
    15 939 455 442      cpu_core/topdown-fetch-lat/u     #      0,1% Fetch Latency          #      0,5% Fetch Bandwidth         (98,53%)
 3 402 714 615 342      cpu_core/topdown-mem-bound/u     #     17,8% Memory Bound           #     29,4% Core Bound              (98,53%)

@ggerganov
Copy link
Owner

Interesting, can you suggest an alternative for arm64 that I can try?

@Safari77
Copy link
Contributor Author

Safari77 commented May 4, 2023

Interesting, can you suggest an alternative for arm64 that I can try?

I do not have arm64 to test with, but linux arch/arm64/include/asm/cmpxchg.h uses sevl and wfe, try what combination works for you...

@Safari77
Copy link
Contributor Author

Safari77 commented May 4, 2023

Interesting, can you suggest an alternative for arm64 that I can try?

I do not have arm64 to test with, but linux arch/arm64/include/asm/cmpxchg.h uses sevl and wfe, try what combination works for you...

memory can be marked as clobbered, so compiler does not optimize away anything.

__asm__ __volatile__("pause\n" : : : "memory"); 

@ggerganov
Copy link
Owner

Tried a few variations of these, but not sure how to measure the effect:

-- a/ggml.c
+++ b/ggml.c
@@ -11123,7 +11123,21 @@ typedef int ggml_lock_t;
 
 #define ggml_lock_init(x)    UNUSED(x)
 #define ggml_lock_destroy(x) UNUSED(x)
+#ifdef __ARM_NEON
+static inline void ggml_lock_lock(void * x);
+inline void ggml_lock_lock(void * x) {
+    UNUSED(x);
+    for (int i = 0; i < 10; ++i) {
+        __asm __volatile("nop");
+    }
+    __asm __volatile(
+            "  sev\n"
+            "  wfe\n"
+            );
+}
+#else
 #define ggml_lock_lock(x)    UNUSED(x)
+#endif
 #define ggml_lock_unlock(x)  UNUSED(x)
 
 #define GGML_LOCK_INITIALIZER 0

Anyway, for x86, if other people can confirm that the performance is not degraded, we can merge the proposed pause.

@j-f1
Copy link
Collaborator

j-f1 commented May 4, 2023

It looks like the ARM version may be yield.

@Safari77
Copy link
Contributor Author

Safari77 commented May 4, 2023

Tried a few variations of these, but not sure how to measure the effect:

Maybe yield is more suitable for arm64. But I used sensors to check out the temperature, and perf stat -- ./main ... for performance counter stats.

ggml.c Outdated Show resolved Hide resolved
@Alumniminium
Copy link

Alumniminium commented May 5, 2023

Ryzen 4600H in a Huawei Matebook 14 - No temp differences, both hit 85C instantly (6c,12t)
ran this 3 times each to make sure

12 Threads

pause:
llama_print_timings:        load time =  2793.53 ms
llama_print_timings:      sample time =   180.13 ms /   128 runs   (    1.41 ms per run)
llama_print_timings: prompt eval time =  2437.19 ms /    15 tokens (  162.48 ms per token)
llama_print_timings:        eval time = 26209.22 ms /   127 runs   (  206.37 ms per run)
llama_print_timings:       total time = 29236.74 ms
master:
llama_print_timings:        load time =  2799.41 ms
llama_print_timings:      sample time =   180.16 ms /   128 runs   (    1.41 ms per run)
llama_print_timings: prompt eval time =  2445.65 ms /    15 tokens (  163.04 ms per token)
llama_print_timings:        eval time = 26362.74 ms /   127 runs   (  207.58 ms per run)
llama_print_timings:       total time = 29396.40 ms

6 Threads

pause:
llama_print_timings:        load time =  2596.68 ms
llama_print_timings:      sample time =   156.47 ms /   128 runs   (    1.22 ms per run)
llama_print_timings: prompt eval time =  2237.88 ms /    15 tokens (  149.19 ms per token)
llama_print_timings:        eval time = 27610.40 ms /   127 runs   (  217.40 ms per run)
llama_print_timings:       total time = 30410.26 ms
master:
llama_print_timings:        load time =  2708.89 ms
llama_print_timings:      sample time =   164.13 ms /   128 runs   (    1.28 ms per run)
llama_print_timings: prompt eval time =  2353.16 ms /    15 tokens (  156.88 ms per token)
llama_print_timings:        eval time = 30370.40 ms /   127 runs   (  239.14 ms per run)
llama_print_timings:       total time = 33293.02 ms

3 Threads

pause:
llama_print_timings:        load time =  4126.54 ms
llama_print_timings:      sample time =   151.20 ms /   128 runs   (    1.18 ms per run)
llama_print_timings: prompt eval time =  3787.19 ms /    15 tokens (  252.48 ms per token)
llama_print_timings:        eval time = 35192.36 ms /   127 runs   (  277.11 ms per run)
llama_print_timings:       total time = 39516.05 ms
master:
llama_print_timings:        load time =  4108.10 ms
llama_print_timings:      sample time =   157.00 ms /   128 runs   (    1.23 ms per run)
llama_print_timings: prompt eval time =  3767.13 ms /    15 tokens (  251.14 ms per token)
llama_print_timings:        eval time = 35767.54 ms /   127 runs   (  281.63 ms per run)
llama_print_timings:       total time = 40080.61 ms

1 Thread

pause:
llama_print_timings:        load time = 11223.93 ms
llama_print_timings:      sample time =    63.93 ms /    64 runs   (    1.00 ms per run)
llama_print_timings: prompt eval time = 10867.14 ms /    15 tokens (  724.48 ms per token)
llama_print_timings:        eval time = 40488.44 ms /    63 runs   (  642.67 ms per run)
llama_print_timings:       total time = 51795.73 ms
master:
llama_print_timings:        load time = 10348.28 ms
llama_print_timings:      sample time =    60.87 ms /    64 runs   (    0.95 ms per run)
llama_print_timings: prompt eval time =  9994.98 ms /    15 tokens (  666.33 ms per token)
llama_print_timings:        eval time = 39635.09 ms /    63 runs   (  629.13 ms per run)
llama_print_timings:       total time = 50062.91 ms

@slaren
Copy link
Collaborator

slaren commented May 5, 2023

Testing on a i9 9900k (12 threads, cuBLAS enabled):

PR:

llama_print_timings: prompt eval time =  3289.73 ms /   631 tokens (    5.21 ms per token)
llama_print_timings:        eval time = 16655.04 ms /   127 runs   (  131.14 ms per run)
2.10 seconds per pass - ETA 22 minutes
CPU power usage: ~80W

Master:

llama_print_timings: prompt eval time =  3206.96 ms /   631 tokens (    5.08 ms per token)
llama_print_timings:        eval time = 16859.71 ms /   127 runs   (  132.75 ms per run)
2.17 seconds per pass - ETA 23 minutes
CPU power usage: ~100W

For me, this has no downsides, it uses 20% less power and is as fast or faster.

@Midaychi
Copy link

Midaychi commented May 5, 2023

You sure you wouldn't get the same results just reducing your number of threads to your logical cores?
The hyperthreading isn't doing much if your logical core is already pinned to 100%

@Safari77
Copy link
Contributor Author

Safari77 commented May 6, 2023

with -t 6: 70 ms per token
36,6% Memory Bound # 32,2% Core Bound
12% of CPU time used in ggml_graph_compute_thread

with -t 14 (autoselected): 92 ms per token
19,9% Memory Bound # 67,7% Core Bound
30% of CPU time used in ggml_graph_compute_thread

My system has only two memory channels.

@sw
Copy link
Contributor

sw commented May 6, 2023

This seems like something the operating system / standard library should handle for us, though. C11 has introduced <threads.h>, but support by compilers seems to be lack-luster. Could we achieve the same thing by using cnd_wait?

@slaren
Copy link
Collaborator

slaren commented May 6, 2023

There have been a few attempts to do that, but this seems to be the first that successfully reduces power usage without affecting performance.

@slaren
Copy link
Collaborator

slaren commented May 9, 2023

Since nobody has reported degraded performance with this PR, I think this should be merged. But I am concerned that if #1305 is merged at the same time, and one of the two PRs causes performance issues for some people, it may be harder to find which one is the source of the problem. So it may be better to delay merging this until a few days after #1305.

@DannyDaemonic
Copy link
Contributor

I tried this myself and didn't notice any performance loss. I did a couple runs and if anything, it seemed to be slightly faster but not by any significant amount. (Within the margin of error/variance that I was seeing.)

I'm thinking at least some of the pushback we're seeing about 1305 is from people who download models already quantized rather than quantizing them themselves, so the breaking change will mean they can't upgrade until they can find newer models. If this does save energy, it might be better to push it first so people who choose to stay on an older version until they can find newer models have a more eco friendly version.

@ggerganov
Copy link
Owner

ggerganov commented May 9, 2023

#1305 will not be merged before Friday Thursday, so maybe get this in for a couple of days before that

@slaren slaren merged commit 9f8dbc4 into ggerganov:master May 9, 2023
@Safari77
Copy link
Contributor Author

Safari77 commented May 9, 2023

Since nobody has reported degraded performance with this PR, I think this should be merged. But I am concerned that if #1305 is merged at the same time, and one of the two PRs causes performance issues for some people, it may be harder to find which one is the source of the problem. So it may be better to delay merging this until a few days after #1305.

They can always do git bisect to find out what is the problematic commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants