Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : Metal inference #1642

Merged
merged 49 commits into from
Jun 4, 2023
Merged

llama : Metal inference #1642

merged 49 commits into from
Jun 4, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented May 29, 2023

Add full GPU inference of LLaMA on Apple Silicon using Metal

Demo

M1 Pro + 7B LLaMA:

llama-metal-0.mp4

M2 Max + 7B LLaMA:

llama-metal-1-lq.mp4

M2 Max + 13B LLaMA:

llama-metal-13B-0-lq.mp4

M2 Max + 65B LLaMa:

llama-metal-65B-0-lq.mp4

Details

  • The ggml API is extended in ggml-metal.h
  • The Metal shaders / kernels are implemented in ggml-metal.metal
  • This PR implements support only for Q4_0, but all other quantizations can easily be added in the future
  • Works well with mmap to avoid model data duplication in memory. Still there are a few memory improvements that can be made in the future to reduce the memory usage when Metal is enabled
  • The core of the implementation is contained in the ggml_metal_graph_compute() function. It is analogous to the CPU-only ggml_graph_compute() and it's purpose is to evaluate a ggml_cgraph on the GPU in a similar way
  • The implemented shaders currently focus on qMatrix x Vector multiplication which is normally needed for LLM text-generation. For other tasks that involve Matrix x Matrix (for example prompt ingestion, perplexity computation, etc) we don't have an efficient implementation yet, so we fallback to the CPU / ANE
  • There is a nice separation of the implementation: the new ggml-metal.h, ggml-metal.m and ggml-metal.metal files are optional and all Metal-related code is contained within them. 3rd party user apps can decide whether they want to include / modify / ignore them
  • The proposed implementation can be easily extended for other backends like CUDA by following the same pattern as demonstrated in this PR
  • Optionally, we now have support for exporting static computation graphs. Creation and usage is demonstrated in the metal example

Usage

  • Add LLAMA_METAL=1 to your make command or -DLLAMA_METAL=ON to your cmake command.
  • Add -ngl 1 to main command-line arguments to enable GPU inference
$ make clean
$ LLAMA_METAL=1 make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --ignore-eos -n 64 -ngl 1

I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL -c examples/common.cpp -o common.o
cc -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c ggml-metal.m -o ggml-metal.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/main/main.cpp ggml.o llama.o common.o ggml-metal.o -o main  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/quantize/quantize.cpp ggml.o llama.o ggml-metal.o -o quantize  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/quantize-stats/quantize-stats.cpp ggml.o llama.o ggml-metal.o -o quantize-stats  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/perplexity/perplexity.cpp ggml.o llama.o common.o ggml-metal.o -o perplexity  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/embedding/embedding.cpp ggml.o llama.o common.o ggml-metal.o -o embedding  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL pocs/vdot/vdot.cpp ggml.o ggml-metal.o -o vdot  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders

====  Run ./main -h for help.  ====

main: build = 653 (db3db9e)
main: seed  = 1685893102
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/ggerganov/development/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x120a06020
ggml_metal_init: loaded kernel_mul                            0x120a065a0
ggml_metal_init: loaded kernel_mul_row                        0x120a06bd0
ggml_metal_init: loaded kernel_scale                          0x120a070f0
ggml_metal_init: loaded kernel_silu                           0x120a07610
ggml_metal_init: loaded kernel_relu                           0x120a07b30
ggml_metal_init: loaded kernel_soft_max                       0x120a081e0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x120a08840
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x120a08ec0
ggml_metal_init: loaded kernel_rms_norm                       0x120a09570
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x120a09dd0
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x120a0a7a0
ggml_metal_init: loaded kernel_rope                           0x120a0b090
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x120a0b920
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x120a0c1b0
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  3616.07 MB
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =   768.00 MB
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   258.00 MB
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   512.00 MB
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 64, n_keep = 0


 I believe the meaning of life is to be happy.
That's what I would call my philosophy on how to live life, that's what I want people to remember me for.
I was actually diagnosed with a tumor when I was 17 years old and had a very long surgery in order to get it removed.

llama_print_timings:        load time =  1685.43 ms
llama_print_timings:      sample time =    45.70 ms /    64 runs   (    0.71 ms per token)
llama_print_timings: prompt eval time =   342.51 ms /     8 tokens (   42.81 ms per token)
llama_print_timings:        eval time =  3079.50 ms /    63 runs   (   48.88 ms per token)
llama_print_timings:       total time =  4816.85 ms

Implementation process of this PR (archive)

  • Export a ggml computation graph of a LLaMA model:

    ./bin/main -m ../models/7B/ggml-model-q4_0.bin --export

    This creates the llama.ggml file which contains the computation graph

  • We will now load it with a separate tool and attempt to evaluate with Metal:

    ./bin/mtl llama.ggml
  • Implement the entire network layer by layer, comparing the CPU and GPU results

    • GET_ROWS_Q4_0
    • RMS_NORM
    • MUL
    • MUL_MAT
    • RESHAPE
    • TRANSPOSE
    • ROPE
    • VIEW
    • CPY
    • SCALE
    • DIAG_MASK_INF
    • SOFT_MAX
    • SILU
  • Optimize the kernels to achieve at the very least parity with CPU-only speed

  • Adjust dynamic shapes before evaluating the graph (i.e. n_past, N)

  • Simplify encoder dispatch code, reduce duplication

  • Add basic text-generation example


Robots

🤖 Generated by Copilot at 324e823

Summary

🍎📝🚀

This pull request adds Metal support for llama, a library for tensor manipulation and computation graph export/import. It introduces a new CMake option LLAMA_METAL and a new header file ggml-metal.h that enable GPU acceleration of llama expressions on Apple devices. It also improves the readability, consistency, and usability of the existing code and documentation, and adds some new features and examples. It fixes a bug in the main example program and adds a new metal example program that demonstrates how to evaluate a statically exported ggml computation graph with Metal.

If you want to use llama with Metal
You can now do so with this pull request, all
You need is to set LLAMA_METAL
And then you can export your ggml
To a file or a graph that is special

Walkthrough

  • Add Metal support for llama, a GPU backend for Apple devices (link, link, link, link, link, link, link, link, link, link, link, link, link, link, link)
  • Fix a bug in the example program main.cpp that used subtraction instead of addition to compute the sum of two numbers (link)
  • Add a command-line option --export to the example program main.cpp that allows exporting the computation graph to a file named llama.ggml (link, link, link)
  • Add a function llama_eval_export that exports a static computation graph for a context of 511 and a batch size of 1 using llama_eval_internal (link, link)
  • Change the logic of the function ggml_graph_import to parse the arguments of the tensor before creating it, and to handle different cases of view operations differently (link, link)
  • Change the logic of the function ggml_nbytes to handle cases where the tensor is not contiguous in memory (link)
  • Add a call to ggml_scratch_save and ggml_scratch_load to the functions ggml_view_1d, ggml_view_2d, ggml_view_3d and ggml_view_4d to preserve the scratch memory state when creating a new tensor for the offset (link, link, link, link)
  • Add a call to ggml_set_name to the functions ggml_view_2d, ggml_view_3d and ggml_view_4d to assign a name to the result tensor for debugging purposes (link, link, link)
  • Add a call to ggml_set_name to the function llama_eval_internal to assign a name to the tensor Vcur for debugging purposes (link)
  • Add a parameter cgraph_fname to the function llama_eval_internal that allows exporting the computation graph to a file if not null (link, link, link)
  • Add a variable eop to the function ggml_graph_import that stores the enum value of the operation code for convenience (link)
  • Add a const qualifier to the variables mean and x0 in the functions ggml_compute_forward_rms_norm_f32 and ggml_compute_forward_rope_f32 to indicate that they are not modified after initialization (link, link, link)
  • Change the return type of the function ggml_nrows from int to int64_t to match the type of the ne field of the ggml_tensor struct (link)
  • Change the visibility of the functions ggml_is_transposed and ggml_is_contiguous from static inline to public by adding them to the ggml.h header file (link, link)
  • Increase the width of the last column in the format strings of the functions ggml_graph_export_leaf and ggml_graph_export_node to accommodate longer tensor names (link, link)
  • Comment out two assertions in the function ggml_graph_export that check the work buffer size of the computation graph, because they are not valid when exporting a graph with Metal support (link)
  • Remove an empty line from the function ggml_graph_export for consistency (link)
  • Remove the declaration of the variable cur from the function llama_eval_internal because it is declared later in the same scope (link)
  • Replace the variable inpL with cur in the function llama_eval_internal to reflect the previous changes in the tensor creation logic (link, link)
  • Remove an empty line from the function llama_eval_internal for consistency (link)
  • Add an empty line to the function llama_eval_internal for readability (link)
  • Format the call to llama_model_load in the function llama_init to use multiple lines and indentation for readability (link)
  • Format the declarations of the functions ggml_init and ggml_free in the ggml.h header file to use multiple lines and indentation for readability (link)
  • Format the target link libraries command for llama to use multiple lines and indentation for readability (link)
  • Align the spacing of the memory requirements expressions in the function llama_model_load_internal for readability (link)
  • Align the spacing of the CMake options for llama to make them more consistent and readable (link)
  • Rename the variable GGML_CUDA_SOURCES to GGML_SOURCES_CUDA to match the naming convention of other source variables in the CMake file (link, link)
  • Add a subdirectory metal to the examples CMake file if LLAMA_METAL is enabled (link)
  • Add an empty line to the README.md file for readability (link)
  • Add empty lines to the Makefile to separate different conditional blocks for readability (link, link, link)
  • Add comments to mark the end of the conditional blocks in the Makefile (link, link, link)

@ggerganov ggerganov added the performance Speed related topics label May 29, 2023
examples/mtl/mtl.metal Outdated Show resolved Hide resolved
@ggerganov
Copy link
Owner Author

Ok, the Q4 mul mat kernel is next - very important to get this right.
If we can hit that bullseye, the rest of the dominoes will fall like a house of cards. Checkmate

@philipturner
Copy link

philipturner commented May 30, 2023

Ok, the Q4 mul mat kernel is next - very important to get this right.

A bit of advice, when I made the kernel above, I ran the CPU-side script over a dozen times per change to the Metal code. I ran until I was confident I had found the maximum achievable bandwidth. Although this overestimates actual performance, it removes all noise, so you can focus on relative performance. "Does this change make it slightly faster or slightly slower?"

Then it's very similar to training a neural network. Incrementally descend the performance slope until reaching whatever Metal shader works best for you.

@jason-hulkman
Copy link

I'm considering purchasing the Mac Studio with M2 Ultra 76 core 192GB. I'm curious about the performance of your 65B 4-bits model. Could you provide some details? Does it run same as A6000(9~13tokens/s in 65B 4-bits)?

@philipturner
Copy link

I'm considering purchasing the Mac Studio with M2 Ultra 76 core 192GB.

Wouldn't it be cheaper to just purchase access to GPT-4 through the OpenAI API? If the goal is the highest-quality LLM models available.

@soleblaze
Copy link

soleblaze commented Jun 25, 2023

I started working on my benchmark app. I’ll publish some alpha results when I get it setup to benchmark every quant and param value for a given model and put it in a table.

Wouldn't it be cheaper to just purchase access to GPT-4 through the OpenAI API? If the goal is the highest-quality LLM models available

if you can get gpt-4 access. That said, gpt-3.5-turbo is still better than any local LLM and is much cheaper. Using gpu instances like runpod is also way cheaper for non 24/7 use vs building even a mid-level setup. I was looking at an amd 6950xt to mess with amd support and decided since it’s not officially supported with rocm I’ll be using azure amd instances. I can get about 300 hours for the price of that card.

Running your own hardware doesn’t make sense from a cost perspective unless you’re literally doing it 24/7. Even then I’m not sure what the break even point is due to power bills.

Ofc, it’s not like “makes sense from a cost perspective” is always a priority with hobbies.

@soleblaze
Copy link

soleblaze commented Jun 27, 2023

My benchmark app can go through some models in a directory, but eventually dies with an out of memory error. This appears to be an issue with llama-cpp-python. I don't think the CPU thread piece is working properly. I removed the prompt eval times from this, as they are much slower than I get if I run llama.cpp directly. The eval times appear to be in line with running llama.cpp directly.

I may spawning llama.cpp directly or I'll look into fixing llama-cpp-python. Not sure yet, but I'll have time next week to work on that and flesh this out more.

Here's what I have running it against a few 65b models:

Racing Llama Benchmark

System Information:
OS: MacOS
ARCH: arm64
CPU: Apple M2 Ultra - 24 cores (16 performance and 8 efficiency)
GPU: Apple M2 Ultra - 76 cores
RAM: 192 GB

Runs: 10
llama-cpp-python version: 0.1.66
CPU Threads: 15
GPU Acceleration: True
Seed: -1
Prompt: ### Human: You are an AI being benchmarked. You want to be helpful and provide a useful response that can be repeated. What would you suggest is the best way to benchmark the response times of a large language model?

Assistant:

Eval Tokens per second:

Model Params Quant Fastest Slowest Mean Median
airoboros (gpt4 1.3) 65B q4_0 10.84 10.27 10.73 10.77
airoboros (gpt4 1.3) 65B q5_K_M 9.57 8.36 9.32 9.45
alpaca lora 65B q5_K_M 8.6 6.98 7.87 7.95
dromedary lora 65B q4_K_M 10.3 9 9.92 10.24
dromedary lora 65B q5_K_M 9.61 8.59 9.22 9.30
gpt4 alpaca lora_mlp 65B q4_K_M 10.16 9.07 9.61 9.70
gpt4 alpaca lora_mlp 65B q5_K_M 9.32 8.31 8.97 8.98
guanaco 65B q4_K_M 10.26 9.53 10.11 10.21
guanaco 65B q5_K_M 9.56 9.14 9.37 9.41

@x4080
Copy link

x4080 commented Jun 28, 2023

@soleblaze Wow you have the greatest of M2 Ultra, congrats. Do you know how it compares to using like Nvidia 4090 ? But maybe not even running on 4090 because of the RAM requrements ?

@philipturner
Copy link

philipturner commented Jun 28, 2023

$200 more for 5x less bandwidth. Not 5% less, 5x less.

4 x RTX 4090 M2 Ultra GPU M2 Ultra ANE A100 H100
Cost $6400 $6600 $6600 $10,000 $40,000
Dense FP16 TFLOPS 1321.2 25.6 31.6 311.84 989.5
Bandwidth 4032 GB/s 800 GB/s 400 GB/s 2039 GB/s 3350 GB/s
RAM 96 GB 192 GB 192 GB 80 GB 80 GB

@soleblaze
Copy link

soleblaze commented Jun 28, 2023

I think that cost comparison is a bit misleading, considering you’d also need a motherboard that can handle the 4 cards, two power supplies, two electrical circuits, fast ram, and a cpu that won’t bottleneck 4 cards. I’m not sure where the multi gpu support is on this and if the cards would need to use the pci bus to share a lot data. That said, I would never argue that the m2 ultra is the better buy for this use case.

Main thing m2 has going for it is power efficiency and a small form factor. I’m guessing if llama.cpp gets the MFA stuff that philipturner is working on that it could hit 3080-4080 levels of performance. I should have my benchmark app to the point where it’d be useful to do a comparison when that happens.

It would be nice to put a 2x 3090 box on that list. IMO that’s the best performance per dollar and I’m not sure a realistic home use would go over the 48GB of ram. Plus you’d get nvlink support.

@philipturner
Copy link

I think that cost comparison is a bit misleading, considering you’d also need a motherboard

Exactly. To buy into the CUDA ecosystem, you have to set up a Windows PC, with a massive box and 500 W power supply. I am all for using existing hardware, which I already own, to do the computations. Not for getting new hardware unless it can be built for free (my end goal with nanotech; build a personal supercluster).

It would be nice to put a 2x 3090 box on that list. IMO that’s the best performance per dollar and I’m not sure a realistic home use would go over the 48GB of ram. Plus you’d get nvlink support.

You're implying that a 2-GPU system costs $6,000, factoring in the CPU and box?

@ggerganov
Copy link
Owner Author

Does 4x GPUs really offer 4x the bandwidth?

If I remember correctly, with multiple GPUs the inference speed does not seem to scale proportionally, although I haven't had the chance to test it (cc @JohannesGaessler)

@philipturner
Copy link

You can shard the feedforward and attention layers straightforwardly. The bottleneck could be the latency-bound process of broadcasting the result vector to the peers for the next feedforward.

If I remember correctly, with multiple GPUs the inference speed does not seem to scale proportionally, although I haven't had the chance to test

Amdahl’s Law

@soleblaze
Copy link

soleblaze commented Jun 28, 2023

You're implying that a 2-GPU system costs $6,000, factoring in the CPU and box?

Lol no. Well, I’m sure there’s some boutique builders that have some offerings at that level. More to show how much lower that would cost. I don’t think you start hitting the large non-gpu costs of a build until you go to 3 or 4 gpus

I am really curious on the performance difference with multiple gpus. It looks like runpod goes up to 8x 4090. I’ll put it on my list to look at. Guessing runpod will be a good starting point to compare hardware performance differences.

@JohannesGaessler
Copy link
Collaborator

Does 4x GPUs really offer 4x the bandwidth?

You do get 4x the bandwidth, the problem is actually utilizing it. For large tensors like the matrix multiplication the scaling should be roughly linear but the problem is that for all of the small tensors the overhead from moving data between GPUs is larger than just doing the calculation on a single GPU. So this limits how much of the program you can actually parallelize and the parallelization itself introduces overhead (currently the CUDA synchronization logic for multi GPU settings still needs a lot of optimization). One possible way to improve this would be to fuse tensors where applicable so that you have one large tensor instead of many small tensors which can then be handled more efficiently (this would be beneficial in general).

There is also the issue that writing code that utilizes multiple GPUs simply takes more work to develop and maintain; right now only matrix multiplications using the weights can be parallelized with the CUDA implementation (~67% of the runtime). The matrix multiplications using the KV cache could also be parallelized for another ~20% of the runtime.

@philipturner
Copy link

2 x RTX 4090 M2 Max GPU M2 Max ANE
Cost $3200 $3000 $3000
Dense FP16 TFLOPS 660.6 12.8 15.8
Bandwidth 2016 GB/s 400 GB/s 200 GB/s
RAM 48 GB 96 GB 96 GB

@soleblaze
Copy link

soleblaze commented Jun 28, 2023

Since we’re talking about performance and what the manufacturers max specs are, rather than ask my questions about what matters and what’s worth benchmarking here I opened a discussion: #2038. Can y’all give me input on this?

I think y’all have the most knowledge around what we actually care about and I want to have a way for people to run useful benchmarks on their system vs trusting that the manufacturer max's are achievable (for instance, I don’t really trust that 800gb/s is achievable on apple silicon ultra chips for a single workload)

@okpatil4u
Copy link

Does it really compare though ? You may have to add performance of both CPU and Neural engine of M2 Ultra systems together to benchmark bang for the bucks performance.

I am also not sure about 660.6/2 tflops FP16 performance. The wikipedia page is showing 82 TFlops at half precision. Maybe I am missing the source.

Same with M2 Ultra, the wikipedia page is showing 27.2 TFlops FP32 performance, does that mean that FP16 performance is 2x27.2 TFLOPS ?

@philipturner
Copy link

You may have to add performance of both CPU and Neural engine of M2 Ultra systems together to benchmark bang for the bucks performance.

They cannot be used simultaneously without a hideous latency.

I am also not sure about 660.6/2 tflops FP16 performance. The wikipedia page is showing 82 TFlops at half precision. Maybe I am missing the source.

330 is for tensor cores. 82.5 is for shader cores.

Same with M2 Ultra, the wikipedia page is showing 27.2 TFlops FP32 performance, does that mean that FP16 performance is 2x27.2 TFLOPS ?

That is theoretical max ALU. The SIMD MATMUL FMADD32 instruction performs 32 float ops in 18 cycles, while the 16-bit version takes 17 cycles. So max FP32 matmul is 24.2 TFLOPS, max FP16 matmul is 25.6 TFLOPS.

@okpatil4u
Copy link

okpatil4u commented Jun 29, 2023 via email

@PaddyPatPat
Copy link

Okpatil4u, the ggml.ai homepage shows a screen recording of "Simultaneously running 4 instances of 13B LLaMA + Whisper Small on a single M1 Pro". I take this to mean that you can run multiple models at once. I assume there are complications to this though.

@philipturner
Copy link

It's actually quite efficient (theoretically) because you perform 4 batched inferences. The latency is the same as 1 inference, until the batch size becomes so large it's compute-bound.

@x4080
Copy link

x4080 commented Aug 2, 2023

When using mac, is the "prompt" processing from user is less efficient compare to using NVIDIA ? Because for example, in summarization, the text needs to be summarized will be the prompt and in my case (m2) it takes a long time to process, and the guys who have NVIDIA is not having that problem

@philipturner
Copy link

That's because Metal Performance Shaders has a very inefficient GEMM, which is a compute-bound operation. Token decoding (tokens/second) is a memory-bound operation.

@ghost
Copy link

ghost commented Aug 2, 2023

That's because Metal Performance Shaders has a very inefficient GEMM, which is a compute-bound operation. Token decoding (tokens/second) is a memory-bound operation.

Maybe ANE accessible through CoreML (ONNX ex) or libane can be used to accelerate GEMM?

@philipturner
Copy link

The ANE is designed for convolutions, so its GEMM throughput is ~25% of the advertised TFLOPS. On everything besides A-series chips, it's slower than the GPU. The solution is a more performant GPU GEMM library.

@x4080
Copy link

x4080 commented Aug 19, 2023

Can we use CLBlast to speedup prompt ingestion ? It supports m1 and m2 apparently

@ggerganov
Copy link
Owner Author

Unlikely. The next performance jump will come from quantum matrix multiplication.

@philipturner
Copy link

Can we use CLBlast to speedup prompt ingestion ? It supports m1 and m2 apparently

CLBlast is slower than Metal Performance Shaders. Only able to reach 28% ALU utilization and unable to use half precision.

The next performance jump will come from quantum matrix multiplication.

Yes, using quantum computers to multiply hermitian matrices and solve eigenvalue problems in under $O(n^3)$ time. Until then, use GPUs to simulate quantum chemistry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.