llama : add Mixtral support #4381

fakerybakery · 2023-12-08T18:20:09Z

Hi,
Please add support for Mistral's MOE model Mixtral.

ddh0 · 2023-12-08T21:19:44Z

Seconded!

EDIT: There is an early implementation here: https://github.com/dzhulgakov/llama-mistral

fakerybakery · 2023-12-08T21:48:25Z

Does the hacky implementation support quantization?

ddh0 · 2023-12-08T21:49:23Z

Does the hacky implementation support quantization?

It barely supports inference, lol

lxe · 2023-12-08T22:28:21Z

What's the general effort in modifying gguf-py / convert.py to support quantizing/converting this mistral MoE architecture?

fakerybakery · 2023-12-08T22:42:51Z

Has anyone seen this?

https://github.com/stanford-futuredata/Megatron-LM/blob/f385caf934b84e71c946c4342362270edae02173/tools/run_text_generation_server.py

lxe · 2023-12-08T22:58:01Z

Here's a hacked together conversion script: lxe@2dd8944

irony · 2023-12-08T23:01:52Z

Here's a hacked together conversion script: 2dd8944

Nice!

mayfer · 2023-12-08T23:06:58Z

Here's a hacked together conversion script: 2dd8944

conversion worked! inference fails with "error loading model: unknown model architecture: 'moe'", any tips?

fakerybakery · 2023-12-08T23:43:43Z

Hi @lxe, so your script supports conversion but not inference?

lxe · 2023-12-08T23:45:16Z

It's a work in progress. It can convert to gguf format (with wrong layer names) but there's no logic for quantizing under 8 bits or loading and running the model.

Yet.

clemens98 · 2023-12-09T00:31:04Z

so how much ram or vram is needed to run it?

leedrake5 · 2023-12-09T03:34:45Z

Here's a hacked together conversion script: 2dd8944

conversion worked! inference fails with "error loading model: unknown model architecture: 'moe'", any tips?

Same here. Mystified by what the problem here is. Mac silicon

Update: I'm dumb, just downloaded the script. With all files it creates a 93 Gb half precision model. I've got a system that in theory can run this (128 Gb M3 Max) but still get an error:

error loading model: unknown model architecture: 'moe'
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '~/GitHub/text-generation-webui/models/ggml-model-f16.gguf'

mayfer · 2023-12-09T04:17:26Z

It requires changes to llama.cpp. each model type has some custom code, it's not all generic. check out source code and see what kind of custom stuff there is for qwen etc.

fakerybakery · 2023-12-09T19:54:56Z

@lxe just curious why your quantization script doesn't support int4?

ciekawy · 2023-12-10T10:32:59Z

this is related probably? #2672

someone13574 · 2023-12-10T14:56:30Z

Posting here for visibility: We know the general implementation of the MoE layer.

From the Mistral discord:

They link to here.

someone13574 · 2023-12-10T14:59:37Z

There is also an (unofficial) attempt at implementing it in Huggingface transformers here

fakerybakery · 2023-12-10T20:09:17Z

I hope it can be quantized down to a reasonable size. What are the chances that it can eventually be run on a consumer-level laptop with reasonable results?

antirez · 2023-12-10T21:18:56Z

I wonder if for this model llama.cpp could modify the routing to produce at least N tokens with the currently selected 2 experts. And only after N check again the routing, and if needed load other two experts and so forth. So that we need to take only 2 experts per time in memory, and ammortize the cost of "swapping" models to N tokens. There is to understand if the routing is so unstable, but seems unlikely.

Dampfinchen · 2023-12-10T21:50:37Z

I wonder if for this model llama.cpp could modify the routing to produce at least N tokens with the currently selected 2 experts. And only after N check again the routing, and if needed load other two experts and so forth. So that we need to take only 2 experts per time in memory, and ammortize the cost of "swapping" models to N tokens. There is to understand if the routing is so unstable, but seems unlikely.

Yes as far as I understand it, it only needs the resources of a 14B model, so it shouldn't be too hard to run. I guess the main catch is that you need all experts to be in RAM so they can be loaded as quickly as possible.

ddh0 · 2023-12-10T21:54:15Z

I am hoping for an option to that will let me only load models into memory when they're used, and keep them on disk when not used. I know this would be super slow but it'd let me and many other people run it when we otherwise would have no hope, barring other developments.

easp · 2023-12-10T21:56:56Z

My understanding is that there are 8 experts +router at every layer. It's making choices about which experts to use at every layer, not up-front.

Memory bandwidth + FLOPs is that of a ~14B model, but RAM is needed for all the weights.

Dampfinchen · 2023-12-10T21:57:47Z

I am hoping for an option to that will let me only load models into memory when they're used, and keep them on disk when not used. I know this would be super slow but it'd let me and many other people run it when we otherwise would have no hope, barring other developments.

I think each expert gets chosen per token basis. So it has to happen very fast and unfortunately, it would put an enourmous strain on an SSD. I don't think its feasible, but options are always welcome.

But hey that's just me speculating!

bergkvist · 2023-12-10T22:01:19Z

I guess it is fine to use the mmap syscall for the full model with all the experts, and only the models/experts being used will actually be loaded into memory by your OS.

khimaros · 2023-12-11T13:34:47Z

seems like TheBloke has model weights in the works: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF

leedrake5 · 2023-12-11T13:36:38Z

Looks like this might be the inference code.

ciekawy · 2023-12-12T12:45:37Z

so actually this PR works, also on Apple Metal giving 10 tokens/s #4406

https://www.loom.com/share/7b25b0da48bd4543a86c1d223f876597?sid=54c6dc1a-65ac-4067-ab29-a2371f146868

clemens98 · 2023-12-12T14:22:51Z

Doas the new update support Mixtral x8 ?

ciekawy · 2023-12-12T14:24:46Z

Doas the new update support Mixtral x8 ?

The PR is not yet merged

DutchEllie · 2023-12-13T06:56:54Z

Someone in the PR thread was able to offload some layers onto their RX 6700, but I haven't gotten that to work for my RX 7900 XTX. In this project all I got was a flashed screen and a frozen GPU, with Ooba it loaded the layers but for inference called on ggml-cuda.c, breaking it and dumping the core.

Obviously for both I used the custom branch

ZihaoTan · 2023-12-13T07:04:30Z

Can try MLX framework if you are using Apple Silicon https://github.com/ml-explore/mlx-examples/tree/main/mixtral.

clemens98 · 2023-12-13T07:12:27Z

Someone in the PR thread was able to offload some layers onto their RX 6700, but I haven't gotten that to work for my RX 7900 XTX.

So far it almost Always worked with clblast rx7900xt
Are you asking for Mixtral 7b x8 in particular?

DutchEllie · 2023-12-13T07:19:31Z

So far it almost Always worked with clblas rx7900xt Are we Talking for Mixtral 7b x8 in particular?

Yes, particularly Mixtral 8x7B. I thought the ROCm version was the hipBLAS one? That's the one I compiled. For Ooba I used the llama-cpp-python package and swapped out the included llama.cpp project with the mixtral branch from here, then compiled and installed the package with the hipBLAS implementation. That's when I got errors.

clemens98 · 2023-12-13T07:22:41Z

Yes, particularly Mixtral 8x7B. I thought the ROCm version was the hipBLAS one?

It is but I never manged to get ROCm to work at all

DutchEllie · 2023-12-13T07:26:38Z

It is but I never manged to get ROCm to work at all

You bring up an interesting point, that being that I have no idea whether or not I have been the hipBLAS version or the CLBLAS version all this time, even before Mixtral. I believe, but am unsure, that I checked whether or not normal models worked with my hipBLAS compiled binary, but again I forgot.
When I get to the office soon, I might remote into my workstation just to check 2 things. One, whether or not hipBLAS version works at all even for older models, and whether using CLBLAS fixes it.

DutchEllie · 2023-12-13T08:35:50Z

So I followed the ROCm installation first, but as you said that didn't really work well. Not even the normal Mistral v0.1 model would run properly, same error as with Mixtral. Uses ggml-cuda.c and it crashes, core dumped. Happens with both llama.cpp and llama-cpp-python in ooba.

Using the CLBLAS version works better. I see both significant improvements on CPU and GPU (CPU went from 3tps to about 6, GPU from crashing to 10tps.) However, I don't think I've been using CLBLAS all this time, as the llama-cpp-python version installed by Ooba's AMD requirements is a ROCm version. Also, might be because I can only offload 20/33 layers, but initial loading is very slow:

# First time loading prompt
llama_print_timings:        load time =   22956.41 ms
llama_print_timings:      sample time =      29.61 ms /   100 runs   (    0.30 ms per token,  3377.01 tokens per second)
llama_print_timings: prompt eval time =   22956.23 ms /    25 tokens (  918.25 ms per token,     1.09 tokens per second)
llama_print_timings:        eval time =    9082.68 ms /    99 runs   (   91.74 ms per token,    10.90 tokens per second)
llama_print_timings:       total time =   32351.86 ms
Output generated in 32.70 seconds (3.06 tokens/s, 100 tokens, context 25, seed 740123037)

# Same prompt, second time.
llama_print_timings:        load time =   22956.41 ms
llama_print_timings:      sample time =      29.58 ms /   100 runs   (    0.30 ms per token,  3380.89 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    9129.46 ms /   100 runs   (   91.29 ms per token,    10.95 tokens per second)
llama_print_timings:       total time =    9455.29 ms
Output generated in 9.84 seconds (10.16 tokens/s, 100 tokens, context 25, seed 1015765663)

Weird how ROCm doesn't work at all.

ChandanVerma · 2023-12-13T08:48:01Z

https://medium.com/@verma.chandan/mastering-mixtral-8x7b-your-guide-to-ai-magic-with-flask-2c5da9de5d3f

ChandanVerma · 2023-12-13T08:48:14Z

llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloaded 22/33 layers to GPU
llm_load_tensors: VRAM used: 32353.06 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 44.00 MB
llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB
llama_build_graph: non-view tensors processed: 1124/1124
llama_new_context_with_model: compute buffer total size = 117.85 MiB
llama_new_context_with_model: VRAM scratch buffer: 114.54 MiB
llama_new_context_with_model: total VRAM used: 32511.60 MiB (model: 32353.06 MiB, context: 158.54 MiB)

system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 1.000, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 1000, n_keep = 0

[end of text]

llama_print_timings: load time = 7587.29 ms
llama_print_timings: sample time = 226.28 ms / 799 runs ( 0.28 ms per token, 3530.96 tokens per second)
llama_print_timings: prompt eval time = 11221.46 ms / 110 tokens ( 102.01 ms per token, 9.80 tokens per second)
llama_print_timings: eval time = 116379.25 ms / 798 runs ( 145.84 ms per token, 6.86 tokens per second)
llama_print_timings: total time = 128094.46 ms

ciekawy · 2023-12-13T11:05:45Z

thats OT but would be exciting to know more about experts and be able to fine tune particular expert

ciekawy · 2023-12-13T11:08:29Z

Can try MLX framework if you are using Apple Silicon https://github.com/ml-explore/mlx-examples/tree/main/mixtral.

@ZihaoTan with this PR I'm able to run quantized gguf on 36GB (24GB allocated to GPU)

itsdotscience · 2023-12-13T11:11:06Z

So I followed the ROCm installation first, but as you said that didn't really work well. Not even the normal Mistral v0.1 model would run properly, same error as with Mixtral. Uses ggml-cuda.c and it crashes, core dumped. Happens with both llama.cpp and llama-cpp-python in ooba.

Using the CLBLAS version works better. I see both significant improvements on CPU and GPU (CPU went from 3tps to about 6, GPU from crashing to 10tps.) However, I don't think I've been using CLBLAS all this time, as the llama-cpp-python version installed by Ooba's AMD requirements is a ROCm version. Also, might be because I can only offload 20/33 layers, but initial loading is very slow:
# First time loading prompt
llama_print_timings:        load time =   22956.41 ms
llama_print_timings:      sample time =      29.61 ms /   100 runs   (    0.30 ms per token,  3377.01 tokens per second)
llama_print_timings: prompt eval time =   22956.23 ms /    25 tokens (  918.25 ms per token,     1.09 tokens per second)
llama_print_timings:        eval time =    9082.68 ms /    99 runs   (   91.74 ms per token,    10.90 tokens per second)
llama_print_timings:       total time =   32351.86 ms
Output generated in 32.70 seconds (3.06 tokens/s, 100 tokens, context 25, seed 740123037)

# Same prompt, second time.
llama_print_timings:        load time =   22956.41 ms
llama_print_timings:      sample time =      29.58 ms /   100 runs   (    0.30 ms per token,  3380.89 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    9129.46 ms /   100 runs   (   91.29 ms per token,    10.95 tokens per second)
llama_print_timings:       total time =    9455.29 ms
Output generated in 9.84 seconds (10.16 tokens/s, 100 tokens, context 25, seed 1015765663)
Weird how ROCm doesn't work at all.

set env var HSAKMT_DEBUG_LEVEL=7 and see what it spits out. Should show you what its doing, if anything on the GPU side

irony · 2023-12-13T12:22:54Z

Great work!!

JohnGalt1717 · 2023-12-13T18:23:56Z

Any indication when the docker containers will be updated?

DutchEllie · 2023-12-13T18:29:03Z

set env var HSAKMT_DEBUG_LEVEL=7 and see what it spits out. Should show you what its doing, if anything on the GPU side

I did that. I did the following:

I pulled the very freshest version of this repo's master branch
I ran the following command to compile make LLAMA_HIPBLAS=1 -j48
I ran the following command to test out ./main -m ../text-generation-webui/models/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf -e -p "<s>[INST] How fast can a Toyota Supra go? [/INST]" --min-p 0.05 --top-p 1.0 -n 10 -ngl 10

At that point, nvtop reports full 100% GPU usage and the power goes to about 137/327 watts. The screen briefly flashes once black and KDE gives me a scary Desktop effects were restarted due to a graphics reset message.
After that, the llama.cpp program hangs seemingly forever. I had to kill it.

This is the output of the program with the variable enabled like you said.
log from failure.txt

I did check using the completely default Oobabooga packages, that definitely does use the actual ROCm versions, not the OpenCL version. When it loads a model (that actually works) it will say "Using ROCm for GPU acceleration." or something similar. I would copy-paste a log to prove it, but you'll have to take my word for it. Why? Because llama.cpp messed with my GPU so much that I think I need to reboot, everything is going kinda crazy lmao.

Edit:
One reboot later, Ooba works again. Here you go, proof it's using ROCm:

... truncated ...
llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: mem required  =   86.04 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 4807.05 MiB
... truncated ...

clemens98 · 2023-12-13T18:29:34Z

Did load the model but I can't Input anything
And wen I did press enter I got greeted with the space invaders bug

I really hope it's just a side effect of Running out of VRAM and not bad signs for what is to come

Well the gpu driver is dead
Don't know if it was coused by the llama or just a ROCm conflict

Removed everything ROCm Related and with sudo modprobe everything works again

65a · 2023-12-14T02:54:24Z

ROCm/HIPBlas inference works fine. Card is W7900/gfx1100.

Timings:

llama_print_timings:        load time =   35904.56 ms
llama_print_timings:      sample time =     548.19 ms /   137 runs   (    4.00 ms per token,   249.91 tokens per second)
llama_print_timings: prompt eval time =   12865.74 ms /   986 tokens (   13.05 ms per token,    76.64 tokens per second)
llama_print_timings:        eval time =    4512.59 ms /   136 runs   (   33.18 ms per token,    30.14 tokens per second)
llama_print_timings:       total time =   40969.37 ms

Model is a 4x7B Mixtral q8 quant, seems to get about 30tok/s.

DutchEllie · 2023-12-14T05:32:43Z

ROCm/HIPBlas inference works fine. Card is W7900/gfx1100.

That's a relief, to know it's just on my side somehow. I guess it's off to figure out how the hell to get it working now I suppose. I don't assume you're doing anything different from me to compile and run the code?

65a · 2023-12-14T07:15:15Z

@DutchEllie I had a working ROCm environment already, I recommend debugging your set up with the distro/package specific forum (there's a lot that can go wrong there and I've needed to file various bugs). It's difficult if you're a dev/maintainer and trying to figure out if a patch is bad or a setup is bad if debugging starts with the environment. The classic troubleshooting question is always: "Has it ever worked?" Get to yes on that, and then you can test things more clearly.

clemens98 · 2023-12-14T10:01:17Z

I really hope they get clblas to work ROCm is hard to get to work and apparently also drops support for older generations extremely fast

DutchEllie · 2023-12-15T07:26:58Z

I was able to get rocm it "working" using llama-cpp-python on the latest version by adding the gpu targets.
The command I ran to do that there was CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ CMAKE_ARGS="-DLLAMA_HIPBLAS=on -DAMDGPU_TARGETS=gfx1100" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir

However, you can see I still have issues with actually getting results.
Have not tested running an older version with this method though.

Update: I get normal working results when I set the context size to anything <32768!

clemens98 · 2023-12-15T23:14:11Z

i still have to use sudo modprobe amdgpu every time i start my pc to get my gpu detected i am not touching ROCm again

In case anyone else has this problem dmks (something like that) blacklist the amd driver in every installation attempt

fakerybakery added the enhancement New feature or request label Dec 8, 2023

ggerganov added high priority Very important issue model Model specific labels Dec 9, 2023

ggerganov assigned ggerganov and slaren Dec 9, 2023

ggerganov changed the title ~~Mixtral MOE~~ llama : add Mixtral support Dec 9, 2023

baas-hans mentioned this issue Dec 11, 2023

error loading model: create_tensor: tensor 'blk.0.ffn_gate.weight' not found oobabooga/text-generation-webui#4881

Closed

1 task

ggerganov mentioned this issue Dec 13, 2023

llama : add Mixtral support #4406

Merged

3 tasks

ggerganov closed this as completed in #4406 Dec 13, 2023

llama : add Mixtral support #4381

llama : add Mixtral support #4381

Comments

fakerybakery commented Dec 8, 2023

ddh0 commented Dec 8, 2023 • edited Loading

fakerybakery commented Dec 8, 2023

ddh0 commented Dec 8, 2023

lxe commented Dec 8, 2023 • edited Loading

fakerybakery commented Dec 8, 2023

lxe commented Dec 8, 2023 • edited Loading

irony commented Dec 8, 2023 • edited Loading

mayfer commented Dec 8, 2023

fakerybakery commented Dec 8, 2023

lxe commented Dec 8, 2023

clemens98 commented Dec 9, 2023

leedrake5 commented Dec 9, 2023 • edited Loading

mayfer commented Dec 9, 2023

fakerybakery commented Dec 9, 2023

ciekawy commented Dec 10, 2023

someone13574 commented Dec 10, 2023

someone13574 commented Dec 10, 2023 • edited Loading

fakerybakery commented Dec 10, 2023

antirez commented Dec 10, 2023

Dampfinchen commented Dec 10, 2023 • edited Loading

ddh0 commented Dec 10, 2023

easp commented Dec 10, 2023 • edited Loading

Dampfinchen commented Dec 10, 2023

bergkvist commented Dec 10, 2023 • edited Loading

khimaros commented Dec 11, 2023

leedrake5 commented Dec 11, 2023

ciekawy commented Dec 12, 2023 • edited Loading

clemens98 commented Dec 12, 2023

ciekawy commented Dec 12, 2023

DutchEllie commented Dec 13, 2023

ZihaoTan commented Dec 13, 2023

clemens98 commented Dec 13, 2023 • edited Loading

DutchEllie commented Dec 13, 2023

clemens98 commented Dec 13, 2023

DutchEllie commented Dec 13, 2023

DutchEllie commented Dec 13, 2023

ChandanVerma commented Dec 13, 2023

ChandanVerma commented Dec 13, 2023

ciekawy commented Dec 13, 2023

ciekawy commented Dec 13, 2023

itsdotscience commented Dec 13, 2023

irony commented Dec 13, 2023

JohnGalt1717 commented Dec 13, 2023

DutchEllie commented Dec 13, 2023 • edited Loading

clemens98 commented Dec 13, 2023 • edited Loading

65a commented Dec 14, 2023 • edited Loading

DutchEllie commented Dec 14, 2023

65a commented Dec 14, 2023

clemens98 commented Dec 14, 2023

DutchEllie commented Dec 15, 2023 • edited Loading

clemens98 commented Dec 15, 2023 • edited Loading

ddh0 commented Dec 8, 2023 •

edited

Loading

lxe commented Dec 8, 2023 •

edited

Loading

lxe commented Dec 8, 2023 •

edited

Loading

irony commented Dec 8, 2023 •

edited

Loading

leedrake5 commented Dec 9, 2023 •

edited

Loading

someone13574 commented Dec 10, 2023 •

edited

Loading

Dampfinchen commented Dec 10, 2023 •

edited

Loading

easp commented Dec 10, 2023 •

edited

Loading

bergkvist commented Dec 10, 2023 •

edited

Loading

ciekawy commented Dec 12, 2023 •

edited

Loading

clemens98 commented Dec 13, 2023 •

edited

Loading

DutchEllie commented Dec 13, 2023 •

edited

Loading

clemens98 commented Dec 13, 2023 •

edited

Loading

65a commented Dec 14, 2023 •

edited

Loading

DutchEllie commented Dec 15, 2023 •

edited

Loading

clemens98 commented Dec 15, 2023 •

edited

Loading