Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : add Mixtral support #4381

Closed
fakerybakery opened this issue Dec 8, 2023 · 62 comments · Fixed by #4406
Closed

llama : add Mixtral support #4381

fakerybakery opened this issue Dec 8, 2023 · 62 comments · Fixed by #4406
Assignees
Labels
enhancement New feature or request high priority Very important issue model Model specific

Comments

@fakerybakery
Copy link

Hi,
Please add support for Mistral's MOE model Mixtral.

@fakerybakery fakerybakery added the enhancement New feature or request label Dec 8, 2023
@ddh0
Copy link
Contributor

ddh0 commented Dec 8, 2023

Seconded!

EDIT: There is an early implementation here: https://github.com/dzhulgakov/llama-mistral

@fakerybakery
Copy link
Author

Does the hacky implementation support quantization?

@ddh0
Copy link
Contributor

ddh0 commented Dec 8, 2023

Does the hacky implementation support quantization?

It barely supports inference, lol

@lxe
Copy link

lxe commented Dec 8, 2023

What's the general effort in modifying gguf-py / convert.py to support quantizing/converting this mistral MoE architecture?

@lxe
Copy link

lxe commented Dec 8, 2023

Here's a hacked together conversion script: lxe@2dd8944

@irony
Copy link

irony commented Dec 8, 2023

Here's a hacked together conversion script: 2dd8944

Nice!

@mayfer
Copy link

mayfer commented Dec 8, 2023

Here's a hacked together conversion script: 2dd8944

conversion worked! inference fails with "error loading model: unknown model architecture: 'moe'", any tips?

@fakerybakery
Copy link
Author

Hi @lxe, so your script supports conversion but not inference?

@lxe
Copy link

lxe commented Dec 8, 2023

It's a work in progress. It can convert to gguf format (with wrong layer names) but there's no logic for quantizing under 8 bits or loading and running the model.

Yet.

@clemens98
Copy link

so how much ram or vram is needed to run it?

@leedrake5
Copy link
Contributor

leedrake5 commented Dec 9, 2023

Here's a hacked together conversion script: 2dd8944

conversion worked! inference fails with "error loading model: unknown model architecture: 'moe'", any tips?

Same here. Mystified by what the problem here is. Mac silicon

Update: I'm dumb, just downloaded the script. With all files it creates a 93 Gb half precision model. I've got a system that in theory can run this (128 Gb M3 Max) but still get an error:

error loading model: unknown model architecture: 'moe'
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '~/GitHub/text-generation-webui/models/ggml-model-f16.gguf'

@mayfer
Copy link

mayfer commented Dec 9, 2023

It requires changes to llama.cpp. each model type has some custom code, it's not all generic. check out source code and see what kind of custom stuff there is for qwen etc.

@ggerganov ggerganov added high priority Very important issue model Model specific labels Dec 9, 2023
@ggerganov ggerganov changed the title Mixtral MOE llama : add Mixtral support Dec 9, 2023
@fakerybakery
Copy link
Author

@lxe just curious why your quantization script doesn't support int4?

@ciekawy
Copy link

ciekawy commented Dec 10, 2023

this is related probably? #2672

@someone13574
Copy link

Posting here for visibility: We know the general implementation of the MoE layer.

From the Mistral discord:
Screenshot from 2023-12-10 09-54-33
Screenshot from 2023-12-10 09-55-09
They link to here.

@someone13574
Copy link

someone13574 commented Dec 10, 2023

There is also an (unofficial) attempt at implementing it in Huggingface transformers here

@fakerybakery
Copy link
Author

I hope it can be quantized down to a reasonable size. What are the chances that it can eventually be run on a consumer-level laptop with reasonable results?

@antirez
Copy link

antirez commented Dec 10, 2023

I wonder if for this model llama.cpp could modify the routing to produce at least N tokens with the currently selected 2 experts. And only after N check again the routing, and if needed load other two experts and so forth. So that we need to take only 2 experts per time in memory, and ammortize the cost of "swapping" models to N tokens. There is to understand if the routing is so unstable, but seems unlikely.

@Dampfinchen
Copy link

Dampfinchen commented Dec 10, 2023

I wonder if for this model llama.cpp could modify the routing to produce at least N tokens with the currently selected 2 experts. And only after N check again the routing, and if needed load other two experts and so forth. So that we need to take only 2 experts per time in memory, and ammortize the cost of "swapping" models to N tokens. There is to understand if the routing is so unstable, but seems unlikely.

Yes as far as I understand it, it only needs the resources of a 14B model, so it shouldn't be too hard to run. I guess the main catch is that you need all experts to be in RAM so they can be loaded as quickly as possible.

@ddh0
Copy link
Contributor

ddh0 commented Dec 10, 2023

I am hoping for an option to that will let me only load models into memory when they're used, and keep them on disk when not used. I know this would be super slow but it'd let me and many other people run it when we otherwise would have no hope, barring other developments.

@easp
Copy link

easp commented Dec 10, 2023

My understanding is that there are 8 experts +router at every layer. It's making choices about which experts to use at every layer, not up-front.

Memory bandwidth + FLOPs is that of a ~14B model, but RAM is needed for all the weights.

@Dampfinchen
Copy link

I am hoping for an option to that will let me only load models into memory when they're used, and keep them on disk when not used. I know this would be super slow but it'd let me and many other people run it when we otherwise would have no hope, barring other developments.

I think each expert gets chosen per token basis. So it has to happen very fast and unfortunately, it would put an enourmous strain on an SSD. I don't think its feasible, but options are always welcome.

But hey that's just me speculating!

@bergkvist
Copy link

bergkvist commented Dec 10, 2023

I guess it is fine to use the mmap syscall for the full model with all the experts, and only the models/experts being used will actually be loaded into memory by your OS.

@khimaros
Copy link
Contributor

seems like TheBloke has model weights in the works: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF

@leedrake5
Copy link
Contributor

Looks like this might be the inference code.

@ciekawy
Copy link

ciekawy commented Dec 12, 2023

so actually this PR works, also on Apple Metal giving 10 tokens/s #4406

https://www.loom.com/share/7b25b0da48bd4543a86c1d223f876597?sid=54c6dc1a-65ac-4067-ab29-a2371f146868

Screenshot 2023-12-12 at 13 47 53

@clemens98
Copy link

Doas the new update support Mixtral x8 ?

@ciekawy
Copy link

ciekawy commented Dec 12, 2023

Doas the new update support Mixtral x8 ?

The PR is not yet merged

@DutchEllie
Copy link

Someone in the PR thread was able to offload some layers onto their RX 6700, but I haven't gotten that to work for my RX 7900 XTX. In this project all I got was a flashed screen and a frozen GPU, with Ooba it loaded the layers but for inference called on ggml-cuda.c, breaking it and dumping the core.

Obviously for both I used the custom branch

@ZihaoTan
Copy link

Can try MLX framework if you are using Apple Silicon https://github.com/ml-explore/mlx-examples/tree/main/mixtral.

@clemens98
Copy link

clemens98 commented Dec 13, 2023

Someone in the PR thread was able to offload some layers onto their RX 6700, but I haven't gotten that to work for my RX 7900 XTX.

So far it almost Always worked with clblast rx7900xt
Are you asking for Mixtral 7b x8 in particular?

@DutchEllie
Copy link

So far it almost Always worked with clblas rx7900xt Are we Talking for Mixtral 7b x8 in particular?

Yes, particularly Mixtral 8x7B. I thought the ROCm version was the hipBLAS one? That's the one I compiled. For Ooba I used the llama-cpp-python package and swapped out the included llama.cpp project with the mixtral branch from here, then compiled and installed the package with the hipBLAS implementation. That's when I got errors.

@clemens98
Copy link

Yes, particularly Mixtral 8x7B. I thought the ROCm version was the hipBLAS one?

It is but I never manged to get ROCm to work at all

@DutchEllie
Copy link

It is but I never manged to get ROCm to work at all

You bring up an interesting point, that being that I have no idea whether or not I have been the hipBLAS version or the CLBLAS version all this time, even before Mixtral. I believe, but am unsure, that I checked whether or not normal models worked with my hipBLAS compiled binary, but again I forgot.
When I get to the office soon, I might remote into my workstation just to check 2 things. One, whether or not hipBLAS version works at all even for older models, and whether using CLBLAS fixes it.

@DutchEllie
Copy link

So I followed the ROCm installation first, but as you said that didn't really work well. Not even the normal Mistral v0.1 model would run properly, same error as with Mixtral. Uses ggml-cuda.c and it crashes, core dumped. Happens with both llama.cpp and llama-cpp-python in ooba.

Using the CLBLAS version works better. I see both significant improvements on CPU and GPU (CPU went from 3tps to about 6, GPU from crashing to 10tps.) However, I don't think I've been using CLBLAS all this time, as the llama-cpp-python version installed by Ooba's AMD requirements is a ROCm version. Also, might be because I can only offload 20/33 layers, but initial loading is very slow:

# First time loading prompt
llama_print_timings:        load time =   22956.41 ms
llama_print_timings:      sample time =      29.61 ms /   100 runs   (    0.30 ms per token,  3377.01 tokens per second)
llama_print_timings: prompt eval time =   22956.23 ms /    25 tokens (  918.25 ms per token,     1.09 tokens per second)
llama_print_timings:        eval time =    9082.68 ms /    99 runs   (   91.74 ms per token,    10.90 tokens per second)
llama_print_timings:       total time =   32351.86 ms
Output generated in 32.70 seconds (3.06 tokens/s, 100 tokens, context 25, seed 740123037)

# Same prompt, second time.
llama_print_timings:        load time =   22956.41 ms
llama_print_timings:      sample time =      29.58 ms /   100 runs   (    0.30 ms per token,  3380.89 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    9129.46 ms /   100 runs   (   91.29 ms per token,    10.95 tokens per second)
llama_print_timings:       total time =    9455.29 ms
Output generated in 9.84 seconds (10.16 tokens/s, 100 tokens, context 25, seed 1015765663)

Weird how ROCm doesn't work at all.

@ChandanVerma
Copy link

llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloaded 22/33 layers to GPU
llm_load_tensors: VRAM used: 32353.06 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 44.00 MB
llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB
llama_build_graph: non-view tensors processed: 1124/1124
llama_new_context_with_model: compute buffer total size = 117.85 MiB
llama_new_context_with_model: VRAM scratch buffer: 114.54 MiB
llama_new_context_with_model: total VRAM used: 32511.60 MiB (model: 32353.06 MiB, context: 158.54 MiB)

system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 1.000, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 1000, n_keep = 0

[end of text]

llama_print_timings: load time = 7587.29 ms
llama_print_timings: sample time = 226.28 ms / 799 runs ( 0.28 ms per token, 3530.96 tokens per second)
llama_print_timings: prompt eval time = 11221.46 ms / 110 tokens ( 102.01 ms per token, 9.80 tokens per second)
llama_print_timings: eval time = 116379.25 ms / 798 runs ( 145.84 ms per token, 6.86 tokens per second)
llama_print_timings: total time = 128094.46 ms

@ciekawy
Copy link

ciekawy commented Dec 13, 2023

thats OT but would be exciting to know more about experts and be able to fine tune particular expert

@ciekawy
Copy link

ciekawy commented Dec 13, 2023

Can try MLX framework if you are using Apple Silicon https://github.com/ml-explore/mlx-examples/tree/main/mixtral.

@ZihaoTan with this PR I'm able to run quantized gguf on 36GB (24GB allocated to GPU)

@itsdotscience
Copy link

So I followed the ROCm installation first, but as you said that didn't really work well. Not even the normal Mistral v0.1 model would run properly, same error as with Mixtral. Uses ggml-cuda.c and it crashes, core dumped. Happens with both llama.cpp and llama-cpp-python in ooba.

Using the CLBLAS version works better. I see both significant improvements on CPU and GPU (CPU went from 3tps to about 6, GPU from crashing to 10tps.) However, I don't think I've been using CLBLAS all this time, as the llama-cpp-python version installed by Ooba's AMD requirements is a ROCm version. Also, might be because I can only offload 20/33 layers, but initial loading is very slow:

# First time loading prompt
llama_print_timings:        load time =   22956.41 ms
llama_print_timings:      sample time =      29.61 ms /   100 runs   (    0.30 ms per token,  3377.01 tokens per second)
llama_print_timings: prompt eval time =   22956.23 ms /    25 tokens (  918.25 ms per token,     1.09 tokens per second)
llama_print_timings:        eval time =    9082.68 ms /    99 runs   (   91.74 ms per token,    10.90 tokens per second)
llama_print_timings:       total time =   32351.86 ms
Output generated in 32.70 seconds (3.06 tokens/s, 100 tokens, context 25, seed 740123037)

# Same prompt, second time.
llama_print_timings:        load time =   22956.41 ms
llama_print_timings:      sample time =      29.58 ms /   100 runs   (    0.30 ms per token,  3380.89 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    9129.46 ms /   100 runs   (   91.29 ms per token,    10.95 tokens per second)
llama_print_timings:       total time =    9455.29 ms
Output generated in 9.84 seconds (10.16 tokens/s, 100 tokens, context 25, seed 1015765663)

Weird how ROCm doesn't work at all.

set env var HSAKMT_DEBUG_LEVEL=7 and see what it spits out. Should show you what its doing, if anything on the GPU side

@irony
Copy link

irony commented Dec 13, 2023

Great work!!

@JohnGalt1717
Copy link

Any indication when the docker containers will be updated?

@DutchEllie
Copy link

DutchEllie commented Dec 13, 2023

set env var HSAKMT_DEBUG_LEVEL=7 and see what it spits out. Should show you what its doing, if anything on the GPU side

I did that. I did the following:

  1. I pulled the very freshest version of this repo's master branch
  2. I ran the following command to compile make LLAMA_HIPBLAS=1 -j48
  3. I ran the following command to test out ./main -m ../text-generation-webui/models/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf -e -p "<s>[INST] How fast can a Toyota Supra go? [/INST]" --min-p 0.05 --top-p 1.0 -n 10 -ngl 10

At that point, nvtop reports full 100% GPU usage and the power goes to about 137/327 watts. The screen briefly flashes once black and KDE gives me a scary Desktop effects were restarted due to a graphics reset message.
After that, the llama.cpp program hangs seemingly forever. I had to kill it.

This is the output of the program with the variable enabled like you said.
log from failure.txt

I did check using the completely default Oobabooga packages, that definitely does use the actual ROCm versions, not the OpenCL version. When it loads a model (that actually works) it will say "Using ROCm for GPU acceleration." or something similar. I would copy-paste a log to prove it, but you'll have to take my word for it. Why? Because llama.cpp messed with my GPU so much that I think I need to reboot, everything is going kinda crazy lmao.

Edit:
One reboot later, Ooba works again. Here you go, proof it's using ROCm:

... truncated ...
llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: mem required  =   86.04 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 4807.05 MiB
... truncated ...

@clemens98
Copy link

clemens98 commented Dec 13, 2023

Did load the model but I can't Input anything
And wen I did press enter I got greeted with the space invaders bug

I really hope it's just a side effect of Running out of VRAM and not bad signs for what is to come

Well the gpu driver is dead
Don't know if it was coused by the llama or just a ROCm conflict

Removed everything ROCm Related and with sudo modprobe everything works again

@65a
Copy link
Contributor

65a commented Dec 14, 2023

ROCm/HIPBlas inference works fine. Card is W7900/gfx1100.

Timings:

llama_print_timings:        load time =   35904.56 ms
llama_print_timings:      sample time =     548.19 ms /   137 runs   (    4.00 ms per token,   249.91 tokens per second)
llama_print_timings: prompt eval time =   12865.74 ms /   986 tokens (   13.05 ms per token,    76.64 tokens per second)
llama_print_timings:        eval time =    4512.59 ms /   136 runs   (   33.18 ms per token,    30.14 tokens per second)
llama_print_timings:       total time =   40969.37 ms

Model is a 4x7B Mixtral q8 quant, seems to get about 30tok/s.

@DutchEllie
Copy link

ROCm/HIPBlas inference works fine. Card is W7900/gfx1100.

That's a relief, to know it's just on my side somehow. I guess it's off to figure out how the hell to get it working now I suppose. I don't assume you're doing anything different from me to compile and run the code?

@65a
Copy link
Contributor

65a commented Dec 14, 2023

@DutchEllie I had a working ROCm environment already, I recommend debugging your set up with the distro/package specific forum (there's a lot that can go wrong there and I've needed to file various bugs). It's difficult if you're a dev/maintainer and trying to figure out if a patch is bad or a setup is bad if debugging starts with the environment. The classic troubleshooting question is always: "Has it ever worked?" Get to yes on that, and then you can test things more clearly.

@clemens98
Copy link

I really hope they get clblas to work ROCm is hard to get to work and apparently also drops support for older generations extremely fast

@DutchEllie
Copy link

DutchEllie commented Dec 15, 2023

I was able to get rocm it "working" using llama-cpp-python on the latest version by adding the gpu targets.
The command I ran to do that there was CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ CMAKE_ARGS="-DLLAMA_HIPBLAS=on -DAMDGPU_TARGETS=gfx1100" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir

However, you can see I still have issues with actually getting results.
Have not tested running an older version with this method though.

Update: I get normal working results when I set the context size to anything <32768!

@clemens98
Copy link

clemens98 commented Dec 15, 2023

i still have to use sudo modprobe amdgpu every time i start my pc to get my gpu detected i am not touching ROCm again

In case anyone else has this problem dmks (something like that) blacklist the amd driver in every installation attempt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high priority Very important issue model Model specific
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.