-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : add Mixtral support #4381
Comments
Seconded! EDIT: There is an early implementation here: https://github.com/dzhulgakov/llama-mistral |
Does the hacky implementation support quantization? |
It barely supports inference, lol |
What's the general effort in modifying gguf-py / convert.py to support quantizing/converting this mistral MoE architecture? |
Here's a hacked together conversion script: lxe@2dd8944 |
Nice! |
conversion worked! inference fails with "error loading model: unknown model architecture: 'moe'", any tips? |
Hi @lxe, so your script supports conversion but not inference? |
It's a work in progress. It can convert to gguf format (with wrong layer names) but there's no logic for quantizing under 8 bits or loading and running the model. Yet. |
so how much ram or vram is needed to run it? |
Same here. Mystified by what the problem here is. Mac silicon Update: I'm dumb, just downloaded the script. With all files it creates a 93 Gb half precision model. I've got a system that in theory can run this (128 Gb M3 Max) but still get an error:
|
It requires changes to llama.cpp. each model type has some custom code, it's not all generic. check out source code and see what kind of custom stuff there is for qwen etc. |
@lxe just curious why your quantization script doesn't support int4? |
this is related probably? #2672 |
Posting here for visibility: We know the general implementation of the MoE layer. From the Mistral discord: |
There is also an (unofficial) attempt at implementing it in Huggingface transformers here |
I hope it can be quantized down to a reasonable size. What are the chances that it can eventually be run on a consumer-level laptop with reasonable results? |
I wonder if for this model llama.cpp could modify the routing to produce at least N tokens with the currently selected 2 experts. And only after N check again the routing, and if needed load other two experts and so forth. So that we need to take only 2 experts per time in memory, and ammortize the cost of "swapping" models to N tokens. There is to understand if the routing is so unstable, but seems unlikely. |
Yes as far as I understand it, it only needs the resources of a 14B model, so it shouldn't be too hard to run. I guess the main catch is that you need all experts to be in RAM so they can be loaded as quickly as possible. |
I am hoping for an option to that will let me only load models into memory when they're used, and keep them on disk when not used. I know this would be super slow but it'd let me and many other people run it when we otherwise would have no hope, barring other developments. |
My understanding is that there are 8 experts +router at every layer. It's making choices about which experts to use at every layer, not up-front. Memory bandwidth + FLOPs is that of a ~14B model, but RAM is needed for all the weights. |
I think each expert gets chosen per token basis. So it has to happen very fast and unfortunately, it would put an enourmous strain on an SSD. I don't think its feasible, but options are always welcome. But hey that's just me speculating! |
I guess it is fine to use the |
seems like TheBloke has model weights in the works: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF |
Looks like this might be the inference code. |
so actually this PR works, also on Apple Metal giving 10 tokens/s #4406 https://www.loom.com/share/7b25b0da48bd4543a86c1d223f876597?sid=54c6dc1a-65ac-4067-ab29-a2371f146868 |
Doas the new update support Mixtral x8 ? |
The PR is not yet merged |
Someone in the PR thread was able to offload some layers onto their RX 6700, but I haven't gotten that to work for my RX 7900 XTX. In this project all I got was a flashed screen and a frozen GPU, with Ooba it loaded the layers but for inference called on ggml-cuda.c, breaking it and dumping the core. Obviously for both I used the custom branch |
Can try MLX framework if you are using Apple Silicon https://github.com/ml-explore/mlx-examples/tree/main/mixtral. |
So far it almost Always worked with clblast rx7900xt |
Yes, particularly Mixtral 8x7B. I thought the ROCm version was the hipBLAS one? That's the one I compiled. For Ooba I used the llama-cpp-python package and swapped out the included llama.cpp project with the mixtral branch from here, then compiled and installed the package with the hipBLAS implementation. That's when I got errors. |
It is but I never manged to get ROCm to work at all |
You bring up an interesting point, that being that I have no idea whether or not I have been the hipBLAS version or the CLBLAS version all this time, even before Mixtral. I believe, but am unsure, that I checked whether or not normal models worked with my hipBLAS compiled binary, but again I forgot. |
So I followed the ROCm installation first, but as you said that didn't really work well. Not even the normal Mistral v0.1 model would run properly, same error as with Mixtral. Uses ggml-cuda.c and it crashes, core dumped. Happens with both llama.cpp and llama-cpp-python in ooba. Using the CLBLAS version works better. I see both significant improvements on CPU and GPU (CPU went from 3tps to about 6, GPU from crashing to 10tps.) However, I don't think I've been using CLBLAS all this time, as the llama-cpp-python version installed by Ooba's AMD requirements is a ROCm version. Also, might be because I can only offload 20/33 layers, but initial loading is very slow:
Weird how ROCm doesn't work at all. |
llm_load_tensors: offloading 22 repeating layers to GPU system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | [end of text] llama_print_timings: load time = 7587.29 ms |
thats OT but would be exciting to know more about experts and be able to fine tune particular expert |
@ZihaoTan with this PR I'm able to run quantized gguf on 36GB (24GB allocated to GPU) |
set env var HSAKMT_DEBUG_LEVEL=7 and see what it spits out. Should show you what its doing, if anything on the GPU side |
Great work!! |
Any indication when the docker containers will be updated? |
I did that. I did the following:
At that point, This is the output of the program with the variable enabled like you said. I did check using the completely default Oobabooga packages, that definitely does use the actual ROCm versions, not the OpenCL version. When it loads a model (that actually works) it will say "Using ROCm for GPU acceleration." or something similar. I would copy-paste a log to prove it, but you'll have to take my word for it. Why? Because llama.cpp messed with my GPU so much that I think I need to reboot, everything is going kinda crazy lmao. Edit:
|
Did load the model but I can't Input anything I really hope it's just a side effect of Running out of VRAM and not bad signs for what is to come Well the gpu driver is dead Removed everything ROCm Related and with sudo modprobe everything works again |
ROCm/HIPBlas inference works fine. Card is W7900/gfx1100. Timings:
Model is a 4x7B Mixtral q8 quant, seems to get about 30tok/s. |
That's a relief, to know it's just on my side somehow. I guess it's off to figure out how the hell to get it working now I suppose. I don't assume you're doing anything different from me to compile and run the code? |
@DutchEllie I had a working ROCm environment already, I recommend debugging your set up with the distro/package specific forum (there's a lot that can go wrong there and I've needed to file various bugs). It's difficult if you're a dev/maintainer and trying to figure out if a patch is bad or a setup is bad if debugging starts with the environment. The classic troubleshooting question is always: "Has it ever worked?" Get to yes on that, and then you can test things more clearly. |
I really hope they get clblas to work ROCm is hard to get to work and apparently also drops support for older generations extremely fast |
I was able to get rocm it "working" using llama-cpp-python on the latest version by adding the gpu targets. However, you can see I still have issues with actually getting results. Update: I get normal working results when I set the context size to anything <32768! |
i still have to use sudo modprobe amdgpu every time i start my pc to get my gpu detected i am not touching ROCm again In case anyone else has this problem dmks (something like that) blacklist the amd driver in every installation attempt |
Hi,
Please add support for Mistral's MOE model Mixtral.
The text was updated successfully, but these errors were encountered: