Mamba #694

JoaoVictorVP · 2024-04-25T13:44:51Z

Is mamba supported already in the current version of llama.cpp this library uses?

martindevans · 2024-04-25T13:47:32Z

The binaries in the latest release (0.11.1) are a little too old. The ones in the master branch were compiled after that PR was merged, so in theory they should include mamba support. I'd be interested to hear how that goes if you try it!

JoaoVictorVP · 2024-04-25T14:34:16Z

Oh, I'm using the 0.11.2 (from nuget). I tried copying the binaries from the master and replacing the /bin ones with it.
Surprisingly it loaded the model which was not being the case before. But crashed without any printed errors and with exit code -1073740791 after 3 seconds I started an inference, not sure if it is because I'm using the same 0.11.2 or not. Just after the inferenced started and before crashing, it also outputted this to the console:

GGML_ASSERT: D:\a\LLamaSharp\LLamaSharp\llama.cpp:10282: n_threads > 0
(I tested with different configs over 'Threads' in ModelParams)

Is a new build of the nuget packet from master needed in this case?

martindevans · 2024-04-25T14:37:16Z

I tried copying the binaries from the master

That won't work I'm afraid. The llama.cpp API is unstable so every time the binaries are updated there are various internal changes on the C# side to work with the changed API. You always need to use the correc set of binaries with the correct version of the C# code.

JoaoVictorVP · 2024-04-25T15:13:21Z

Yep. Just compiled the main package and the CPU version and the results were the same, same exit code and assertion log.
Maybe I could inspect the sources this weekend to try finding any cause for this, or do you have any ideas in mind for why this is happening?

martindevans · 2024-04-25T15:32:11Z

I don't have any ideas at the moment. I know Mamba is a bit of an unusual archtecture just because I've seen various comments inside llama.cpp about how certain APIs needs to be adjusted for Mamba, or don't quite make sense in a Mamba context. We'd definitely be interested in any investigations/PRs for Mamba support!

JoaoVictorVP · 2024-04-25T15:41:14Z

Ops, correction.

It actually worked, I suspected it was because nuget was caching the package (0.11.2) from the remote nuget (because I built the project using the 0.11.2 version), then I deleted the cache and now it works.

The outputs are very strange tho, but I suspect this is because I'm not formatting the inputs yet (for the tests), see here:

Also, the token limit is not working so I did my own with the output transformer to test here.

(This is a very small model as well, but compared to something like Phi3 it is very crude)
(On the other hands, even with weird responses the initial token time do not increase absurdly like with Phi3 model, so it seems like at least a partial win)

AsakusaRinne · 2024-04-26T09:08:13Z

I suspected it was because nuget was caching the package (0.11.2) from the remote nuget (because I built the project using the 0.11.2 version), then I deleted the cache and now it works.

Yes, nuget caches the package and will not take your compiled one if it has the same version tag.

The outputs are very strange

That's an unexpected thing. What prompt were you using? If you have cmake installed on your PC, you could also try to run the same model and prompt directly in llama.cpp to see if the output is still in a mess.

JoaoVictorVP · 2024-04-27T14:34:26Z

That's an unexpected thing. What prompt were you using? If you have cmake installed on your PC, you could also try to run the same model and prompt directly in llama.cpp to see if the output is still in a mess.

About this, the model was one of the unique I was able to find in hugging face in GGUF that was actually mamba (MambaHermes 3B).

I tested it with the same formatting with the same processor I made for Phi3 and it also "kinda worked" (the responses were then very short, but more coherent). I also got it working a little better with the version quantized with 6-bits instead of 4.

But I realized something a little strange, there is something on the implementation of llama.cpp that makes models run progressively slower? I thought it was because I was using transformed-based models before, but even with mamba the time for initial token is many times increasing absurdly with each message (like, from 1 second to the first token to 5, then 10, then 26, etc).
(I'm asking because later I tested the same Phi3 model [not the mamba yet] in LM Studio and the time per first token was not changing so much, more like 1-3 seconds per message at max)

One of my tests where they performed reasonably well:
Q6 https://gist.github.com/JoaoVictorVP/92f6f30ad9d3c3dc343fdf0d7685685f
Q4 https://gist.github.com/JoaoVictorVP/f4de9ee658108898eaefa2c58c37938d

AsakusaRinne · 2024-04-28T07:43:50Z

there is something on the implementation of llama.cpp that makes models run progressively slower?

AFAIK, there's no such thing in llama.cpp. Could you please post the huggingface model link here so that we can try to reproduce this case?

(I'm asking because later I tested the same Phi3 model [not the mamba yet] in LM Studio and the time per first token was not changing so much, more like 1-3 seconds per message at max)

Though LM studio is not open-source, if I remember correctly, it also uses llama.cpp as the backend. As you mentioned above, phi-3 works well in LM studio while mamba becomes slower in llama.cpp. It doesn't indicate that it's llama.cpp's problem, but also probably the model's problem. Could you please try mamba in LM-studio, or try phi-3 with llama.cpp/LLamaSharp?

martindevans · 2024-04-28T15:07:29Z

You'll get a progressive slowdown if you are using a stateless executor, and submitting larger and larger chat history each time. The stateful executors internally store the chat history and should be around the same time for every token. I'm not sure exactly how the situation there differs for Mamba, but it should be roughly the same afaik.

martindevans · 2024-06-26T00:27:56Z

I'll close this one now, since mamba is now supported. If there's still problems please don't hesitate to re-open or to create new issues :)

martindevans closed this as completed Jun 26, 2024

martindevans added the enhancement New feature or request label Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mamba #694

Mamba #694

JoaoVictorVP commented Apr 25, 2024

martindevans commented Apr 25, 2024

JoaoVictorVP commented Apr 25, 2024

martindevans commented Apr 25, 2024

JoaoVictorVP commented Apr 25, 2024

martindevans commented Apr 25, 2024

JoaoVictorVP commented Apr 25, 2024 •

edited

Loading

AsakusaRinne commented Apr 26, 2024

JoaoVictorVP commented Apr 27, 2024 •

edited

Loading

AsakusaRinne commented Apr 28, 2024

martindevans commented Apr 28, 2024

martindevans commented Jun 26, 2024

Mamba #694

Mamba #694

Comments

JoaoVictorVP commented Apr 25, 2024

martindevans commented Apr 25, 2024

JoaoVictorVP commented Apr 25, 2024

martindevans commented Apr 25, 2024

JoaoVictorVP commented Apr 25, 2024

martindevans commented Apr 25, 2024

JoaoVictorVP commented Apr 25, 2024 • edited Loading

AsakusaRinne commented Apr 26, 2024

JoaoVictorVP commented Apr 27, 2024 • edited Loading

AsakusaRinne commented Apr 28, 2024

martindevans commented Apr 28, 2024

martindevans commented Jun 26, 2024

JoaoVictorVP commented Apr 25, 2024 •

edited

Loading

JoaoVictorVP commented Apr 27, 2024 •

edited

Loading