Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump llama-cpp-python to 0.2.23 (NVIDIA & CPU-only, no AMD, no Metal) #4924

Merged
merged 1 commit into from
Dec 14, 2023

Conversation

oobabooga
Copy link
Owner

@oobabooga oobabooga commented Dec 14, 2023

Adds Mixtral support.

Compiled using GitHub Actions workflows at https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels

The AMD and Metal workflows are failing, so I only have the NVIDIA and CPU wheels for now.

@mjameson
Copy link

Awesome, many thanks!!

@oobabooga oobabooga deleted the bump-llamacpp-mixtral branch December 15, 2023 01:07
@Fastmedic
Copy link

After this update my token generation speed seems to be about 10x slower on my 3090 running regular LLaMa models.

Also: https://www.reddit.com/r/Oobabooga/s/XqGCaA1Rtm

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Dec 15, 2023

It has a problem offloading KV cache. I forced it on and it's back to normal but I don't think most users can do that. All other models will be 1/2 speed.

@oobabooga
Copy link
Owner Author

That's a bit of a conundrum because the previous version does not support Mixtral. @Ph0rk0z is it necessary to recompile llama-cpp-python to apply this fix? Can it be monkeypatched?

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Dec 15, 2023

It's not the lib and just the python files. You can edit it under site packages I think. It's not a big patch, it's up in the reddit thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants