-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference on the model #1
Comments
Hiya. Yes, right this way: https://twitter.com/rowancrowe/status/1632676722612269057 Basically, clone https://github.com/shawwn/llama and use that for inferencing instead. Note that it's using FP16 weights, not int8, so the memory requirements are 2x of the int8 quantized model. But personally I'm skeptical that the model can be quantized to int8 without harming its performance, and I don't need it anyway. Maybe I'll make it an option, but until then, you might want to try https://github.com/tloen/llama-int8 instead. (Note that you'll probably need to merge my improved sampler if you're seeing repetitive, low-quality outputs.) Also note that the repo is set up to use a context window of 2048, which will probably run out of memory on most video cards. So change "2048" to "512" in model.py if needed. (I'm not sure why this causes an OOM, since the default in example.py is 512, but I have no way to reproduce the bug. Have fun! |
Hey Shawn, not relevant - but would be cool to wire up this somehow |
Run it on home desktop PC: https://github.com/randaller/llama-chat |
this is implemented here: https://github.com/jorahn/llama-int8 |
@jorahn Nice, 13B working on my 3090 |
Hi @shawwn, I've implemented your repetion_penalty and top_k sampler in my repo (https://github.com/randaller/llama-chat) and it works great, so I just would like to say Thank you very much!!! |
contributing to this project with chat would enable people to run it on basically any web server (assuming they had enough RAM) 7B only uses ~4gb |
Hi, could someone shed some light on how this model can be loaded and used for inference? I know this is early and everybody might be a little vague on this but still, only for educational purposes.
The text was updated successfully, but these errors were encountered: