Inference on the model #1

ajaysurya1221 · 2023-03-06T13:27:03Z

Hi, could someone shed some light on how this model can be loaded and used for inference? I know this is early and everybody might be a little vague on this but still, only for educational purposes.

shawwn · 2023-03-07T21:02:45Z

Hiya. Yes, right this way: https://twitter.com/rowancrowe/status/1632676722612269057

Basically, clone https://github.com/shawwn/llama and use that for inferencing instead.

Note that it's using FP16 weights, not int8, so the memory requirements are 2x of the int8 quantized model. But personally I'm skeptical that the model can be quantized to int8 without harming its performance, and I don't need it anyway. Maybe I'll make it an option, but until then, you might want to try https://github.com/tloen/llama-int8 instead. (Note that you'll probably need to merge my improved sampler if you're seeing repetitive, low-quality outputs.)

Also note that the repo is set up to use a context window of 2048, which will probably run out of memory on most video cards. So change "2048" to "512" in model.py if needed. (I'm not sure why this causes an OOM, since the default in example.py is 512, but I have no way to reproduce the bug.

Have fun!

johndpope · 2023-03-08T02:37:28Z

Hey Shawn, not relevant - but would be cool to wire up this somehow
https://github.com/patrikzudel/PatrikZeros-ChatGPT-API-UI

randaller · 2023-03-09T10:19:00Z

Run it on home desktop PC: https://github.com/randaller/llama-chat

jorahn · 2023-03-10T20:42:07Z

Note that it's using FP16 weights, not int8, so the memory requirements are 2x of the int8 quantized model. But personally I'm skeptical that the model can be quantized to int8 without harming its performance, and I don't need it anyway. Maybe I'll make it an option, but until then, you might want to try https://github.com/tloen/llama-int8 instead. (Note that you'll probably need to merge my improved sampler if you're seeing repetitive, low-quality outputs.)

this is implemented here: https://github.com/jorahn/llama-int8

Straafe · 2023-03-10T23:22:21Z

@jorahn Nice, 13B working on my 3090

randaller · 2023-03-11T12:16:33Z

Hi @shawwn, I've implemented your repetion_penalty and top_k sampler in my repo (https://github.com/randaller/llama-chat) and it works great, so I just would like to say Thank you very much!!!

G2G2G2G · 2023-03-12T02:32:34Z

ggerganov/llama.cpp#23

ggerganov/llama.cpp#20

contributing to this project with chat would enable people to run it on basically any web server (assuming they had enough RAM) 7B only uses ~4gb

ajaysurya1221 changed the title ~~Infernce on the model~~ Inference on the model Mar 6, 2023

drakejwong added a commit to drakejwong/llama that referenced this issue Mar 10, 2023

https://github.com/shawwn/llama-dl/issues/1#issuecomment-1458870564

f2356c2

drakejwong added a commit to drakejwong/llama that referenced this issue Mar 10, 2023

https://github.com/shawwn/llama-dl/issues/1#issuecomment-1458870564

a468a67

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference on the model #1

Inference on the model #1

ajaysurya1221 commented Mar 6, 2023

shawwn commented Mar 7, 2023 •

edited

Loading

johndpope commented Mar 8, 2023

randaller commented Mar 9, 2023

jorahn commented Mar 10, 2023

Straafe commented Mar 10, 2023 •

edited

Loading

randaller commented Mar 11, 2023 •

edited

Loading

G2G2G2G commented Mar 12, 2023

Inference on the model #1

Inference on the model #1

Comments

ajaysurya1221 commented Mar 6, 2023

shawwn commented Mar 7, 2023 • edited Loading

johndpope commented Mar 8, 2023

randaller commented Mar 9, 2023

jorahn commented Mar 10, 2023

Straafe commented Mar 10, 2023 • edited Loading

randaller commented Mar 11, 2023 • edited Loading

G2G2G2G commented Mar 12, 2023

shawwn commented Mar 7, 2023 •

edited

Loading

Straafe commented Mar 10, 2023 •

edited

Loading

randaller commented Mar 11, 2023 •

edited

Loading