Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run with -ngl parameter? #268

Closed
albertoZurini opened this issue May 23, 2023 · 9 comments
Closed

How to run with -ngl parameter? #268

albertoZurini opened this issue May 23, 2023 · 9 comments
Labels
hardware Hardware specific issue performance

Comments

@albertoZurini
Copy link

Is your feature request related to a problem? Please describe.
I have a low VRAM GPU and would like to execute the python binding. I can run LLaMA, thanks to https://gist.github.com/rain-1/8cc12b4b334052a21af8029aa9c4fafc . But I didn't understand if this is possible with this binding.

Describe the solution you'd like
I want to run 13B model on my 3060.

Describe alternatives you've considered
https://gist.github.com/rain-1/8cc12b4b334052a21af8029aa9c4fafc

Additional context

@gjmulder gjmulder added the hardware Hardware specific issue label May 23, 2023
@gjmulder
Copy link
Contributor

gjmulder commented May 23, 2023

  1. Use a 5_1 quantized model. This allows you to load the largest model on your GPU with the smallest amount of quality loss.
  2. Set n_ctx as you want. Note that increasing this parameter increases quality at the cost of performance (tokens per second) and VRAM.
  3. Run without the ngl parameter and see how much free VRAM you have.
  4. Increment ngl=NN until you are using almost all your VRAM.

With a 7B model and an 8K context I can fit all the layers on the GPU in 6GB of VRAM. Similarly, the 13B model will fit in 11GB of VRAM:

llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 8196
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 1979.59 MB (+ 1026.00 MB per state)
llama_model_load_internal: [cublas] offloading 32 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 4633 MB
...................................................................................................
llama_init_from_file: kv self size  = 4098.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
$ nvidia-smi | grep python3
|    0   N/A  N/A   2222222      C   python3                                    5998MiB |

@albertoZurini
Copy link
Author

albertoZurini commented May 23, 2023

I get this error when setting that parameter:

llama.cpp: loading model from LLaMA/13B/ggml-model-f16-q4_0.bin
terminate called after throwing an instance of 'std::runtime_error'
  what():  unexpectedly reached end of file
Aborted (core dumped)

In particular it hangs on line 259 of llama_cpp.py.

@gjmulder
Copy link
Contributor

It looks like your model file is corrupt. Does it work with llama.cpp/main ?

@albertoZurini
Copy link
Author

albertoZurini commented May 24, 2023

Yes, it does. Could you please give me any hint on how to debug the python binding better?

@gjmulder
Copy link
Contributor

That's really strange. The error 'std::runtime_error' looks like a C++ error.

DebuggingWithGdb maybe?

@albertoZurini
Copy link
Author

albertoZurini commented May 24, 2023

I think this can be an error due to the encoding of the file, because I've tried to download a pre-quantized model from https://huggingface.co/eachadea/ggml-vicuna-13b-1.1/tree/main and running it in Docker, but I'm getting segmentation fault there as well:

llama.cpp: loading model from /models/ggml-old-vic13b-q4_0.bin
Illegal instruction (core dumped)

But this is strange, I did follow the steps at the link I've sent on the first post in the issue and for llama.cpp they work just fine, I don't know how it can be that they don't work for the python binding. How did you prepare the data?

@gjmulder
Copy link
Contributor

Try cloning llama-cpp-python, building the package locally as per the README.md, and then to verify whether the llama.cpp pulled in via llama-cpp-python works:

$ cd llama-cpp-python
$ cd vendor/llama.cpp
$ make -j
$ ./main -m /models/ggml-old-vic13b-q4_0.bin

@albertoZurini
Copy link
Author

Thank you a lot, that was the issue: I was quantizing using a different version of llama.cpp.

@Huge
Copy link
Contributor

Huge commented May 29, 2023

I have struggled with the same trouble last week. It is caused by ggerganov/llama.cpp#1305 breaking change of lamma.cpp rolled out recently, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hardware Hardware specific issue performance
Projects
None yet
Development

No branches or pull requests

3 participants