Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using metal and n_gpu_layers produces no tokens #30

Open
jasonw247 opened this issue Jan 4, 2024 · 5 comments
Open

Using metal and n_gpu_layers produces no tokens #30

jasonw247 opened this issue Jan 4, 2024 · 5 comments

Comments

@jasonw247
Copy link

I'm running the example script with a few different models:

use llama_cpp_rs::{
    options::{ModelOptions, PredictOptions},
    LLama,
};

pub fn llama_predict() -> Result<String, anyhow::Error> {
    
    // metal seems to give really bad results
    let model_options = ModelOptions {
          //n_gpu_layers: 1,
        ..Default::default()
    };
    
    // let model_options = ModelOptions::default();

    let llama = LLama::new(
        "models/mistral-7b-instruct-v0.1.Q4_0.gguf".into(),
        &model_options,
    )
    .unwrap();

    let predict_options = PredictOptions {
        //top_k: 20,
        // top_p: 0.1,
        // f16_kv: true,

        token_callback: Some(Box::new(|token| {
            println!("token: {}", token);
            true
        })),
        ..Default::default()
    };

    // TODO: get this working on master. Metal support is flakey.
    let response = llama
        .predict(
            "what are the national animals of india".into(),
             predict_options,
        )
        .unwrap();
    println!("Response: {}", response);
    Ok(response)
}


#[cfg(test)]
mod tests {
    use super::*;
    #[test]
    fn test_llama_cpp_rs() -> Result<(), anyhow::Error> {
        let response = llama_predict()?;
        println!("Response: {}", response);
        assert!(!response.is_empty());
        Ok(())
    }
}

When not using metal (not using n_gpu_layers) the models generate tokens ex:

token: ind
token: ian
token:  national
token:  animal
token:  is
token:  t
token: iger
token: 
Response: indian national animal is tiger
Response: indian national animal is tiger

When I use n_gpu_layers it does not generate tokens, ex:

llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =   64.00 MiB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 76.07 MiB
llama_new_context_with_model: max tensor size =   102.54 MiB
count 0
token:
token:
token:
token:
...
Response:
Response:

Is this a known behavior?

@pixelspark
Copy link
Contributor

Is llama.cpp actually using Metal? I tried this and noticed (only after enabling some debug logging) that in fact the file ggml-metal.metal could not be found (it needs to be placed in the current working directory). After this the basic example works just fine for me (and actually uses the GPU) with a Mixtral GGUF model.

@jasonw247
Copy link
Author

I copied over the necessary metal files, otherwise I would get an error. After copying the files I encountered the no generated tokens issue.

@shaqq
Copy link

shaqq commented Feb 11, 2024

Is llama.cpp actually using Metal? I tried this and noticed (only after enabling some debug logging) that in fact the file ggml-metal.metal could not be found (it needs to be placed in the current working directory). After this the basic example works just fine for me (and actually uses the GPU) with a Mixtral GGUF model.

AFAIK it does: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#metal-build

@shaqq
Copy link

shaqq commented Feb 11, 2024

llama-cpp-python requires the user to specify CMAKE_ARGS when during pip install: https://llama-cpp-python.readthedocs.io/en/latest/install/macos/

Do users need to do something similar during cargo install for this crate?

@mikecvet
Copy link

Reading through here, it seems like llama.cpp needs to be built with specific flags in order for metal support to work: ggerganov/llama.cpp#1642

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants