Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with C++ inference for model meta-llama/Llama-2-70b-hf #1452

Open
DDDDDYTS opened this issue Jul 24, 2024 · 0 comments
Open

Issue with C++ inference for model meta-llama/Llama-2-70b-hf #1452

DDDDDYTS opened this issue Jul 24, 2024 · 0 comments

Comments

@DDDDDYTS
Copy link

I was trying C++ version of spec_infer and incr_decoding in FlexFlow/inference. And I found that when using the model Llama-2-70b-hf, An error occured as below.

 inference/file_loader.cc:252: void load_attention_weights_v2(DT*, int, int, size_t, size_t, std::string, std::string, size_t, int) [with DT = __half; size_t = long unsigned int; std::string = std::__cxx11::basic_string<char>]: Assertion `false && "data size mismatch"' failed.

And after checking issue #1154 and the source code of llama.cc and file_loader.cc, I realized that the reason was that the number of attention heads and key value heads are not identical in this model, but the fix was only added to python version in issue #1154.

I added a parameter num_key_value_heads to FlexFlow/inference/models/llama.h and passed it to FileDataLoader in FlexFlow/inference/models/llama.cc, and it worked.

FlexFlow/inference/models/llama.h

LLAMAConfig(std::string const &model_config_file_path) {
      std::ifstream config_file(model_config_file_path);
      if (config_file.is_open()) {
        try {
          json model_config;
          config_file >> model_config;
          num_hidden_layers = model_config["num_hidden_layers"];
          vocab_size = model_config["vocab_size"];
          num_attention_heads = model_config["num_attention_heads"];
          hidden_size = model_config["hidden_size"];
          rms_norm_eps = model_config["rms_norm_eps"];
          intermediate_size = model_config["intermediate_size"];
          num_key_value_heads = model_config["num_key_value_heads"]; //modified!!
        } catch (json::exception const &e) {
          std::cerr << "Error parsing LLAMA config from JSON file: " << e.what()
                    << std::endl;
          assert(false);
        }
      } else {
        std::cerr << "Error opening JSON file " << model_config_file_path
                  << std::endl;
        assert(false);
      }
      // max_seq_len = BatchConfig::MAX_SEQ_LENGTH;
      // max_num_tokens = BatchConfig::MAX_NUM_TOKENS;
      max_beam_width = BeamSearchBatchConfig::MAX_BEAM_WIDTH;
      max_beam_depth = BeamSearchBatchConfig::MAX_BEAM_DEPTH;
    }

FlexFlow/inference/models/llama.cc


  FileDataLoader *fileloader = new FileDataLoader(
      "",
      weight_file_path,
      llama_config.num_attention_heads,
      llama_config.num_key_value_heads, //modified!!
      llama_config.hidden_size,
      llama_config.hidden_size / llama_config.num_attention_heads,
      ff.config.tensor_parallelism_degree,
      use_full_precision);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant