llama2 70B cause OOM #83

congdamaS · 2023-08-24T06:30:13Z

When testing with llama2 70B, the need memory is too large(>250GB).
This issue is not seen in the original lm-evaluation-harness for English.
How to set to test the llama2(70B)?

mkshing · 2023-10-11T00:20:59Z

@congdamaS we're trying to test 70B models soon. So, we will get back to you after that. Thanks!

yumemio · 2023-12-06T06:54:43Z

+1 on this. I'm evaluating an unquantized 7B model (stabilityai/japanese-stablelm-instruct-ja_vocab-beta-7b), but this eval script is eating 26GB of VRAM. Running inference with the same model using a bare-minimum transformers snippet consumes about 15GB. Does this script load anything other than the model itself onto the GPU?

dakotamahan-stability · 2023-12-06T23:53:22Z

not sure which branch you're on but

python main.py --model hf-causal-experimental --model_args pretrained=meta-llama/Llama-2-70b-chat-hf,dtype=float16,use_accelerate=True --no_cache --num_fewshot=25 --tasks arc_challenge

works just fine with 70B parameter models on an a40 node

yumemio · 2023-12-07T06:10:21Z

Hi @dakotamahan-stability and thanks for the reply!

Sorry for the lack of information. I'm on the jp-stable branch (commit effdbea). Here's an example notebook that reproduces the issue:

Gist (example notebook)

I'm running this notebook on a Colab Pro+ VM. The eval script throws an OOM error when run with a V100 GPU (w/ 16.0 GB of VRAM):

Running loglikelihood requests
  0% 0/5595 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/content/lm-evaluation-harness/main.py", line 121, in <module>
    results = main(args, description_dict_path, output_path)
  File "/content/lm-evaluation-harness/main.py", line 96, in main
    results = evaluator.simple_evaluate(**eval_args)
  File "/content/lm-evaluation-harness/lm_eval/utils.py", line 185, in _wrapper
    return fn(*args, **kwargs)
  File "/content/lm-evaluation-harness/lm_eval/evaluator.py", line 87, in simple_evaluate
    results = evaluate(
  File "/content/lm-evaluation-harness/lm_eval/utils.py", line 185, in _wrapper
    return fn(*args, **kwargs)
  File "/content/lm-evaluation-harness/lm_eval/evaluator.py", line 287, in evaluate
    resps = getattr(lm, reqtype)([req.args for req in reqs])
  File "/content/lm-evaluation-harness/lm_eval/base.py", line 980, in fn
    rem_res = getattr(self.lm, attr)(remaining_reqs)
  File "/content/lm-evaluation-harness/lm_eval/base.py", line 193, in loglikelihood
    return self._loglikelihood_tokens(new_reqs)
  File "/content/lm-evaluation-harness/lm_eval/base.py", line 303, in _loglikelihood_tokens
    self._model_call(batched_inps), dim=-1
  File "/content/lm-evaluation-harness/lm_eval/models/gpt2.py", line 120, in _model_call
    return self.gpt2(inps)[0]
...
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

What's strange is that the below code uses just around 14.3 GB of VRAM on the exact same machine:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Setup model and tokenizer
model_name = "stabilityai/japanese-stablelm-instruct-ja_vocab-beta-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="auto")

def format_prompt(input_text):
    prompt_template = """<s>[INST] <<SYS>>\nあなたは役立つアシスタントです。\n<<SYS>>\n\nユーザの質問に答えてください。\n\n{input}[/INST]"""
    return prompt_template.format(input=input_text)

def generate_text(input_text):
    formatted_prompt = format_prompt(input_text)
    input_ids = tokenizer.encode(
        formatted_prompt,
        add_special_tokens=False,
        return_tensors="pt"
    )

    # Set seed for reproducibility
    seed = 23
    torch.manual_seed(seed)

    tokens = model.generate(
        input_ids.to(device=model.device),
        max_new_tokens=1024,
        temperature=0.99,
        top_p=0.95,
        do_sample=True,
    )

    # Remove the input tokens from the generated tokens before decoding
    output_tokens = tokens[0][len(input_ids[0]):]
    return tokenizer.decode(output_tokens, skip_special_tokens=True)

prompt = "もう冬ですね。最近は寝室が寒くて寝られません。どうすればいいですか？"
generated_text = generate_text(prompt)
print(generated_text)

I'm wondering if I've misconfigured the eval script, or the script is prefetching/preloading the dataset to the GPU (which would make sense, given that the prompt in the snippet is short).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama2 70B cause OOM #83

llama2 70B cause OOM #83

congdamaS commented Aug 24, 2023

mkshing commented Oct 11, 2023 •

edited

Loading

yumemio commented Dec 6, 2023

dakotamahan-stability commented Dec 6, 2023

yumemio commented Dec 7, 2023

llama2 70B cause OOM #83

llama2 70B cause OOM #83

Comments

congdamaS commented Aug 24, 2023

mkshing commented Oct 11, 2023 • edited Loading

yumemio commented Dec 6, 2023

dakotamahan-stability commented Dec 6, 2023

yumemio commented Dec 7, 2023

mkshing commented Oct 11, 2023 •

edited

Loading