Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama2 70B cause OOM #83

Open
congdamaS opened this issue Aug 24, 2023 · 4 comments
Open

llama2 70B cause OOM #83

congdamaS opened this issue Aug 24, 2023 · 4 comments

Comments

@congdamaS
Copy link

When testing with llama2 70B, the need memory is too large(>250GB).
This issue is not seen in the original lm-evaluation-harness for English.
How to set to test the llama2(70B)?

@mkshing
Copy link

mkshing commented Oct 11, 2023

@congdamaS we're trying to test 70B models soon. So, we will get back to you after that. Thanks!

@yumemio
Copy link

yumemio commented Dec 6, 2023

+1 on this. I'm evaluating an unquantized 7B model (stabilityai/japanese-stablelm-instruct-ja_vocab-beta-7b), but this eval script is eating 26GB of VRAM. Running inference with the same model using a bare-minimum transformers snippet consumes about 15GB. Does this script load anything other than the model itself onto the GPU?

@dakotamahan-stability
Copy link

not sure which branch you're on but

python main.py --model hf-causal-experimental --model_args pretrained=meta-llama/Llama-2-70b-chat-hf,dtype=float16,use_accelerate=True --no_cache --num_fewshot=25 --tasks arc_challenge

works just fine with 70B parameter models on an a40 node

@yumemio
Copy link

yumemio commented Dec 7, 2023

Hi @dakotamahan-stability and thanks for the reply!

Sorry for the lack of information. I'm on the jp-stable branch (commit effdbea). Here's an example notebook that reproduces the issue:

Gist (example notebook)

I'm running this notebook on a Colab Pro+ VM. The eval script throws an OOM error when run with a V100 GPU (w/ 16.0 GB of VRAM):

Running loglikelihood requests
  0% 0/5595 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/content/lm-evaluation-harness/main.py", line 121, in <module>
    results = main(args, description_dict_path, output_path)
  File "/content/lm-evaluation-harness/main.py", line 96, in main
    results = evaluator.simple_evaluate(**eval_args)
  File "/content/lm-evaluation-harness/lm_eval/utils.py", line 185, in _wrapper
    return fn(*args, **kwargs)
  File "/content/lm-evaluation-harness/lm_eval/evaluator.py", line 87, in simple_evaluate
    results = evaluate(
  File "/content/lm-evaluation-harness/lm_eval/utils.py", line 185, in _wrapper
    return fn(*args, **kwargs)
  File "/content/lm-evaluation-harness/lm_eval/evaluator.py", line 287, in evaluate
    resps = getattr(lm, reqtype)([req.args for req in reqs])
  File "/content/lm-evaluation-harness/lm_eval/base.py", line 980, in fn
    rem_res = getattr(self.lm, attr)(remaining_reqs)
  File "/content/lm-evaluation-harness/lm_eval/base.py", line 193, in loglikelihood
    return self._loglikelihood_tokens(new_reqs)
  File "/content/lm-evaluation-harness/lm_eval/base.py", line 303, in _loglikelihood_tokens
    self._model_call(batched_inps), dim=-1
  File "/content/lm-evaluation-harness/lm_eval/models/gpt2.py", line 120, in _model_call
    return self.gpt2(inps)[0]
...
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

image

What's strange is that the below code uses just around 14.3 GB of VRAM on the exact same machine:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Setup model and tokenizer
model_name = "stabilityai/japanese-stablelm-instruct-ja_vocab-beta-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="auto")

def format_prompt(input_text):
    prompt_template = """<s>[INST] <<SYS>>\nあなたは役立つアシスタントです。\n<<SYS>>\n\nユーザの質問に答えてください。\n\n{input}[/INST]"""
    return prompt_template.format(input=input_text)

def generate_text(input_text):
    formatted_prompt = format_prompt(input_text)
    input_ids = tokenizer.encode(
        formatted_prompt,
        add_special_tokens=False,
        return_tensors="pt"
    )

    # Set seed for reproducibility
    seed = 23
    torch.manual_seed(seed)

    tokens = model.generate(
        input_ids.to(device=model.device),
        max_new_tokens=1024,
        temperature=0.99,
        top_p=0.95,
        do_sample=True,
    )

    # Remove the input tokens from the generated tokens before decoding
    output_tokens = tokens[0][len(input_ids[0]):]
    return tokenizer.decode(output_tokens, skip_special_tokens=True)

prompt = "もう冬ですね。最近は寝室が寒くて寝られません。どうすればいいですか?"
generated_text = generate_text(prompt)
print(generated_text)

image

I'm wondering if I've misconfigured the eval script, or the script is prefetching/preloading the dataset to the GPU (which would make sense, given that the prompt in the snippet is short).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants