CUDA memory increase with example code for llama2 training #40984

dtimokhin12 · 2023-11-06T19:30:30Z

Description

Problem:
In the llama2 example located here: https://github.com/ray-project/ray/blob/master/doc/source/templates/04_finetuning_llms_with_deepspeed/finetune_hf_llm.py

Running with lora activated and zero3 activated using the 7b model on 4xA100 GPUS
At around iteration ~215 when I use a batch size of 8, the GPU memory runs out of memory.

I use a variation of the code provided here, but wanted to point this out so that others don't run into it.
In the loss aggregation step here:
loss_sum += loss
This consistently increases the GPU memory requirements each iteration, because of the incurred cost of storing the loss values as a tensor.

My environment:
4xA100 GPUS using 1 node of type g5.12xlarge in AWS

Proposed Solution:
Changing loss to loss.item() my issue went away. (This of course requires changes to how loss_sum is used down the line)

Link

https://github.com/ray-project/ray/blob/master/doc/source/templates/04_finetuning_llms_with_deepspeed/finetune_hf_llm.py

The text was updated successfully, but these errors were encountered:

matthewdeng · 2023-11-06T19:34:28Z

Hey @dtimokhin12, thanks for creating this issue and sharing your solution with the community!

I'm going to mark this as a duplicate of #40714. This will be fixed by #40940.

dtimokhin12 added docs An issue or change related to documentation triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 6, 2023

matthewdeng closed this as completed Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA memory increase with example code for llama2 training #40984

CUDA memory increase with example code for llama2 training #40984

dtimokhin12 commented Nov 6, 2023

matthewdeng commented Nov 6, 2023

CUDA memory increase with example code for llama2 training #40984

CUDA memory increase with example code for llama2 training #40984

Comments

dtimokhin12 commented Nov 6, 2023

Description

Link

matthewdeng commented Nov 6, 2023