You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running with lora activated and zero3 activated using the 7b model on 4xA100 GPUS
At around iteration ~215 when I use a batch size of 8, the GPU memory runs out of memory.
I use a variation of the code provided here, but wanted to point this out so that others don't run into it.
In the loss aggregation step here: loss_sum += loss
This consistently increases the GPU memory requirements each iteration, because of the incurred cost of storing the loss values as a tensor.
My environment:
4xA100 GPUS using 1 node of type g5.12xlarge in AWS
Proposed Solution:
Changing loss to loss.item() my issue went away. (This of course requires changes to how loss_sum is used down the line)
The text was updated successfully, but these errors were encountered:
dtimokhin12
added
docs
An issue or change related to documentation
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Nov 6, 2023
Description
Problem:
In the llama2 example located here: https://github.com/ray-project/ray/blob/master/doc/source/templates/04_finetuning_llms_with_deepspeed/finetune_hf_llm.py
Running with lora activated and zero3 activated using the 7b model on 4xA100 GPUS
At around iteration ~215 when I use a batch size of 8, the GPU memory runs out of memory.
I use a variation of the code provided here, but wanted to point this out so that others don't run into it.
In the loss aggregation step here:
loss_sum += loss
This consistently increases the GPU memory requirements each iteration, because of the incurred cost of storing the loss values as a tensor.
My environment:
4xA100 GPUS using 1 node of type g5.12xlarge in AWS
Proposed Solution:
Changing loss to
loss.item()
my issue went away. (This of course requires changes to howloss_sum
is used down the line)Link
https://github.com/ray-project/ray/blob/master/doc/source/templates/04_finetuning_llms_with_deepspeed/finetune_hf_llm.py
The text was updated successfully, but these errors were encountered: