Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA memory increase with example code for llama2 training #40984

Closed
dtimokhin12 opened this issue Nov 6, 2023 · 1 comment
Closed

CUDA memory increase with example code for llama2 training #40984

dtimokhin12 opened this issue Nov 6, 2023 · 1 comment
Labels
docs An issue or change related to documentation triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@dtimokhin12
Copy link

Description

Problem:
In the llama2 example located here: https://github.com/ray-project/ray/blob/master/doc/source/templates/04_finetuning_llms_with_deepspeed/finetune_hf_llm.py

Running with lora activated and zero3 activated using the 7b model on 4xA100 GPUS
At around iteration ~215 when I use a batch size of 8, the GPU memory runs out of memory.

I use a variation of the code provided here, but wanted to point this out so that others don't run into it.
In the loss aggregation step here:
loss_sum += loss
This consistently increases the GPU memory requirements each iteration, because of the incurred cost of storing the loss values as a tensor.

My environment:
4xA100 GPUS using 1 node of type g5.12xlarge in AWS

Proposed Solution:
Changing loss to loss.item() my issue went away. (This of course requires changes to how loss_sum is used down the line)

Link

https://github.com/ray-project/ray/blob/master/doc/source/templates/04_finetuning_llms_with_deepspeed/finetune_hf_llm.py

@dtimokhin12 dtimokhin12 added docs An issue or change related to documentation triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 6, 2023
@matthewdeng
Copy link
Contributor

Hey @dtimokhin12, thanks for creating this issue and sharing your solution with the community!

I'm going to mark this as a duplicate of #40714. This will be fixed by #40940.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs An issue or change related to documentation triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

2 participants