[Templates] Finetuneing template 04 OOMS due to gpu memory leak when using LoRA + V100s #40714

ArturNiederfahrenhorst · 2023-10-26T18:02:33Z

What happened + What you expected to happen

Fine-tuning template 04 OOMs under specific circumstances.

Start an AWS or GCP V100 node.
Deploy the template.
Run ./run_llama_ft.sh --size=7b --lora.

This will OOM after a while after GRAM fills up. GRAM usage will fill up linearly with steps strongly suggesting a memory leak.
Attempts to reproduce this on AWS's p4des or G5s failed. So this appears to be more or less specific to V100s.

Screenshot shows GRAM (orange) on runs with different context lengths (8, 4, 1 in this order) with LoRA on a V100.
(The final run with batch size 1 grows to approx 100% before crashing).

This is how it looks like on some A100s and should look like (no linear increase, just a flat GRAM curve (orange)).

Versions / Dependencies

master
https://github.com/ray-project/ray/tree/master/doc/source/templates/04_finetuning_llms_with_deepspeed

Reproduction script

.

Issue Severity

None

The text was updated successfully, but these errors were encountered:

kouroshHakha · 2023-10-26T18:16:35Z

some related issues found online:
microsoft/DeepSpeed#3002
microsoft/DeepSpeed#3378

mak-454 · 2023-11-01T17:10:57Z

+1 we are facing this same issue with 4 A10 node (AWS g5.12xlarge )

woshiyyya · 2023-11-01T20:17:41Z

@mak-454 Did you use Ray? Or LoRA + Deepspeed only?

mak-454 · 2023-11-02T09:13:07Z

@woshiyyya am using https://github.com/ray-project/ray/blob/master/doc/source/templates/04_finetuning_llms_with_deepspeed/finetune_hf_llm.py
and running it with --lora option.
Few more details
Setup - single node AWS g5.12xlarge. It has 4 a10 nodes.
Batchsize - 4
ND - 4
Block size - 512
Lora config and deepspeed config from the dir - https://github.com/ray-project/ray/blob/master/doc/source/templates/04_finetuning_llms_with_deepspeed

mak-454 · 2023-11-02T18:59:14Z

@woshiyyya It seems to be going fine and GPU utilization seems to be stable after commenting the line

ray/doc/source/templates/04_finetuning_llms_with_deepspeed/finetune_hf_llm.py

Line 402 in 8504563

loss_sum += loss

woshiyyya · 2023-11-02T19:07:26Z

@mak-454 Interesting! Let us repro it on our side and get back to you.

ArturNiederfahrenhorst added bug Something that is supposed to be working; but isn't P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared llm labels Oct 26, 2023

ArturNiederfahrenhorst self-assigned this Oct 26, 2023

kouroshHakha changed the title ~~[Templates] Finetuneing template 04 OOMS when using LoRA + V100s~~ [Templates] Finetuneing template 04 OOMS due to memory leak when using LoRA + V100s Oct 26, 2023

kouroshHakha added P1 Issue that should be fixed within a few weeks and removed P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared labels Oct 26, 2023

kouroshHakha changed the title ~~[Templates] Finetuneing template 04 OOMS due to memory leak when using LoRA + V100s~~ [Templates] Finetuneing template 04 OOMS due to gpu memory leak when using LoRA + V100s Oct 26, 2023

ArturNiederfahrenhorst assigned woshiyyya Oct 26, 2023

ArturNiederfahrenhorst mentioned this issue Nov 3, 2023

Make sure that loss can be freed after every iteration #40940

Merged

matthewdeng mentioned this issue Nov 6, 2023

CUDA memory increase with example code for llama2 training #40984

Closed

ArturNiederfahrenhorst closed this as completed in #40940 Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Templates] Finetuneing template 04 OOMS due to gpu memory leak when using LoRA + V100s #40714

[Templates] Finetuneing template 04 OOMS due to gpu memory leak when using LoRA + V100s #40714

ArturNiederfahrenhorst commented Oct 26, 2023 •

edited

Loading

kouroshHakha commented Oct 26, 2023

mak-454 commented Nov 1, 2023

woshiyyya commented Nov 1, 2023

mak-454 commented Nov 2, 2023

mak-454 commented Nov 2, 2023

woshiyyya commented Nov 2, 2023

[Templates] Finetuneing template 04 OOMS due to gpu memory leak when using LoRA + V100s #40714

[Templates] Finetuneing template 04 OOMS due to gpu memory leak when using LoRA + V100s #40714

Comments

ArturNiederfahrenhorst commented Oct 26, 2023 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

kouroshHakha commented Oct 26, 2023

mak-454 commented Nov 1, 2023

woshiyyya commented Nov 1, 2023

mak-454 commented Nov 2, 2023

mak-454 commented Nov 2, 2023

woshiyyya commented Nov 2, 2023

ArturNiederfahrenhorst commented Oct 26, 2023 •

edited

Loading