-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ZeRO3 - GPU memory leakage during backward operation while training a Huggingface PEFT model #3378
Comments
@suri-kunal. thanks for reporting this issue. Can you please add this to the "zero_optimization" section of ds_config "memory_efficient_linear": false |
@suri-kunal, can you please open a new issue for this and share stack trace there? Thanks! |
I attached these graphs to prove that GPU memory was indeed increasing. |
@suri-kunal, my request for this is because |
I am getting the following error - Traceback (most recent call last):
File "Task A - Summarization - Sweep with Deepspeed wo wandb.py", line 580, in <module>
main()
File "Task A - Summarization - Sweep with Deepspeed wo wandb.py", line 570, in main
training_loop(model_name, \
File "Task A - Summarization - Sweep with Deepspeed wo wandb.py", line 493, in training_loop
train_summarization(ds_config, \
File "Task A - Summarization - Sweep with Deepspeed wo wandb.py", line 252, in train_summarization
model_engine, _, train_dl, _ = deepspeed.initialize(model=model_zero_init,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 125, in initialize
engine = DeepSpeedEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 272, in __init__
self._configure_with_arguments(args, mpu)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1010, in _configure_with_arguments
self._config = DeepSpeedConfig(self.config, mpu)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/config.py", line 813, in __init__
self._initialize_params(copy.copy(self._param_dict))
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/config.py", line 832, in _initialize_params
self.zero_config = get_zero_config(param_dict)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/config.py", line 67, in get_zero_config
return DeepSpeedZeroConfig(**zero_config_dict)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/config_utils.py", line 62, in __init__
super().__init__(**data)
File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for DeepSpeedZeroConfig
memory_efficient_linear
extra fields not permitted (type=value_error.extra)
[2023-04-25 18:50:33,573] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 143
[2023-04-25 18:50:33,574] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python', '-u', 'Task A - Summarization - Sweep with Deepspeed wo wandb.py', '--local_rank=0'] exits with return code = 1 The new ds_config is - {
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": 0,
"warmup_type": "linear"
}
},
"optimizer": {
"type": "Adam",
"params": {
"betas": [
0.9,
0.999
]
}
},
"fp16": {
"enabled": true,
"auto_cast": false,
"loss_scale": 0,
"initial_scale_power": 16,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1.000000e+09,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1.000000e+09,
"stage3_max_reuse_distance": 1.000000e+09,
"stage3_gather_16bit_weights_on_model_save": true,
"memory_efficient_linear": false
}
} |
Can you please use latest deepspeed? |
Issue resolved!! Thanks for creating this library. Could you please look into #3377 as well? |
@tjruwase, @tohtana I am sorry but the issue still seems to persist. I have even applied The 'glamorous-sweep-1' graph is that ZeRO 3 with memory_efficient_linear as True and 'expert-sweep-1' is ZeRO 3 with memory_efficient_linear as False. Surprisingly GPU utilization remains the same as shown in the following chart - |
@suri-kunal, thanks for sharing this update. We will investigate further. |
Any update on this issue? |
@suri-kunal I'm trying to reproduce the problem but haven't succeeded. One thing you may need to fix is that you call |
Closing this issue because we don't have an update. Feel free to reopen if you still have this issue. |
@stas00, @tjruwase - Tagging you here since I have seen you working on ZeRO3 extensively. Apologies if I shouldn't do this.
Describe the bug
I am fine tuning a LoRA model on top of BioBART-V2-Base using Deepspeed and Hugginface PEFT library on T4 instance. I am not using Hugginface Trainer class as I wanted to learn how to integrate Deepspeed in with any code. To benchmark how different ZeRO configurations work, I ran the code using following configurations -
Baseline -
ZeRO 2 -
and ZeRO 3 -
Training learning curves are matching perfectly for Baseline and ZeRO2 but I am getting
RuntimeError: CUDA out of memory
when I try to use ZeRO3.To Reproduce
Steps to reproduce the behavior:
StackTrace -
Expected behavior
GPU Utilization should not increase
ds_report output
Please run
ds_report
to give us details about your setup.Screenshots
As you can see, GPU Usage in ZeRO3 is increasing as compared to ZeRO2. I tried using
model_engine.empty_partition_cache()
as well but I got an error thatempty_partition_cache
attribute doesn't exist for model_engine.ZeRO3 GPU Usage -
ZeRO2 GPU Usage -
System info (please complete the following information):
Launcher context
Are you launching your experiment with the
deepspeed
launcher, MPI, or something else?Docker context
Are you using a specific docker image that you can share?
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: