Unable to save checkpoints #50

canamika27 · 2023-06-15T19:34:01Z

Hi Team,

I was trying to finetune open-llama 7b on 20gb A100 with LORA with batch-size =1 & max_seq_lenth = 256 but while saving the checkpoints through huggingface transformers.trainer I am getting cuda out of memory.

As per my observation, the model & batch on total took around 10 GB vram & it was constant throughout the training but when trainer trying to save checkpoint at specific step its failing with cuda OOM.
And when I tried the same finetuning code to META llama-7b it is working fine & checkpoints also getting save without any memory overhead .

As per #1 (comment) - if open llama-7b has same model size & architecture as meta llama-7b then why I am facing cuda OOM, ideally it should work same for both.

If anyone can look into it & help me out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to save checkpoints #50

Unable to save checkpoints #50

canamika27 commented Jun 15, 2023

Unable to save checkpoints #50

Unable to save checkpoints #50

Comments

canamika27 commented Jun 15, 2023