You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was trying to finetune open-llama 7b on 20gb A100 with LORA with batch-size =1 & max_seq_lenth = 256 but while saving the checkpoints through huggingface transformers.trainer I am getting cuda out of memory.
As per my observation, the model & batch on total took around 10 GB vram & it was constant throughout the training but when trainer trying to save checkpoint at specific step its failing with cuda OOM.
And when I tried the same finetuning code to META llama-7b it is working fine & checkpoints also getting save without any memory overhead .
As per #1 (comment) - if open llama-7b has same model size & architecture as meta llama-7b then why I am facing cuda OOM, ideally it should work same for both.
If anyone can look into it & help me out.
The text was updated successfully, but these errors were encountered:
Hi Team,
I was trying to finetune open-llama 7b on 20gb A100 with LORA with batch-size =1 & max_seq_lenth = 256 but while saving the checkpoints through huggingface transformers.trainer I am getting cuda out of memory.
As per my observation, the model & batch on total took around 10 GB vram & it was constant throughout the training but when trainer trying to save checkpoint at specific step its failing with cuda OOM.
And when I tried the same finetuning code to META llama-7b it is working fine & checkpoints also getting save without any memory overhead .
As per #1 (comment) - if open llama-7b has same model size & architecture as meta llama-7b then why I am facing cuda OOM, ideally it should work same for both.
If anyone can look into it & help me out.
The text was updated successfully, but these errors were encountered: