Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to save checkpoints #50

Open
canamika27 opened this issue Jun 15, 2023 · 0 comments
Open

Unable to save checkpoints #50

canamika27 opened this issue Jun 15, 2023 · 0 comments

Comments

@canamika27
Copy link

Hi Team,

I was trying to finetune open-llama 7b on 20gb A100 with LORA with batch-size =1 & max_seq_lenth = 256 but while saving the checkpoints through huggingface transformers.trainer I am getting cuda out of memory.

As per my observation, the model & batch on total took around 10 GB vram & it was constant throughout the training but when trainer trying to save checkpoint at specific step its failing with cuda OOM.
And when I tried the same finetuning code to META llama-7b it is working fine & checkpoints also getting save without any memory overhead .

As per #1 (comment) - if open llama-7b has same model size & architecture as meta llama-7b then why I am facing cuda OOM, ideally it should work same for both.

If anyone can look into it & help me out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant