Add torch_empty_cache_steps to TrainingArguments #31546

aliencaocao · 2024-06-22T05:41:19Z

What does this PR do?

Following up on #31530, add torch.cuda.empty_cache() as an optional TrainingArgument to the training loop every N step, such that those who need it to avoid OOMs can use it.

Also added to the reduce vram usage docs and changed the table for mixed precision training a bit - let me know if you prefer this change to be isolated in another PR.

Not sure if a test is needed - we can't really 'verify' it being called besides monitoring vram usage?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@muellerz @amyeroberts

amyeroberts

Thanks for adding this!

Overall I think this looks OK, but let's get @muellerzr's opinion

src/transformers/training_args.py

muellerzr

Thanks, overall this looks good to me, bar one nit for our docs!

src/transformers/training_args.py

HuggingFaceDocBuilderDev · 2024-07-01T19:19:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts

Looks great - thanks for adding!

muellerzr · 2024-07-03T11:03:12Z

Test failures look unrelated, so we're okay to merge here I think.

src/transformers/trainer.py

muellerzr

Well done! Thanks!

aliencaocao added 3 commits June 22, 2024 13:33

Add torch_empty_cache_steps to TrainingArguments

b20828d

Fix formatting

ee04109

Add torch_empty_cache_steps to docs on single gpu training

32d0c04

amyeroberts reviewed Jun 23, 2024

View reviewed changes

src/transformers/training_args.py Outdated Show resolved Hide resolved

aliencaocao added 2 commits June 25, 2024 00:35

Remove check for torch_empty_cache_steps <= max_steps

e9299ba

Merge branch 'refs/heads/main' into empty-cache-arg

d465098

muellerzr approved these changes Jul 1, 2024

View reviewed changes

src/transformers/training_args.py Outdated Show resolved Hide resolved

Captalize Tip

c24bd3d

amyeroberts approved these changes Jul 2, 2024

View reviewed changes

muellerzr requested changes Jul 4, 2024

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

Be device agnostic

ca4107c

muellerzr approved these changes Jul 4, 2024

View reviewed changes

Fix linting

fcae626

muellerzr merged commit 43ffb78 into huggingface:main Jul 4, 2024
21 checks passed

aliencaocao deleted the empty-cache-arg branch July 4, 2024 22:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add torch_empty_cache_steps to TrainingArguments #31546

Add torch_empty_cache_steps to TrainingArguments #31546

aliencaocao commented Jun 22, 2024 •

edited

Loading

amyeroberts left a comment

muellerzr left a comment

HuggingFaceDocBuilderDev commented Jul 1, 2024

amyeroberts left a comment

muellerzr commented Jul 3, 2024

muellerzr left a comment

Add torch_empty_cache_steps to TrainingArguments #31546

Add torch_empty_cache_steps to TrainingArguments #31546

Conversation

aliencaocao commented Jun 22, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

amyeroberts left a comment

Choose a reason for hiding this comment

muellerzr left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jul 1, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

muellerzr commented Jul 3, 2024

muellerzr left a comment

Choose a reason for hiding this comment

aliencaocao commented Jun 22, 2024 •

edited

Loading