Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the TF dummies even smaller #24071

Merged
merged 6 commits into from
Jun 7, 2023
Merged

Conversation

Rocketknight1
Copy link
Member

cc @ydshieh - this will probably break some things, but if I can make it work it should reduce the memory usage during building a lot

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@Rocketknight1
Copy link
Member Author

Adding @frostming's fix for Keras 2.13 to this PR as well

@Rocketknight1 Rocketknight1 marked this pull request as ready for review June 7, 2023 11:31
@Rocketknight1
Copy link
Member Author

Tests are passing now, pinging @ydshieh and @amyeroberts for quick review! Unfortunately it's quite hard for me to test if this PR will fix the entire memory issue in the overnight CI, but we'll try this fix and if the memory issues remain then I'll try some other things too.

@ydshieh
Copy link
Collaborator

ydshieh commented Jun 7, 2023

I can trigger a run.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Thanks for iterating on this

For my own understanding: do we know why the new build logic is triggering the OOM issues (what's the difference from before)? Would these failures happen if just one model was being tested - or does it happen after N models being tested?

@ydshieh
Copy link
Collaborator

ydshieh commented Jun 7, 2023

It's more like when several processes are run in parallel: on circleci, 8 pytest processes.

@ydshieh
Copy link
Collaborator

ydshieh commented Jun 7, 2023

I also have the same question on why the memory usage increases that much. Previously, we don't really use batch size 1 in dummy if I remember correctly.

@ydshieh
Copy link
Collaborator

ydshieh commented Jun 7, 2023

A run is triggered here.

If you need more changes and more runs to check, you can update the branch

https://github.com/huggingface/transformers/tree/run_even_lower_tf_dummy_memory

on top of this PR branch.

@Rocketknight1
Copy link
Member Author

@amyeroberts I don't have a very good intuition for this, actually. I think it's some combination of:

  • The test runners were already at 90%+ memory usage before all of these PRs and tests are run in parallel as @ydshieh said, which means small perturbations could push them over the limit.
  • The update changed the shapes of dummies a bit - they should be smaller on average, especially after this PR, but maybe they ended up a little larger for some high-memory models and that caused the issues.

It's also possible that the update sped up building by removing unnecessary build ops left over from TF 1 and not unneccessarily passing dummies when models were already built. Speeding up builds might cause tests to be in the actual model calls more of the time, and if peak model usage occurs during the actual model calls and we have lots of tests running in parallel then more tests being in the calls simultaneously might result in higher peak memory usage for the test server.

This is all purely speculative on my part, though - I can't reproduce the problem locally and the nature of the parallel tests makes it hard to point to a single culprit for an OOM error!

@Rocketknight1
Copy link
Member Author

@ydshieh the new run seems to be passing - there's an unrelated issue with one of the vit-mae tests that I can't reproduce locally and that doesn't seem related, but I think this PR resolves most of the problems!

@ydshieh
Copy link
Collaborator

ydshieh commented Jun 7, 2023

@Rocketknight1 Unfortunately, it doesn't pass. We have to go to the Resource tab, and see the memory usage.

Screenshot 2023-06-07 145417

And if you click Download the full output as a file, you will see worker gw7 crashed and worker restarting disabled.

😢 😭

@ydshieh
Copy link
Collaborator

ydshieh commented Jun 7, 2023

Well, to be more sure, I can revert the PR #23234 on another branch, so would be main without that PR, and run the test. The goal is to make sure no other PRs contribute to the OOM issue.

Do you want me to do this?

@Rocketknight1
Copy link
Member Author

No, I'm pretty confident that the change to the dummies is the cause!

@Rocketknight1
Copy link
Member Author

@ydshieh Can we reduce the number of parallel workers by 1 for these tests? I think the speed boost from these PRs (plus some other ones I have planned) should compensate for any slowdown we experience, and it would be good to be able to make small changes without breaking fragile parallel tests like these

@ydshieh
Copy link
Collaborator

ydshieh commented Jun 7, 2023

Let me open a PR for that :-)

@ydshieh ydshieh mentioned this pull request Jun 7, 2023
Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Rocketknight1
Copy link
Member Author

(rebasing onto @ydshieh's PR to test everything in combination)

@Rocketknight1 Rocketknight1 merged commit 1fc832b into main Jun 7, 2023
@Rocketknight1 Rocketknight1 deleted the even_lower_tf_dummy_memory branch June 7, 2023 15:23
novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023
* Let's see if we can use the smallest possible dummies

* Make GPT-2's dummies a little longer

* Just use (1,2) as the default shape

* Update other dummies in sync

* Correct imports for Keras 2.13

* Shrink the Wav2Vec2 dummies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants