-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LibriSpeech Conformer Workload OOMs or NCCL Errors When Run With Multiple Trials #663
Comments
What is the version of the GPU driver? We had a similar issue a few months ago #497 |
Our error is deterministically reproducible. Here is the nvidia-smi output in the Docker container:
|
Ok I see, sorry I did not catch that it was consistently OOMing. We can try to clear the cache between trials. @chandramouli-sastry @pomonam @msaroufim do you have any other ideas? |
Thank you for the suggestion! We've confirmed that adding
This trial corresponds to the following hyperparameter set:
We observed that changing |
Ok thanks for the update. I will make a fix to reset the cuda mem in our repo. Also, I assume you're running these to produce some baseline logs? I checked in logs on our dev branch for under |
Thank you so much! Yes, we wanted to sanity check the baseline on our setup. Thank you for pointing us towards the logs, they will indeed be helpful :) |
It seems like @pomonam was unable to reproduce this issue with dropout. |
Sorry for the delayed response - unfortunately, we've found that while |
We consistently observe an OOM error when running the one of the NAdamW baselines on LibriSpeech Conformer with multiple trials in PyTorch on 8 V100s with 16GB each. This is run for the external ruleset. The first trial will successfully run through, but any subsequent trial will OOM.
If we try to resume a multi-trial run, we will observe a NCCL error. This occurs even if we delete the
trial_2
folder (but thetrial_1
folder remains intact).Description
As discussed above, we will observe OOM when running LibriSpeech Conformer with the NAdamW baseline with multiple trials on 8 V100s with 16GB each. This is an example of an OOM we observe on the subsequent trial:
Alternatively, if we try to resume the multi-trial run, we will observe the NCCL error:
cc @anana10c @mikerabbat @tsunghsienlee @yuchenhao @shintaro-iwasaki
Steps to Reproduce
In the Docker container, run:
Source or Possible Fix
We are not aware of a possible fix for this issue. We suspect there may be a memory leak in the PyTorch LibriSpeech Conformer workload. Please let us know how to proceed. Thanks in advance!
The text was updated successfully, but these errors were encountered: