-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA error: device-side assert triggered while training Marian MT #14798
Comments
It's sadly impossible for us to reproduce this error give the message above. From the error message, I'm quite sure that you are using a sequence length which is too long. Could you make sure you cut the input sequences to the maximum length of Marian? |
@patrickvonplaten Thanks for the reply. From the logs, i tried to print the length of the input_ids from the batch the training fails on : and it prints The max length in the config of this model is 512. Could you recommend if there is any flag to make sure of this length or should i preprocess my data to have a certain length ? Thanks again for the help :) |
Could you try to simply add:
to your command for this input:
It is set to 1024 by default |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Environment info
transformers
version: transformers-4.13.0.dev0Who can help
@patrickvonplaten
@sgugger
@LysandreJik
Information
Model I am using (Bert, XLNet ...):
The problem arises when using:
The tasks I am working on is:
Training marianMT on EMEA custom dataset
To reproduce
Steps to reproduce the behavior:
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:698: indexSelectLargeIndex: block: [194,0,0], thread: [30,0,0] Assertion
srcIndex < srcSelectDimSize
failed./opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:698: indexSelectLargeIndex: block: [194,0,0], thread: [31,0,0] Assertion
srcIndex < srcSelectDimSize
failed.Traceback (most recent call last):
File "/data/atc_tenant/NMT/smancha5/transformers/examples/pytorch/translation/run_translation.py", line 621, in
main()
File "/data/atc_tenant/NMT/smancha5/transformers/examples/pytorch/translation/run_translation.py", line 538, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1471, in train
self._total_loss_scalar += tr_loss.item()
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
Exception raised from query at /opt/pytorch/pytorch/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7fe03f1d3e1c in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x125 (0x7fe042e6d345 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7fe042e704e8 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x158 (0x7fe042e71df8 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xcc9d4 (0x7fe0d47a29d4 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #5: + 0x9609 (0x7fe0d6295609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fe0d6055293 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Debugging Logs:
print(inputs['labels'].shape) : torch.Size([8, 94])
print(inputs['input_ids'].shape) : torch.Size([8, 70])
print(inputs['decoder_input_ids'].shape) : torch.Size([8, 94])
Expected behavior
Training of model complete
The text was updated successfully, but these errors were encountered: