Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: device-side assert triggered while training Marian MT #14798

Closed
2 of 4 tasks
manchandasahil opened this issue Dec 16, 2021 · 4 comments
Closed
2 of 4 tasks

Comments

@manchandasahil
Copy link

manchandasahil commented Dec 16, 2021

Environment info

  • transformers version: transformers-4.13.0.dev0
  • Platform: linux
  • Python version: 3.8
  • PyTorch version (GPU?): torch==1.11.0a0+b6df043 GPU
  • Tensorflow version (GPU?):
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?: one node multigpu

Who can help

@patrickvonplaten
@sgugger
@LysandreJik

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

  • the official example scripts: (give details below)
  • --> NMT/smancha5/transformers/examples/pytorch/translation/run_translation.py
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)
    Training marianMT on EMEA custom dataset

To reproduce

Steps to reproduce the behavior:

  1. Clone the latest transformer repo
  2. /opt/conda/bin/python -m torch.distributed.launch --nnodes 1 --nproc_per_node 4 /data/atc_tenant/NMT/smancha5/transformers/examples/pytorch/translation/run_translation.py --train_file /data/atc_tenant/NMT/smancha5/EMEA.en-es.train.json --model_name_or_path Helsinki-NLP/opus-mt-en-es --do_train --source_lang=en --target_lang=es --output_dir=/data/atc_tenant/NMT/model1/ --per_device_train_batch_size=8 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --cache_dir=/data/atc_tenant/NMT/cache/

/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:698: indexSelectLargeIndex: block: [194,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:698: indexSelectLargeIndex: block: [194,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed.

Traceback (most recent call last):
File "/data/atc_tenant/NMT/smancha5/transformers/examples/pytorch/translation/run_translation.py", line 621, in
main()
File "/data/atc_tenant/NMT/smancha5/transformers/examples/pytorch/translation/run_translation.py", line 538, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1471, in train
self._total_loss_scalar += tr_loss.item()
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
Exception raised from query at /opt/pytorch/pytorch/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7fe03f1d3e1c in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x125 (0x7fe042e6d345 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7fe042e704e8 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x158 (0x7fe042e71df8 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xcc9d4 (0x7fe0d47a29d4 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #5: + 0x9609 (0x7fe0d6295609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fe0d6055293 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Debugging Logs:
print(inputs['labels'].shape) : torch.Size([8, 94])
print(inputs['input_ids'].shape) : torch.Size([8, 70])
print(inputs['decoder_input_ids'].shape) : torch.Size([8, 94])

Expected behavior

Training of model complete

@patrickvonplaten
Copy link
Contributor

It's sadly impossible for us to reproduce this error give the message above. From the error message, I'm quite sure that you are using a sequence length which is too long. Could you make sure you cut the input sequences to the maximum length of Marian?

@manchandasahil
Copy link
Author

manchandasahil commented Dec 16, 2021

@patrickvonplaten Thanks for the reply. From the logs, i tried to print the length of the input_ids from the batch the training fails on : and it prints
print(inputs['labels'].shape) : torch.Size([8, 94])
print(inputs['input_ids'].shape) : torch.Size([8, 70])
print(inputs['decoder_input_ids'].shape) : torch.Size([8, 94])

The max length in the config of this model is 512. Could you recommend if there is any flag to make sure of this length or should i preprocess my data to have a certain length ?

Thanks again for the help :)

@patrickvonplaten
Copy link
Contributor

Could you try to simply add:

--max_source_length 512

to your command for this input:

max_source_length: Optional[int] = field(

It is set to 1024 by default

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants