CUDA error: device-side assert triggered while training Marian MT #14798

manchandasahil · 2021-12-16T14:31:22Z

Environment info

transformers version: transformers-4.13.0.dev0
Platform: linux
Python version: 3.8
PyTorch version (GPU?): torch==1.11.0a0+b6df043 GPU
Tensorflow version (GPU?):
Using GPU in script?:
Using distributed or parallel set-up in script?: one node multigpu

Who can help

@patrickvonplaten
@sgugger
@LysandreJik

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

the official example scripts: (give details below)
--> NMT/smancha5/transformers/examples/pytorch/translation/run_translation.py
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)
Training marianMT on EMEA custom dataset

To reproduce

Steps to reproduce the behavior:

Clone the latest transformer repo
/opt/conda/bin/python -m torch.distributed.launch --nnodes 1 --nproc_per_node 4 /data/atc_tenant/NMT/smancha5/transformers/examples/pytorch/translation/run_translation.py --train_file /data/atc_tenant/NMT/smancha5/EMEA.en-es.train.json --model_name_or_path Helsinki-NLP/opus-mt-en-es --do_train --source_lang=en --target_lang=es --output_dir=/data/atc_tenant/NMT/model1/ --per_device_train_batch_size=8 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --cache_dir=/data/atc_tenant/NMT/cache/

/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:698: indexSelectLargeIndex: block: [194,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:698: indexSelectLargeIndex: block: [194,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed.

Traceback (most recent call last):
File "/data/atc_tenant/NMT/smancha5/transformers/examples/pytorch/translation/run_translation.py", line 621, in
main()
File "/data/atc_tenant/NMT/smancha5/transformers/examples/pytorch/translation/run_translation.py", line 538, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1471, in train
self._total_loss_scalar += tr_loss.item()
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
Exception raised from query at /opt/pytorch/pytorch/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7fe03f1d3e1c in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x125 (0x7fe042e6d345 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7fe042e704e8 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x158 (0x7fe042e71df8 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xcc9d4 (0x7fe0d47a29d4 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #5: + 0x9609 (0x7fe0d6295609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fe0d6055293 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Debugging Logs:
print(inputs['labels'].shape) : torch.Size([8, 94])
print(inputs['input_ids'].shape) : torch.Size([8, 70])
print(inputs['decoder_input_ids'].shape) : torch.Size([8, 94])

Expected behavior

Training of model complete

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2021-12-16T15:31:13Z

It's sadly impossible for us to reproduce this error give the message above. From the error message, I'm quite sure that you are using a sequence length which is too long. Could you make sure you cut the input sequences to the maximum length of Marian?

manchandasahil · 2021-12-16T15:51:58Z

@patrickvonplaten Thanks for the reply. From the logs, i tried to print the length of the input_ids from the batch the training fails on : and it prints
print(inputs['labels'].shape) : torch.Size([8, 94])
print(inputs['input_ids'].shape) : torch.Size([8, 70])
print(inputs['decoder_input_ids'].shape) : torch.Size([8, 94])

The max length in the config of this model is 512. Could you recommend if there is any flag to make sure of this length or should i preprocess my data to have a certain length ?

Thanks again for the help :)

patrickvonplaten · 2021-12-16T16:58:10Z

Could you try to simply add:

--max_source_length 512

to your command for this input:

transformers/examples/pytorch/translation/run_translation.py

Line 136 in 48463eb

max_source_length: Optional[int] = field(

It is set to 1024 by default

github-actions · 2022-01-15T15:01:49Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Jan 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error: device-side assert triggered while training Marian MT #14798

CUDA error: device-side assert triggered while training Marian MT #14798

manchandasahil commented Dec 16, 2021 •

edited

Loading

patrickvonplaten commented Dec 16, 2021

manchandasahil commented Dec 16, 2021 •

edited

Loading

patrickvonplaten commented Dec 16, 2021

github-actions bot commented Jan 15, 2022

CUDA error: device-side assert triggered while training Marian MT #14798

CUDA error: device-side assert triggered while training Marian MT #14798

Comments

manchandasahil commented Dec 16, 2021 • edited Loading

Environment info

Who can help

Information

To reproduce

Expected behavior

patrickvonplaten commented Dec 16, 2021

manchandasahil commented Dec 16, 2021 • edited Loading

patrickvonplaten commented Dec 16, 2021

github-actions bot commented Jan 15, 2022

manchandasahil commented Dec 16, 2021 •

edited

Loading

manchandasahil commented Dec 16, 2021 •

edited

Loading