Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Longformer training : CUDA error: device-side assert triggered #10852

Closed
2 tasks
manchandasahil opened this issue Mar 22, 2021 · 5 comments
Closed
2 tasks

Longformer training : CUDA error: device-side assert triggered #10852

manchandasahil opened this issue Mar 22, 2021 · 5 comments

Comments

@manchandasahil
Copy link

Environment info

  • transformers version:
  • Platform:
  • Python version: 3.7
  • PyTorch version (GPU?):
  • Tensorflow version (GPU?):
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: sharedddp (Fairscale)

Who can help

Information

Model I am using (Bert, XLNet ...): Longformer

The problem arises when using:

  • the official example scripts: (give details below)
  • [ x ] my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • [ x ] my own task or dataset: (give details below)

To reproduce

When i use the same configuration to train model type bert it works but this does not work for longformer.
Steps to reproduce the behavior:
/opt/conda/bin/python -m torch.distributed.launch
--nnodes=$WORLD_SIZE
--node_rank=$RANK
--master_addr=$MASTER_ADDR
--master_port=$MASTER_PORT
--nproc_per_node=1 $SCRIPT
--output_dir=$OUT_DIR
--logging_dir=$OUT_DIR
--tokenizer_name=$TOKENIZER
--model_type=longformer --do_train --do_eval
--cache_dir=$CACHE_DIR
--overwrite_cache
--validation_file=$EVAL_DATA
--overwrite_output_dir
--train_file=$TRAIN_DATA_FOLDER
--dataset_name=$DATASET_NAME
--line_by_line
--learning_rate=${INIT_LR}
--save_steps=${SAVE_STEPS}
--max_seq_length=${BLOCK_SIZE}
--gradient_accumulation_steps=${GRAD_ACCUM_STEPS}
--fp16
--num_train_epochs=$EPOCHS
--per_device_train_batch_size=$BATCH_SIZE_PER_GPU
--local_rank=$LOCAL_RANK
--train_dataset_info_path=$TRAIN_DATASET_INFO
--test_dataset_info_path=$TEST_DATASET_INFO
--sharded_ddp \

Traceback (most recent call last):
File "/data/atc_tenant/bert_data/smancha5/run_mlm.py", line 661, in
main()
File "/data/atc_tenant/bert_data/smancha5/run_mlm.py", line 465, in main
train_result = trainer.train(resume_from_checkpoint=model_path)
File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1003, in train
tr_loss += self.training_step(model, inputs)
File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1443, in training_step
loss = self.compute_loss(model, inputs)
File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1477, in compute_loss
outputs = model(**inputs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/fairscale/nn/data_parallel/sharded_ddp.py", line 218, in forward
return self.module(*inputs, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1765, in forward
return_dict=return_dict,
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1669, in forward
return_dict=return_dict,
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1245, in forward
Traceback (most recent call last):
File "/data/atc_tenant/bert_data/smancha5/run_mlm.py", line 661, in
Traceback (most recent call last):
File "/data/atc_tenant/bert_data/smancha5/run_mlm.py", line 661, in
is_global_attn = is_index_global_attn.flatten().any().item()
RuntimeError: CUDA error: device-side assert triggered
main()
File "/data/atc_tenant/bert_data/smancha5/run_mlm.py", line 465, in main
main()
File "/data/atc_tenant/bert_data/smancha5/run_mlm.py", line 465, in main
train_result = trainer.train(resume_from_checkpoint=model_path)
File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1003, in train
train_result = trainer.train(resume_from_checkpoint=model_path)
File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1003, in train
tr_loss += self.training_step(model, inputs)
File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1443, in training_step
tr_loss += self.training_step(model, inputs)
File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1443, in training_step
loss = self.compute_loss(model, inputs)
File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1477, in compute_loss
loss = self.compute_loss(model, inputs)
File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1477, in compute_loss
outputs = model(**inputs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
outputs = model(**inputs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/fairscale/nn/data_parallel/sharded_ddp.py", line 218, in forward
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/fairscale/nn/data_parallel/sharded_ddp.py", line 218, in forward
return self.module(*inputs, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
return self.module(*inputs, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1765, in forward
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1765, in forward
Traceback (most recent call last):
File "/data/atc_tenant/bert_data/smancha5/run_mlm.py", line 661, in
return_dict=return_dict,
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
return_dict=return_dict,
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1669, in forward
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1669, in forward
Traceback (most recent call last):
File "/data/atc_tenant/bert_data/smancha5/run_mlm.py", line 661, in
return_dict=return_dict,
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
return_dict=return_dict,
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
main()
File "/data/atc_tenant/bert_data/smancha5/run_mlm.py", line 465, in main
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1245, in forward
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1245, in forward
is_global_attn = is_index_global_attn.flatten().any().item()
RuntimeError: CUDA error: device-side assert triggered
is_global_attn = is_index_global_attn.flatten().any().item()
RuntimeError: CUDA error: device-side assert triggered
train_result = trainer.train(resume_from_checkpoint=model_path)
File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1003, in train
main()
File "/data/atc_tenant/bert_data/smancha5/run_mlm.py", line 465, in main
tr_loss += self.training_step(model, inputs)
File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1443, in training_step
loss = self.compute_loss(model, inputs)
File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1477, in compute_loss
train_result = trainer.train(resume_from_checkpoint=model_path)
File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1003, in train
outputs = model(**inputs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
tr_loss += self.training_step(model, inputs)
File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1443, in training_step
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/fairscale/nn/data_parallel/sharded_ddp.py", line 218, in forward
return self.module(*inputs, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1765, in forward
loss = self.compute_loss(model, inputs)
File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1477, in compute_loss
return_dict=return_dict,
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
outputs = model(**inputs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1669, in forward
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/fairscale/nn/data_parallel/sharded_ddp.py", line 218, in forward
return self.module(*inputs, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1765, in forward
return_dict=return_dict,
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1245, in forward
return_dict=return_dict,
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
is_global_attn = is_index_global_attn.flatten().any().item()
RuntimeError: CUDA error: device-side assert triggered
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1669, in forward
return_dict=return_dict,
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
result = self.forward(*input, *kwargs)
File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1245, in forward
is_global_attn = is_index_global_attn.flatten().any().item()
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7fc78c43d99b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0xc10 (0x7fc78c680280 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fc78c425dfd in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: + 0x5414e2 (0x7fc7c549d4e2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x19aaae (0x5603f8975aae in /opt/conda/bin/python)
frame #5: + 0xf2868 (0x5603f88cd868 in /opt/conda/bin/python)
frame #6: + 0x1f0d91 (0x5603f89cbd91 in /opt/conda/bin/python)
frame #7: + 0xf270d (0x5603f88cd70d in /opt/conda/bin/python)
frame #8: + 0x19aa90 (0x5603f8975a90 in /opt/conda/bin/python)
frame #9: + 0xf2868 (0x5603f88cd868 in /opt/conda/bin/python)
frame #10: + 0x1f0d91 (0x5603f89cbd91 in /opt/conda/bin/python)
frame #11: + 0xf2828 (0x5603f88cd828 in /opt/conda/bin/python)
frame #12: + 0x19aa90 (0x5603f8975a90 in /opt/conda/bin/python)
frame #13: + 0xf2868 (0x5603f88cd868 in /opt/conda/bin/python)
frame #14: + 0x1f0d91 (0x5603f89cbd91 in /opt/conda/bin/python)
frame #15: + 0x1688cb (0x5603f89438cb in /opt/conda/bin/python)
frame #16: _PyGC_CollectNoFail + 0x2a (0x5603f89cb79a in /opt/conda/bin/python)
frame #17: PyImport_Cleanup + 0x278 (0x5603f897ffa8 in /opt/conda/bin/python)
frame #18: Py_FinalizeEx + 0x61 (0x5603f89ea961 in /opt/conda/bin/python)
frame #19: Py_Main + 0x35e (0x5603f89f4cae in /opt/conda/bin/python)
frame #20: main + 0xee (0x5603f88bef2e in /opt/conda/bin/python)
frame #21: __libc_start_main + 0xe7 (0x7fc7f2cf3b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: + 0x1c327f (0x5603f899e27f in /opt/conda/bin/python)

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7fa371cb999b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xc10 (0x7fa371efc280 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fa371ca1dfd in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: + 0x5414e2 (0x7fa3aad194e2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x19aaae (0x5559699ffaae in /opt/conda/bin/python)
frame #5: + 0xf2868 (0x555969957868 in /opt/conda/bin/python)
frame #6: + 0x1f0d91 (0x555969a55d91 in /opt/conda/bin/python)
frame #7: + 0xf270d (0x55596995770d in /opt/conda/bin/python)
frame #8: + 0x19aa90 (0x5559699ffa90 in /opt/conda/bin/python)
frame #9: + 0xf2868 (0x555969957868 in /opt/conda/bin/python)
frame #10: + 0x1f0d91 (0x555969a55d91 in /opt/conda/bin/python)
frame #11: + 0xf2828 (0x555969957828 in /opt/conda/bin/python)
frame #12: + 0x19aa90 (0x5559699ffa90 in /opt/conda/bin/python)
frame #13: + 0xf2868 (0x555969957868 in /opt/conda/bin/python)
frame #14: + 0x1f0d91 (0x555969a55d91 in /opt/conda/bin/python)
frame #15: + 0x1688cb (0x5559699cd8cb in /opt/conda/bin/python)
frame #16: _PyGC_CollectNoFail + 0x2a (0x555969a5579a in /opt/conda/bin/python)
frame #17: PyImport_Cleanup + 0x278 (0x555969a09fa8 in /opt/conda/bin/python)
frame #18: Py_FinalizeEx + 0x61 (0x555969a74961 in /opt/conda/bin/python)
frame #19: Py_Main + 0x35e (0x555969a7ecae in /opt/conda/bin/python)
frame #20: main + 0xee (0x555969948f2e in /opt/conda/bin/python)
frame #21: __libc_start_main + 0xe7 (0x7fa3d856fb97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: + 0x1c327f (0x555969a2827f in /opt/conda/bin/python)

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7f121fb5299b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xc10 (0x7f121fd95280 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f121fb3adfd in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: + 0x5414e2 (0x7f1258bb24e2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x19aaae (0x5601c5024aae in /opt/conda/bin/python)
frame #5: + 0xf2868 (0x5601c4f7c868 in /opt/conda/bin/python)
frame #6: + 0x1f0d91 (0x5601c507ad91 in /opt/conda/bin/python)
frame #7: + 0xf270d (0x5601c4f7c70d in /opt/conda/bin/python)
frame #8: + 0x19aa90 (0x5601c5024a90 in /opt/conda/bin/python)
frame #9: + 0xf2868 (0x5601c4f7c868 in /opt/conda/bin/python)
frame #10: + 0x1f0d91 (0x5601c507ad91 in /opt/conda/bin/python)
frame #11: + 0xf2828 (0x5601c4f7c828 in /opt/conda/bin/python)
frame #12: + 0x19aa90 (0x5601c5024a90 in /opt/conda/bin/python)
frame #13: + 0xf2868 (0x5601c4f7c868 in /opt/conda/bin/python)
frame #14: + 0x1f0d91 (0x5601c507ad91 in /opt/conda/bin/python)
frame #15: + 0x1688cb (0x5601c4ff28cb in /opt/conda/bin/python)
frame #16: _PyGC_CollectNoFail + 0x2a (0x5601c507a79a in /opt/conda/bin/python)
frame #17: PyImport_Cleanup + 0x278 (0x5601c502efa8 in /opt/conda/bin/python)
frame #18: Py_FinalizeEx + 0x61 (0x5601c5099961 in /opt/conda/bin/python)
frame #19: Py_Main + 0x35e (0x5601c50a3cae in /opt/conda/bin/python)
frame #20: main + 0xee (0x5601c4f6df2e in /opt/conda/bin/python)
frame #21: __libc_start_main + 0xe7 (0x7f1286408b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: + 0x1c327f (0x5601c504d27f in /opt/conda/bin/python)

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7fe94f54799b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xc10 (0x7fe94f78a280 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fe94f52fdfd in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: + 0x5414e2 (0x7fe9885a74e2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x19aaae (0x55ab4542baae in /opt/conda/bin/python)
frame #5: + 0xf2868 (0x55ab45383868 in /opt/conda/bin/python)
frame #6: + 0x1f0d91 (0x55ab45481d91 in /opt/conda/bin/python)
frame #7: + 0xf270d (0x55ab4538370d in /opt/conda/bin/python)
frame #8: + 0x19aa90 (0x55ab4542ba90 in /opt/conda/bin/python)
frame #9: + 0xf2868 (0x55ab45383868 in /opt/conda/bin/python)
frame #10: + 0x1f0d91 (0x55ab45481d91 in /opt/conda/bin/python)
frame #11: + 0xf2828 (0x55ab45383828 in /opt/conda/bin/python)
frame #12: + 0x19aa90 (0x55ab4542ba90 in /opt/conda/bin/python)
frame #13: + 0xf2868 (0x55ab45383868 in /opt/conda/bin/python)
frame #14: + 0x1f0d91 (0x55ab45481d91 in /opt/conda/bin/python)
frame #15: + 0x1688cb (0x55ab453f98cb in /opt/conda/bin/python)
frame #16: _PyGC_CollectNoFail + 0x2a (0x55ab4548179a in /opt/conda/bin/python)
frame #17: PyImport_Cleanup + 0x278 (0x55ab45435fa8 in /opt/conda/bin/python)
frame #18: Py_FinalizeEx + 0x61 (0x55ab454a0961 in /opt/conda/bin/python)
frame #19: Py_Main + 0x35e (0x55ab454aacae in /opt/conda/bin/python)
frame #20: main + 0xee (0x55ab45374f2e in /opt/conda/bin/python)
frame #21: __libc_start_main + 0xe7 (0x7fe9b5dfdb97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: + 0x1c327f (0x55ab4545427f in /opt/conda/bin/python)

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7fce50e8399b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xc10 (0x7fce510c6280 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fce50e6bdfd in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: + 0x5414e2 (0x7fce89ee34e2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x19aaae (0x55919a5ffaae in /opt/conda/bin/python)
frame #5: + 0xf2868 (0x55919a557868 in /opt/conda/bin/python)
frame #6: + 0x1f0d91 (0x55919a655d91 in /opt/conda/bin/python)
frame #7: + 0xf270d (0x55919a55770d in /opt/conda/bin/python)
frame #8: + 0x19aa90 (0x55919a5ffa90 in /opt/conda/bin/python)
frame #9: + 0xf2868 (0x55919a557868 in /opt/conda/bin/python)
frame #10: + 0x1f0d91 (0x55919a655d91 in /opt/conda/bin/python)
frame #11: + 0xf2828 (0x55919a557828 in /opt/conda/bin/python)
frame #12: + 0x19aa90 (0x55919a5ffa90 in /opt/conda/bin/python)
frame #13: + 0xf2868 (0x55919a557868 in /opt/conda/bin/python)
frame #14: + 0x1f0d91 (0x55919a655d91 in /opt/conda/bin/python)
frame #15: + 0x1688cb (0x55919a5cd8cb in /opt/conda/bin/python)
frame #16: _PyGC_CollectNoFail + 0x2a (0x55919a65579a in /opt/conda/bin/python)
frame #17: PyImport_Cleanup + 0x278 (0x55919a609fa8 in /opt/conda/bin/python)
frame #18: Py_FinalizeEx + 0x61 (0x55919a674961 in /opt/conda/bin/python)
frame #19: Py_Main + 0x35e (0x55919a67ecae in /opt/conda/bin/python)
frame #20: main + 0xee (0x55919a548f2e in /opt/conda/bin/python)
frame #21: __libc_start_main + 0xe7 (0x7fceb7739b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: + 0x1c327f (0x55919a62827f in /opt/conda/bin/python)

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7f01ad8c799b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xc10 (0x7f01adb0a280 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f01ad8afdfd in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: + 0x5414e2 (0x7f01e69274e2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x19aaae (0x55c9bc565aae in /opt/conda/bin/python)
frame #5: + 0xf2868 (0x55c9bc4bd868 in /opt/conda/bin/python)
frame #6: + 0x1f0d91 (0x55c9bc5bbd91 in /opt/conda/bin/python)
frame #7: + 0xf270d (0x55c9bc4bd70d in /opt/conda/bin/python)
frame #8: + 0x19aa90 (0x55c9bc565a90 in /opt/conda/bin/python)
frame #9: + 0xf2868 (0x55c9bc4bd868 in /opt/conda/bin/python)
frame #10: + 0x1f0d91 (0x55c9bc5bbd91 in /opt/conda/bin/python)
frame #11: + 0xf2828 (0x55c9bc4bd828 in /opt/conda/bin/python)
frame #12: + 0x19aa90 (0x55c9bc565a90 in /opt/conda/bin/python)
frame #13: + 0xf2868 (0x55c9bc4bd868 in /opt/conda/bin/python)
frame #14: + 0x1f0d91 (0x55c9bc5bbd91 in /opt/conda/bin/python)
frame #15: + 0x1688cb (0x55c9bc5338cb in /opt/conda/bin/python)
frame #16: _PyGC_CollectNoFail + 0x2a (0x55c9bc5bb79a in /opt/conda/bin/python)
frame #17: PyImport_Cleanup + 0x278 (0x55c9bc56ffa8 in /opt/conda/bin/python)
frame #18: Py_FinalizeEx + 0x61 (0x55c9bc5da961 in /opt/conda/bin/python)
frame #19: Py_Main + 0x35e (0x55c9bc5e4cae in /opt/conda/bin/python)
frame #20: main + 0xee (0x55c9bc4aef2e in /opt/conda/bin/python)
frame #21: __libc_start_main + 0xe7 (0x7f021417db97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: + 0x1c327f (0x55c9bc58e27f in /opt/conda/bin/python)

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7ff569f1599b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xc10 (0x7ff56a158280 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7ff569efddfd in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: + 0x5414e2 (0x7ff5a2f754e2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x19aaae (0x562bbdb46aae in /opt/conda/bin/python)
frame #5: + 0xf2868 (0x562bbda9e868 in /opt/conda/bin/python)
frame #6: + 0x1f0d91 (0x562bbdb9cd91 in /opt/conda/bin/python)
frame #7: + 0xf270d (0x562bbda9e70d in /opt/conda/bin/python)
frame #8: + 0x19aa90 (0x562bbdb46a90 in /opt/conda/bin/python)
frame #9: + 0xf2868 (0x562bbda9e868 in /opt/conda/bin/python)
frame #10: + 0x1f0d91 (0x562bbdb9cd91 in /opt/conda/bin/python)
frame #11: + 0xf2828 (0x562bbda9e828 in /opt/conda/bin/python)
frame #12: + 0x19aa90 (0x562bbdb46a90 in /opt/conda/bin/python)
frame #13: + 0xf2868 (0x562bbda9e868 in /opt/conda/bin/python)
frame #14: + 0x1f0d91 (0x562bbdb9cd91 in /opt/conda/bin/python)
frame #15: + 0x1688cb (0x562bbdb148cb in /opt/conda/bin/python)
frame #16: _PyGC_CollectNoFail + 0x2a (0x562bbdb9c79a in /opt/conda/bin/python)
frame #17: PyImport_Cleanup + 0x278 (0x562bbdb50fa8 in /opt/conda/bin/python)
frame #18: Py_FinalizeEx + 0x61 (0x562bbdbbb961 in /opt/conda/bin/python)
frame #19: Py_Main + 0x35e (0x562bbdbc5cae in /opt/conda/bin/python)
frame #20: main + 0xee (0x562bbda8ff2e in /opt/conda/bin/python)
frame #21: __libc_start_main + 0xe7 (0x7ff5d07cbb97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: + 0x1c327f (0x562bbdb6f27f in /opt/conda/bin/python)

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7f9808d0299b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xc10 (0x7f9808f45280 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f9808ceadfd in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: + 0x5414e2 (0x7f9841d624e2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x19aaae (0x55ba33d58aae in /opt/conda/bin/python)
frame #5: + 0xf2868 (0x55ba33cb0868 in /opt/conda/bin/python)
frame #6: + 0x1f0d91 (0x55ba33daed91 in /opt/conda/bin/python)
frame #7: + 0xf270d (0x55ba33cb070d in /opt/conda/bin/python)
frame #8: + 0x19aa90 (0x55ba33d58a90 in /opt/conda/bin/python)
frame #9: + 0xf2868 (0x55ba33cb0868 in /opt/conda/bin/python)
frame #10: + 0x1f0d91 (0x55ba33daed91 in /opt/conda/bin/python)
frame #11: + 0xf2828 (0x55ba33cb0828 in /opt/conda/bin/python)
frame #12: + 0x19aa90 (0x55ba33d58a90 in /opt/conda/bin/python)
frame #13: + 0xf2868 (0x55ba33cb0868 in /opt/conda/bin/python)
frame #14: + 0x1f0d91 (0x55ba33daed91 in /opt/conda/bin/python)
frame #15: + 0x1688cb (0x55ba33d268cb in /opt/conda/bin/python)
frame #16: _PyGC_CollectNoFail + 0x2a (0x55ba33dae79a in /opt/conda/bin/python)
frame #17: PyImport_Cleanup + 0x278 (0x55ba33d62fa8 in /opt/conda/bin/python)
frame #18: Py_FinalizeEx + 0x61 (0x55ba33dcd961 in /opt/conda/bin/python)
frame #19: Py_Main + 0x35e (0x55ba33dd7cae in /opt/conda/bin/python)
frame #20: main + 0xee (0x55ba33ca1f2e in /opt/conda/bin/python)
frame #21: __libc_start_main + 0xe7 (0x7f986f5b8b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: + 0x1c327f (0x55ba33d8127f in /opt/conda/bin/python)

Expected behavior

@matteomedioli
Copy link

Seems like my issue. Maybe can help: #10832

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@happy-nlp
Copy link

happy-nlp commented Dec 28, 2021

How to fix it?I come up with this issue too.

@akedjouadj
Copy link

Also

@jonathanvevance
Copy link

Any fix for this? Facing the same issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants