Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illegal memory error when training with multi-GPU #247

Open
desh2608 opened this issue Mar 11, 2022 · 39 comments
Open

Illegal memory error when training with multi-GPU #247

desh2608 opened this issue Mar 11, 2022 · 39 comments

Comments

@desh2608
Copy link
Collaborator

I am facing the following error when training with multiple GPUs (on the same node). I am not sure if this is icefall related, but I thought maybe someone has seen it before? (I also tried running with CUDA_LAUNCH_BLOCKING=1 but got the same error message.

# Running on r7n01
# Started at Fri Mar 11 13:48:01 EST 2022
# python conformer_ctc/train.py --world-size 4 
free gpu: 0 1 2 3

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x2aab217dc2f2 in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x2aab217d967b in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x2aab2156d1f9 in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x2aab217c43a4 in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x2f9 (0x2aaaad8aecc9 in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x26a (0x2aaaad8a3c8a in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x2aaaad8caf22 in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x2aaaad207e76 in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0xa2121f (0x2aaaad8ce21f in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x369f80 (0x2aaaad216f80 in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x36b1ee (0x2aaaad2181ee in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x10fd35 (0x555555663d35 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #12: <unknown function> + 0x1aa047 (0x5555556fe047 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #13: <unknown function> + 0x110882 (0x555555664882 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #14: <unknown function> + 0x1102a9 (0x5555556642a9 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #15: <unknown function> + 0x110293 (0x555555664293 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #16: <unknown function> + 0x1130b8 (0x5555556670b8 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #17: <unknown function> + 0x1106ff (0x5555556646ff in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #18: <unknown function> + 0x1fba33 (0x55555574fa33 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x2685 (0x55555572c0d5 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #20: _PyEval_EvalCodeWithName + 0x260 (0x5555557201f0 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #21: _PyFunction_Vectorcall + 0x534 (0x555555721754 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x4bf (0x555555729f0f in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #23: _PyFunction_Vectorcall + 0x1b7 (0x5555557213d7 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x71a (0x55555572a16a in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #25: _PyEval_EvalCodeWithName + 0x260 (0x5555557201f0 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #26: _PyFunction_Vectorcall + 0x594 (0x5555557217b4 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x1517 (0x55555572af67 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #28: _PyEval_EvalCodeWithName + 0x260 (0x5555557201f0 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #29: PyEval_EvalCode + 0x23 (0x555555721aa3 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #30: <unknown function> + 0x241382 (0x555555795382 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #31: <unknown function> + 0x252202 (0x5555557a6202 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #32: PyRun_StringFlags + 0x7a (0x5555557a8e4a in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #33: PyRun_SimpleStringFlags + 0x3c (0x5555557a8eac in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #34: Py_RunMain + 0x15b (0x5555557a981b in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #35: Py_BytesMain + 0x39 (0x5555557a9c69 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #36: __libc_start_main + 0xf5 (0x2aaaab616445 in /lib64/libc.so.6)
frame #37: <unknown function> + 0x1f7427 (0x55555574b427 in /home/hltcoe/draj/.conda/envs/scale/bin/python)

Traceback (most recent call last):
  File "conformer_ctc/train.py", line 787, in <module>
    main()
  File "conformer_ctc/train.py", line 775, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 3 terminated with the following error:
Traceback (most recent call last):
  File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/exp/draj/mini_scale_2022/icefall/egs/spgispeech/ASR/conformer_ctc/train.py", line 701, in run
    train_one_epoch(
  File "/exp/draj/mini_scale_2022/icefall/egs/spgispeech/ASR/conformer_ctc/train.py", line 527, in train_one_epoch
    loss.backward()
  File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

When I train on single GPU, it seems to be working fine:

# Running on r7n07
# Started at Fri Mar 11 13:36:00 EST 2022
# python conformer_ctc/train.py --world-size 1 
free gpu: 0

2022-03-11 13:36:03,704 INFO [train.py:589] Training started
2022-03-11 13:36:03,705 INFO [train.py:590] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 100, 'reset_interval': 500, 'valid_interval': 25000, 'feature_dim': 80, 'subsampling_factor': 4, 'use_feat_batchnorm': True, 'attention_dim': 512, 'nhead': 8, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'weight_decay': 1e-06, 'warm_step': 80000, 'env_info': {'k2-version': '1.13', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '5ee082ea55f50e8bd42203ba266945ea5a236ab8', 'k2-git-date': 'Sat Feb 26 20:00:48 2022', 'lhotse-version': '1.0.0.dev+git.e6e73e4.dirty', 'torch-cuda-available': True, 'torch-cuda-version': '11.1', 'python-version': '3.8', 'icefall-git-branch': 'spgi', 'icefall-git-sha1': '0c27ba4-dirty', 'icefall-git-date': 'Tue Mar 8 15:01:58 2022', 'icefall-path': '/exp/draj/mini_scale_2022/icefall', 'k2-path': '/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/k2/__init__.py', 'lhotse-path': '/exp/draj/mini_scale_2022/lhotse/lhotse/__init__.py', 'hostname': 'r7n07', 'IP address': '10.1.7.7'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 20, 'start_epoch': 0, 'exp_dir': PosixPath('conformer_ctc/exp'), 'lang_dir': PosixPath('data/lang_bpe_5000'), 'att_rate': 0.8, 'num_decoder_layers': 6, 'lr_factor': 5.0, 'seed': 42, 'manifest_dir': PosixPath('data/manifests'), 'enable_musan': True, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'max_duration': 150.0, 'num_buckets': 30, 'on_the_fly_feats': False, 'shuffle': True, 'num_workers': 8, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80}
2022-03-11 13:36:03,859 INFO [lexicon.py:176] Loading pre-compiled data/lang_bpe_5000/Linv.pt
2022-03-11 13:36:04,019 INFO [train.py:638] About to create model
2022-03-11 13:36:08,869 INFO [asr_datamodule.py:295] About to get SPGISpeech dev cuts
2022-03-11 13:36:08,874 INFO [asr_datamodule.py:243] About to create dev dataset
2022-03-11 13:36:09,048 INFO [asr_datamodule.py:258] About to create dev dataloader
2022-03-11 13:36:09,049 INFO [train.py:735] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-03-11 13:36:14,049 INFO [train.py:697] epoch 0, learning rate 5.8593749999999995e-08
2022-03-11 13:36:15,186 INFO [train.py:532] Epoch 0, batch 0, loss[ctc_loss=7.717, att_loss=1.04, loss=2.376, over 3593.00 frames.], tot_loss[ctc_loss=7.717, att_loss=1.04, loss=2.376, over 3593.00 frames.], batch size: 13
@desh2608
Copy link
Collaborator Author

The error went away on reducing --max-duration in the asr_datamodule.py to 100s, so it seems it was a weirdly thrown OOM issue.

@danpovey
Copy link
Collaborator

Hm. It might be worthwhile trying to debug that a bit, e.g. see if you can do
export K2_SYNC_KERNELS=1
export CUDA_LAUNCH_BLOCKING=1
and possibly the error might show up earlier.

@desh2608
Copy link
Collaborator Author

I get the same error even after adding export K2_SYNC_KERNELS=1 and export CUDA_LAUNCH_BLOCKING=1. I have k2 compiled in the debug mode. Is there some flag I can change to print more information?

@csukuangfj
Copy link
Collaborator

export K2_DISABLE_CHECKS=0 can enable extra checks.

You can use the steps in #142 (comment)
to debug the code with gdb.

@ahazned
Copy link
Contributor

ahazned commented Apr 13, 2022

Hi, any updates on this issue?

I also get the same error on both single-gpu and multi-gpu setups unless I decrease "--max-duration" to 50.

I've also tried K2_SYNC_KERNELS=1 and CUDA_LAUNCH_BLOCKING=1 but the problem continues.

@danpovey
Copy link
Collaborator

How up-to-date is your code? We haven't seen this type of error for a while on our end.

@ahazned
Copy link
Contributor

ahazned commented Apr 13, 2022

Hi Dan,

I cloned Icefall yesterday, my branch is up to date with 'origin/master' and k2 details are below. By the way I'm trying egs/librispeech/ASR/pruned_transducer_stateless2/train.py on Librispeech 100 hours.

/tmp/icefall$ git status
On branch master
Your branch is up to date with 'origin/master'.

python3 -m k2.version
Collecting environment information...

k2 version: 1.14
Build type: Release
Git SHA1: 6833270cb228aba7bf9681fccd41e2b52f7d984c
Git date: Wed Mar 16 03:16:05 2022
Cuda used to build k2: 11.1
cuDNN used to build k2: 8.0.4
Python version used to build k2: 3.8
OS used to build k2: Ubuntu 18.04.6 LTS
CMake version: 3.18.4
GCC version: 7.5.0
CMAKE_CUDA_FLAGS: --expt-extended-lambda -gencode arch=compute_35,code=sm_35 --expt-extended-lambda -gencode arch=compute_50,code=sm_50 --expt-extended-lambda -gencode arch=compute_60,code=sm_60 --expt-extended-lambda -gencode arch=compute_61,code=sm_61 --expt-extended-lambda -gencode arch=compute_70,code=sm_70 --expt-extended-lambda -gencode arch=compute_75,code=sm_75 --expt-extended-lambda -gencode arch=compute_80,code=sm_80 --expt-extended-lambda -gencode arch=compute_86,code=sm_86 -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall --compiler-options -Wno-unknown-pragmas --compiler-options -Wno-strict-overflow
CMAKE_CXX_FLAGS: -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-strict-overflow
PyTorch version used to build k2: 1.8.1
PyTorch is using Cuda: 11.1
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False

Here is what I got:

python3 pruned_transducer_stateless2/train.py --exp-dir=pruned_transducer_stateless2/exp_100h_ws1 --world-size 2 --num-epochs 40 --full-libri 0 --max-duration 300

/tmp/miniconda3/envs/k2/lib/python3.8/site-packages/lhotse/dataset/sampling/bucketing.py:96: UserWarning: Lazy CutSet detected in BucketingSampler: we will read it into memory anyway. Please use lhotse.dataset.DynamicBucketingSampler instead.
warnings.warn(
/tmp/miniconda3/envs/k2/lib/python3.8/site-packages/lhotse/dataset/sampling/bucketing.py:96: UserWarning: Lazy CutSet detected in BucketingSampler: we will read it into memory anyway. Please use lhotse.dataset.DynamicBucketingSampler instead.
warnings.warn(
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1616554793803/work/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f0c4b9b82f2 in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f0c4b9b567b in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7f0c4bc11219 in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f0c4b9a03a4 in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0x6e0e5a (0x7f0ca2916e5a in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x6e0ef1 (0x7f0ca2916ef1 in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x1a974a (0x5568edb6a74a in /tmp/miniconda3/envs/k2/bin/python3)
frame #7: + 0x10f660 (0x5568edad0660 in /tmp/miniconda3/envs/k2/bin/python3)
frame #8: + 0x10f660 (0x5568edad0660 in /tmp/miniconda3/envs/k2/bin/python3)
frame #9: + 0x10faf5 (0x5568edad0af5 in /tmp/miniconda3/envs/k2/bin/python3)
frame #10: + 0x1a9727 (0x5568edb6a727 in /tmp/miniconda3/envs/k2/bin/python3)
frame #11: + 0x110632 (0x5568edad1632 in /tmp/miniconda3/envs/k2/bin/python3)
frame #12: + 0x110059 (0x5568edad1059 in /tmp/miniconda3/envs/k2/bin/python3)
frame #13: + 0x110043 (0x5568edad1043 in /tmp/miniconda3/envs/k2/bin/python3)
frame #14: + 0x112f68 (0x5568edad3f68 in /tmp/miniconda3/envs/k2/bin/python3)
frame #15: + 0x1104af (0x5568edad14af in /tmp/miniconda3/envs/k2/bin/python3)
frame #16: + 0x1fe1f3 (0x5568edbbf1f3 in /tmp/miniconda3/envs/k2/bin/python3)
frame #17: _PyEval_EvalFrameDefault + 0x2681 (0x5568edb9a021 in /tmp/miniconda3/envs/k2/bin/python3)
frame #18: _PyEval_EvalCodeWithName + 0x260 (0x5568edb8d600 in /tmp/miniconda3/envs/k2/bin/python3)
frame #19: _PyFunction_Vectorcall + 0x534 (0x5568edb8eb64 in /tmp/miniconda3/envs/k2/bin/python3)
frame #20: _PyEval_EvalFrameDefault + 0x4c0 (0x5568edb97e60 in /tmp/miniconda3/envs/k2/bin/python3)
frame #21: _PyFunction_Vectorcall + 0x1b7 (0x5568edb8e7e7 in /tmp/miniconda3/envs/k2/bin/python3)
frame #22: _PyEval_EvalFrameDefault + 0x71b (0x5568edb980bb in /tmp/miniconda3/envs/k2/bin/python3)
frame #23: _PyEval_EvalCodeWithName + 0x260 (0x5568edb8d600 in /tmp/miniconda3/envs/k2/bin/python3)
frame #24: _PyFunction_Vectorcall + 0x594 (0x5568edb8ebc4 in /tmp/miniconda3/envs/k2/bin/python3)
frame #25: _PyEval_EvalFrameDefault + 0x1510 (0x5568edb98eb0 in /tmp/miniconda3/envs/k2/bin/python3)
frame #26: _PyEval_EvalCodeWithName + 0x260 (0x5568edb8d600 in /tmp/miniconda3/envs/k2/bin/python3)
frame #27: PyEval_EvalCode + 0x23 (0x5568edb8eeb3 in /tmp/miniconda3/envs/k2/bin/python3)
frame #28: + 0x242622 (0x5568edc03622 in /tmp/miniconda3/envs/k2/bin/python3)
frame #29: + 0x2531d2 (0x5568edc141d2 in /tmp/miniconda3/envs/k2/bin/python3)
frame #30: PyRun_StringFlags + 0x7a (0x5568edc16e0a in /tmp/miniconda3/envs/k2/bin/python3)
frame #31: PyRun_SimpleStringFlags + 0x3c (0x5568edc16e6c in /tmp/miniconda3/envs/k2/bin/python3)
frame #32: Py_RunMain + 0x15b (0x5568edc177db in /tmp/miniconda3/envs/k2/bin/python3)
frame #33: Py_BytesMain + 0x39 (0x5568edc17c29 in /tmp/miniconda3/envs/k2/bin/python3)
frame #34: __libc_start_main + 0xe7 (0x7f0cd469fc87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #35: + 0x1f9ad7 (0x5568edbbaad7 in /tmp/miniconda3/envs/k2/bin/python3)

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1616554793803/work/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f27956ae2f2 in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f27956ab67b in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7f2795907219 in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f27956963a4 in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0x6e0e5a (0x7f27ec60ce5a in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x6e0ef1 (0x7f27ec60cef1 in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x1a974a (0x55953ec0f74a in /tmp/miniconda3/envs/k2/bin/python3)
frame #7: + 0x10f660 (0x55953eb75660 in /tmp/miniconda3/envs/k2/bin/python3)
frame #8: + 0x10f660 (0x55953eb75660 in /tmp/miniconda3/envs/k2/bin/python3)
frame #9: + 0x10faf5 (0x55953eb75af5 in /tmp/miniconda3/envs/k2/bin/python3)
frame #10: + 0x1a9727 (0x55953ec0f727 in /tmp/miniconda3/envs/k2/bin/python3)
frame #11: + 0x110632 (0x55953eb76632 in /tmp/miniconda3/envs/k2/bin/python3)
frame #12: + 0x110059 (0x55953eb76059 in /tmp/miniconda3/envs/k2/bin/python3)
frame #13: + 0x110043 (0x55953eb76043 in /tmp/miniconda3/envs/k2/bin/python3)
frame #14: + 0x112f68 (0x55953eb78f68 in /tmp/miniconda3/envs/k2/bin/python3)
frame #15: + 0x1104af (0x55953eb764af in /tmp/miniconda3/envs/k2/bin/python3)
frame #16: + 0x1fe1f3 (0x55953ec641f3 in /tmp/miniconda3/envs/k2/bin/python3)
frame #17: _PyEval_EvalFrameDefault + 0x2681 (0x55953ec3f021 in /tmp/miniconda3/envs/k2/bin/python3)
frame #18: _PyEval_EvalCodeWithName + 0x260 (0x55953ec32600 in /tmp/miniconda3/envs/k2/bin/python3)
frame #19: _PyFunction_Vectorcall + 0x534 (0x55953ec33b64 in /tmp/miniconda3/envs/k2/bin/python3)
frame #20: _PyEval_EvalFrameDefault + 0x4c0 (0x55953ec3ce60 in /tmp/miniconda3/envs/k2/bin/python3)
frame #21: _PyFunction_Vectorcall + 0x1b7 (0x55953ec337e7 in /tmp/miniconda3/envs/k2/bin/python3)
frame #22: _PyEval_EvalFrameDefault + 0x71b (0x55953ec3d0bb in /tmp/miniconda3/envs/k2/bin/python3)
frame #23: _PyEval_EvalCodeWithName + 0x260 (0x55953ec32600 in /tmp/miniconda3/envs/k2/bin/python3)
frame #24: _PyFunction_Vectorcall + 0x594 (0x55953ec33bc4 in /tmp/miniconda3/envs/k2/bin/python3)
frame #25: _PyEval_EvalFrameDefault + 0x1510 (0x55953ec3deb0 in /tmp/miniconda3/envs/k2/bin/python3)
frame #26: _PyEval_EvalCodeWithName + 0x260 (0x55953ec32600 in /tmp/miniconda3/envs/k2/bin/python3)
frame #27: PyEval_EvalCode + 0x23 (0x55953ec33eb3 in /tmp/miniconda3/envs/k2/bin/python3)
frame #28: + 0x242622 (0x55953eca8622 in /tmp/miniconda3/envs/k2/bin/python3)
frame #29: + 0x2531d2 (0x55953ecb91d2 in /tmp/miniconda3/envs/k2/bin/python3)
frame #30: PyRun_StringFlags + 0x7a (0x55953ecbbe0a in /tmp/miniconda3/envs/k2/bin/python3)
frame #31: PyRun_SimpleStringFlags + 0x3c (0x55953ecbbe6c in /tmp/miniconda3/envs/k2/bin/python3)
frame #32: Py_RunMain + 0x15b (0x55953ecbc7db in /tmp/miniconda3/envs/k2/bin/python3)
frame #33: Py_BytesMain + 0x39 (0x55953ecbcc29 in /tmp/miniconda3/envs/k2/bin/python3)
frame #34: __libc_start_main + 0xe7 (0x7f281e395c87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #35: + 0x1f9ad7 (0x55953ec5fad7 in /tmp/miniconda3/envs/k2/bin/python3)

Traceback (most recent call last):
File "pruned_transducer_stateless2/train.py", line 997, in
main()
File "pruned_transducer_stateless2/train.py", line 988, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "/tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/tmp/icefall/egs/librispeech/ASR/pruned_transducer_stateless2/train.py", line 878, in run
scan_pessimistic_batches_for_oom(
File "/tmp/icefall/egs/librispeech/ASR/pruned_transducer_stateless2/train.py", line 964, in scan_pessimistic_batches_for_oom
loss.backward()
File "/tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/init.py", line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

@danpovey
Copy link
Collaborator

danpovey commented Apr 13, 2022 via email

@ahazned
Copy link
Contributor

ahazned commented Apr 13, 2022

Thanks. I tried, but unfortunately it doesn't help.

@danpovey
Copy link
Collaborator

It's supposed to make it print a more detailed error message, not fix the issue.

@danpovey
Copy link
Collaborator

Anyway I think a version of k2 from March 14th is not recent enough to run the pruned_transducer_stateless2 recipe.
You may have to compile k2 from scratch; or use a more recent version if you can find one.

@csukuangfj
Copy link
Collaborator

@ahazned
Are you able to run the unit tests of k2? You can follow https://k2-fsa.github.io/k2/installation/for_developers.html to run the tests.

@desh2608
Copy link
Collaborator Author

desh2608 commented Apr 14, 2022

@csukuangfj I have the most recent versions of k2 and icefall (all tests are passing), but still get this error for larger batch sizes (>100s when training with 4 GPUs with 12G mem each). I am trying to run a pruned_transducer_stateless2 model on SPGISpeech.

@danpovey
Copy link
Collaborator

danpovey commented Apr 15, 2022

@desh2608 see if you can run the training inside cuda-gdb (but I'm not sure whether cuda-gdb is able to handle multiple training processes, and also whether it will be easy for you to install). If the problem can be reproduced with 1 job that might make it easier.
Also
export K2_SYNC_KERNELS=1
export K2_DISABLE_DEBUG=0
export CUDA_LAUNCH_BLOCKING=1
may help to make a problem visible easier.

@ahazned
Copy link
Contributor

ahazned commented Apr 15, 2022

I successfully run "pruned_transducer_stateless2/train.py" with "--max-duration=300" when I use a newer K2 (1.14, Git date: Wed Apr 13 00:46:49 2022). I use two GPU's with 24GB mem each.

But one interesting thing is that I get different WERs on "egs/yesno/ASR/tdnn/train.py" with different K2/Pytorch/Cuda combinations. Not sure if this is expected.

k2 version: 1.14 | Git date: Wed Mar 16 03:16:05 2022 | PyTorch version used to build k2: 1.8.1+cu111
%WER 0.42% [1 / 240, 0 ins, 1 del, 0 sub ]

k2 version: 1.14 | Git date: Wed Apr 13 00:46:49 2022 | PyTorch version used to build k2: 1.11.0+cu102
%WER 2.50% [6 / 240, 5 ins, 1 del, 0 sub ]

k2 version: 1.14 | Git date: Wed Apr 13 00:46:49 2022 | PyTorch version used to build k2: 1.8.1+cu102
%WER 3.33% [8 / 240, 7 ins, 1 del, 0 sub ]

k2 version: 1.14 | Git date: Wed Apr 13 00:46:49 2022 | PyTorch version used to build k2: 1.11.0+cu113
%WER 2.50% [6 / 240, 5 ins, 1 del, 0 sub ]

@danpovey
Copy link
Collaborator

Different PyTorch versions may cause different random-number sequences; and there may be other reasons why they differ slightly. I think this is probably expected. The yesno data set is super tiny, so random noise is a larger factor than normal.

@ahazned
Copy link
Contributor

ahazned commented Apr 15, 2022

Ok, thanks Dan.

@csukuangfj
Copy link
Collaborator

@desh2608

  1. How did you install k2 ?
  2. What is the output of python3 -m k2.version ?
  3. What is the type of your GPU?
  4. If you compiled k2 from souce, are you running on the same machine that you used to compile k2?

@desh2608
Copy link
Collaborator Author

desh2608 commented May 3, 2022

I think this is fixed now (although I don't know what fixed it). I just updated PyTorch from version 1.8.1 to 1.10.1 and pulled the latest k2 (v1.14), and compiled it from source in debug mode.

09:56 $ python -m k2.version
Collecting environment information...

k2 version: 1.14
Build type: Debug
Git SHA1: 1b29f0a946f50186aaa82df46a59f492ade9692b
Git date: Tue Apr 12 20:46:49 2022
Cuda used to build k2: 11.1
cuDNN used to build k2: 8.0.2
Python version used to build k2: 3.8
OS used to build k2: CentOS Linux release 7.5.1804 (Core)
CMake version: 3.22.1
GCC version: 7.2.0
CMAKE_CUDA_FLAGS:  --compiler-options -rdynamic --compiler-options -lineinfo -Wno-deprecated-gpu-targets  --expt-extended-lambda -gencode arch=compute_35,code=sm_35 --expt-extended-lambda -gencode arch=compute_50,code=sm_50 --expt-extended-lambda -gencode arch=compute_60,code=sm_60 --expt-extended-lambda -gencode arch=compute_61,code=sm_61 --expt-extended-lambda -gencode arch=compute_70,code=sm_70 --expt-extended-lambda -gencode arch=compute_75,code=sm_75 --expt-extended-lambda -gencode arch=compute_80,code=sm_80 --expt-extended-lambda -gencode arch=compute_86,code=sm_86 -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall  --compiler-options -Wno-strict-overflow  --compiler-options -Wno-unknown-pragmas
CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-unused-variable  -Wno-strict-overflow
PyTorch version used to build k2: 1.10.1+cu111
PyTorch is using Cuda: 11.1
NVTX enabled: True
With CUDA: True
Disable debug: False
Sync kernels : True
Disable checks: False

After this upgrade, I am able to train with a batch size of 250s, where earlier I was getting the weird memory issues even with a batch size of 100 (using 8 V100 GPUs). Perhaps there was an issue with PyTorch 1.8.1? It's hard to say.

I still get a CUDA error when I try to use batch size 300, but from PyTorch discussion forums, it seems to be related to OOM, although I was hoping it would be caught by scan_pessimistic_batches_for_oom().

2022-05-03 10:28:06,656 INFO [asr_datamodule.py:289] (7/8) About to create dev dataloader
2022-05-03 10:28:06,656 INFO [train.py:926] (7/8) Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-05-03 10:40:03,363 INFO [distributed.py:874] (0/8) Reducer buckets have been rebuilt in this iteration.
2022-05-03 10:40:03,367 INFO [distributed.py:874] (1/8) Reducer buckets have been rebuilt in this iteration.
2022-05-03 10:40:03,371 INFO [distributed.py:874] (5/8) Reducer buckets have been rebuilt in this iteration.
2022-05-03 10:40:03,371 INFO [distributed.py:874] (7/8) Reducer buckets have been rebuilt in this iteration.
2022-05-03 10:40:03,371 INFO [distributed.py:874] (6/8) Reducer buckets have been rebuilt in this iteration.
2022-05-03 10:40:03,371 INFO [distributed.py:874] (3/8) Reducer buckets have been rebuilt in this iteration.
2022-05-03 10:40:03,371 INFO [distributed.py:874] (4/8) Reducer buckets have been rebuilt in this iteration.
2022-05-03 10:40:03,371 INFO [distributed.py:874] (2/8) Reducer buckets have been rebuilt in this iteration.
2022-05-03 10:40:46,352 INFO [train.py:710] (6/8) Epoch 0, batch 0, loss[loss=0.8929, simple_loss=1.786, pruned_loss=6.343, over 7364.00 frames.], tot_loss[loss=0.8929, simple_loss=1.786, pruned_loss=6.343, over 7364.00 frames.], batch size: 37, lr: 3.00e-03
2022-05-03 10:40:46,353 INFO [train.py:710] (3/8) Epoch 0, batch 0, loss[loss=0.8258, simple_loss=1.652, pruned_loss=6.268, over 7487.00 frames.], tot_loss[loss=0.8258, simple_loss=1.652, pruned_loss=6.268, over 7487.00 frames.], batch size: 20, lr: 3.00e-03
2022-05-03 10:40:46,353 INFO [train.py:710] (4/8) Epoch 0, batch 0, loss[loss=0.8831, simple_loss=1.766, pruned_loss=6.333, over 7294.00 frames.], tot_loss[loss=0.8831, simple_loss=1.766, pruned_loss=6.333, over 7294.00 frames.], batch size: 31, lr: 3.00e-03
2022-05-03 10:40:46,353 INFO [train.py:710] (5/8) Epoch 0, batch 0, loss[loss=0.9224, simple_loss=1.845, pruned_loss=6.351, over 7392.00 frames.], tot_loss[loss=0.9224, simple_loss=1.845, pruned_loss=6.351, over 7392.00 frames.], batch size: 52, lr: 3.00e-03
2022-05-03 10:40:46,353 INFO [train.py:710] (7/8) Epoch 0, batch 0, loss[loss=0.8968, simple_loss=1.794, pruned_loss=6.348, over 7478.00 frames.], tot_loss[loss=0.8968, simple_loss=1.794, pruned_loss=6.348, over 7478.00 frames.], batch size: 29, lr: 3.00e-03
2022-05-03 10:40:46,353 INFO [train.py:710] (2/8) Epoch 0, batch 0, loss[loss=0.9219, simple_loss=1.844, pruned_loss=6.435, over 7405.00 frames.], tot_loss[loss=0.9219, simple_loss=1.844, pruned_loss=6.435, over 7405.00 frames.], batch size: 23, lr: 3.00e-03
2022-05-03 10:40:46,353 INFO [train.py:710] (1/8) Epoch 0, batch 0, loss[loss=0.8198, simple_loss=1.64, pruned_loss=6.221, over 7314.00 frames.], tot_loss[loss=0.8198, simple_loss=1.64, pruned_loss=6.221, over 7314.00 frames.], batch size: 32, lr: 3.00e-03
2022-05-03 10:40:46,353 INFO [train.py:710] (0/8) Epoch 0, batch 0, loss[loss=0.8963, simple_loss=1.793, pruned_loss=6.408, over 7483.00 frames.], tot_loss[loss=0.8963, simple_loss=1.793, pruned_loss=6.408, over 7483.00 frames.], batch size: 21, lr: 3.00e-03
Traceback (most recent call last):
  File "pruned_transducer_stateless2/train.py", line 978, in <module>
    main()
  File "pruned_transducer_stateless2/train.py", line 969, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/exp/draj/mini_scale_2022/icefall/egs/spgispeech/ASR/pruned_transducer_stateless2/train.py", line 882, in run
    train_one_epoch(
  File "/exp/draj/mini_scale_2022/icefall/egs/spgispeech/ASR/pruned_transducer_stateless2/train.py", line 676, in train_one_epoch
    scaler.scale(loss).backward()
  File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA error: invalid configuration argument

@danpovey
Copy link
Collaborator

danpovey commented May 4, 2022

@csukuangfj I am thinking we should just make it the default that it prints out some details of the batch (e.g. dimensions and sentence-lengths at least; or perhaps the entire object), when we get an OOM error. This will make things like this easier to debug.

HOWEVER, desh, I'm not convinced that this actually is an OOM error. Try doing
export K2_SYNC_KERNELS=1
export CUDA_LAUNCH_BLOCKING=1
and rerunning, hopefully we'll get a more relevant stack trace.

@desh2608
Copy link
Collaborator Author

desh2608 commented May 4, 2022

HOWEVER, desh, I'm not convinced that this actually is an OOM error. Try doing export K2_SYNC_KERNELS=1 export CUDA_LAUNCH_BLOCKING=1 and rerunning, hopefully we'll get a more relevant stack trace.

Yeah, I already have the following variables set:

export K2_DISABLE_CHECKS=0
export K2_SYNC_KERNELS=1
export CUDA_LAUNCH_BLOCKING=1

but I didn't see any more details in the stack trace. I also printed out the batch when the error happened, but it looked similar to all other batches. I'll try to get it again when my current training ends and share the batch details here.

@danpovey
Copy link
Collaborator

danpovey commented May 5, 2022

OK, thanks. It would be appreciated if you could help us debug this.
Something else you can try is to get a gdb stack trace:
gdb --args python3 [args]
(gdb) catch throw
(gdb) r
... this may give more info.

@wgb14
Copy link
Contributor

wgb14 commented May 19, 2022

Post the log when I set --max-duration 300 while training on GigaSpeech

2022-05-20 01:01:03,441 INFO [train_test.py:782] Training started
2022-05-20 01:01:03,449 INFO [train_test.py:792] Device: cuda:0
2022-05-20 01:01:03,496 INFO [train_test.py:801] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 500, 'reset_interval': 2000, 'valid_interval': 20000, 'feature_dim': 80, 'subsampling_factor': 4, 'encoder_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'decoder_dim': 512, 'joiner_dim': 512, 'model_warm_step': 20000, 'env_info': {'k2-version': '1.15.1', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'ecfe7bd6d9189964bf3ff043038918d889a43185', 'k2-git-date': 'Tue May 10 10:57:55 2022', 'lhotse-version': '1.2.0.dev+git.a3d7b8e.clean', 'torch-version': '1.10.0', 'torch-cuda-available': True, 'torch-cuda-version': '11.1', 'python-version': '3.7', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'f6ce135-dirty', 'icefall-git-date': 'Mon May 16 21:46:59 2022', 'icefall-path': '/userhome/user/guanbo/icefall_test', 'k2-path': '/opt/conda/lib/python3.7/site-packages/k2-1.15.1.dev20220519+cuda11.1.torch1.10.0-py3.7-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/userhome/user/guanbo/lhotse/lhotse/__init__.py', 'hostname': 'e0e708b00d794011ec09cda0e7275cb175f4-chenx8564-0', 'IP address': '10.229.82.57'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 0, 'start_batch': 0, 'exp_dir': PosixPath('pruned_transducer_stateless2/exp_oom'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'initial_lr': 0.003, 'lr_batches': 5000, 'lr_epochs': 6, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'save_every_n': 8000, 'keep_last_k': 20, 'use_fp16': True, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 4, 'enable_spec_aug': False, 'spec_aug_time_warp_factor': 80, 'enable_musan': False, 'subset': 'XL', 'small_dev': False, 'blank_id': 0, 'vocab_size': 500}
2022-05-20 01:01:03,496 INFO [train_test.py:803] About to create model
2022-05-20 01:01:04,027 INFO [train_test.py:807] Number of model parameters: 78648040
2022-05-20 01:01:09,835 INFO [asr_datamodule.py:399] About to get train_XL cuts
2022-05-20 01:01:09,835 INFO [asr_datamodule.py:229] Disable MUSAN
2022-05-20 01:01:09,836 INFO [asr_datamodule.py:271] Disable SpecAugment
2022-05-20 01:01:09,836 INFO [asr_datamodule.py:273] About to create train dataset
2022-05-20 01:01:09,836 INFO [asr_datamodule.py:301] Using DynamicBucketingSampler.
2022-05-20 01:01:12,470 INFO [asr_datamodule.py:316] About to create train dataloader
2022-05-20 01:01:12,471 INFO [asr_datamodule.py:406] About to get dev cuts
2022-05-20 01:01:12,918 INFO [asr_datamodule.py:347] About to create dev dataset
2022-05-20 01:01:12,925 INFO [asr_datamodule.py:366] About to create dev dataloader
2022-05-20 01:01:12,925 INFO [train_test.py:959] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-05-20 01:39:10,638 INFO [train_test.py:936] Saving batch to pruned_transducer_stateless2/exp_oom/batch-f2d3f761-0ba3-6279-14cb-056407437c3b.pt
2022-05-20 01:39:10,777 INFO [train_test.py:942] features shape: torch.Size([273, 170, 80])
2022-05-20 01:39:10,781 INFO [train_test.py:946] num tokens: 1377
Traceback (most recent call last):
  File "./pruned_transducer_stateless2/train_test.py", line 1011, in <module>
    main()
  File "./pruned_transducer_stateless2/train_test.py", line 1004, in main
    run(rank=0, world_size=1, args=args)
  File "./pruned_transducer_stateless2/train_test.py", line 863, in run
    params=params,
  File "./pruned_transducer_stateless2/train_test.py", line 977, in scan_pessimistic_batches_for_oom
    loss.backward()
  File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

And the batch.pt: batch-f2d3f761-0ba3-6279-14cb-056407437c3b.zip

@csukuangfj
Copy link
Collaborator

The attached batch works perfectly for me.

Here is the change I made to train.py to run it.

diff --git a/egs/gigaspeech/ASR/pruned_transducer_stateless2/train.py b/egs/gigaspeech/ASR/pruned_transducer_stateless2/train.py
index 83ae255..b69b6fc 100755
--- a/egs/gigaspeech/ASR/pruned_transducer_stateless2/train.py
+++ b/egs/gigaspeech/ASR/pruned_transducer_stateless2/train.py
@@ -833,6 +833,24 @@ def run(rank, world_size, args):
     if params.print_diagnostics:
         diagnostic = diagnostics.attach_diagnostics(model)

+    pt_file = "./batch-f2d3f761-0ba3-6279-14cb-056407437c3b.pt"
+    batch = torch.load(pt_file)
+    with torch.cuda.amp.autocast(enabled=params.use_fp16):
+        loss, _ = compute_loss(
+            params=params,
+            model=model,
+            sp=sp,
+            batch=batch,
+            is_training=True,
+            warmup=0.0,
+        )
+    loss.backward()
+    optimizer.step()
+    optimizer.zero_grad()
+    logging.info(f"loss: {loss}")
+
+    return
+
     gigaspeech = GigaSpeechAsrDataModule(args)

     train_cuts = gigaspeech.train_cuts()

The command for training is

./pruned_transducer_stateless2/train.py

The output is

2022-05-20 07:39:02,040 INFO [train.py:782] Training started
2022-05-20 07:39:02,044 INFO [train.py:792] Device: cuda:0
2022-05-20 07:39:02,052 INFO [train.py:801] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 500, 'reset_interval': 2000, 'valid_interval': 20000, 'feature_dim': 80, 'subsampling_factor': 4, 'encoder_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'decoder_dim': 512, 'joiner_dim': 512, 'model_warm_step': 20000, 'env_info': {'k2-version': '1.15.1', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'f8d2dba06c000ffee36aab5b66f24e7c9809f116', 'k2-git-date': 'Thu Apr 21 12:20:34 2022', 'lhotse-version': '1.1.0.dev+missing.version.file', 'torch-version': '1.10.0+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'master', 'icefall-git-sha1': '2900ed8-dirty', 'icefall-git-date': 'Thu May 19 12:51:07
2022', 'icefall-path': '/ceph-fj/fangjun/open-source-2/icefall-master-3', 'k2-path': '/ceph-fj/fangjun/open-source-2/k2-multi-22/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-fj/fangjun/open-source-2/lhotse-master/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-1-0307195509-54c966b95f-rtpfq', 'IP address': '10.177.22.9'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 0, 'start_batch': 0, 'exp_dir': PosixPath('pruned_transducer_stateless2/exp'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'initial_lr': 0.003, 'lr_batches': 5000, 'lr_epochs': 6, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'save_every_n': 8000, 'keep_last_k': 20, 'use_fp16': False, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 200.0, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'subset': 'XL', 'small_dev': False, 'blank_id': 0, 'vocab_size': 500}
2022-05-20 07:39:02,052 INFO [train.py:803] About to create model
2022-05-20 07:39:02,471 INFO [train.py:807] Number of model parameters: 78648040
2022-05-20 07:39:07,589 INFO [train.py:850] loss: 26499.20703125

@pkufool
Copy link
Collaborator

pkufool commented May 19, 2022

@wgb14 I think you can first try run with CUDA_LAUNCH_BLOCKING=1, it may get more informative stacktrace. You can also print out the error message in scan_pessimistic_batches_for_oom (i.e. the expection error). I saw you were using max-duration=300 and the batch size is 273, there may be too much padding and it raised an OOM.

@csukuangfj
Copy link
Collaborator

I notice that you are using

  • torch 1.10.0 + CUDA 11.1 + Python 3.7

while I am using torch + CUDA 10.2 + Python 3.8. I will try to switch CUDA 11.1 + Python 3.7 and run it again.

@pkufool
Copy link
Collaborator

pkufool commented May 19, 2022

@csukuangfj I think we can print out the expection error message here, even if it is not an OOM error.

optimizer.zero_grad()
except Exception as e:
if "CUDA out of memory" in str(e):
logging.error(
"Your GPU ran out of memory with the current "
"max_duration setting. We recommend decreasing "
"max_duration and trying again.\n"
f"Failing criterion: {criterion} "
f"(={crit_values[criterion]}) ..."
)
display_and_save_batch(batch, params=params, sp=sp)
raise

@pkufool
Copy link
Collaborator

pkufool commented May 19, 2022

I notice that you are using

  • torch 1.10.0 + CUDA 11.1 + Python 3.7

while I am using torch + CUDA 10.2 + Python 3.8. I will try to switch CUDA 11.1 + Python 3.7 and run it again.

BTW, he used mix-precision training.

@wgb14
Copy link
Contributor

wgb14 commented May 19, 2022

The error message in scan_pessimistic_batches_for_oom:

Failing criterion: max_num_cuts (=273) ...

And in my previous experiments,

export CUDA_LAUNCH_BLOCKING=1
export K2_SYNC_KERNELS=1

didn't give me any additional information.

@pkufool
Copy link
Collaborator

pkufool commented May 20, 2022

The error message in scan_pessimistic_batches_for_oom:

Failing criterion: max_num_cuts (=273) ...

Does it print out by str(e)?

And, what is your memory size, I think fangjun has a 32GB v100.

@wgb14
Copy link
Contributor

wgb14 commented May 20, 2022

The error message in scan_pessimistic_batches_for_oom:

Failing criterion: max_num_cuts (=273) ...

Does it print out by str(e)?

And, what is your memory size, I think fangjun has a 32GB v100.

No, this is from logging, after commenting out the line if "CUDA out of memory" in str(e):

2022-05-20 06:02:33,454 INFO [train_test.py:782] Training started
2022-05-20 06:02:33,462 INFO [train_test.py:792] Device: cuda:0
2022-05-20 06:02:33,501 INFO [train_test.py:801] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 500, 'reset_interval': 2000, 'valid_interval': 20000, 'feature_dim': 80, 'subsampling_factor': 4, 'encoder_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'decoder_dim': 512, 'joiner_dim': 512, 'model_warm_step': 20000, 'env_info': {'k2-version': '1.15.1', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'ecfe7bd6d9189964bf3ff043038918d889a43185', 'k2-git-date': 'Tue May 10 10:57:55 2022', 'lhotse-version': '1.2.0.dev+git.a3d7b8e.clean', 'torch-version': '1.10.0', 'torch-cuda-available': True, 'torch-cuda-version': '11.1', 'python-version': '3.7', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'f6ce135-dirty', 'icefall-git-date': 'Mon May 16 21:46:59 2022', 'icefall-path': '/userhome/user/guanbo/icefall_test', 'k2-path': '/opt/conda/lib/python3.7/site-packages/k2-1.15.1.dev20220519+cuda11.1.torch1.10.0-py3.7-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/userhome/user/guanbo/lhotse/lhotse/__init__.py', 'hostname': 'bad3b4500d7bf011ec09cda0e7275cb175f4-chenx8564-0', 'IP address': '10.229.82.7'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 0, 'start_batch': 0, 'exp_dir': PosixPath('pruned_transducer_stateless2/exp_oom'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'initial_lr': 0.003, 'lr_batches': 5000, 'lr_epochs': 6, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'save_every_n': 8000, 'keep_last_k': 20, 'use_fp16': True, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 4, 'enable_spec_aug': False, 'spec_aug_time_warp_factor': 80, 'enable_musan': False, 'subset': 'XL', 'small_dev': False, 'blank_id': 0, 'vocab_size': 500}
2022-05-20 06:02:33,502 INFO [train_test.py:803] About to create model
2022-05-20 06:02:34,003 INFO [train_test.py:807] Number of model parameters: 78648040
2022-05-20 06:02:38,928 INFO [asr_datamodule.py:399] About to get train_XL cuts
2022-05-20 06:02:38,929 INFO [asr_datamodule.py:229] Disable MUSAN
2022-05-20 06:02:38,929 INFO [asr_datamodule.py:271] Disable SpecAugment
2022-05-20 06:02:38,929 INFO [asr_datamodule.py:273] About to create train dataset
2022-05-20 06:02:38,929 INFO [asr_datamodule.py:301] Using DynamicBucketingSampler.
2022-05-20 06:02:41,685 INFO [asr_datamodule.py:316] About to create train dataloader
2022-05-20 06:02:41,686 INFO [asr_datamodule.py:406] About to get dev cuts
2022-05-20 06:02:42,191 INFO [asr_datamodule.py:347] About to create dev dataset
2022-05-20 06:02:42,199 INFO [asr_datamodule.py:366] About to create dev dataloader
2022-05-20 06:02:42,199 INFO [train_test.py:959] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-05-20 06:40:54,103 ERROR [train_test.py:983] Your GPU ran out of memory with the current max_duration setting. We recommend decreasing max_duration and trying again.
Failing criterion: max_num_cuts (=273) ...
2022-05-20 06:40:54,122 INFO [train_test.py:936] Saving batch to pruned_transducer_stateless2/exp_oom/batch-f2d3f761-0ba3-6279-14cb-056407437c3b.pt
2022-05-20 06:40:55,000 INFO [train_test.py:942] features shape: torch.Size([273, 170, 80])
2022-05-20 06:40:55,003 INFO [train_test.py:946] num tokens: 1377
Traceback (most recent call last):
  File "./pruned_transducer_stateless2/train_test.py", line 1011, in <module>
    main()
  File "./pruned_transducer_stateless2/train_test.py", line 1004, in main
    run(rank=0, world_size=1, args=args)
  File "./pruned_transducer_stateless2/train_test.py", line 863, in run
    params=params,
  File "./pruned_transducer_stateless2/train_test.py", line 977, in scan_pessimistic_batches_for_oom
    loss.backward()
  File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I'm also using Tesla V100-32GB

@csukuangfj
Copy link
Collaborator

@csukuangfj I think we can print out the expection error message here, even if it is not an OOM error.

It should be printed by Python, i.e., the one at the end of the logs:

RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

@csukuangfj
Copy link
Collaborator

@wgb14 I can reproduce your error with torch 1.10.0 + CUDA 11.1.

Here is the log:

2022-05-20 08:40:40,258 INFO [train.py:782] Training started
2022-05-20 08:40:40,274 INFO [train.py:792] Device: cuda:0
2022-05-20 08:40:40,301 INFO [train.py:801] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 500, 'reset_interval': 2000, 'valid_interval': 20000, 'feature_dim': 80, 'subsampling_factor': 4, 'encoder_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'decoder_dim': 512, 'joiner_dim': 512, 'model_warm_step': 20000, 'env_info': {'k2-version': '1.15.1', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'ecfe7bd6d9189964bf3ff043038918d889a43185', 'k2-git-date': 'Tue May 10 10:57:55 2022', 'lhotse-version': '1.1.0.dev+missing.version.file', 'torch-version': '1.10.0+cu111', 'torch-cuda-available': True, 'torch-cuda-version': '11.1', 'python-version': '3.7', 'icefall-git-branch': 'master', 'icefall-git-sha1': '2900ed8-dirty', 'icefall-git-date': 'Thu May 19 12:51:07
2022', 'icefall-path': '/ceph-fj/fangjun/open-source-2/icefall-master-3', 'k2-path': '/ceph-fj/fangjun/open-source-2/k2-master/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-fj/fangjun/open-source-2/lhotse-master/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-7-0309102938-68688b4cbd-xhtcg', 'IP address': '10.48.32.137'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 0, 'start_batch': 0, 'exp_dir': PosixPath('pruned_transducer_stateless2/exp'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'initial_lr': 0.003, 'lr_batches': 5000, 'lr_epochs': 6, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'save_every_n': 8000, 'keep_last_k': 20, 'use_fp16': True, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 200.0, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'subset': 'XL', 'small_dev': False, 'blank_id': 0, 'vocab_size': 500}
2022-05-20 08:40:40,301 INFO [train.py:803] About to create model
2022-05-20 08:40:40,744 INFO [train.py:807] Number of model parameters: 78648040
Traceback (most recent call last):
  File "./pruned_transducer_stateless2/train.py", line 992, in <module>
    main()
  File "./pruned_transducer_stateless2/train.py", line 985, in main
    run(rank=0, world_size=1, args=args)
  File "./pruned_transducer_stateless2/train.py", line 847, in run
    loss.backward()
  File "/ceph-fj/fangjun/py37/lib/python3.7/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/ceph-fj/fangjun/py37/lib/python3.7/site-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Note both --use-fp16=1 and --use-fp16=0 throw the same error.


I would suggest you to switch to torch 1.10.0 + CUDA 10.2

@csukuangfj
Copy link
Collaborator

I find that both @desh2608 and @ahazned are also using CUDA 11.1. Probably the issue is caused by CUDA 11.1. Switching to CUDA 10.2 may fix the issue, I think.

@danpovey
Copy link
Collaborator

@csukuangfj since you can repro the issue, perhaps you could try running in cuda-gdb?
This could be caused by asking for too many threads or something like that, which could potentially be in our code (but also could be in Torch).

@danpovey
Copy link
Collaborator

... catch throw in cuda-gdb might show where in the C++ it's failing. (if this is even necessary).
By default the exception gets caught by Python and printed out there, and it's that stack trace that we see.

@csukuangfj
Copy link
Collaborator

@csukuangfj since you can repro the issue, perhaps you could try running in cuda-gdb? This could be caused by asking for too many threads or something like that, which could potentially be in our code (but also could be in Torch).

Yes, I am trying it.

@csukuangfj
Copy link
Collaborator

Output of the following command:

cuda-gdb --args python3 ./pruned_transducer_stateless2/train.py
NVIDIA (R) CUDA Debugger
11.1 release
Portions Copyright (C) 2007-2020 NVIDIA Corporation
GNU gdb (GDB) 8.3.1
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python3...
(cuda-gdb) catch throw
Catchpoint 1 (throw)
(cuda-gdb) r
Starting program: /ceph-fj/fangjun/py37/bin/python3 ./pruned_transducer_stateless2/train.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 1354964]
[New Thread 0x7ffef1dad700 (LWP 1354967)]
[New Thread 0x7ffef15ac700 (LWP 1354968)]
.... omit [New Thread xxx] here ....
[Thread 0x7ffe962fd700 (LWP 1355056) exited]
[Thread 0x7ffe95afc700 (LWP 1355057) exited]
... omit [Thread xxx exited] here
2022-05-20 10:17:36,560 INFO [train.py:782] Training started
[New Thread 0x7ffec6d9b700 (LWP 1355068)]
2022-05-20 10:17:36,563 INFO [train.py:792] Device: cuda:0
2022-05-20 10:17:36,566 INFO [train.py:801] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_i
dx_train': 0, 'log_interval': 500, 'reset_interval': 2000, 'valid_interval': 20000, 'feature_dim': 80, 'subsampling_factor': 4, 'encoder_dim': 512, '
nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'decoder_dim': 512, 'joiner_dim': 512, 'model_warm_step': 20000, 'env_info': {'k2-versi
on': '1.15.1', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'ecfe7bd6d9189964bf3ff043038918d889a43185', 'k2-git-date': 'Tue May 1
0 10:57:55 2022', 'lhotse-version': '1.1.0.dev+missing.version.file', 'torch-version': '1.10.0+cu111', 'torch-cuda-available': True, 'torch-cuda-vers
ion': '11.1', 'python-version': '3.7', 'icefall-git-branch': 'master', 'icefall-git-sha1': '2900ed8-dirty', 'icefall-git-date': 'Thu May 19 12:51:07
2022', 'icefall-path': '/ceph-fj/fangjun/open-source-2/icefall-master-3', 'k2-path': '/ceph-fj/fangjun/open-source-2/k2-master/k2/python/k2/__init__.
py', 'lhotse-path': '/ceph-fj/fangjun/open-source-2/lhotse-master/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-2-0307200233-b554c565c-lf9qd',
'IP address': '10.177.74.201'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 0, 'start_batch': 0, 'ex
p_dir': PosixPath('pruned_transducer_stateless2/exp'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'initial_lr': 0.003, 'lr_batches': 5000, 'lr_epoch
s': 6, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'sav
e_every_n': 8000, 'keep_last_k': 20, 'use_fp16': False, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 200.0, 'bucketing_sampler': True, 'n
um_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num
_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'subset': 'XL', 'small_dev': False, 'blank_id': 0, 'voc
ab_size': 500}
2022-05-20 10:17:36,566 INFO [train.py:803] About to create model
2022-05-20 10:17:37,034 INFO [train.py:807] Number of model parameters: 78648040
[New Thread 0x7ffec959c700 (LWP 1355069)]

[New Thread 0x7ffecbd9d700 (LWP 1355071)]
warning: Cuda API error detected: cudaLaunchKernel returned (0x1)

warning: Cuda API error detected: cudaPeekAtLastError returned (0x1)

warning: Cuda API error detected: cudaPeekAtLastError returned (0x1)

warning: Cuda API error detected: cudaGetLastError returned (0x1)

warning: Cuda API error detected: cudaLaunchKernel returned (0x9)

warning: Cuda API error detected: cudaGetLastError returned (0x9)

[Switching to Thread 0x7ffecbd9d700 (LWP 1355071)]

Thread 95 "python3" hit Catchpoint 1 (exception thrown), 0x00007ffff1ce2d1d in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
(cuda-gdb)
(cuda-gdb) bt
#0  0x00007ffff1ce2d1d in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00007ffef68b25eb in at::native::embedding_backward_cuda_kernel(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long
, int, bool, at::Tensor const&, at::Tensor const&, at::Tensor const&) ()
   from /ceph-fj/fangjun/py37/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cu.so
#2  0x00007ffef688be07 in at::native::embedding_dense_backward_cuda(at::Tensor const&, at::Tensor const&, long, long, bool) ()
   from /ceph-fj/fangjun/py37/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cu.so
#3  0x00007ffef7a9f269 in at::(anonymous namespace)::(anonymous namespace)::wrapper__embedding_dense_backward(at::Tensor const&, at::Tensor const&, l
ong, long, bool) () from /ceph-fj/fangjun/py37/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cu.so
#4  0x00007ffef7a9f2bd in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Ten
sor (at::Tensor const&, at::Tensor const&, long, long, bool), &at::(anonymous namespace)::(anonymous namespace)::wrapper__embedding_dense_backward>,
at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, long, long, bool> >, at::Tensor (at::Tensor const&, at::Tensor const&
, long, long, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, long, bool) ()
   from /ceph-fj/fangjun/py37/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cu.so
#5  0x00007fff475fc75c in at::_ops::embedding_dense_backward::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, long, bool)
 () from /ceph-fj/fangjun/py37/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#6  0x00007fff48f71375 in torch::autograd::VariableType::(anonymous namespace)::embedding_dense_backward(c10::DispatchKeySet, at::Tensor const&, at::
Tensor const&, long, long, bool) () from /ceph-fj/fangjun/py37/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#7  0x00007fff48f71914 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Ten
sor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, long, bool), &torch::autograd::VariableType::(anonymous namespace)::embedding_d
ense_backward>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, long, bool> >, at::Tensor
(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, long, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at
::Tensor const&, long, long, bool) () from /ceph-fj/fangjun/py37/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#8  0x00007fff4764f2f5 in at::_ops::embedding_dense_backward::call(at::Tensor const&, at::Tensor const&, long, long, bool) ()
   from /ceph-fj/fangjun/py37/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#9  0x00007fff471525b4 in at::native::embedding_backward(at::Tensor const&, at::Tensor const&, long, long, bool, bool) ()
   from /ceph-fj/fangjun/py37/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#10 0x00007fff47bd4f57 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Ten
sor (at::Tensor const&, at::Tensor const&, long, long, bool, bool), &at::(anonymous namespace)::(anonymous namespace)::wrapper__embedding_backward>,
at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, long, long, bool, bool> >, at::Tensor (at::Tensor const&, at::Tensor
const&, long, long, bool, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, long, bool, bool) ()
--Type <RET> for more, q to quit, c to continue without paging--
  om /ceph-fj/fangjun/py37/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#11 0x00007fff4764ec48 in at::_ops::embedding_backward::call(at::Tensor const&, at::Tensor const&, long, long, bool, bool) ()
   from /ceph-fj/fangjun/py37/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#12 0x00007fff48ed3c71 in torch::autograd::generated::EmbeddingBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) ()
   from /ceph-fj/fangjun/py37/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#13 0x00007fff495bcdc7 in torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) ()
   from /ceph-fj/fangjun/py37/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#14 0x00007fff495b802b in torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::aut
ograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) ()
   from /ceph-fj/fangjun/py37/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#15 0x00007fff495b8d5a in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) ()
   from /ceph-fj/fangjun/py37/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#16 0x00007fff495b0779 in torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) ()
   from /ceph-fj/fangjun/py37/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#17 0x00007ffff0c72963 in torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) ()
   from /ceph-fj/fangjun/py37/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#18 0x00007ffff1d0d6df in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#19 0x00007ffff7bbb6db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#20 0x00007ffff713f71f in clone () from /lib/x86_64-linux-gnu/libc.so.6
(cuda-gdb)

Looks like the error is from PyTorch, not k2.

@yuekaizhang
Copy link
Collaborator

I encountered the same issue with Desh with torch1.7 k2 1.15, cuda 11.6. Update to torch 1.11 and install latest k2 from source fixed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants