Conformer RNN-T for Librispeech #1063

huangruizhe · 2023-05-16T01:01:30Z

Hello,

I was trying to run this model: #316

After I check out the model and the corresponding git commit:

git clone [email protected]:csukuangfj/icefall-asr-librispeech-transducer-stateless2-torchaudio-2022-04-19
git checkout fce7f3c

path_to_pretrained_asr_model="icefall-asr-librispeech-transducer-stateless2-torchaudio-2022-04-19"
./pruned_transducer_stateless2/decode.py \
  --bpe-model $path_to_pretrained_asr_model/data/lang_bpe_500/bpe.model \
  --exp-dir $path_to_pretrained_asr_model/exp/ \
  --epoch 999 \
  --avg 1 \
  --max-duration 100 \
  --decoding-method greedy_search

I got this error during decoding:

 ./pruned_transducer_stateless2/decode.py   --bpe-model $path_to_pretrained_asr_model/data/lang_bpe_500/bpe.model   --exp-dir $path_to_pretrained_asr_model/exp/   --epoch 999   --avg 1   --max-duration 100   --decoding-method greedy_search
2023-05-14 05:46:51,950 INFO [decode.py:699] Decoding started
2023-05-14 05:46:51,951 INFO [decode.py:705] Device: cpu
2023-05-14 05:46:51,955 INFO [decode.py:720] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'encoder_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'decoder_dim': 512, 'joiner_dim': 512, 'model_warm_step': 3000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '3b7f09fa35e72589914f67089c0da9f196a92ca4', 'k2-git-date': 'Mon May 8 14:58:45 2023', 'lhotse-version': '1.15.0.dev+git.6fcfced.clean', 'torch-version': '2.0.1+cu118', 'torch-cuda-available': False, 'torch-cuda-version': '11.8', 'python-version': '3.1', 'icefall-git-branch': 'HEAD', 'icefall-git-sha1': '36eacac-clean', 'icefall-git-date': 'Wed Aug 3 11:19:40 2022', 'icefall-path': '/fsx/users/huangruizhe/icefall', 'k2-path': '/fsx/users/huangruizhe/k2/k2/python/k2/__init__.py', 'lhotse-path': '/data/home/huangruizhe/miniconda3/envs/aligner/lib/python3.10/site-packages/lhotse/__init__.py', 'hostname': 'ip-10-200-79-136', 'IP address': '10.200.79.136'}, 'epoch': 999, 'iter': 0, 'avg': 1, 'exp_dir': PosixPath('/fsx/users/huangruizhe/audio_ruizhe/icefall_exp/conformer_rnnt/icefall-asr-librispeech-transducer-stateless2-torchaudio-2022-04-19/exp'), 'bpe_model': '/fsx/users/huangruizhe/audio_ruizhe/icefall_exp/conformer_rnnt/icefall-asr-librispeech-transducer-stateless2-torchaudio-2022-04-19/data/lang_bpe_500/bpe.model', 'lang_dir': PosixPath('data/lang_bpe_500'), 'decoding_method': 'greedy_search', 'beam_size': 4, 'beam': 20.0, 'ngram_lm_scale': 0.01, 'max_contexts': 8, 'max_states': 64, 'context_size': 2, 'max_sym_per_frame': 1, 'simulate_streaming': False, 'decode_chunk_size': 16, 'left_context': 64, 'num_paths': 200, 'nbest_scale': 0.5, 'dynamic_chunk_training': False, 'causal_convolution': False, 'short_chunk_size': 25, 'num_left_chunks': 4, 'full_libri': True, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 100, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'res_dir': PosixPath('/fsx/users/huangruizhe/audio_ruizhe/icefall_exp/conformer_rnnt/icefall-asr-librispeech-transducer-stateless2-torchaudio-2022-04-19/exp/greedy_search'), 'suffix': 'epoch-999-avg-1-context-2-max-sym-per-frame-1', 'blank_id': 0, 'unk_id': 2, 'vocab_size': 500}
2023-05-14 05:46:51,955 INFO [decode.py:722] About to create model
2023-05-14 05:46:52,321 INFO [checkpoint.py:112] Loading checkpoint from /fsx/users/huangruizhe/audio_ruizhe/icefall_exp/conformer_rnnt/icefall-asr-librispeech-transducer-stateless2-torchaudio-2022-04-19/exp/epoch-999.pt
Traceback (most recent call last):
  File "/fsx/users/huangruizhe/icefall/egs/librispeech/ASR/./pruned_transducer_stateless2/decode.py", line 811, in <module>
    main()
  File "/data/home/huangruizhe/miniconda3/envs/aligner/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/fsx/users/huangruizhe/icefall/egs/librispeech/ASR/./pruned_transducer_stateless2/decode.py", line 743, in main
    load_checkpoint(f"{params.exp_dir}/epoch-{params.epoch}.pt", model)
  File "/fsx/users/huangruizhe/icefall/icefall/checkpoint.py", line 126, in load_checkpoint
    model.load_state_dict(checkpoint["model"], strict=strict)
  File "/data/home/huangruizhe/miniconda3/envs/aligner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Transducer:
        size mismatch for encoder.encoder_embed.conv.0.weight: copying a param with shape torch.Size([512, 1, 3, 3]) from checkpoint, the shape in current model is torch.Size([8, 1, 3, 3]).
        size mismatch for encoder.encoder_embed.conv.0.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([8]).
        size mismatch for encoder.encoder_embed.out.weight: copying a param with shape torch.Size([512, 9728]) from checkpoint, the shape in current model is torch.Size([512, 2432]).

It seems the pretrained model does not match the definition in the codes. And the difference seems very big (512 vs. 8)
Would you suggest whether I have done anything wrong? Thanks!

The text was updated successfully, but these errors were encountered:

desh2608 · 2023-05-16T02:26:30Z

It seems you are using the decode script from the pruned_transducer_stateless2 recipe, instead of the transducer_stateless2 recipe. The former uses the k2 pruned RNNT loss, while the latter uses the torchaudio RNNT loss. I am not sure if there are differences in the encoder, but might be worth trying the decode from the correct recipe.

csukuangfj · 2023-05-16T02:30:48Z

It seems you are using the decode script from the pruned_transducer_stateless2 recipe, instead of the transducer_stateless2 recipe. The former uses the k2 pruned RNNT loss, while the latter uses the torchaudio RNNT loss. I am not sure if there are differences in the encoder, but might be worth trying the decode from the correct recipe.

Yes, I think you are right.

@huangruizhe
I suggest that you look at the RESULTS.md in the given PR to copy the decoding commands.

huangruizhe · 2023-05-16T02:38:31Z

Ah, sorry that I made a careless mistake! The issue is fixed after using the correct recipe. Thanks!!
And thanks for the detailed pointers.

huangruizhe closed this as completed May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conformer RNN-T for Librispeech #1063

Conformer RNN-T for Librispeech #1063

huangruizhe commented May 16, 2023

desh2608 commented May 16, 2023

csukuangfj commented May 16, 2023

huangruizhe commented May 16, 2023

Conformer RNN-T for Librispeech #1063

Conformer RNN-T for Librispeech #1063

Comments

huangruizhe commented May 16, 2023

desh2608 commented May 16, 2023

csukuangfj commented May 16, 2023

huangruizhe commented May 16, 2023