Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conformer RNN-T for Librispeech #1063

Closed
huangruizhe opened this issue May 16, 2023 · 3 comments
Closed

Conformer RNN-T for Librispeech #1063

huangruizhe opened this issue May 16, 2023 · 3 comments

Comments

@huangruizhe
Copy link
Contributor

Hello,

I was trying to run this model: #316

After I check out the model and the corresponding git commit:

git clone [email protected]:csukuangfj/icefall-asr-librispeech-transducer-stateless2-torchaudio-2022-04-19
git checkout fce7f3c

path_to_pretrained_asr_model="icefall-asr-librispeech-transducer-stateless2-torchaudio-2022-04-19"
./pruned_transducer_stateless2/decode.py \
  --bpe-model $path_to_pretrained_asr_model/data/lang_bpe_500/bpe.model \
  --exp-dir $path_to_pretrained_asr_model/exp/ \
  --epoch 999 \
  --avg 1 \
  --max-duration 100 \
  --decoding-method greedy_search

I got this error during decoding:

 ./pruned_transducer_stateless2/decode.py   --bpe-model $path_to_pretrained_asr_model/data/lang_bpe_500/bpe.model   --exp-dir $path_to_pretrained_asr_model/exp/   --epoch 999   --avg 1   --max-duration 100   --decoding-method greedy_search
2023-05-14 05:46:51,950 INFO [decode.py:699] Decoding started
2023-05-14 05:46:51,951 INFO [decode.py:705] Device: cpu
2023-05-14 05:46:51,955 INFO [decode.py:720] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'encoder_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'decoder_dim': 512, 'joiner_dim': 512, 'model_warm_step': 3000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '3b7f09fa35e72589914f67089c0da9f196a92ca4', 'k2-git-date': 'Mon May 8 14:58:45 2023', 'lhotse-version': '1.15.0.dev+git.6fcfced.clean', 'torch-version': '2.0.1+cu118', 'torch-cuda-available': False, 'torch-cuda-version': '11.8', 'python-version': '3.1', 'icefall-git-branch': 'HEAD', 'icefall-git-sha1': '36eacac-clean', 'icefall-git-date': 'Wed Aug 3 11:19:40 2022', 'icefall-path': '/fsx/users/huangruizhe/icefall', 'k2-path': '/fsx/users/huangruizhe/k2/k2/python/k2/__init__.py', 'lhotse-path': '/data/home/huangruizhe/miniconda3/envs/aligner/lib/python3.10/site-packages/lhotse/__init__.py', 'hostname': 'ip-10-200-79-136', 'IP address': '10.200.79.136'}, 'epoch': 999, 'iter': 0, 'avg': 1, 'exp_dir': PosixPath('/fsx/users/huangruizhe/audio_ruizhe/icefall_exp/conformer_rnnt/icefall-asr-librispeech-transducer-stateless2-torchaudio-2022-04-19/exp'), 'bpe_model': '/fsx/users/huangruizhe/audio_ruizhe/icefall_exp/conformer_rnnt/icefall-asr-librispeech-transducer-stateless2-torchaudio-2022-04-19/data/lang_bpe_500/bpe.model', 'lang_dir': PosixPath('data/lang_bpe_500'), 'decoding_method': 'greedy_search', 'beam_size': 4, 'beam': 20.0, 'ngram_lm_scale': 0.01, 'max_contexts': 8, 'max_states': 64, 'context_size': 2, 'max_sym_per_frame': 1, 'simulate_streaming': False, 'decode_chunk_size': 16, 'left_context': 64, 'num_paths': 200, 'nbest_scale': 0.5, 'dynamic_chunk_training': False, 'causal_convolution': False, 'short_chunk_size': 25, 'num_left_chunks': 4, 'full_libri': True, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 100, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'res_dir': PosixPath('/fsx/users/huangruizhe/audio_ruizhe/icefall_exp/conformer_rnnt/icefall-asr-librispeech-transducer-stateless2-torchaudio-2022-04-19/exp/greedy_search'), 'suffix': 'epoch-999-avg-1-context-2-max-sym-per-frame-1', 'blank_id': 0, 'unk_id': 2, 'vocab_size': 500}
2023-05-14 05:46:51,955 INFO [decode.py:722] About to create model
2023-05-14 05:46:52,321 INFO [checkpoint.py:112] Loading checkpoint from /fsx/users/huangruizhe/audio_ruizhe/icefall_exp/conformer_rnnt/icefall-asr-librispeech-transducer-stateless2-torchaudio-2022-04-19/exp/epoch-999.pt
Traceback (most recent call last):
  File "/fsx/users/huangruizhe/icefall/egs/librispeech/ASR/./pruned_transducer_stateless2/decode.py", line 811, in <module>
    main()
  File "/data/home/huangruizhe/miniconda3/envs/aligner/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/fsx/users/huangruizhe/icefall/egs/librispeech/ASR/./pruned_transducer_stateless2/decode.py", line 743, in main
    load_checkpoint(f"{params.exp_dir}/epoch-{params.epoch}.pt", model)
  File "/fsx/users/huangruizhe/icefall/icefall/checkpoint.py", line 126, in load_checkpoint
    model.load_state_dict(checkpoint["model"], strict=strict)
  File "/data/home/huangruizhe/miniconda3/envs/aligner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Transducer:
        size mismatch for encoder.encoder_embed.conv.0.weight: copying a param with shape torch.Size([512, 1, 3, 3]) from checkpoint, the shape in current model is torch.Size([8, 1, 3, 3]).
        size mismatch for encoder.encoder_embed.conv.0.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([8]).
        size mismatch for encoder.encoder_embed.out.weight: copying a param with shape torch.Size([512, 9728]) from checkpoint, the shape in current model is torch.Size([512, 2432]).

It seems the pretrained model does not match the definition in the codes. And the difference seems very big (512 vs. 8)
Would you suggest whether I have done anything wrong? Thanks!

@desh2608
Copy link
Collaborator

It seems you are using the decode script from the pruned_transducer_stateless2 recipe, instead of the transducer_stateless2 recipe. The former uses the k2 pruned RNNT loss, while the latter uses the torchaudio RNNT loss. I am not sure if there are differences in the encoder, but might be worth trying the decode from the correct recipe.

@csukuangfj
Copy link
Collaborator

It seems you are using the decode script from the pruned_transducer_stateless2 recipe, instead of the transducer_stateless2 recipe. The former uses the k2 pruned RNNT loss, while the latter uses the torchaudio RNNT loss. I am not sure if there are differences in the encoder, but might be worth trying the decode from the correct recipe.

Yes, I think you are right.

@huangruizhe
I suggest that you look at the RESULTS.md in the given PR to copy the decoding commands.
Screenshot 2023-05-16 at 10 29 03

@huangruizhe
Copy link
Contributor Author

Ah, sorry that I made a careless mistake! The issue is fixed after using the correct recipe. Thanks!!
And thanks for the detailed pointers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants