Decoding was hung #875

jiangj-dc · 2021-11-16T21:24:16Z

k2 commit: 86e5479
Icefall commit: d54828e73a620ecd6a87b801860e4fa71643f01d
Experiment: icefall/egs/librispeech/ASR

Training was done using the following command:
python3 conformer_ctc/train.py --world-size 1 --max-duration 50

Decoding was carried out with:
python3 conformer_ctc/decode.py --epoch 34 --avg 1 --max-duration 100 --exp-dir conformer_ctc/exp --lang-dir data/lang_bpe_5000 --method ctc-decoding

The decoding was hung ......
The process was hung in k2/csrc/intersect_dense_pruned.cu at
if (state_map_.NumKeyBits() == 32) {
frames_.push_back(PropagateForward<32>(t, frames_.back().get()));
}

Did a few tests to verify that the k2 and icefall were working fine:

python3 k2/python/tests/intersect_dense_pruned_test.py
Downloaded the pre-trained model, ran a decoding, and it ran well.

When I used conformer_ctc/pretrained.py to decode with the trained model, it ran without hanging but had empty results for icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09/test_wavs/1089-134686-0001.wav.

Then I pulled latest code as of 11/16/2021 and trained with
python3 conformer_ctc/train.py --world-size 1 --max-duration 50 --full-libri 0
For decoding with epoch 0, it wasn't hanging but with a WER of 98.61.

The text was updated successfully, but these errors were encountered:

csukuangfj · 2021-11-16T22:32:45Z

Training was done using the following command:
python3 conformer_ctc/train.py --world-size 1 --max-duration 50

Decoding was carried out with:
python3 conformer_ctc/decode.py --epoch 34 --avg 1 --max-duration 100 --exp-dir conformer_ctc/exp --lang-dir data/lang_bpe_5000 --method ctc-decoding

I think that is a known issue. To use ctc-decoding for a model with a vocab size 5000, you have to use modified ctc topo, i.e., change
https://github.com/k2-fsa/icefall/blob/68506609ad1b36a3a0faeb142d3ff54f0e3608d9/egs/librispeech/ASR/conformer_ctc/decode.py#L573-L577

to

        H = k2.ctc_topo(
            max_token=max_token_id,
            modified=True,
            device=device,
        )

Comment k2-fsa/icefall#70 (comment) says that if you don't use modified ctc topo but reduce --max-duration to 5, it also works.

(Note: The icefall documentation is using a model with vocab size 500, --max-duration 300 for ctc decoding)

csukuangfj · 2021-11-16T22:39:00Z

When I used conformer_ctc/pretrained.py to decode with the trained model, it ran without hanging but had empty results for icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09/test_wavs/1089-134686-0001.wav.

Does it produce empty results also for the other two test sound files and with other decoding methods?

csukuangfj · 2021-11-16T22:40:18Z

Then I pulled latest code as of 11/16/2021 and trained with
python3 conformer_ctc/train.py --world-size 1 --max-duration 50 --full-libri 0
For decoding with epoch 0, it wasn't hanging but with a WER of 98.61.

I need to test it locally. But from past experience, the WER is too high for epoch 0 with only a subset 100h of the training data.

csukuangfj · 2021-11-17T02:58:14Z

Then I pulled latest code as of 11/16/2021 and trained with
python3 conformer_ctc/train.py --world-size 1 --max-duration 50 --full-libri 0
For decoding with epoch 0, it wasn't hanging but with a WER of 98.61.

I just test it locally. After epoch 0, the model is still not converging. Its CTC loss is still quite high, i.e., around 1.0; also, its attention loss is also high, around 0.8.

If you train for more epochs, I believe the WER will become better.

jiangj-dc · 2021-11-17T15:52:37Z

I re-started training with
python3 conformer_ctc/train.py --world-size 1 --max-duration 50 --full-libri 0

(1) decoding with
python3 ./conformer_ctc/decode.py --epoch ${EPOCH} --avg 1 --max-duration 50 --exp-dir conformer_ctc/exp --lang-dir data/lang_bpe_500 --method ctc-decoding
epoch 0: WER of 97.62%
epoch 3: WER of 100%

(2) decoding with conformer_ctc/pretrained.py for test_wavs/1089-134686-0001.wav
epoch 0: THE
epoch 3: [empty]

Python 3.8.11
k2-1.10.dev20211116+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg
cudnn: 8.1.1

csukuangfj · 2021-11-17T22:00:52Z

Could you show us the training log, i.e, the tensorboard log? I suspect that the model has not converged yet.

jiangj-dc · 2021-11-18T16:50:46Z

pkufool · 2021-11-19T02:13:22Z

I think your model has not converged yet, the tot_ctc_loss is expetced to be around 0.02, your loss value is too high. And you only trained for 120k steps, please train more epochs.

csukuangfj · 2021-11-19T10:18:00Z

@jiangj-apptek
As you are using only 1 GPU for training, please modify
https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/conformer_ctc/train.py#L212

            "lr_factor": 5.0,

You can use a smaller value for lr_factor, e.g., 0.8 or 1.0.
(If you don't, it won't converge even after 20 epochs)

I just tested with the following training command (after setting lr_factor to 1.0)

./conformer_ctc/train.py \
  --exp-dir ./conformer_ctc/exp \
  --full-libri 0 \
  --world-size 1 \
  --max-duration 200 \
  --start-epoch 0 \
  --num-epochs 10

Its tensorboard log is at
https://tensorboard.dev/experiment/PQ9XVnNFQ2S2acMP6A05Zg/#scalars&_smoothingWeight=0

You can see that it starts to converge.

The WER using CTC decoding with --epoch 1 --avg 1 is

test-clean: 83.75
test-other: 87.0

jiangj-dc · 2021-11-19T16:14:05Z

Modifying lr_factor DOES make a lot of sense because only one GPU is used here. I will try that and do more epochs. Thanks!

jiangj-dc · 2021-11-20T18:55:05Z

lr_factor = 0.8 and for epoch 11, I have:
ctc-decoding 12.59 best for test-clean
ctc-decoding 30.34 best for test-other
Thanks @csukuangfj!

danpovey · 2021-11-21T04:18:01Z

Hm, those WERs still seem a bit high to me. I guess we'll see how they improve.
It's possible that learning rate is too low now. I would have tried 1.5 or 2.0.

csukuangfj · 2021-11-21T04:22:15Z

Hm, those WERs still seem a bit high to me.

It is trained only for 12 epochs with a subset of 100 hours. Also, I am not sure whether model averaging is used.
I think the WER will continue to decrease if it is trained for more epochs.

jiangj-dc · 2021-11-22T18:01:07Z

python3 conformer_ctc/decode.py --epoch 40 --avg 20 --max-duration 50 --exp-dir conformer_ctc/exp --lang-dir data/lang_bpe_500 --method ctc-decoding
ctc-decoding 8.55 best for test-clean
ctc-decoding 22.68 best for test-other

Will do lr_factor = 2.0.

danpovey · 2021-11-23T12:11:18Z

Possibly the model is too big for 100 hours of data, maybe d_model=256 would be better.

…

On Tue, Nov 23, 2021 at 2:01 AM jiangj-apptek ***@***.***> wrote: python3 conformer_ctc/decode.py --epoch 40 --avg 20 --max-duration 50 --exp-dir conformer_ctc/exp --lang-dir data/lang_bpe_500 --method ctc-decoding ctc-decoding 8.55 best for test-clean ctc-decoding 22.68 best for test-other Will do lr_factor = 2.0. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#875 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLOZDPX4NYIYK6Z73KZLUNKAO3ANCNFSM5IFJJCZQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

jiangj-dc · 2021-11-24T17:34:39Z

lr_factor = 1.5, d_model = 256 ("attention_dim")

python3 ./conformer_ctc/train.py --world-size 1 --max-duration 50 --full-libri 0

python3 conformer_ctc/decode.py --epoch 11 --avg 1 --max-duration 50 --exp-dir conformer_ctc/exp --lang-dir data/lang_bpe_500 --method ctc-decoding

ctc-decoding 13.56 best for test-clean
ctc-decoding 31.03 best for test-other

danpovey · 2021-11-25T03:50:54Z

OK, it looks like the model is not doing that great with so little data. We have definitely tuned for more.
How do the train and valid loss values compare?
(Note: for valid we use test-mode, which should boost the loss, but also it's unseen...)

jiangj-dc · 2021-11-25T16:12:39Z

Agree. I used the 100-hour set as a sanity check and it passed.

jiangj-dc closed this as completed Nov 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoding was hung #875

Decoding was hung #875

jiangj-dc commented Nov 16, 2021

csukuangfj commented Nov 16, 2021

csukuangfj commented Nov 16, 2021

csukuangfj commented Nov 16, 2021

csukuangfj commented Nov 17, 2021

jiangj-dc commented Nov 17, 2021

csukuangfj commented Nov 17, 2021

jiangj-dc commented Nov 18, 2021

pkufool commented Nov 19, 2021

csukuangfj commented Nov 19, 2021

jiangj-dc commented Nov 19, 2021

jiangj-dc commented Nov 20, 2021

danpovey commented Nov 21, 2021

csukuangfj commented Nov 21, 2021

jiangj-dc commented Nov 22, 2021

danpovey commented Nov 23, 2021 via email

jiangj-dc commented Nov 24, 2021

danpovey commented Nov 25, 2021

jiangj-dc commented Nov 25, 2021

Decoding was hung #875

Decoding was hung #875

Comments

jiangj-dc commented Nov 16, 2021

csukuangfj commented Nov 16, 2021

csukuangfj commented Nov 16, 2021

csukuangfj commented Nov 16, 2021

csukuangfj commented Nov 17, 2021

jiangj-dc commented Nov 17, 2021

csukuangfj commented Nov 17, 2021

jiangj-dc commented Nov 18, 2021

pkufool commented Nov 19, 2021

csukuangfj commented Nov 19, 2021

jiangj-dc commented Nov 19, 2021

jiangj-dc commented Nov 20, 2021

danpovey commented Nov 21, 2021

csukuangfj commented Nov 21, 2021

jiangj-dc commented Nov 22, 2021

danpovey commented Nov 23, 2021 via email

jiangj-dc commented Nov 24, 2021

danpovey commented Nov 25, 2021

jiangj-dc commented Nov 25, 2021