Support CTC/AED option for Zipformer recipe (#1389)

* add attention-decoder loss option for zipformer recipe * add attention-decoder-rescoring * update export.py and pretrained_ctc.py * update RESULTS.md
k2-fsa · Jul 5, 2024 · f76afff · f76afff
1 parent cbcac23
commit f76afff
Show file tree

Hide file tree

Showing 9 changed files with 1,221 additions and 21 deletions.
diff --git a/egs/librispeech/ASR/RESULTS.md b/egs/librispeech/ASR/RESULTS.md
@@ -1,5 +1,75 @@
 ## Results
 
+### zipformer (zipformer + CTC/AED)
+
+See <https://github.com/k2-fsa/icefall/pull/1389> for more details.
+
+[zipformer](./zipformer)
+
+#### Non-streaming
+
+##### large-scale model, number of model parameters: 174319650, i.e., 174.3 M
+
+You can find a pretrained model, training logs, decoding logs, and decoding results at:
+<https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-large-ctc-attention-decoder-2024-05-26>
+
+You can use <https://github.com/k2-fsa/sherpa> to deploy it.
+
+Results of the CTC head:
+
+| decoding method                      | test-clean | test-other | comment             |
+|--------------------------------------|------------|------------|---------------------|
+| ctc-decoding                         | 2.29       | 5.14       | --epoch 50 --avg 29 |
+| attention-decoder-rescoring-no-ngram | 2.1        | 4.57       | --epoch 50 --avg 29 |
+
+The training command is:
+```bash
+export CUDA_VISIBLE_DEVICES="0,1,2,3"
+# For non-streaming model training:
+./zipformer/train.py \
+  --world-size 4 \
+  --num-epochs 50 \
+  --start-epoch 1 \
+  --use-fp16 1 \
+  --exp-dir zipformer/exp-large \
+  --full-libri 1 \
+  --use-ctc 1 \
+  --use-transducer 0 \
+  --use-attention-decoder 1 \
+  --ctc-loss-scale 0.1 \
+  --attention-decoder-loss-scale 0.9 \
+  --num-encoder-layers 2,2,4,5,4,2 \
+  --feedforward-dim 512,768,1536,2048,1536,768 \
+  --encoder-dim 192,256,512,768,512,256 \
+  --encoder-unmasked-dim 192,192,256,320,256,192 \
+  --max-duration 1200 \
+  --master-port 12345
+```
+
+The decoding command is:
+```bash
+export CUDA_VISIBLE_DEVICES="0"
+for m in ctc-decoding attention-decoder-rescoring-no-ngram; do
+  ./zipformer/ctc_decode.py \
+    --epoch 50 \
+    --avg 29 \
+    --exp-dir zipformer/exp-large \
+    --use-ctc 1 \
+    --use-transducer 0 \
+    --use-attention-decoder 1 \
+    --attention-decoder-loss-scale 0.9 \
+    --num-encoder-layers 2,2,4,5,4,2 \
+    --feedforward-dim 512,768,1536,2048,1536,768 \
+    --encoder-dim 192,256,512,768,512,256 \
+    --encoder-unmasked-dim 192,192,256,320,256,192 \
+    --max-duration 100 \
+    --causal 0 \
+    --num-paths 100 \
+    --decoding-method $m
+done
+```
+
+
 ### zipformer (zipformer + pruned stateless transducer + CTC)
 
 See <https://github.com/k2-fsa/icefall/pull/1111> for more details.