Add MMI training with word pieces as modelling unit. (#6)

* Fix an error in TDNN-LSTM training. * WIP: Refactoring * Refactor transformer.py * Remove unused code. * Minor fixes. * Fix decoder padding mask. * Add MMI training with word pieces. * Remove unused files. * Minor fixes. * Refactoring. * Minor fixes. * Use pre-computed alignments in LF-MMI training. * Minor fixes. * Update decoding script. * Add doc about how to check and use extracted alignments. * Fix style issues. * Fix typos. * Fix style issues. * Disable macOS tests for now.
k2-fsa · Oct 18, 2021 · 53b79fa · 53b79fa
1 parent 4890e27
commit 53b79fa
Show file tree

Hide file tree

Showing 30 changed files with 6,896 additions and 149 deletions.
diff --git a/.flake8 b/.flake8
@@ -4,8 +4,9 @@ statistics=true
 max-line-length = 80
 per-file-ignores =
     # line too long
-    egs/librispeech/ASR/conformer_ctc/conformer.py: E501,
+    egs/librispeech/ASR/*/conformer.py: E501,
 
 exclude =
   .git,
-  **/data/**
+  **/data/**,
+  icefall/shared/make_kn_lm.py
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -29,7 +29,9 @@ jobs:
     runs-on: ${{ matrix.os }}
     strategy:
       matrix:
-        os: [ubuntu-18.04, macos-10.15]
+        # os: [ubuntu-18.04, macos-10.15]
+        # disable macOS test for now.
+        os: [ubuntu-18.04]
         python-version: [3.6, 3.7, 3.8, 3.9]
         torch: ["1.8.1"]
         k2-version: ["1.9.dev20210919"]

diff --git a/egs/librispeech/ASR/conformer_ctc/README.md b/egs/librispeech/ASR/conformer_ctc/README.md
@@ -27,10 +27,10 @@ avg=15
   --bucketing-sampler 0 \
   --full-libri 1 \
   --exp-dir conformer_ctc/exp \
-  --lang-dir data/lang_bpe_5000 \
-  --ali-dir data/ali_5000
+  --lang-dir data/lang_bpe_500 \
+  --ali-dir data/ali_500
 ```
-and  you will get four files inside the folder `data/ali_5000`:
+and  you will get four files inside the folder `data/ali_500`:
 
 ```
 $ ls -lh data/ali_500
@@ -51,3 +51,27 @@ in `conformer_ctc/train.py`.
 Search `./conformer_ctc/asr_datamodule.py` for `preserve_id`.
 
 **TODO:** Add doc about how to use the extracted alignment in the other pull-request.
+
+### Step 3: Check your extracted alignments
+
+There is a file `test_ali.py` in `icefall/test` that can be used to test your
+alignments. It uses pre-computed alignments to modify a randomly generated
+`nnet_output` and it checks that we can decode the correct transcripts
+from the resulting `nnet_output`.
+
+You should get something like the following if you run that script:
+
+```
+$ ./test/test_ali.py
+['THE GOOD NATURED AUDIENCE IN PITY TO FALLEN MAJESTY SHOWED FOR ONCE GREATER DEFERENCE TO THE KING THAN TO THE MINISTER AND SUNG THE PSALM WHICH THE FORMER HAD CALLED FOR', 'THE OLD SERVANT TOLD HIM QUIETLY AS THEY CREPT BACK TO DWELL THAT THIS PASSAGE THAT LED FROM THE HUT IN THE PLEASANCE TO SHERWOOD AND THAT GEOFFREY FOR THE TIME WAS HIDING WITH THE OUTLAWS IN THE FOREST', 'FOR A WHILE SHE LAY IN HER CHAIR IN HAPPY DREAMY PLEASURE AT SUN AND BIRD AND TREE', "BUT THE ESSENCE OF LUTHER'S LECTURES IS THERE"]
+['THE GOOD NATURED AUDIENCE IN PITY TO FALLEN MAJESTY SHOWED FOR ONCE GREATER DEFERENCE TO THE KING THAN TO THE MINISTER AND SUNG THE PSALM WHICH THE FORMER HAD CALLED FOR', 'THE OLD SERVANT TOLD HIM QUIETLY AS THEY CREPT BACK TO GAMEWELL THAT THIS PASSAGE WAY LED FROM THE HUT IN THE PLEASANCE TO SHERWOOD AND THAT GEOFFREY FOR THE TIME WAS HIDING WITH THE OUTLAWS IN THE FOREST', 'FOR A WHILE SHE LAY IN HER CHAIR IN HAPPY DREAMY PLEASURE AT SUN AND BIRD AND TREE', "BUT THE ESSENCE OF LUTHER'S LECTURES IS THERE"]
+```
+
+### Step 4: Use your alignments in training
+
+Please refer to `conformer_mmi/train.py` for how usage. Some useful
+functions are:
+
+- `load_alignments()`, it loads alignment saved by `conformer_ctc/ali.py`
+- `convert_alignments_to_tensor()`, it converts alignments to PyTorch tensors
+- `lookup_alignments()`, it returns the alignments of utterances by giving the cut ID of the utterances.
diff --git a/egs/librispeech/ASR/conformer_ctc/train.py b/egs/librispeech/ASR/conformer_ctc/train.py
@@ -129,7 +129,7 @@ def get_params() -> AttributeDict:
     """Return a dict containing training parameters.
 
     All training related parameters that are not passed from the commandline
-    is saved in the variable `params`.
+    are saved in the variable `params`.
 
     Commandline options are merged into `params` after they are parsed, so
     you can also access them via `params`.

diff --git a/egs/librispeech/ASR/conformer_mmi/__init__.py b/egs/librispeech/ASR/conformer_mmi/__init__.py