Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MMI training with word pieces as modelling unit. #6

Merged
merged 23 commits into from
Oct 18, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions .flake8
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,9 @@ statistics=true
max-line-length = 80
per-file-ignores =
# line too long
egs/librispeech/ASR/conformer_ctc/conformer.py: E501,
egs/librispeech/ASR/*/conformer.py: E501,

exclude =
.git,
**/data/**
**/data/**,
icefall/shared/make_kn_lm.py
4 changes: 3 additions & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,9 @@ jobs:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-18.04, macos-10.15]
# os: [ubuntu-18.04, macos-10.15]
# disable macOS test for now.
os: [ubuntu-18.04]
python-version: [3.6, 3.7, 3.8, 3.9]
torch: ["1.8.1"]
k2-version: ["1.9.dev20210919"]
Expand Down
30 changes: 27 additions & 3 deletions egs/librispeech/ASR/conformer_ctc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,10 +27,10 @@ avg=15
--bucketing-sampler 0 \
--full-libri 1 \
--exp-dir conformer_ctc/exp \
--lang-dir data/lang_bpe_5000 \
--ali-dir data/ali_5000
--lang-dir data/lang_bpe_500 \
--ali-dir data/ali_500
```
and you will get four files inside the folder `data/ali_5000`:
and you will get four files inside the folder `data/ali_500`:

```
$ ls -lh data/ali_500
Expand All @@ -51,3 +51,27 @@ in `conformer_ctc/train.py`.
Search `./conformer_ctc/asr_datamodule.py` for `preserve_id`.

**TODO:** Add doc about how to use the extracted alignment in the other pull-request.

### Step 3: Check your extracted alignments

There is a file `test_ali.py` in `icefall/test` that can be used to test your
alignments. It uses pre-computed alignments to modify a randomly generated
`nnet_output` and it checks that we can decode the correct transcripts
from the resulting `nnet_output`.

You should get something like the following if you run that script:

```
$ ./test/test_ali.py
['THE GOOD NATURED AUDIENCE IN PITY TO FALLEN MAJESTY SHOWED FOR ONCE GREATER DEFERENCE TO THE KING THAN TO THE MINISTER AND SUNG THE PSALM WHICH THE FORMER HAD CALLED FOR', 'THE OLD SERVANT TOLD HIM QUIETLY AS THEY CREPT BACK TO DWELL THAT THIS PASSAGE THAT LED FROM THE HUT IN THE PLEASANCE TO SHERWOOD AND THAT GEOFFREY FOR THE TIME WAS HIDING WITH THE OUTLAWS IN THE FOREST', 'FOR A WHILE SHE LAY IN HER CHAIR IN HAPPY DREAMY PLEASURE AT SUN AND BIRD AND TREE', "BUT THE ESSENCE OF LUTHER'S LECTURES IS THERE"]
['THE GOOD NATURED AUDIENCE IN PITY TO FALLEN MAJESTY SHOWED FOR ONCE GREATER DEFERENCE TO THE KING THAN TO THE MINISTER AND SUNG THE PSALM WHICH THE FORMER HAD CALLED FOR', 'THE OLD SERVANT TOLD HIM QUIETLY AS THEY CREPT BACK TO GAMEWELL THAT THIS PASSAGE WAY LED FROM THE HUT IN THE PLEASANCE TO SHERWOOD AND THAT GEOFFREY FOR THE TIME WAS HIDING WITH THE OUTLAWS IN THE FOREST', 'FOR A WHILE SHE LAY IN HER CHAIR IN HAPPY DREAMY PLEASURE AT SUN AND BIRD AND TREE', "BUT THE ESSENCE OF LUTHER'S LECTURES IS THERE"]
```

### Step 4: Use your alignments in training

Please refer to `conformer_mmi/train.py` for how usage. Some useful
functions are:

- `load_alignments()`, it loads alignment saved by `conformer_ctc/ali.py`
- `convert_alignments_to_tensor()`, it converts alignments to PyTorch tensors
- `lookup_alignments()`, it returns the alignments of utterances by giving the cut ID of the utterances.
2 changes: 1 addition & 1 deletion egs/librispeech/ASR/conformer_ctc/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ def get_params() -> AttributeDict:
"""Return a dict containing training parameters.

All training related parameters that are not passed from the commandline
is saved in the variable `params`.
are saved in the variable `params`.

Commandline options are merged into `params` after they are parsed, so
you can also access them via `params`.
Expand Down
Empty file.
Loading