-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support computing nbest oracle WER. #10
Conversation
@@ -56,6 +57,15 @@ def get_parser(): | |||
"consecutive checkpoints before the checkpoint specified by " | |||
"'--epoch'. ", | |||
) | |||
|
|||
parser.add_argument( | |||
"--scale", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this scale is only used for the nbest-oracle mode, perhaps that should be clarified, e.g. via the name and the documentation? Right now it's a bit unclear whether this would affect other things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is also useful for other n-best rescoring methods, e.g., attention-decoder rescoring. Tuning this value can
change the number of unique paths in an n-best list, which can potentially affect the final WER.
I'm adding more documentation to clarify its usage.
The following screenshot shows the nbest oracle WER with different scale values for the librispeech Note:
|
@@ -0,0 +1,27 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only HLG decoding with the transformer encoder output is added.
Do we need to use the attention decoder for rescoring?
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great-- thanks!
Regarding using the attention decoder for rescoring-- yes, I'd like you to add that, because this will probably
be a main feature of the tutorial, and I think having good results is probably worthwhile.
Also, use kaldifeat for feature extraction.
|
||
features = features.unsqueeze(0) | ||
logging.info(f"Decoding started") | ||
features = fbank(waves) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacing torchaudio.compliance.kaldi with kaldifeat
since it is easier to extract features for multiple sound
files at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I still have adding kaldifeat to Lhotse on my radar. I might remove all other kaldi-related feature extractors at the same time. But I think I won’t be able to do it before the tutorial.
Now it supports transcribing multiple files with LM rescoring and attention decoder rescoring. Ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
Perhaps we can mention concretely where one might obtain this checkpoint, words.txt and HLG.pt, if someone were to try to run this without having trained the system? E.g. download location?
Could you please upload the following files:
|
I just added some detailed documentation to show how to download and use a pre-trained model, uploaded by @pkufool. You can find a preview by visiting Will also create a colab notebook to show how to use the pre-trained model. Ready to merge. |
Wow-- very nice and complete documentation! |
Here are the logs using the CPU to transcribe the test waves. Useful if someone wants to compare (1) HLG decoding$ CUDA_VISIBLE_DEVICES= ./conformer_ctc/pretrained.py \
--checkpoint ./tmp/conformer_ctc/exp/pretraind.pt \
--words-file ./tmp/conformer_ctc/data/lang_bpe/words.txt \
--HLG ./tmp/conformer_ctc/data/lang_bpe/HLG.pt \
./tmp/conformer_ctc/test_wavs/1089-134686-0001.flac \
./tmp/conformer_ctc/test_wavs/1221-135766-0001.flac \
./tmp/conformer_ctc/test_wavs/1221-135766-0002.flac
(2) HLG decoding + LM rescoring$ CUDA_VISIBLE_DEVICES= ./conformer_ctc/pretrained.py \
--checkpoint ./tmp/conformer_ctc/exp/pretraind.pt \
--words-file ./tmp/conformer_ctc/data/lang_bpe/words.txt \
--HLG ./tmp/conformer_ctc/data/lang_bpe/HLG.pt \
--method whole-lattice-rescoring \
--G ./tmp/conformer_ctc/data/lm/G_4_gram.pt \
--ngram-lm-scale 0.8 \
./tmp/conformer_ctc/test_wavs/1089-134686-0001.flac \
./tmp/conformer_ctc/test_wavs/1221-135766-0001.flac \
./tmp/conformer_ctc/test_wavs/1221-135766-0002.flac
(3) HLG decoding + LM rescoring + attention decoder rescoring$ CUDA_VISIBLE_DEVICES= ./conformer_ctc/pretrained.py \
--checkpoint ./tmp/conformer_ctc/exp/pretraind.pt \
--words-file ./tmp/conformer_ctc/data/lang_bpe/words.txt \
--HLG ./tmp/conformer_ctc/data/lang_bpe/HLG.pt \
--method attention-decoder \
--G ./tmp/conformer_ctc/data/lm/G_4_gram.pt \
--ngram-lm-scale 1.3 \
--attention-decoder-scale 1.2 \
--lattice-score-scale 0.5 \
--num-paths 100 \
--sos-id 1 \
--eos-id 1 \
./tmp/conformer_ctc/test_wavs/1089-134686-0001.flac \
./tmp/conformer_ctc/test_wavs/1221-135766-0001.flac \
./tmp/conformer_ctc/test_wavs/1221-135766-0002.flac
|
BTW, for this thing where we transcribe the waves, it would be nice to know how much we are being affected by batches being too irregular. It should be possible to find how big the WER impact of this is, by changing the lhotse options for the sampler used in our test code. |
nbest oracle WER can help us evaluate different n-best rescoring methods
as it is the best WER that we could get if we had the perfect rescoring method.