ESPNet integration: v0 #329

mpariente · 2020-11-16T22:18:48Z

So that's a first iteration for ASR evaluation in the LibriMix recipe.

Thanks to @JorisCos for some of the ground work.

I'll comment in the PR about things I'm unsure about and where I'd like help hopefully.
None of the code has been ran yet, things will likely change place but this is just a rough draft to see how easy it would be to do it.

@raikarsagar @JorisCos @popcornell @mhu-coder

mpariente

There are few things left to do for this to be acceptable, in a first step.

@raikarsagar, would you like to help? That would really be amazing !! 😃

mpariente · 2020-11-16T22:19:45Z

egs/librimix/ConvTasNet/eval.py

@@ -18,6 +19,17 @@
 from asteroid.utils import tensors_to_device


+def import_wer():


This function can live somewhere else.

mpariente · 2020-11-16T22:20:11Z

egs/librimix/ConvTasNet/eval.py

+)
+
+COMPUTE_METRICS = ["si_sdr", "sdr", "sir", "sar", "stoi"]
+ASR_MODEL_PATH = (


Will need to be a variable probably.

it should be specified by the user according to the recipe

Yes, and we'll have to make sure it was trained on LibriSpeech.

I assumed that ASR models couldn't be used to perform cross datasets evaluation but is this true ?

There is the problem of vocab at least. If the vocab of the second dataset is included in the first one, then that's probably possible. Samu can probably answer better than me.

mpariente · 2020-11-16T22:21:56Z

egs/librimix/ConvTasNet/eval.py

+    return metric_list + ["wer"]
+
+
+class MockTracker:


Will live somewhere else (follows WERTracker). It's to enable the WER/no WER without too many if and code changes.

mpariente · 2020-11-16T22:22:30Z

egs/librimix/ConvTasNet/eval.py

+    def compute_wer(self, transcriptions):
+        # Is average WER the average over individual WER?
+        # Or computed overall with all S, D, I?
+        # Is it different than the average from call (probably)
+        return


Would love an answer to that

https://github.com/kaldi-asr/kaldi/blob/85a3dd5f0b71e419abf1169a26b759bfc423a543/src/bin/compute-wer.cc#L94
This is the way to go: compute it over all the dataset (e.g. test set) by summing the errors of all utterances then normalize by the sum of n_words. If you compute it by utterance and then take average of WERs say you have on word in one sequence and you miss that you have there 100% WER. In another you have 20 words and you get them all correct. If you take your average you get 50% which is not really reperesentative of the performance at this point.

This is what I thought, thanks !
How expensive is the computation of WER? At first, I had planned to accumulate all the hypothesis/transcripts and compute the overall WER at the end. I still log all of it so we can compute the average WER in the end (supported by jiwer, which accounts for total number of words, as we want).

BTW, even if less critical, we do this "mistake" on SI-SDR, 3s examples and 12s examples have the same weight on the averaged metric.

This is what I thought, thanks !
How expensive is the computation of WER? At first, I had planned to accumulate all the hypothesis/transcripts and compute the overall WER at the end. I still log all of it so we can compute the average WER in the end (supported by jiwer, which accounts for total number of words, as we want).

It depends on the internal algorithm (o(N) or (o(N^2)) and I don't know the details of this library.
Some toolkit takes a very long time for the WER computation of the long sequence.
I think your idea would be fine.

Some people may want to know the breakdown numbers per utterance or per speaker, but this would be too much for now.
This is one reason that we still use NIST sclite, which provides such analysis information, but it is not a pythonic solution.
(Another reason for using NIST toolkit is that I want to stick to use a standardized toolkit for the evaluation metric. It is a disaster if the scoring toolkit includes some errors or some dialects.).

Also, I recommend you generate the number of total word, insertion, deletion, and substitution counts, as well as WER.
This is good debugging information.
I often encountered the issue that people don't use the correct test set.
If we display the number of sentences and words, we can easily detect such mistakes.
Also, if we display insertion, deletion, and substitution counts, we can easily detect what issue causes the error.
Speech enhancement or separation tends to increase deletion
(if the speech is over suppressed) and the insertion (if there are interference speakers).

Thank you very much for your feedback!
We'll log the numbers (ins, del, sub, tot, WER) per utterance then, as is done for the other metrics.

For reference:
Opened a PR in jiwer to access those numbers: jitsi/jiwer#35

mpariente · 2020-11-16T22:23:06Z

egs/librimix/ConvTasNet/eval.py

+        # Count the mixture output for each speaker
+        txt = self.predict_hypothesis(mix)
+        for tmp_id in wav_id:
+            input_wer += wer(truth=self.trans_dic[tmp_id], hypothesis=txt) / len(wav_id)
+            self.input_txt_list.append(dict(utt_id=tmp_id, text=txt))


Compute the WER from mixture's hypothesis to each ground truth. Might not be ideal. IDK how it's done in ESPNet.

we should accumulate n_words, sub del and ins for each utterance.
Remember we need to handle the computation for those for both the speakers ( or N speakers) in case we do not perform speaker adaptation: i think you count an insertion only after checking if the word is not in both the transcripts for both the speakers (del and subs in a similar way also)

I see the point of doing that but it seems out of scope of this PR maybe.
For ins, the word order doesn't matter but for del and sub, it does so that's tricky.

mpariente · 2020-11-16T22:24:37Z

egs/librimix/ConvTasNet/local/get_text.py

+        row_list.append(dict1)
+
+df = pd.DataFrame(row_list)
+df.to_csv(os.path.join(args.outdir, "annotations.csv"), index=False)


Might want to change the name to match ESPNet's conventions.

mpariente · 2020-11-16T22:25:39Z

egs/librimix/ConvTasNet/local/prepare_data.sh

+# TODO: check folders and outdir
+#dev-clean  test-clean  train-clean-100  train-clean-360
+for split in train-360 train-100 dev-clean test-clean; do
+  $python_path local/get_text.py --libridir $storage_dir/LibriSpeech --split $split --outdir data/$split
+done


The output directories don't have the clean in the name.. We might either want to call this 4 times without the loop, or loop over directories in the .py file.

mpariente · 2020-11-16T22:26:52Z

egs/librimix/ConvTasNet/local/prepare_data.sh

@@ -17,3 +17,9 @@ cd LibriMix
 cd $current_dir
 $python_path local/create_local_metadata.py --librimix_dir $storage_dir/Libri$n_src"Mix"

+
+# TODO: check folders and outdir


Here, we retrieve the txt files for each split and gather them in .csv files that we'll store in the local data folder.

mpariente · 2020-11-16T22:28:13Z

egs/librimix/ConvTasNet/run.sh

+# FIXME: this is wrong to have it written in plain
 train_dir=data/wav8k/min/train-360
 valid_dir=data/wav8k/min/dev
 test_dir=data/wav8k/min/test


These path should depend on the sample rate, not be hardcoded. From WHAM's recipes!

sr_string=$(($sample_rate/1000)) suffix=wav${sr_string}k/$mode

mpariente · 2020-11-16T22:30:14Z

egs/librimix/ConvTasNet/eval.py

@@ -58,6 +160,8 @@ def main(conf):
        sample_rate=conf["sample_rate"],
        n_src=conf["train_conf"]["data"]["n_src"],
        segment=None,
+        return_id=True,
+        # FIXME: ensure max mode for eval.


New problem: we need the "max" version for ASR, but the other metrics should still be computed with the "min" mode so that the results are consistent with what we had before.
We'll probably need to change the dataloader to return the lengths, other solutions?

I would use max version all along. Otherwise you would not be able to say plot SI-SDR vs WER and compare the two directly.

Yes, but SISDR results are usually reported on min, we cannot change that without the user knowing it. Otherwise recipes inspired by this might report wrong results afterward..

mpariente · 2020-11-16T22:37:57Z

egs/librimix/ConvTasNet/eval.py

@@ -18,6 +19,17 @@
 from asteroid.utils import tensors_to_device


+def import_wer():
+    try:
+        from jiwer import wer


Since you're here @sw005320, what do you think about jiwer?
BTW, feel free to add any comment, advice etc..

I checked the library and it looks good to me.
I was concerning punctuation handling, upper/lower case, and non-linguistic special annotations (e.g., ), but the tool seems to handle everything.

Again, thank a lot ! It's going to make it much easier to use jiwer directly rather than a non-python alternative for us.

mpariente · 2020-11-18T14:41:14Z

The PR was merged in jiwer and hits, sub, ins, del are exposed in v2.2.0

This shouldn't be too much work from now. @raikarsagar, could you give it a try please?

sw005320 · 2020-11-18T14:46:08Z

FYI, this is a WER report format we often use in espnet.
If asteroid produces such information, people in espnet can easily catch up the result.
Our script automatically generates the following markdown table after the ASR evaluation stage by using https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/asr1/scripts/utils/show_asr_result.sh.

dataset	Snt	Wrd	Corr	Sub	Del	Ins	Err	S.Err
decode_asr_lm_lm_train_lm_adam_bpe_valid.loss.ave_asr_model_valid.acc.ave/dev_clean	2703	54402	98.0	1.8	0.2	0.3	2.3	27.6
decode_asr_lm_lm_train_lm_adam_bpe_valid.loss.ave_asr_model_valid.acc.ave/dev_other	2864	50948	95.0	4.3	0.6	0.5	5.5	45.1
decode_asr_lm_lm_train_lm_adam_bpe_valid.loss.ave_asr_model_valid.acc.ave/test_clean	2620	52576	97.9	1.9	0.2	0.3	2.4	28.2
decode_asr_lm_lm_train_lm_adam_bpe_valid.loss.ave_asr_model_valid.acc.ave/test_other	2939	52343	94.9	4.5	0.7	0.6	5.8	48.6

mpariente · 2020-11-18T20:38:50Z

We'll adopt this convention as well, thanks.

Does "S.Err" mean the percent of sentences without perfect score?

sw005320 · 2020-11-18T21:22:55Z

Does "S.Err" mean the percent of sentences without perfect score?

S.Err means the sentence error rate.
If there is any word/char/token error in the sentence, it would become a sentence error.
This is very strict but some applications require such metrics.

egs/librimix/ConvTasNet/local/get_text.py

egs/librimix/ConvTasNet/eval.py

egs/librimix/ConvTasNet/run.sh

egs/librimix/ConvTasNet/eval.py

popcornell · 2020-11-23T14:22:06Z

It seems that right now espnet2.bin.asr_inference.Speech2Text does not supported batched audio:
see https://github.com/espnet/espnet/blob/554e65ee638ebdc8256647ce273cf6303ea72e39/espnet2/bin/asr_inference.py#L180

Because of this WER computation is quite slow even on GPU (but so is SDR also right now with mir_eval).
Can you confirm @JorisCos I have in mind couple of workarounds in this case.

JorisCos · 2020-11-24T18:19:22Z

It seems that right now espnet2.bin.asr_inference.Speech2Text does not supported batched audio:
see https://github.com/espnet/espnet/blob/554e65ee638ebdc8256647ce273cf6303ea72e39/espnet2/bin/asr_inference.py#L180

Because of this WER computation is quite slow even on GPU (but so is SDR also right now with mir_eval).
Can you confirm @JorisCos I have in mind couple of workarounds in this case.

I haven't tried yet, but I wonder if we really need a workaround since we are only using espnet2.bin.asr_inference.Speech2Text in eval and evaluation isn't done on batched audio or am I missing something ?

mpariente · 2020-11-24T18:30:20Z

We could have sped things up by using batch processing, because it seems slow otherwise.

egs/librimix/ConvTasNet/eval.py

We'll get hits, sub, ins and del from there and compute out local WER and total WER from that, using Counter

Enable "conscious" max testing for WER computation

It might be useful for other recipes.

Improve WER report install tabulate

mpariente commented Nov 16, 2020

View reviewed changes

popcornell self-requested a review November 17, 2020 10:02

JorisCos reviewed Nov 19, 2020

View reviewed changes

egs/librimix/ConvTasNet/local/get_text.py Outdated Show resolved Hide resolved

JorisCos reviewed Nov 19, 2020

View reviewed changes

egs/librimix/ConvTasNet/eval.py Outdated Show resolved Hide resolved

JorisCos reviewed Nov 19, 2020

View reviewed changes

egs/librimix/ConvTasNet/run.sh Outdated Show resolved Hide resolved

mpariente commented Nov 20, 2020

View reviewed changes

egs/librimix/ConvTasNet/eval.py Outdated Show resolved Hide resolved

mpariente commented Dec 8, 2020

View reviewed changes

egs/librimix/ConvTasNet/eval.py Outdated Show resolved Hide resolved

mpariente mentioned this pull request Dec 15, 2020

Add MetricTracker #394

Merged

mpariente and others added 14 commits January 5, 2021 20:13

Add return_id option in LibriMix

4a433de

Add compute_wer options to run.sh (skeleton)

21ee081

Add WERTracker in eval.py

3ee2d4b

Add FIXME in run.sh

d3f9ad2

Retrieve transcriptions in prepare_data.sh

27cb6ae

Load annotation file + add FIXME about max mode

8150eec

get_text.py template

f979746

Fix min/max mode

590b638

Rely on future compute_measures in jiwer

800d146

We'll get hits, sub, ins and del from there and compute out local WER and total WER from that, using Counter

Generate annotations only for test for now

64143e0

Add eval_mode to run.sh

01ab640

Enable "conscious" max testing for WER computation

Fix annotation file path

a9b15a0

minor fixes

18560c6

fix Counter() deleting non positive keys

4220aa4

mpariente and others added 8 commits January 5, 2021 20:13

Add markdown result card

258873c

"wer" was already in compute_metrics

1b592e3

Update egs/librimix/ConvTasNet/eval.py

50e83bd

Support None in docstring

ec739ec

Compute WER from clean speech as well

47ca467

Move WERTracker

c74d514

It might be useful for other recipes.

Add minimal docstring

7ee55a6

Fix typo

d665cca

Improve WER report install tabulate

mpariente force-pushed the espnet_integration branch from baf9dc7 to d665cca Compare January 5, 2021 19:29

Rewrite WERTracker a bit

4615ffb

mpariente merged commit d563b74 into master Jan 5, 2021

mpariente deleted the espnet_integration branch January 5, 2021 20:12

patrickvonplaten mentioned this pull request Feb 9, 2021

[Metrics] Add word error metric metric huggingface/datasets#1847

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESPNet integration: v0 #329

ESPNet integration: v0 #329

mpariente commented Nov 16, 2020

mpariente left a comment

mpariente Nov 16, 2020

mpariente Nov 16, 2020

popcornell Nov 17, 2020

mpariente Nov 17, 2020

JorisCos Nov 17, 2020

mpariente Nov 17, 2020

mpariente Nov 16, 2020

mpariente Nov 16, 2020

popcornell Nov 17, 2020 •

edited

Loading

mpariente Nov 17, 2020

sw005320 Nov 17, 2020

mpariente Nov 17, 2020

mpariente Nov 17, 2020

mpariente Nov 16, 2020

popcornell Nov 17, 2020 •

edited

Loading

mpariente Nov 17, 2020

mpariente Nov 16, 2020

mpariente Nov 16, 2020

mpariente Nov 16, 2020

mpariente Nov 16, 2020

mpariente Nov 16, 2020

popcornell Nov 17, 2020

mpariente Nov 17, 2020

mpariente Nov 16, 2020

sw005320 Nov 17, 2020

mpariente Nov 17, 2020

mpariente commented Nov 18, 2020

sw005320 commented Nov 18, 2020 •

edited

Loading

mpariente commented Nov 18, 2020

sw005320 commented Nov 18, 2020

popcornell commented Nov 23, 2020

JorisCos commented Nov 24, 2020

mpariente commented Nov 24, 2020

		@@ -18,6 +19,17 @@
		from asteroid.utils import tensors_to_device


		def import_wer():

ESPNet integration: v0 #329

ESPNet integration: v0 #329

Conversation

mpariente commented Nov 16, 2020

mpariente left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

popcornell Nov 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

popcornell Nov 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpariente commented Nov 18, 2020

sw005320 commented Nov 18, 2020 • edited Loading

mpariente commented Nov 18, 2020

sw005320 commented Nov 18, 2020

popcornell commented Nov 23, 2020

JorisCos commented Nov 24, 2020

mpariente commented Nov 24, 2020

popcornell Nov 17, 2020 •

edited

Loading

popcornell Nov 17, 2020 •

edited

Loading

sw005320 commented Nov 18, 2020 •

edited

Loading