Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESPNet integration: v0 #329

Merged
merged 23 commits into from
Jan 5, 2021
Merged

ESPNet integration: v0 #329

merged 23 commits into from
Jan 5, 2021

Conversation

mpariente
Copy link
Collaborator

So that's a first iteration for ASR evaluation in the LibriMix recipe.

Thanks to @JorisCos for some of the ground work.

I'll comment in the PR about things I'm unsure about and where I'd like help hopefully.
None of the code has been ran yet, things will likely change place but this is just a rough draft to see how easy it would be to do it.

@raikarsagar @JorisCos @popcornell @mhu-coder

Copy link
Collaborator Author

@mpariente mpariente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are few things left to do for this to be acceptable, in a first step.

@raikarsagar, would you like to help? That would really be amazing !! 😃

@@ -18,6 +19,17 @@
from asteroid.utils import tensors_to_device


def import_wer():
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function can live somewhere else.

)

COMPUTE_METRICS = ["si_sdr", "sdr", "sir", "sar", "stoi"]
ASR_MODEL_PATH = (
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to be a variable probably.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be specified by the user according to the recipe

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and we'll have to make sure it was trained on LibriSpeech.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that ASR models couldn't be used to perform cross datasets evaluation but is this true ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the problem of vocab at least. If the vocab of the second dataset is included in the first one, then that's probably possible. Samu can probably answer better than me.

return metric_list + ["wer"]


class MockTracker:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will live somewhere else (follows WERTracker). It's to enable the WER/no WER without too many if and code changes.

Comment on lines 128 to 132
def compute_wer(self, transcriptions):
# Is average WER the average over individual WER?
# Or computed overall with all S, D, I?
# Is it different than the average from call (probably)
return
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love an answer to that

Copy link
Collaborator

@popcornell popcornell Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/kaldi-asr/kaldi/blob/85a3dd5f0b71e419abf1169a26b759bfc423a543/src/bin/compute-wer.cc#L94
This is the way to go: compute it over all the dataset (e.g. test set) by summing the errors of all utterances then normalize by the sum of n_words. If you compute it by utterance and then take average of WERs say you have on word in one sequence and you miss that you have there 100% WER. In another you have 20 words and you get them all correct. If you take your average you get 50% which is not really reperesentative of the performance at this point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I thought, thanks !
How expensive is the computation of WER? At first, I had planned to accumulate all the hypothesis/transcripts and compute the overall WER at the end. I still log all of it so we can compute the average WER in the end (supported by jiwer, which accounts for total number of words, as we want).

BTW, even if less critical, we do this "mistake" on SI-SDR, 3s examples and 12s examples have the same weight on the averaged metric.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I thought, thanks !
How expensive is the computation of WER? At first, I had planned to accumulate all the hypothesis/transcripts and compute the overall WER at the end. I still log all of it so we can compute the average WER in the end (supported by jiwer, which accounts for total number of words, as we want).

It depends on the internal algorithm (o(N) or (o(N^2)) and I don't know the details of this library.
Some toolkit takes a very long time for the WER computation of the long sequence.
I think your idea would be fine.

Some people may want to know the breakdown numbers per utterance or per speaker, but this would be too much for now.
This is one reason that we still use NIST sclite, which provides such analysis information, but it is not a pythonic solution.
(Another reason for using NIST toolkit is that I want to stick to use a standardized toolkit for the evaluation metric. It is a disaster if the scoring toolkit includes some errors or some dialects.).

Also, I recommend you generate the number of total word, insertion, deletion, and substitution counts, as well as WER.
This is good debugging information.
I often encountered the issue that people don't use the correct test set.
If we display the number of sentences and words, we can easily detect such mistakes.
Also, if we display insertion, deletion, and substitution counts, we can easily detect what issue causes the error.
Speech enhancement or separation tends to increase deletion
(if the speech is over suppressed) and the insertion (if there are interference speakers).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your feedback!
We'll log the numbers (ins, del, sub, tot, WER) per utterance then, as is done for the other metrics.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference:
Opened a PR in jiwer to access those numbers: jitsi/jiwer#35

Comment on lines 111 to 115
# Count the mixture output for each speaker
txt = self.predict_hypothesis(mix)
for tmp_id in wav_id:
input_wer += wer(truth=self.trans_dic[tmp_id], hypothesis=txt) / len(wav_id)
self.input_txt_list.append(dict(utt_id=tmp_id, text=txt))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compute the WER from mixture's hypothesis to each ground truth. Might not be ideal. IDK how it's done in ESPNet.

Copy link
Collaborator

@popcornell popcornell Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should accumulate n_words, sub del and ins for each utterance.
Remember we need to handle the computation for those for both the speakers ( or N speakers) in case we do not perform speaker adaptation: i think you count an insertion only after checking if the word is not in both the transcripts for both the speakers (del and subs in a similar way also)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the point of doing that but it seems out of scope of this PR maybe.
For ins, the word order doesn't matter but for del and sub, it does so that's tricky.

row_list.append(dict1)

df = pd.DataFrame(row_list)
df.to_csv(os.path.join(args.outdir, "annotations.csv"), index=False)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to change the name to match ESPNet's conventions.

Comment on lines 21 to 25
# TODO: check folders and outdir
#dev-clean test-clean train-clean-100 train-clean-360
for split in train-360 train-100 dev-clean test-clean; do
$python_path local/get_text.py --libridir $storage_dir/LibriSpeech --split $split --outdir data/$split
done
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output directories don't have the clean in the name.. We might either want to call this 4 times without the loop, or loop over directories in the .py file.

@@ -17,3 +17,9 @@ cd LibriMix
cd $current_dir
$python_path local/create_local_metadata.py --librimix_dir $storage_dir/Libri$n_src"Mix"


# TODO: check folders and outdir
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we retrieve the txt files for each split and gather them in .csv files that we'll store in the local data folder.

Comment on lines 43 to 46
# FIXME: this is wrong to have it written in plain
train_dir=data/wav8k/min/train-360
valid_dir=data/wav8k/min/dev
test_dir=data/wav8k/min/test
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These path should depend on the sample rate, not be hardcoded. From WHAM's recipes!

sr_string=$(($sample_rate/1000))
suffix=wav${sr_string}k/$mode

@@ -58,6 +160,8 @@ def main(conf):
sample_rate=conf["sample_rate"],
n_src=conf["train_conf"]["data"]["n_src"],
segment=None,
return_id=True,
# FIXME: ensure max mode for eval.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New problem: we need the "max" version for ASR, but the other metrics should still be computed with the "min" mode so that the results are consistent with what we had before.
We'll probably need to change the dataloader to return the lengths, other solutions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use max version all along. Otherwise you would not be able to say plot SI-SDR vs WER and compare the two directly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but SISDR results are usually reported on min, we cannot change that without the user knowing it. Otherwise recipes inspired by this might report wrong results afterward..

@@ -18,6 +19,17 @@
from asteroid.utils import tensors_to_device


def import_wer():
try:
from jiwer import wer
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're here @sw005320, what do you think about jiwer?
BTW, feel free to add any comment, advice etc..

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the library and it looks good to me.
I was concerning punctuation handling, upper/lower case, and non-linguistic special annotations (e.g., ), but the tool seems to handle everything.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, thank a lot ! It's going to make it much easier to use jiwer directly rather than a non-python alternative for us.

@popcornell popcornell self-requested a review November 17, 2020 10:02
@mpariente
Copy link
Collaborator Author

The PR was merged in jiwer and hits, sub, ins, del are exposed in v2.2.0

This shouldn't be too much work from now. @raikarsagar, could you give it a try please?

@sw005320
Copy link

sw005320 commented Nov 18, 2020

FYI, this is a WER report format we often use in espnet.
If asteroid produces such information, people in espnet can easily catch up the result.
Our script automatically generates the following markdown table after the ASR evaluation stage by using https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/asr1/scripts/utils/show_asr_result.sh.

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_asr_lm_lm_train_lm_adam_bpe_valid.loss.ave_asr_model_valid.acc.ave/dev_clean 2703 54402 98.0 1.8 0.2 0.3 2.3 27.6
decode_asr_lm_lm_train_lm_adam_bpe_valid.loss.ave_asr_model_valid.acc.ave/dev_other 2864 50948 95.0 4.3 0.6 0.5 5.5 45.1
decode_asr_lm_lm_train_lm_adam_bpe_valid.loss.ave_asr_model_valid.acc.ave/test_clean 2620 52576 97.9 1.9 0.2 0.3 2.4 28.2
decode_asr_lm_lm_train_lm_adam_bpe_valid.loss.ave_asr_model_valid.acc.ave/test_other 2939 52343 94.9 4.5 0.7 0.6 5.8 48.6

@mpariente
Copy link
Collaborator Author

We'll adopt this convention as well, thanks.

Does "S.Err" mean the percent of sentences without perfect score?

@sw005320
Copy link

Does "S.Err" mean the percent of sentences without perfect score?

S.Err means the sentence error rate.
If there is any word/char/token error in the sentence, it would become a sentence error.
This is very strict but some applications require such metrics.

@popcornell
Copy link
Collaborator

It seems that right now espnet2.bin.asr_inference.Speech2Text does not supported batched audio:
see https://github.com/espnet/espnet/blob/554e65ee638ebdc8256647ce273cf6303ea72e39/espnet2/bin/asr_inference.py#L180

Because of this WER computation is quite slow even on GPU (but so is SDR also right now with mir_eval).
Can you confirm @JorisCos I have in mind couple of workarounds in this case.

@JorisCos
Copy link
Collaborator

It seems that right now espnet2.bin.asr_inference.Speech2Text does not supported batched audio:
see https://github.com/espnet/espnet/blob/554e65ee638ebdc8256647ce273cf6303ea72e39/espnet2/bin/asr_inference.py#L180

Because of this WER computation is quite slow even on GPU (but so is SDR also right now with mir_eval).
Can you confirm @JorisCos I have in mind couple of workarounds in this case.

I haven't tried yet, but I wonder if we really need a workaround since we are only using espnet2.bin.asr_inference.Speech2Text in eval and evaluation isn't done on batched audio or am I missing something ?

@mpariente
Copy link
Collaborator Author

We could have sped things up by using batch processing, because it seems slow otherwise.

@mpariente mpariente mentioned this pull request Dec 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants