-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESPNet integration: v0 #329
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are few things left to do for this to be acceptable, in a first step.
@raikarsagar, would you like to help? That would really be amazing !! 😃
egs/librimix/ConvTasNet/eval.py
Outdated
@@ -18,6 +19,17 @@ | |||
from asteroid.utils import tensors_to_device | |||
|
|||
|
|||
def import_wer(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function can live somewhere else.
) | ||
|
||
COMPUTE_METRICS = ["si_sdr", "sdr", "sir", "sar", "stoi"] | ||
ASR_MODEL_PATH = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need to be a variable probably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be specified by the user according to the recipe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and we'll have to make sure it was trained on LibriSpeech.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed that ASR models couldn't be used to perform cross datasets evaluation but is this true ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is the problem of vocab at least. If the vocab of the second dataset is included in the first one, then that's probably possible. Samu can probably answer better than me.
egs/librimix/ConvTasNet/eval.py
Outdated
return metric_list + ["wer"] | ||
|
||
|
||
class MockTracker: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will live somewhere else (follows WERTracker
). It's to enable the WER/no WER without too many if and code changes.
egs/librimix/ConvTasNet/eval.py
Outdated
def compute_wer(self, transcriptions): | ||
# Is average WER the average over individual WER? | ||
# Or computed overall with all S, D, I? | ||
# Is it different than the average from call (probably) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would love an answer to that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/kaldi-asr/kaldi/blob/85a3dd5f0b71e419abf1169a26b759bfc423a543/src/bin/compute-wer.cc#L94
This is the way to go: compute it over all the dataset (e.g. test set) by summing the errors of all utterances then normalize by the sum of n_words. If you compute it by utterance and then take average of WERs say you have on word in one sequence and you miss that you have there 100% WER. In another you have 20 words and you get them all correct. If you take your average you get 50% which is not really reperesentative of the performance at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I thought, thanks !
How expensive is the computation of WER? At first, I had planned to accumulate all the hypothesis/transcripts and compute the overall WER at the end. I still log all of it so we can compute the average WER in the end (supported by jiwer, which accounts for total number of words, as we want).
BTW, even if less critical, we do this "mistake" on SI-SDR, 3s examples and 12s examples have the same weight on the averaged metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I thought, thanks !
How expensive is the computation of WER? At first, I had planned to accumulate all the hypothesis/transcripts and compute the overall WER at the end. I still log all of it so we can compute the average WER in the end (supported by jiwer, which accounts for total number of words, as we want).
It depends on the internal algorithm (o(N) or (o(N^2)) and I don't know the details of this library.
Some toolkit takes a very long time for the WER computation of the long sequence.
I think your idea would be fine.
Some people may want to know the breakdown numbers per utterance or per speaker, but this would be too much for now.
This is one reason that we still use NIST sclite, which provides such analysis information, but it is not a pythonic solution.
(Another reason for using NIST toolkit is that I want to stick to use a standardized toolkit for the evaluation metric. It is a disaster if the scoring toolkit includes some errors or some dialects.).
Also, I recommend you generate the number of total word, insertion, deletion, and substitution counts, as well as WER.
This is good debugging information.
I often encountered the issue that people don't use the correct test set.
If we display the number of sentences and words, we can easily detect such mistakes.
Also, if we display insertion, deletion, and substitution counts, we can easily detect what issue causes the error.
Speech enhancement or separation tends to increase deletion
(if the speech is over suppressed) and the insertion (if there are interference speakers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your feedback!
We'll log the numbers (ins, del, sub, tot, WER) per utterance then, as is done for the other metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference:
Opened a PR in jiwer to access those numbers: jitsi/jiwer#35
egs/librimix/ConvTasNet/eval.py
Outdated
# Count the mixture output for each speaker | ||
txt = self.predict_hypothesis(mix) | ||
for tmp_id in wav_id: | ||
input_wer += wer(truth=self.trans_dic[tmp_id], hypothesis=txt) / len(wav_id) | ||
self.input_txt_list.append(dict(utt_id=tmp_id, text=txt)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compute the WER from mixture's hypothesis to each ground truth. Might not be ideal. IDK how it's done in ESPNet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should accumulate n_words, sub del and ins for each utterance.
Remember we need to handle the computation for those for both the speakers ( or N speakers) in case we do not perform speaker adaptation: i think you count an insertion only after checking if the word is not in both the transcripts for both the speakers (del and subs in a similar way also)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the point of doing that but it seems out of scope of this PR maybe.
For ins, the word order doesn't matter but for del and sub, it does so that's tricky.
row_list.append(dict1) | ||
|
||
df = pd.DataFrame(row_list) | ||
df.to_csv(os.path.join(args.outdir, "annotations.csv"), index=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to change the name to match ESPNet's conventions.
# TODO: check folders and outdir | ||
#dev-clean test-clean train-clean-100 train-clean-360 | ||
for split in train-360 train-100 dev-clean test-clean; do | ||
$python_path local/get_text.py --libridir $storage_dir/LibriSpeech --split $split --outdir data/$split | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output directories don't have the clean
in the name.. We might either want to call this 4 times without the loop, or loop over directories in the .py file.
@@ -17,3 +17,9 @@ cd LibriMix | |||
cd $current_dir | |||
$python_path local/create_local_metadata.py --librimix_dir $storage_dir/Libri$n_src"Mix" | |||
|
|||
|
|||
# TODO: check folders and outdir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we retrieve the txt files for each split and gather them in .csv files that we'll store in the local data
folder.
egs/librimix/ConvTasNet/run.sh
Outdated
# FIXME: this is wrong to have it written in plain | ||
train_dir=data/wav8k/min/train-360 | ||
valid_dir=data/wav8k/min/dev | ||
test_dir=data/wav8k/min/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These path should depend on the sample rate, not be hardcoded. From WHAM's recipes!
sr_string=$(($sample_rate/1000))
suffix=wav${sr_string}k/$mode
egs/librimix/ConvTasNet/eval.py
Outdated
@@ -58,6 +160,8 @@ def main(conf): | |||
sample_rate=conf["sample_rate"], | |||
n_src=conf["train_conf"]["data"]["n_src"], | |||
segment=None, | |||
return_id=True, | |||
# FIXME: ensure max mode for eval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New problem: we need the "max" version for ASR, but the other metrics should still be computed with the "min" mode so that the results are consistent with what we had before.
We'll probably need to change the dataloader to return the lengths, other solutions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use max version all along. Otherwise you would not be able to say plot SI-SDR vs WER and compare the two directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but SISDR results are usually reported on min, we cannot change that without the user knowing it. Otherwise recipes inspired by this might report wrong results afterward..
egs/librimix/ConvTasNet/eval.py
Outdated
@@ -18,6 +19,17 @@ | |||
from asteroid.utils import tensors_to_device | |||
|
|||
|
|||
def import_wer(): | |||
try: | |||
from jiwer import wer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're here @sw005320, what do you think about jiwer
?
BTW, feel free to add any comment, advice etc..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the library and it looks good to me.
I was concerning punctuation handling, upper/lower case, and non-linguistic special annotations (e.g., ), but the tool seems to handle everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, thank a lot ! It's going to make it much easier to use jiwer directly rather than a non-python alternative for us.
The PR was merged in This shouldn't be too much work from now. @raikarsagar, could you give it a try please? |
FYI, this is a WER report format we often use in espnet.
|
We'll adopt this convention as well, thanks. Does "S.Err" mean the percent of sentences without perfect score? |
S.Err means the sentence error rate. |
It seems that right now Because of this WER computation is quite slow even on GPU (but so is SDR also right now with mir_eval). |
I haven't tried yet, but I wonder if we really need a workaround since we are only using |
We could have sped things up by using batch processing, because it seems slow otherwise. |
We'll get hits, sub, ins and del from there and compute out local WER and total WER from that, using Counter
Enable "conscious" max testing for WER computation
It might be useful for other recipes.
baf9dc7
to
d665cca
Compare
So that's a first iteration for ASR evaluation in the LibriMix recipe.
Thanks to @JorisCos for some of the ground work.
I'll comment in the PR about things I'm unsure about and where I'd like help hopefully.
None of the code has been ran yet, things will likely change place but this is just a rough draft to see how easy it would be to do it.
@raikarsagar @JorisCos @popcornell @mhu-coder