SPGISpeech recipe #334

desh2608 · 2022-04-26T18:19:26Z

This is a WIP PR for SPGISpeech. I am opening this here early to discuss some issues I have been facing training a transducer model.

I first trained a conformer CTC model on the data (without speed perturbation) on 4 GPUs for 20 epochs, and the training curve looked reasonable: tensorboard. I was able to get a 4% WER with CTC decoding.

I then tried training a pruned_transducer_stateless2 model (with speed perturbation), but training looks weird: tensorboard. I am not sure if this is because of the 3x speed perturbation (I'm now training a model without speed perturbation to verify). I was hoping someone would be able to suggest what may be the reason for these periodic ups and downs.

Adding screenshot of training curve here:

(Please ignore the README files etc. for now since this is only a rough draft.)

pzelasko · 2022-04-26T18:30:27Z

Try pre-shuffling the input cutset like this: gunzip -c train_cuts.jsonl.gz | shuf | gzip -c > train_cuts_shuf.jsonl.gz and re-trying. DynamicBucketingSampler has a small-ish reservoir-sampling shuffle buffer of 10k cuts, and the data is likely sorted by recording sessions / speakers / topics / etc. which is why you see these patterns. Pre-shuffling will likely help.

EDIT: alternatively try setting the shuffle_buffer_size to sth larger like 100k+ cuts, it's possible the memory usage won't be too drastic (but will increase noticeably).

EDIT2: you can also try splitting the dataset with cuts.split_lazy and read it for training with CutSet.mux(CutSet.from_jsonl_lazy(p) for p in split_paths), that would take care of ensuring sufficient randomness I think.

desh2608 · 2022-04-26T18:55:24Z

@pzelasko Thanks for the suggestions. The pre-shuffling idea looks like the simplest, so I'll try that first and see how it goes.

desh2608 · 2022-04-28T13:18:53Z

Pre-shuffling the training cuts seems to have fixed the issue. New training curve after 1 epoch (in blue):

pzelasko · 2022-04-28T13:24:43Z

Very cool. Can you please compare the WER of both runs, at different epochs and after whole training? I'm curious if it actually affects the performance of the system in a significant way or not.

desh2608 · 2022-04-29T20:03:08Z

Very cool. Can you please compare the WER of both runs, at different epochs and after whole training? I'm curious if it actually affects the performance of the system in a significant way or not.

For the original training (orange curve), the WER on a small dev set (decoded with fast_beam_search) after first 4 epochs are:

Epoch	WER (old)	WER (new)
1	4.26	4.34
2	3.77	3.86
3	3.58
4	3.28

As a comparison, I got about ~4% WER with a conformer-CTC model (but trained without speed perturbation). The corrected model hasn't trained far enough yet so I haven't tried decoding with that one, but it seems the periodicity doesn't seem to have too much (if any) impact on the model performance.

Update: Add WER with new model.

desh2608 · 2022-05-02T18:16:53Z

@csukuangfj I have been wondering recently about the batch size (max-duration) that we can use for the pruned transducer models. Even with 4 GPUs of 24G memory each, I am only able to use --max-duration 60, else I run into OOM errors which are thrown like this. With 8 GPUs of 32G memory each, I run into memory error even with --max-duration 120. However, I have seen in the LibriSpeech recipe that you use batch sizes of up to 300 when training with 8 GPUs?

(Related to @wgb14's observation here)

csukuangfj · 2022-05-02T23:27:46Z

@csukuangfj I have been wondering recently about the batch size (max-duration) that we can use for the pruned transducer models. Even with 4 GPUs of 24G memory each, I am only able to use --max-duration 60, else I run into OOM errors which are thrown like this. With 8 GPUs of 32G memory each, I run into memory error even with --max-duration 120. However, I have seen in the LibriSpeech recipe that you use batch sizes of up to 300 when training with 8 GPUs?

(Related to @wgb14's observation here)

What's the distribution of your utterance duration? You may need to filter out short and long utterances.

desh2608 · 2022-05-02T23:38:38Z

What's the distribution of your utterance duration? You may need to filter out short and long utterances.

In [4]: cuts = load_manifest_lazy('data/manifests/cuts_train.jsonl.gz')

In [5]: cuts.describe()
Cuts count: 5886320
Total duration (hours): 15070.1
Speech duration (hours): 15070.1 (100.0%)
***
Duration statistics (seconds):
mean    9.2
std     2.8
min     4.6
25%     6.9
50%     8.9
75%     11.2
99%     16.0
99.5%   16.3
99.9%   16.6
max     16.7

I would say it's pretty uniform.

danpovey · 2022-05-03T07:08:15Z

When you get an error, please print out the supervision object for the minibatch that fails.
Then we can figure out what the issue is specifically, e.g. is the batch made of short utts, or is it length mismatch,
or is it long utts?

desh2608 · 2022-05-13T02:16:45Z

I finished training a pruned transducer model and here are the decoding results on the evaluation set:

Decoding method	val WER
greedy search	2.40
beam search	2.24
modified beam search	2.30
fast beam search	2.35

Here is the tensorboard: link

As a comparison, the SPGISpeech paper reports 2.6% using NeMo's conformer CTC and 2.3% using ESPNet conformer encoder-decoder model. I believe at this point the recipe is ready to be merged.

@csukuangfj are there some instructions for how to prepare the model to be uploaded on HuggingFace? For instance, the above WERs are using --avg-last-n=10, so I suppose I need to average those models and prepare a "final" checkpoint that should be uploaded?

csukuangfj · 2022-05-13T02:57:27Z

For instance, the above WERs are using --avg-last-n=10, so I suppose I need to average those models and prepare a "final" checkpoint that should be uploaded?

Please have a look at https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless2/export.py

You need to provide --iter xx --avg 10, where xx is your latest checkpoint-xx.pt.

After running export.py, you should get a file pretrained.pt, which can be uploaded to huggingface.

To upload files to huggingface, you need to first register an account on it. After creating an account, you can create a repo on it, clone it to your local computer, copy files to your local repo, and then use git push to upload them to huggingface.

One thing to note is that you have to run sudo apt-get install git-lfs before cloning the repo from huggingface.
(*.pt files are tracked by git lfs by default in the cloned repo.)

desh2608 · 2022-05-13T14:18:41Z

I have uploaded the pretrained model to HF. This PR is ready for review.

csukuangfj · 2022-05-13T14:20:02Z

@desh2608

From https://huggingface.co/desh2608/icefall-asr-spgispeech-pruned-transducer-stateless2/blob/main/log/modified_beam_search/errs-dev-beam_size_4-epoch-28-avg-15-beam-4.txt
there are some insertions at the end of utterances.

You can fix the the insertion errors at end of utterances by using #358
and re-run the decoding for greedy search and modified beam search. I think it helps reduce WERs.

You need to
(1) Use https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless2/beam_search.py
(2) In your decode.py, use

icefall/egs/librispeech/ASR/pruned_transducer_stateless2/decode.py

Lines 270 to 274 in 0f180b3

    
           hyp_tokens = greedy_search_batch( 
        
               model=model, 
        
               encoder_out=encoder_out, 
        
               encoder_out_lens=encoder_out_lens, 
        
           )

icefall/egs/librispeech/ASR/pruned_transducer_stateless2/decode.py

Lines 278 to 282 in 0f180b3

    
           hyp_tokens = modified_beam_search( 
        
               model=model, 
        
               encoder_out=encoder_out, 
        
               encoder_out_lens=encoder_out_lens, 
        
               beam=params.beam_size,

That is, add encoder_out_lens=encoder_out_lens.

csukuangfj · 2022-05-13T14:21:48Z

egs/spgispeech/ASR/pruned_transducer_stateless2/decode.py

+import torch
+import torch.nn as nn
+from asr_datamodule import SPGISpeechAsrDataModule
+from beam_search import (


If you are going to synchronize with the latest master, please use

vimdiff ./decode.py /path/to/librispeech/ASR/pruned_transducer_stateless2/decode.py

to find the differences.

Okay thanks, I will synchronize and update.

Please also fix the style issues reported by GitHub actions.

You can fix them locally by following
https://github.com/k2-fsa/icefall/blob/master/docs/source/contributing/code-style.rst

I think the remaining style issues are from librispeech code that I have soft-linked.

Ah, ok. Thanks! Merging.

desh2608 · 2022-05-13T17:59:39Z

@desh2608

From https://huggingface.co/desh2608/icefall-asr-spgispeech-pruned-transducer-stateless2/blob/main/log/modified_beam_search/errs-dev-beam_size_4-epoch-28-avg-15-beam-4.txt there are some insertions at the end of utterances.

You can fix the the insertion errors at end of utterances by using #358 and re-run the decoding for greedy search and modified beam search. I think it helps reduce WERs.

You need to (1) Use https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless2/beam_search.py (2) In your decode.py, use

icefall/egs/librispeech/ASR/pruned_transducer_stateless2/decode.py

Lines 270 to 274 in 0f180b3

hyp_tokens = greedy_search_batch(

model=model,

encoder_out=encoder_out,

encoder_out_lens=encoder_out_lens,

)

icefall/egs/librispeech/ASR/pruned_transducer_stateless2/decode.py

Lines 278 to 282 in 0f180b3

hyp_tokens = modified_beam_search(

model=model,

encoder_out=encoder_out,

encoder_out_lens=encoder_out_lens,

beam=params.beam_size,

That is, add encoder_out_lens=encoder_out_lens.

The WER for modified beam search improved from 2.30% to 2.24% with this change. WER remained unchanged for greedy search.

egs/spgispeech/ASR/README.md

csukuangfj · 2022-07-19T01:52:34Z

@desh2608

Could you please upload a torchscript model to
https://huggingface.co/desh2608/icefall-asr-spgispeech-pruned-transducer-stateless2/tree/main/exp

You can use export.py --jit=1 to obtain a torchscript model.

I would like to add the pre-trained torchscript model to
https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition

desh2608 · 2022-07-20T17:35:47Z

@desh2608

Could you please upload a torchscript model to

https://huggingface.co/desh2608/icefall-asr-spgispeech-pruned-transducer-stateless2/tree/main/exp

You can use export.py --jit=1 to obtain a torchscript model.

I would like to add the pre-trained torchscript model to

https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition

I will do it this weekend (I'm on an internship right now and have limited access to the JHU cluster).

jtrmal · 2022-07-20T17:38:19Z

@desh2608 let me know and I can add you to the k2-fsa hf repo y.

…

On Wed, Jul 20, 2022 at 1:36 PM Desh Raj ***@***.***> wrote: @desh2608 <https://github.com/desh2608> Could you please upload a torchscript model to https://huggingface.co/desh2608/icefall-asr-spgispeech-pruned-transducer-stateless2/tree/main/exp You can use export.py --jit=1 to obtain a torchscript model. I would like to add the pre-trained torchscript model to https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition I will do it this weekend (I'm on an internship right now and have limited access to the JHU cluster). — Reply to this email directly, view it on GitHub <#334 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYX3DUKDYUU7G3YUN7ODVVA2IBANCNFSM5UMWBMNQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

csukuangfj · 2022-07-20T22:09:27Z

@desh2608

Could you please upload a torchscript model to

https://huggingface.co/desh2608/icefall-asr-spgispeech-pruned-transducer-stateless2/tree/main/exp

You can use export.py --jit=1 to obtain a torchscript model.

I would like to add the pre-trained torchscript model to

https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition

I will do it this weekend (I'm on an internship right now and have limited access to the JHU cluster).

Thanks!

desh2608 · 2022-08-29T15:27:02Z

@csukuangfj sorry it took a while (I just got back from my internship). I have uploaded a jitted model here.

csukuangfj · 2022-08-29T22:32:20Z

@csukuangfj sorry it took a while (I just got back from my internship). I have uploaded a jitted model here.

Thanks!

desh2608 added 7 commits March 8, 2022 15:01

initial commit for SPGISpeech recipe

0c27ba4

add decoding

b2a2b1d

Merge branch 'master' of https://github.com/k2-fsa/icefall into spgi

2da510c

Merge branch 'master' of https://github.com/k2-fsa/icefall into spgi

397216e

Merge branch 'master' of https://github.com/k2-fsa/icefall into spgi

c095d10

add spgispeech transducer

e61e69f

Merge branch 'master' of https://github.com/k2-fsa/icefall into spgi

8cdb893

Merge branch 'master' of https://github.com/k2-fsa/icefall into spgi

e2e5c77

desh2608 added 4 commits May 10, 2022 20:51

Merge branch 'master' of https://github.com/k2-fsa/icefall into spgi

4e1205a

remove conformer ctc; minor fixes in RNN-T

f62f8fb

add results

72d3895

add tensorboard

2cf0d51

desh2608 added 3 commits May 13, 2022 10:00

add pretrained model to HF

2381ba5

remove unused scripts and soft link common scripts

d4a8648

remove duplicate files

63dcc1d

desh2608 changed the title ~~[WIP] SPGISpeech recipe~~ SPGISpeech recipe May 13, 2022

csukuangfj reviewed May 13, 2022

View reviewed changes

desh2608 added 2 commits May 13, 2022 10:23

Merge branch 'master' of https://github.com/k2-fsa/icefall into spgi

e0536c9

pre commit hooks

a2fb185

remove change in librispeech

02b4b46

csukuangfj reviewed May 14, 2022

View reviewed changes

egs/spgispeech/ASR/README.md Outdated Show resolved Hide resolved

desh2608 added 2 commits May 14, 2022 10:41

pre commit hook

4fc1638

add CER numbers

ed30271

csukuangfj merged commit 5aafbb9 into k2-fsa:master May 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPGISpeech recipe #334

SPGISpeech recipe #334

desh2608 commented Apr 26, 2022

pzelasko commented Apr 26, 2022 •

edited

Loading

desh2608 commented Apr 26, 2022 •

edited

Loading

desh2608 commented Apr 28, 2022

pzelasko commented Apr 28, 2022

desh2608 commented Apr 29, 2022 •

edited

Loading

desh2608 commented May 2, 2022 •

edited

Loading

csukuangfj commented May 2, 2022 •

edited

Loading

desh2608 commented May 2, 2022

danpovey commented May 3, 2022

desh2608 commented May 13, 2022

csukuangfj commented May 13, 2022

desh2608 commented May 13, 2022

csukuangfj commented May 13, 2022

csukuangfj May 13, 2022

desh2608 May 13, 2022

csukuangfj May 13, 2022

desh2608 May 16, 2022

csukuangfj May 16, 2022

desh2608 commented May 13, 2022

csukuangfj commented Jul 19, 2022

desh2608 commented Jul 20, 2022

jtrmal commented Jul 20, 2022 via email

csukuangfj commented Jul 20, 2022

desh2608 commented Aug 29, 2022

csukuangfj commented Aug 29, 2022

SPGISpeech recipe #334

SPGISpeech recipe #334

Conversation

desh2608 commented Apr 26, 2022

pzelasko commented Apr 26, 2022 • edited Loading

desh2608 commented Apr 26, 2022 • edited Loading

desh2608 commented Apr 28, 2022

pzelasko commented Apr 28, 2022

desh2608 commented Apr 29, 2022 • edited Loading

desh2608 commented May 2, 2022 • edited Loading

csukuangfj commented May 2, 2022 • edited Loading

desh2608 commented May 2, 2022

danpovey commented May 3, 2022

desh2608 commented May 13, 2022

csukuangfj commented May 13, 2022

desh2608 commented May 13, 2022

csukuangfj commented May 13, 2022

csukuangfj May 13, 2022

Choose a reason for hiding this comment

desh2608 May 13, 2022

Choose a reason for hiding this comment

csukuangfj May 13, 2022

Choose a reason for hiding this comment

desh2608 May 16, 2022

Choose a reason for hiding this comment

csukuangfj May 16, 2022

Choose a reason for hiding this comment

desh2608 commented May 13, 2022

csukuangfj commented Jul 19, 2022

desh2608 commented Jul 20, 2022

jtrmal commented Jul 20, 2022 via email

csukuangfj commented Jul 20, 2022

desh2608 commented Aug 29, 2022

csukuangfj commented Aug 29, 2022

pzelasko commented Apr 26, 2022 •

edited

Loading

desh2608 commented Apr 26, 2022 •

edited

Loading

desh2608 commented Apr 29, 2022 •

edited

Loading

desh2608 commented May 2, 2022 •

edited

Loading

csukuangfj commented May 2, 2022 •

edited

Loading