Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract framewise alignment information using CTC decoding #39

Merged
merged 17 commits into from
Oct 18, 2021

Conversation

csukuangfj
Copy link
Collaborator

For the following test wave librispeech/LibriSpeech/test-clean/8224/274384/8224-274384-0008.flac,
the framewise alignment computed using the model from #17 is

(the first column is in seconds, the second column is BPE tokens, i.e., lattice.labels. I have scaled it by subsampling factor (which is 4) )

0.0, <blk>
0.04, <blk>
0.08, <blk>
0.12, <blk>
0.16, <blk>
0.2, <blk>
0.24, <blk>
0.28, <blk>
0.32, <blk>
0.36, <blk>
0.4, <blk>
0.44, <blk>
0.48, ▁THE
0.52, <blk>
0.56, <blk>
0.6, <blk>
0.64, ▁GOOD
0.68, <blk>
0.72, <blk>
0.76, <blk>
0.8, <blk>
0.84, <blk>
0.88, ▁NATURE

I delete "xxx.xx <blk>" in the following for display purposes

1.12, D
1.16, D
1.32, ▁AUDIENCE
2.0, ▁IN
2.24, ▁PITY
2.64, ▁TO
2.88, ▁FALLEN
3.2800000000000002, ▁MAJESTY
4.68, ▁SHOWED
5.04, ▁FOR
5.24, ▁ONCE
5.76, ▁GREATER
6.16, ▁DE
6.28, FER
6.32, FER
6.36, FER
6.4, ENCE
6.44, ENCE
6.72, ▁TO
6.88, ▁THE
7.04, ▁KING
7.88, ▁THAN
8.120000000000001, ▁TO
8.44, ▁MINISTER
9.64, ▁AND
9.96, ▁SUN
10.040000000000001, G
10.08, G
10.120000000000001, G
10.28, ▁THE
10.36, ▁P
10.4, ▁P
10.44, S
10.48, S
10.52, AL
10.56, AL
10.68, M
10.72, M
10.76, M
11.120000000000001, ▁WHICH
11.36, ▁THE
11.6, ▁FORMER
12.0, ▁HAD
12.280000000000001, ▁CALLED
12.6, ▁FOR

The alignment information from https://github.com/CorentinJ/librispeech-alignments of this wave is

8224-274384-0008 ",THE,GOOD,NATURED,AUDIENCE,IN,PITY,TO,FALLEN,MAJESTY,,SHOWED,FOR,ONCE,GREATER,DEFERENCE,TO,THE,KING,,THAN,TO,THE,MINISTER,,AND,SUNG,THE,PSALM,,WHICH,THE,FORMER,HAD,CALLED,FOR," 
"0.500,0.620,0.830,1.310,2.000,2.210,2.610,2.760,3.280,4.020,4.560,4.960,5.170,5.720,6.110,6.710,6.860,6.990,7.470,7.850,8.120,8.280,8.390,9.010,9.580,9.830,10.260,10.390,10.980,11.010,11.330,11.490,11.930,12.220,12.540,12.950,13.42" 

The following table compares the alignment information obtained with this pull-request with the one from
https://github.com/CorentinJ/librispeech-alignments:



This PR, token,  CorentinJ/librispeech-alignments
0.48, ▁THE  0.500
0.64, ▁GOOD  0.620
0.88, ▁NATURE 0.830
1.12, D
1.16, D
1.32, ▁AUDIENCE  1.310
2.0, ▁IN 2.000
2.24, ▁PITY 2.210
2.64, ▁TO 2.610
2.88, ▁FALLEN 2.760
3.2800000000000002, ▁MAJESTY 3.280
    silence 4.020
4.68, ▁SHOWED  4.560
5.04, ▁FOR 4.960,
5.24, ▁ONCE 5.170
5.76, ▁GREATER 5.720
6.16, ▁DE 6.110
6.28, FER
6.32, FER
6.36, FER
6.4, ENCE
6.44, ENCE
6.72, ▁TO 6.710
6.88, ▁THE 6.860
7.04, ▁KING 6.990,
  silence 7.470
7.88, ▁THAN 7.850,
8.120000000000001, ▁TO 8.120,
8.28, ▁THE 8.280
8.44, ▁MINISTER 8.390
  silence 9.010
9.64, ▁AND 9.580
9.96, ▁SUN 9.830
10.040000000000001, G
10.08, G
10.120000000000001, G
10.28, ▁THE 10.260
10.36, ▁P 10.390
10.4, ▁P 
10.44, S
10.48, S
10.52, AL
10.56, AL
10.68, M
10.72, M
10.76, M
  silence 10.980,
11.120000000000001, ▁WHICH 11.010
11.36, ▁THE 11.330
11.6, ▁FORMER 11.490
12.0, ▁HAD 11.930
12.280000000000001, ▁CALLED 12.220,
12.6, ▁FOR 12.540
  (end of FOR) 12.950
  (end of utterance) 13.42

Since we are using a subsampling factor of 4 in the model, the resolution of the alignment is 4-frame, which is
0.04 seconds as the frameshift is 0.01 seconds.


To compare the alignment information in a more detailed way, I select a subpart of the wave corresponding to

4.68, ▁SHOWED  4.560
5.04, ▁FOR 4.960,
5.24, ▁ONCE 5.170
5.76, ▁GREATER 5.720
6.16, ▁DE 6.110

The waveform and spectrogram of that part are shown in the following:

Screen Shot 2021-09-08 at 8 31 09 PM

You can see that this pull request assigns 4.68 as the start time of SHOWED, which is closer to the actual start.


@danpovey

Can we compute the alignment information by ourselves using a pre-trained CTC model?
The reasons are that:
(1) It is framewise (after subsampling), easier to use than the one using word alignment
(2) It is as accurate as the one computed with https://github.com/CorentinJ/librispeech-alignments , though I have only compared just one wave
(3) Users don't need to download the extra alignment information, though they have to pre-train a CTC model. But for
datasets that don't have alignment information publicly available, this is the only way to go, I think.

@csukuangfj
Copy link
Collaborator Author

csukuangfj commented Sep 9, 2021

The following shows the probabilities and log_probabilities of the alignments at each frame after subsampling.

You can see that the probability is very spiky, which is almost always one.
Also, <blk> appears most of the time in the alignment. Moreover, tokens are not repeated
most of the time, even though CTC allows tokens to be repeated.

If we use word alignment information, it is difficult, if not impossible, to insert blanks between words.

From lhotse-speech/lhotse#378 (comment)

We could add, say, -10 to arcs where the time is out of bounds. That should be more than enough to get training
started.

The word alignment from https://github.com/CorentinJ/librispeech-alignments assumes that a word's end time is the next word's start time. Furthermore, we have to break words into tokens, which makes the implementation more complicated than that of using framewise alignment.

# time_in_seconds, token, log_prob, prob
 0.00, <blk>, -0.00003147, 1.0000
 0.04, <blk>, -0.00000691, 1.0000
 0.08, <blk>, -0.00001979, 1.0000
 0.12, <blk>, -0.00002837, 1.0000
 0.16, <blk>, -0.00002027, 1.0000
 0.20, <blk>, -0.00001824, 1.0000
 0.24, <blk>, -0.00001657, 1.0000
 0.28, <blk>, -0.00001097, 1.0000
 0.32, <blk>, -0.00000298, 1.0000
 0.36, <blk>, -0.00000155, 1.0000
 0.40, <blk>, -0.00000143, 1.0000
 0.44, <blk>, -0.00000107, 1.0000
 0.48, ▁THE, -0.00025388, 0.9997
 0.52, <blk>, -0.01130119, 0.9888
 0.56, <blk>, -0.00000012, 1.0000
 0.60, <blk>, -0.00000763, 1.0000
 0.64, ▁GOOD, -0.00003779, 1.0000
 0.68, <blk>, -0.00060409, 0.9994
 0.72, <blk>, 0.00000000, 1.0000
 0.76, <blk>, -0.00000036, 1.0000
 0.80, <blk>, 0.00000000, 1.0000
 0.84, <blk>, -0.00000215, 1.0000
 0.88, ▁NATURE, -0.00015877, 0.9998
 0.92, <blk>, -0.00000095, 1.0000
 0.96, <blk>, 0.00000000, 1.0000
 1.00, <blk>, -0.00000012, 1.0000
 1.04, <blk>, -0.00074550, 0.9993
 1.08, <blk>, -0.05740658, 0.9442
 1.12, D, -0.22990498, 0.7946
 1.16, D, -0.00000203, 1.0000
 1.20, <blk>, -0.01423326, 0.9859
 1.24, <blk>, 0.00000000, 1.0000
 1.28, <blk>, -0.00000143, 1.0000
 1.32, ▁AUDIENCE, -0.00001502, 1.0000
 1.36, <blk>, -0.00006163, 0.9999
 1.40, <blk>, 0.00000000, 1.0000
 1.44, <blk>, 0.00000000, 1.0000
 1.48, <blk>, -0.00000179, 1.0000
 1.52, <blk>, -0.00000572, 1.0000
 1.56, <blk>, -0.00000346, 1.0000
 1.60, <blk>, -0.00000119, 1.0000
 1.64, <blk>, -0.00000024, 1.0000
 1.68, <blk>, -0.00000048, 1.0000
 1.72, <blk>, -0.00000238, 1.0000
 1.76, <blk>, -0.00000417, 1.0000
 1.80, <blk>, -0.00000286, 1.0000
 1.84, <blk>, -0.00000143, 1.0000
 1.88, <blk>, -0.00000036, 1.0000
 1.92, <blk>, 0.00000000, 1.0000
 1.96, <blk>, -0.00033027, 0.9997
 2.00, ▁IN, -0.00001550, 1.0000
 2.04, <blk>, -0.00000024, 1.0000
 2.08, <blk>, 0.00000000, 1.0000
 2.12, <blk>, -0.00000012, 1.0000
 2.16, <blk>, 0.00000000, 1.0000
 2.20, <blk>, -0.00000620, 1.0000
 2.24, ▁PITY, -0.00004911, 1.0000
 2.28, <blk>, -0.00002813, 1.0000
 2.32, <blk>, 0.00000000, 1.0000
 2.36, <blk>, 0.00000000, 1.0000
 2.40, <blk>, -0.00000334, 1.0000
 2.44, <blk>, -0.00000632, 1.0000
 2.48, <blk>, -0.00000882, 1.0000
 2.52, <blk>, -0.00000465, 1.0000
 2.56, <blk>, -0.00000012, 1.0000
 2.60, <blk>, -0.00052927, 0.9995
 2.64, ▁TO, -0.00002944, 1.0000
 2.68, <blk>, -0.00027843, 0.9997
 2.72, <blk>, -0.00000143, 1.0000
 2.76, <blk>, -0.00005209, 0.9999
 2.80, <blk>, -0.00000036, 1.0000
 2.84, <blk>, -0.00000191, 1.0000
 2.88, ▁FALLEN, -0.01618837, 0.9839
 2.92, <blk>, -0.00004017, 1.0000
 2.96, <blk>, 0.00000000, 1.0000
 3.00, <blk>, -0.00000012, 1.0000
 3.04, <blk>, -0.00000739, 1.0000
 3.08, <blk>, -0.00076991, 0.9992
 3.12, <blk>, -0.00012767, 0.9999
 3.16, <blk>, -0.00000393, 1.0000
 3.20, <blk>, -0.00000691, 1.0000
 3.24, <blk>, -0.00000083, 1.0000
 3.28, ▁MAJESTY, -0.00005996, 0.9999
 3.32, <blk>, -0.00003171, 1.0000
 3.36, <blk>, -0.00000048, 1.0000
 3.40, <blk>, -0.00000072, 1.0000
 3.44, <blk>, -0.00000596, 1.0000
 3.48, <blk>, -0.00000262, 1.0000
 3.52, <blk>, -0.00000668, 1.0000
 3.56, <blk>, -0.00000703, 1.0000
 3.60, <blk>, -0.00000143, 1.0000
 3.64, <blk>, -0.00000083, 1.0000
 3.68, <blk>, -0.00000489, 1.0000
 3.72, <blk>, -0.00000632, 1.0000
 3.76, <blk>, -0.00000536, 1.0000
 3.80, <blk>, -0.00000060, 1.0000
 3.84, <blk>, -0.00000155, 1.0000
 3.88, <blk>, -0.00000215, 1.0000
 3.92, <blk>, -0.00000012, 1.0000
 3.96, <blk>, -0.00003135, 1.0000
 4.00, <blk>, -0.00002110, 1.0000
 4.04, <blk>, -0.00001812, 1.0000
 4.08, <blk>, -0.00001669, 1.0000
 4.12, <blk>, -0.00000894, 1.0000
 4.16, <blk>, 0.00000000, 1.0000
 4.20, <blk>, 0.00000000, 1.0000
 4.24, <blk>, 0.00000000, 1.0000
 4.28, <blk>, 0.00000000, 1.0000
 4.32, <blk>, -0.00001991, 1.0000
 4.36, <blk>, -0.00000250, 1.0000
 4.40, <blk>, -0.00000012, 1.0000
 4.44, <blk>, -0.00000012, 1.0000
 4.48, <blk>, -0.00000036, 1.0000
 4.52, <blk>, -0.00000024, 1.0000
 4.56, <blk>, -0.00000024, 1.0000
 4.60, <blk>, 0.00000000, 1.0000
 4.64, <blk>, -0.00002027, 1.0000
 4.68, ▁SHOWED, -0.00007963, 0.9999
 4.72, <blk>, -0.00177536, 0.9982
 4.76, <blk>, -0.00000513, 1.0000
 4.80, <blk>, -0.00001013, 1.0000
 4.84, <blk>, -0.00000942, 1.0000
 4.88, <blk>, -0.00000203, 1.0000
 4.92, <blk>, -0.00000024, 1.0000
 4.96, <blk>, 0.00000000, 1.0000
 5.00, <blk>, -0.00002503, 1.0000
 5.04, ▁FOR, -0.00002074, 1.0000
 5.08, <blk>, -0.00000525, 1.0000
 5.12, <blk>, -0.00000584, 1.0000
 5.16, <blk>, -0.00000083, 1.0000
 5.20, <blk>, -0.00009632, 0.9999
 5.24, ▁ONCE, -0.00003552, 1.0000
 5.28, <blk>, -0.00000346, 1.0000
 5.32, <blk>, 0.00000000, 1.0000
 5.36, <blk>, -0.00000024, 1.0000
 5.40, <blk>, -0.00000060, 1.0000
 5.44, <blk>, -0.00000191, 1.0000
 5.48, <blk>, -0.00000143, 1.0000
 5.52, <blk>, -0.00000072, 1.0000
 5.56, <blk>, -0.00000107, 1.0000
 5.60, <blk>, -0.00000107, 1.0000
 5.64, <blk>, -0.00000060, 1.0000
 5.68, <blk>, -0.00000012, 1.0000
 5.72, <blk>, -0.00000024, 1.0000
 5.76, ▁GREATER, -0.00002444, 1.0000
 5.80, <blk>, -0.00003076, 1.0000
 5.84, <blk>, 0.00000000, 1.0000
 5.88, <blk>, 0.00000000, 1.0000
 5.92, <blk>, -0.00000012, 1.0000
 5.96, <blk>, -0.00000370, 1.0000
 6.00, <blk>, -0.00000417, 1.0000
 6.04, <blk>, -0.00000393, 1.0000
 6.08, <blk>, -0.00000048, 1.0000
 6.12, <blk>, -0.01328650, 0.9868
 6.16, ▁DE, -0.00000274, 1.0000
 6.20, <blk>, -0.03788889, 0.9628
 6.24, <blk>, -0.00004351, 1.0000
 6.28, FER, -0.02246549, 0.9778
 6.32, FER, -0.00010347, 0.9999
 6.36, FER, -0.09728961, 0.9073
 6.40, ENCE, -0.00000823, 1.0000
 6.44, ENCE, -0.00153840, 0.9985
 6.48, <blk>, -0.00276101, 0.9972
 6.52, <blk>, -0.00000703, 1.0000
 6.56, <blk>, -0.00000882, 1.0000
 6.60, <blk>, -0.00001121, 1.0000
 6.64, <blk>, -0.00000679, 1.0000
 6.68, <blk>, -0.02932693, 0.9711
 6.72, ▁TO, -0.00000525, 1.0000
 6.76, <blk>, -0.04336526, 0.9576
 6.80, <blk>, -0.00000107, 1.0000
 6.84, <blk>, -0.01518593, 0.9849
 6.88, ▁THE, -0.00003123, 1.0000
 6.92, <blk>, -0.00050222, 0.9995
 6.96, <blk>, 0.00000000, 1.0000
 7.00, <blk>, -0.00000417, 1.0000
 7.04, ▁KING, -0.00004125, 1.0000
 7.08, <blk>, -0.00010740, 0.9999
 7.12, <blk>, -0.00000024, 1.0000
 7.16, <blk>, 0.00000000, 1.0000
 7.20, <blk>, -0.00000107, 1.0000
 7.24, <blk>, -0.00000167, 1.0000
 7.28, <blk>, -0.00000131, 1.0000
 7.32, <blk>, -0.00000107, 1.0000
 7.36, <blk>, -0.00000024, 1.0000
 7.40, <blk>, 0.00000000, 1.0000
 7.44, <blk>, -0.00003183, 1.0000
 7.48, <blk>, -0.00002265, 1.0000
 7.52, <blk>, -0.00001800, 1.0000
 7.56, <blk>, -0.00001657, 1.0000
 7.60, <blk>, -0.00000226, 1.0000
 7.64, <blk>, -0.00000024, 1.0000
 7.68, <blk>, 0.00000000, 1.0000
 7.72, <blk>, -0.00000012, 1.0000
 7.76, <blk>, -0.00000024, 1.0000
 7.80, <blk>, 0.00000000, 1.0000
 7.84, <blk>, -0.00002074, 1.0000
 7.88, ▁THAN, -0.00004506, 1.0000
 7.92, <blk>, -0.00075479, 0.9992
 7.96, <blk>, -0.00000131, 1.0000
 8.00, <blk>, -0.00001431, 1.0000
 8.04, <blk>, -0.00001860, 1.0000
 8.08, <blk>, -0.02900053, 0.9714
 8.12, ▁TO, -0.00001490, 1.0000
 8.16, <blk>, -0.30554244, 0.7367
 8.20, <blk>, -0.00000739, 1.0000
 8.24, <blk>, -0.00163338, 0.9984
 8.28, ▁THE, -0.00006485, 0.9999
 8.32, <blk>, -0.00049638, 0.9995
 8.36, <blk>, -0.00000036, 1.0000
 8.40, <blk>, -0.00018488, 0.9998
 8.44, ▁MINISTER, -0.00005007, 0.9999
 8.48, <blk>, -0.00001144, 1.0000
 8.52, <blk>, 0.00000000, 1.0000
 8.56, <blk>, -0.00000548, 1.0000
 8.60, <blk>, -0.00002360, 1.0000
 8.64, <blk>, -0.00000119, 1.0000
 8.68, <blk>, 0.00000000, 1.0000
 8.72, <blk>, -0.00000167, 1.0000
 8.76, <blk>, -0.00000203, 1.0000
 8.80, <blk>, -0.00000393, 1.0000
 8.84, <blk>, -0.00000727, 1.0000
 8.88, <blk>, -0.00000834, 1.0000
 8.92, <blk>, -0.00000024, 1.0000
 8.96, <blk>, -0.00003517, 1.0000
 9.00, <blk>, -0.00001943, 1.0000
 9.04, <blk>, -0.00001836, 1.0000
 9.08, <blk>, -0.00001597, 1.0000
 9.12, <blk>, -0.00001836, 1.0000
 9.16, <blk>, 0.00000000, 1.0000
 9.20, <blk>, 0.00000000, 1.0000
 9.24, <blk>, 0.00000000, 1.0000
 9.28, <blk>, 0.00000000, 1.0000
 9.32, <blk>, 0.00000000, 1.0000
 9.36, <blk>, -0.00002229, 1.0000
 9.40, <blk>, -0.00000095, 1.0000
 9.44, <blk>, 0.00000000, 1.0000
 9.48, <blk>, 0.00000000, 1.0000
 9.52, <blk>, -0.00000012, 1.0000
 9.56, <blk>, -0.00000119, 1.0000
 9.60, <blk>, -0.06021266, 0.9416
 9.64, ▁AND, -0.00002646, 1.0000
 9.68, <blk>, -0.00001132, 1.0000
 9.72, <blk>, -0.00000024, 1.0000
 9.76, <blk>, -0.00000036, 1.0000
 9.80, <blk>, -0.00000012, 1.0000
 9.84, <blk>, -0.00000024, 1.0000
 9.88, <blk>, -0.00000250, 1.0000
 9.92, <blk>, -0.00698115, 0.9930
 9.96, ▁SUN, -0.00033063, 0.9997
 10.00, <blk>, -0.02058383, 0.9796
 10.04, G, -0.45129657, 0.6368
 10.08, G, -0.00001228, 1.0000
 10.12, G, -0.00056930, 0.9994
 10.16, <blk>, -0.00321633, 0.9968
 10.20, <blk>, -0.00000143, 1.0000
 10.24, <blk>, -0.00706022, 0.9930
 10.28, ▁THE, -0.00005114, 0.9999
 10.32, <blk>, -0.02052602, 0.9797
 10.36, ▁P, -0.00117948, 0.9988
 10.40, ▁P, -0.00040618, 0.9996
 10.44, S, -0.00498766, 0.9950
 10.48, S, -0.05732632, 0.9443
 10.52, AL, -0.00001538, 1.0000
 10.56, AL, -0.01960917, 0.9806
 10.60, <blk>, -0.00001001, 1.0000
 10.64, <blk>, -0.01069422, 0.9894
 10.68, M, -0.00004792, 1.0000
 10.72, M, -0.00003147, 1.0000
 10.76, M, -0.55103242, 0.5764
 10.80, <blk>, -0.00017129, 0.9998
 10.84, <blk>, -0.00004732, 1.0000
 10.88, <blk>, -0.00000298, 1.0000
 10.92, <blk>, -0.00002944, 1.0000
 10.96, <blk>, 0.00000000, 1.0000
 11.00, <blk>, 0.00000000, 1.0000
 11.04, <blk>, 0.00000000, 1.0000
 11.08, <blk>, -0.00078945, 0.9992
 11.12, ▁WHICH, -0.00000989, 1.0000
 11.16, <blk>, -0.00007295, 0.9999
 11.20, <blk>, 0.00000000, 1.0000
 11.24, <blk>, -0.00000024, 1.0000
 11.28, <blk>, -0.00000048, 1.0000
 11.32, <blk>, -0.00032694, 0.9997
 11.36, ▁THE, -0.00002420, 1.0000
 11.40, <blk>, -0.02687876, 0.9735
 11.44, <blk>, -0.00000298, 1.0000
 11.48, <blk>, -0.00000358, 1.0000
 11.52, <blk>, -0.00000048, 1.0000
 11.56, <blk>, -0.00000632, 1.0000
 11.60, ▁FORMER, -0.00006521, 0.9999
 11.64, <blk>, -0.00000560, 1.0000
 11.68, <blk>, 0.00000000, 1.0000
 11.72, <blk>, 0.00000000, 1.0000
 11.76, <blk>, -0.00000048, 1.0000
 11.80, <blk>, -0.00000298, 1.0000
 11.84, <blk>, -0.00000167, 1.0000
 11.88, <blk>, -0.00000358, 1.0000
 11.92, <blk>, -0.00000024, 1.0000
 11.96, <blk>, -0.00019584, 0.9998
 12.00, ▁HAD, -0.00001574, 1.0000
 12.04, <blk>, -0.00001311, 1.0000
 12.08, <blk>, 0.00000000, 1.0000
 12.12, <blk>, -0.00000370, 1.0000
 12.16, <blk>, -0.00000191, 1.0000
 12.20, <blk>, 0.00000000, 1.0000
 12.24, <blk>, -0.00005305, 0.9999
 12.28, ▁CALLED, -0.00006389, 0.9999
 12.32, <blk>, -0.00027057, 0.9997
 12.36, <blk>, -0.00000715, 1.0000
 12.40, <blk>, -0.00000429, 1.0000
 12.44, <blk>, -0.00000572, 1.0000
 12.48, <blk>, -0.00000060, 1.0000
 12.52, <blk>, 0.00000000, 1.0000
 12.56, <blk>, -0.00000668, 1.0000
 12.60, ▁FOR, -0.00000465, 1.0000
 12.64, <blk>, -0.00280879, 0.9972
 12.68, <blk>, 0.00000000, 1.0000
 12.72, <blk>, -0.00000131, 1.0000
 12.76, <blk>, -0.00000978, 1.0000
 12.80, <blk>, -0.00001717, 1.0000
 12.84, <blk>, -0.00000274, 1.0000
 12.88, <blk>, -0.00003076, 1.0000
 12.92, <blk>, -0.00001824, 1.0000
 12.96, <blk>, -0.00001526, 1.0000
 13.00, <blk>, -0.00001466, 1.0000
 13.04, <blk>, -0.00001729, 1.0000
 13.08, <blk>, -0.00000358, 1.0000
 13.12, <blk>, -0.00000107, 1.0000
 13.16, <blk>, -0.00000048, 1.0000
 13.20, <blk>, 0.00000000, 1.0000
 13.24, <blk>, 0.00000000, 1.0000
 13.28, <blk>, -0.00000215, 1.0000
 13.32, <blk>, -0.00000119, 1.0000

@danpovey
Copy link
Collaborator

danpovey commented Sep 9, 2021

Sure, I think this approach makes sense. Certainly we will need to have scripts to compute alignments, at some point.

@csukuangfj csukuangfj changed the title Extract framewise alignment information using CTC decoding WIP: Extract framewise alignment information using CTC decoding Sep 23, 2021
@csukuangfj
Copy link
Collaborator Author

Unlike features, I would propose to store framewise alignment information separately.
The reason is that we may try different modelling units, e.g., phones, BPE (with varying vocabulary size), each
of which has a different framewise alignment.

We can have the following layout:

data/ali-500
|-- test_clean.pt
|-- test_other.pt
|-- train-960.pt
`-- valid.pt
data/ali-5000
|-- test_clean.pt
|-- test_other.pt
|-- train-960.pt
`-- valid.pt

where data/ali-500 contains the alignment when we use BPE with a vocab size equal to 500.

alignments: Dict[str, List[int]],

Alignments are indexed by utterance IDs, i.e., cut IDs.

@csukuangfj
Copy link
Collaborator Author

The alignment does not occupy too much memory. I think we can keep it in memory and lookup it on the fly:

$ ls -lh data/ali-500/
total 61M
-rw-r--r-- 1 xxx xxx 1.1M Sep 23 20:26 test_clean.pt
-rw-r--r-- 1 xxx xxx 1.1M Sep 23 20:27 test_other.pt
-rw-r--r-- 1 xxx xxx  57M Sep 23 20:49 train-960.pt
-rw-r--r-- 1 xxx xxx 2.1M Sep 23 20:51 valid.pt

@pzelasko
Copy link
Collaborator

Would there be interest to store the alignments in Cuts using the proposed mechanisms described in lhotse-speech/lhotse#393?

@danpovey
Copy link
Collaborator

I'm personally OK with either method but I'll let Fangjun do whatever is easiest for him.

@csukuangfj
Copy link
Collaborator Author

For this specific task, i.e., using alignment information in MMI training, I feel it is easier to store the alignment separately.
When we start the training, we load the alignment information into memory, which is quite small and the alignment for the whole training set can be kept in memory.

After we getting the cut from the dataloader, we can use cut.id to look up its alignment.


I agree the approach in lhotse-speech/lhotse#393 is more general. However, it needs more work, I think ( I haven't figured out how it would be implemented.)

@csukuangfj csukuangfj changed the title WIP: Extract framewise alignment information using CTC decoding Extract framewise alignment information using CTC decoding Sep 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants