WIP: Try to use multiple datasets with pruned transducer loss #245

csukuangfj · 2022-03-09T15:08:42Z

It also refactors the decoder and joiner to remove the extra nn.Linear() layer.

Will try #229 with this PR.

pkufool · 2022-03-10T02:42:21Z

If removing the extra nn.Linear(), the encoder_out_dim should be greater then vocab-size, otherwise it will cause an error in rnnt_loss_simple/smoothed. It OK for this PR (i.e. the encoder_out_dim=512,vocab-size=500), I think we should add some documents to clarify that.

[edit] I mean the extra nn.Linear() in decoder. The extra nn.Linear() in joiner is to reduce parameters (If the vocab-size is large), can be removed.

csukuangfj · 2022-03-10T05:10:34Z

It OK for this PR

Yes, for the librispeech recipe, we are using vocab size 500, so the nn.Linear() layers in decoder and joiner are not
necessary. We can keep them for aishell.

I think we should add some documents to clarify that.

I don't know the underlying reason. Maybe we should document it in k2

danpovey · 2022-03-10T05:12:29Z

Guys,
I noticed when experimenting with systems with d_model=256, that it actually performs poorly if the joiner input
has dim=256, it's necessary to set it to dim=512 (I did not try larger).
So I know I previously said we should just let that dim be the same as the d_model, but in fact I'm not so sure about this now.

danpovey · 2022-03-10T05:23:13Z

... if we're using the pruned-loss training, it might be worthwhile trying with encoder-output-dim = 1024.

csukuangfj · 2022-03-10T05:43:23Z

the encoder_out_dim should be greater then vocab-size, otherwise it will cause an error in rnnt_loss_simple/smoothed

Can it be fixed on the k2 side so that we can use a larger encoder_out_dim without adding extra nn.Linear() layers?

danpovey · 2022-03-10T09:35:46Z

Guys, I'm not so enthusiastic about avoiding the extra linear layer if it requires that embedding_dim >= vocab_size.
The problem is that it makes the joiner kind of meaningless, it is not able to learn a nontrivial function because we are forcing its input to have a particular meaning.

csukuangfj · 2022-03-11T04:33:05Z

... if we're using the pruned-loss training, it might be worthwhile trying with encoder-output-dim = 1024.

In that case, more than half of the encoder outputs are not used in k2.get_rnnt_logprobs() when vocab size is 500.
https://github.com/k2-fsa/k2/blob/master/k2/python/k2/rnnt_loss.py#L155

    px_am = torch.gather(
        am.unsqueeze(1).expand(B, S, T, C),
        dim=3,
        index=symbols.reshape(B, S, 1, 1).expand(B, S, T, 1),
    ).squeeze(
        -1
    )  # [B][S][T]

You can see that only the left half of am is used since entries in symbols are less than 500 when C is 1024.

danpovey · 2022-03-11T05:03:05Z

@csukuangfj I think it is a mistake to be confusing or identifying the encoder_output_dim with the vocabulary size; I think there should be a projection from one to the other. But actually, in my opinion, it might make more sense to conceptualize the encoder_output_dim as the "hidden dim" of the joiner, i.e. where the nonlinearity (tanh) takes place.

That is: we'd change it so the network would have output of dim==attention-dim (i.e. no linear projection at the output), and we could project that in different ways in the Transducer model:
(i) project from d_model to vocab_size for use in simple/pruned loss; we could perhaps have a version of the Decoder that projects directly to vocab_size for this purpose.
(ii) have a separate projection from d_model to joiner_dim (which we conceptualize as a hidden-dim of the joiner), and have a separate version of the decoder, sharing the embedding, that projects to joiner_dim. joiner_dim could be a bit larger, like 1024. i.e. we'd have decoder and simple_decoder, where decoder output-dim == joiner_dim and simple_decoder output-dim == vocab_size.

danpovey · 2022-03-11T05:03:10Z

@csukuangfj I think it is a mistake to be confusing or identifying the encoder_output_dim with the vocabulary size; I think there should be a projection from one to the other. But actually, in my opinion, it might make more sense to conceptualize the encoder_output_dim as the "hidden dim" of the joiner, i.e. where the nonlinearity (tanh) takes place.

That is: we'd change it so the network would have output of dim==attention-dim (i.e. no linear projection at the output), and we could project that in different ways in the Transducer model:
(i) project from d_model to vocab_size for use in simple/pruned loss; we could perhaps have a version of the Decoder that projects directly to vocab_size for this purpose.
(ii) have a separate projection from d_model to joiner_dim (which we conceptualize as a hidden-dim of the joiner), and have a separate version of the decoder, sharing the embedding, that projects to joiner_dim. joiner_dim could be a bit larger, like 1024. i.e. we'd have decoder and simple_decoder, where decoder output-dim == joiner_dim and simple_decoder output-dim == vocab_size.

csukuangfj · 2022-03-11T05:16:16Z

Thanks! I see. Will a make a change.

csukuangfj · 2022-03-11T07:51:00Z

egs/librispeech/ASR/pruned_transducer_stateless_multi_datasets/transformer.py

+        mask = make_pad_mask(lengths)
+        x = self.encoder(x, src_key_padding_mask=mask)  # (T, N, C)
+
+        x = x.permute(1, 0, 2)  # (T, N, C) ->(N, T, C)


The last nn.Linear() from the transformer model is removed.

csukuangfj · 2022-03-11T07:51:17Z

egs/librispeech/ASR/pruned_transducer_stateless_multi_datasets/conformer.py

+        if self.normalize_before:
+            x = self.after_norm(x)
+
+        x = x.permute(1, 0, 2)  # (T, N, C) ->(N, T, C)


The last nn.Linear() from the conformer model is removed.

csukuangfj · 2022-03-11T07:54:39Z

egs/librispeech/ASR/pruned_transducer_stateless_multi_datasets/conformer.py

+        src = residual + self.ff_scale * self.dropout(self.feed_forward(src))
+        if not self.normalize_before:
+            src = self.norm_ff(src)
+


The last nn.LayerNorm of the conformer encoder layer is also removed.
Otherwise, when normalize_before is True,
(1) The output of the LayerNorm of the i-th encoder layer is fed into the input of the LayerNorm of the (i+1)-th encoder layer.

(2) The output of the LayerNorm of the laster encoder layer is fed into the input of the LayerNorm in the conformer model

csukuangfj · 2022-03-11T07:55:01Z

egs/librispeech/ASR/pruned_transducer_stateless_multi_datasets/train.py

+            "subsampling_factor": 4,
+            "attention_dim": 512,
+            "decoder_embedding_dim": 512,
+            "joiner_dim": 1024,  # input dim of the joiner


Joiner dim is set to 1024.

csukuangfj · 2022-03-11T07:55:56Z

egs/librispeech/ASR/pruned_transducer_stateless_multi_datasets/model.py

+        boundary[:, 2] = y_lens
+        boundary[:, 3] = x_lens
+
+        simple_decoder_out = simple_decoder_linear(decoder_out)


Two nn.Linear() layers are used to transform the encoder output and decoder output for computing the simple loss.

csukuangfj · 2022-03-11T07:56:39Z

egs/librispeech/ASR/pruned_transducer_stateless_multi_datasets/model.py

+            am=simple_encoder_out, lm=simple_decoder_out, ranges=ranges
+        )
+
+        am_pruned = encoder_linear(am_pruned)


Two nn.Linear() layers are used to transform the pruned outputs to the dimension of joiner_dim.

danpovey · 2022-03-12T02:42:26Z

Cool. We should experiment whether joiner_dim=512 or joiner_dim=1024 works better.... e.g. with a few epochs. I imagine 1024 will be an easy win, but we'll see.

danpovey · 2022-05-04T05:16:13Z

Why is this not merged yet? Was it worse? [oh, I see, this is not the latest pruned_transducer_stateless2 setup...]

csukuangfj · 2022-05-04T08:13:06Z

Why is this not merged yet? Was it worse? [oh, I see, this is not the latest pruned_transducer_stateless2 setup...]

Closing via #312

csukuangfj added 3 commits March 9, 2022 22:32

Fix typos.

135fa0e

Copy files.

7d1b064

Refactor decoder and joiner to remove extra nn.Linear().

9071b14

csukuangfj changed the title ~~WIP: Try to use multiple dataset with pruned transducer loss~~ WIP: Try to use multiple datasets with pruned transducer loss Mar 9, 2022

csukuangfj added 2 commits March 10, 2022 10:13

Use giga speech dataset as extra training data.

35f5a15

Fix style issues.

a5d3066

csukuangfj added 3 commits March 11, 2022 14:21

Remve the last nn.Linear from the transformer model.

726c92c

Remove extra layer norm in the conformer encoder layer.

ec78b7e

Add nn.Linear to transform the output of encoder and decoder.

963ac73

csukuangfj commented Mar 11, 2022

View reviewed changes

Support modified transducer.

6de0a84

Minor fixes beam_search

d38a148

csukuangfj closed this May 4, 2022

csukuangfj deleted the pruned-multi-dataset branch May 4, 2022 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Try to use multiple datasets with pruned transducer loss #245

WIP: Try to use multiple datasets with pruned transducer loss #245

csukuangfj commented Mar 9, 2022

pkufool commented Mar 10, 2022 •

edited

Loading

csukuangfj commented Mar 10, 2022

danpovey commented Mar 10, 2022

danpovey commented Mar 10, 2022

csukuangfj commented Mar 10, 2022

danpovey commented Mar 10, 2022

csukuangfj commented Mar 11, 2022

danpovey commented Mar 11, 2022

danpovey commented Mar 11, 2022

csukuangfj commented Mar 11, 2022

csukuangfj Mar 11, 2022

csukuangfj Mar 11, 2022

csukuangfj Mar 11, 2022

csukuangfj Mar 11, 2022

csukuangfj Mar 11, 2022

csukuangfj Mar 11, 2022

danpovey commented Mar 12, 2022

danpovey commented May 4, 2022 •

edited

Loading

csukuangfj commented May 4, 2022

WIP: Try to use multiple datasets with pruned transducer loss #245

WIP: Try to use multiple datasets with pruned transducer loss #245

Conversation

csukuangfj commented Mar 9, 2022

pkufool commented Mar 10, 2022 • edited Loading

csukuangfj commented Mar 10, 2022

danpovey commented Mar 10, 2022

danpovey commented Mar 10, 2022

csukuangfj commented Mar 10, 2022

danpovey commented Mar 10, 2022

csukuangfj commented Mar 11, 2022

danpovey commented Mar 11, 2022

danpovey commented Mar 11, 2022

csukuangfj commented Mar 11, 2022

csukuangfj Mar 11, 2022

Choose a reason for hiding this comment

csukuangfj Mar 11, 2022

Choose a reason for hiding this comment

csukuangfj Mar 11, 2022

Choose a reason for hiding this comment

csukuangfj Mar 11, 2022

Choose a reason for hiding this comment

csukuangfj Mar 11, 2022

Choose a reason for hiding this comment

csukuangfj Mar 11, 2022

Choose a reason for hiding this comment

danpovey commented Mar 12, 2022

danpovey commented May 4, 2022 • edited Loading

csukuangfj commented May 4, 2022

pkufool commented Mar 10, 2022 •

edited

Loading

danpovey commented May 4, 2022 •

edited

Loading