Model Training

CNN-RNN Model

The architecture contains one convolutional 1D layer, one simple RNN layer and one Time Distributed dense layer. We have trained this model on a training set of 100 hours clean speech. The validation loss of the model is around 120.

Bidirectional RNN model (BRNN)

BRNN is a kind of extended RNN where we take both positive and negative time direction into account. Vanilla BRNN mode was trained on the development set of clean speech. Vanilla reached a plateau after 15 epochs of training. We apply the idea of BRNN in combination with other models.

LSTM model

The architecture of our LSTM model contains 3 LSTM layers and 8 Time Distributed Dense layers, where each LSTM layer is followed by 2 Time Distributed layers. At first, we trained the LSTM model on 100 hours of clean data. The validation loss is around 108, which is better than the CNN-RNN Model. Additionally, we trained the model on 360 hours of clean data, and the validation loss dropped to 88.

We added model complexity to the LSTM model by stacking LSTM layer together. The validation loss for 360 hours of clean data dropped to 84.

Baidu Deep Speech 2 Model

In a recent research paper by Baidu [1], researchers use similar end-to-end deep learning approach and achieve state-of-the-art results in some benchmarks on speech recognition.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Training

CNN-RNN Model

Bidirectional RNN model (BRNN)

LSTM model

Baidu Deep Speech 2 Model

Clone this wiki locally