This repository contains the code and experiments for my master thesis, which focused on applying Deep Learning techniques to the field of Speech Recognition. The main objective of the research was to analyze audio timeseries data and develop a robust voice command recognition system using Neural Network models, specifically targeting two datasets: one derived from Greek dialogues and the other from English voice commands.
The research tackled the problem of voice command recognition using three different approaches, implemented with Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks.
In the first experiment, a CNN-based model was designed to classify voice commands from the Greek and English datasets. Special attention was given to handling the imbalanced distribution in the Greek dataset using data augmentation techniques to ensure reliable performance.
The second experiment aimed at building a speech-to-text system. A hybrid CNN-LSTM architecture was employed using Connectionist Temporal Classification to align spoken language with individual letters, enabling transcription of the voice commands into text.
In the third and final experiment, the CNN structure from the first experiment was extended to implement a voice command similarity search using cosine similarity metric. This approach allowed efficient retrieval and comparison of similar voice commands.
- 1st_experiment.ipynb includes the implementation of a voice command classifier for both datasets
- 1st_experiment(greek data augmentation).ipynb includes the implementation of a voice command classifier using data augmentation techniques, in order to improve classification accuracy for the greek dataset.
- 2nd_experiment.ipynb includes the implementation of an automatic speech recognition model using Connectionist Temporal Classification for both datasets
- 3rd_experiment.ipynb includes the implementation of a similarity recognition and indexing of voice commands for both datasets
From these three experiments, three reliable systems were developed, all capable of recognizing spoken voice commands with high accuracy. The models demonstrated strong potential for future applications in automatic speech recognition and voice-activated systems.