Speech-to-Image

Table of Contents

Introduction
Speech to Text
- Transformer Architecture
Text to Image
References

Introduction

In this project we attempt to translate the speech signals into image signals in two steps. The speech signal is converted into text with the help of Automatic speech recognition (ASR) and then high-quality images are generated from the text descriptions by using StackGAN.

Step-1 Speech to Text

Automatic speech recognition (ASR) consists of transcribing audio speech segments into text. ASR can be treated as a sequence-to-sequence problem, where the audio can be represented as a sequence of feature vectors and the text as a sequence of characters, words, or subword tokens.

The Dataset we are using is the LJSpeech dataset from the LibriVox project. It consists of short audio clips of a single speaker reading passages from 7 non-fiction books. Our model is similar to the original Transformer (both encoder and decoder) as proposed in the paper, "Attention is All You Need".

Transformer Model Architecture

Step-2 Text to Image

Create 3 folders (test, weights,results_stage2) in your current working directory.

weights (your model weights will be saved here)
test (generated images from our stage I GAN)
results_stage2 (generated images from stage II fo GAN)

About Dataset

Dataset Name: CUB_200_2011

Download from : https://data.caltech.edu/records/65de6-vp158

Text Embedding Model

Download char-CNN-RNN text embeddings for birds from : https://github.com/hanzhanggit/StackGAN

char-CNN-RNN-embeddings.pickle — Dataframe for the pre-trained embeddings of the text.
filenames.pickle — Dataframe containing the filenames of the images.
class_info.pickle — Dataframe containing the info of classes for each image.

Architecture

Stage 1
- Text Encoder Network
  - Text description to a 1024 dimensional text embedding
  - Learning Deep Representations of Fine-Grained Visual Descriptions Arxiv Link
- Conditioning Augmentation Network
  - Adds randomness to the network
  - Produces more image-text pairs
- Generator Network
- Discriminator Network
- Embedding Compressor Network
- Outputs a 64x64 image

Stage 2
- Text Encoder Network
- Conditioning Augmentation Network
- Generator Network
- Discriminator Network
- Embedding Compressor Network
- Outputs a 256x256 image

References

Attention is All You Need [Arxiv Link]
Very Deep Self-Attention Networks for End-to-End Speech Recognition [Arxiv Link]
Speech-Transformer [IEEE Xplore]
StackGAN: Text to photo-realistic image synthesis [Arxiv Link]

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Images		Images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech-to-Image

Introduction

Step-1 Speech to Text

Transformer Model Architecture

Step-2 Text to Image

Create 3 folders (test, weights,results_stage2) in your current working directory.

About Dataset

Dataset Name: CUB_200_2011

Text Embedding Model

Architecture

References

About

Releases

Packages

Contributors 2

IEEE-NITK/Speech-to-Image

Folders and files

Latest commit

History

Repository files navigation

Speech-to-Image

Introduction

Step-1 Speech to Text

Transformer Model Architecture

Step-2 Text to Image

Create 3 folders (test, weights,results_stage2) in your current working directory.

About Dataset

Dataset Name: CUB_200_2011

Text Embedding Model

Architecture

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages