Skip to content

IEEE-NITK/Speech-to-Image

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 

Repository files navigation

Speech-to-Image

Table of Contents
  1. Introduction
  2. Speech to Text
  3. Text to Image
  4. References

Introduction

In this project we attempt to translate the speech signals into image signals in two steps. The speech signal is converted into text with the help of Automatic speech recognition (ASR) and then high-quality images are generated from the text descriptions by using StackGAN.

Step-1 Speech to Text

Automatic speech recognition (ASR) consists of transcribing audio speech segments into text. ASR can be treated as a sequence-to-sequence problem, where the audio can be represented as a sequence of feature vectors and the text as a sequence of characters, words, or subword tokens.

The Dataset we are using is the LJSpeech dataset from the LibriVox project. It consists of short audio clips of a single speaker reading passages from 7 non-fiction books. Our model is similar to the original Transformer (both encoder and decoder) as proposed in the paper, "Attention is All You Need".

Transformer Model Architecture

Step-2 Text to Image

Create 3 folders (test, weights,results_stage2) in your current working directory.

  1. weights (your model weights will be saved here)
  2. test (generated images from our stage I GAN)
  3. results_stage2 (generated images from stage II fo GAN)

About Dataset

Dataset Name: CUB_200_2011

Download from : https://data.caltech.edu/records/65de6-vp158

Text Embedding Model

Download char-CNN-RNN text embeddings for birds from : https://github.com/hanzhanggit/StackGAN

  1. char-CNN-RNN-embeddings.pickle — Dataframe for the pre-trained embeddings of the text.
  2. filenames.pickle — Dataframe containing the filenames of the images.
  3. class_info.pickle — Dataframe containing the info of classes for each image.

Architecture

  • Stage 1
    • Text Encoder Network
      • Text description to a 1024 dimensional text embedding
      • Learning Deep Representations of Fine-Grained Visual Descriptions Arxiv Link
    • Conditioning Augmentation Network
      • Adds randomness to the network
      • Produces more image-text pairs
    • Generator Network
    • Discriminator Network
    • Embedding Compressor Network
    • Outputs a 64x64 image

  • Stage 2
    • Text Encoder Network
    • Conditioning Augmentation Network
    • Generator Network
    • Discriminator Network
    • Embedding Compressor Network
    • Outputs a 256x256 image

References

  1. Attention is All You Need [Arxiv Link]
  2. Very Deep Self-Attention Networks for End-to-End Speech Recognition [Arxiv Link]
  3. Speech-Transformer [IEEE Xplore]
  4. StackGAN: Text to photo-realistic image synthesis [Arxiv Link]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published