Skip to content

Built an Image Caption model that summarizes the given input image.

Notifications You must be signed in to change notification settings

dheerajkallakuri/ImageSummarize

Repository files navigation

Image Summarize

This project takes an input image and generates a sentence summarizing the content of the image. Inspired by the work in ImageCaption, it focuses on using deep learning models to generate accurate captions for images.

Results
Screenshot 2024-09-18 at 6 35 33 PM Screenshot 2024-09-18 at 6 35 44 PM

Output on Random Images

LSTM Output GPT-1 Output
random image1 output on lstm model random image1 output on gpt1 model
random image2 output on lstm model random image2 output on gpt1 model

Dataset

The project utilizes the Flickr30k dataset from Kaggle. This dataset provides a rich set of images with corresponding captions, which are used for training and testing the model.

Model Architecture

Image Encoder

The image encoder uses a pre-trained ResNet-50 model. The last layer of ResNet-50 is removed and replaced with a linear layer to map the image embeddings to the same size as the word embeddings. The output of the image encoder is used as the first token for the text decoder.

Text Decoder

For text generation, two different architectures were implemented:

  1. LSTM (Long Short-Term Memory) for sequential text decoding.
  2. Stacked GPT-like Transformer Blocks, using a Transformer encoder with masks to simulate GPT architecture.

The models are defined in the model.py file.

Training

To train the model, refer to the step-by-step instructions provided in the ImageSummarize.ipynb notebook. During training:

  • The model achieved approximately 40-43% accuracy for 5-10 epochs.
  • Both LSTM and GPT models showed similar performance in generating image captions.

Training Results

LSTM Accuracy for 5 epochs GPT-1 Accuracy for 5 epochs
lstm_output gpt_output
LSTM Accuracy for 10 epochs GPT-1 Accuracy for 10 epochs
lstm_10e gpt1_10e

Issues During Training

  • Configuration file paths and variables must be properly set before training.
  • Initial training on MacBook M1 was slow due to hardware limitations.
  • Training was shifted to Google Colab, where GPU resources were used, but later reverted to CPU, which also proved to be slow.
  • Finally, training was successfully completed on Kaggle's GPU T4x2, where the models achieved about 40% accuracy.

Running the Image Summarizer App

An Image Summarizer Generator App is provided to generate captions for images interactively. To run the app:

  1. Navigate to the app folder.
  2. Run the app.py file using the command:
    python app.py
    

App Features

  • Upload Image: You can choose to upload an image for caption generation.
  • Load Sample Image: The app can randomly load a sample image using the Unsplash API.
  • The uploaded or generated image is previewed in the GUI.
  • Users can choose between the LSTM or GPT-1 models for caption generation.
  • The generated caption is displayed, and both the image and its caption can be saved to a folder.
LSTM Output GPT-1 Output
Screenshot 2024-09-18 at 6 01 13 PM Screenshot 2024-09-18 at 6 01 26 PM

Notes

  • Ensure all configurations (paths and variables) are correctly set before training or running the app.
  • Performance may vary depending on the computational resources available during training.

About

Built an Image Caption model that summarizes the given input image.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published