This project takes an input image and generates a sentence summarizing the content of the image. Inspired by the work in ImageCaption, it focuses on using deep learning models to generate accurate captions for images.
Results | |
---|---|
LSTM Output | GPT-1 Output |
---|---|
The project utilizes the Flickr30k dataset from Kaggle. This dataset provides a rich set of images with corresponding captions, which are used for training and testing the model.
The image encoder uses a pre-trained ResNet-50 model. The last layer of ResNet-50 is removed and replaced with a linear layer to map the image embeddings to the same size as the word embeddings. The output of the image encoder is used as the first token for the text decoder.
For text generation, two different architectures were implemented:
- LSTM (Long Short-Term Memory) for sequential text decoding.
- Stacked GPT-like Transformer Blocks, using a Transformer encoder with masks to simulate GPT architecture.
The models are defined in the model.py
file.
To train the model, refer to the step-by-step instructions provided in the ImageSummarize.ipynb
notebook. During training:
- The model achieved approximately 40-43% accuracy for 5-10 epochs.
- Both LSTM and GPT models showed similar performance in generating image captions.
LSTM Accuracy for 5 epochs | GPT-1 Accuracy for 5 epochs |
---|---|
LSTM Accuracy for 10 epochs | GPT-1 Accuracy for 10 epochs |
- Configuration file paths and variables must be properly set before training.
- Initial training on MacBook M1 was slow due to hardware limitations.
- Training was shifted to Google Colab, where GPU resources were used, but later reverted to CPU, which also proved to be slow.
- Finally, training was successfully completed on Kaggle's GPU T4x2, where the models achieved about 40% accuracy.
An Image Summarizer Generator App is provided to generate captions for images interactively. To run the app:
- Navigate to the
app
folder. - Run the
app.py
file using the command:python app.py
- Upload Image: You can choose to upload an image for caption generation.
- Load Sample Image: The app can randomly load a sample image using the Unsplash API.
- The uploaded or generated image is previewed in the GUI.
- Users can choose between the LSTM or GPT-1 models for caption generation.
- The generated caption is displayed, and both the image and its caption can be saved to a folder.
LSTM Output | GPT-1 Output |
---|---|
- Ensure all configurations (paths and variables) are correctly set before training or running the app.
- Performance may vary depending on the computational resources available during training.