image caption

link to the repository

Problem Statement

Create an AI tool that creates captions based on the image provided by the user. Should also have the option to generate multiple captions based on the image.
Provide an interface where the user can come and upload images and get AI generated captions.
Solution

Pre-Processing
Used the Blip processor, for processing the image.
Models

1. Train a model on the Fliker8k dataset
Train a image captioning model using CNN and Transformer architecture on the Fliker8k dataset.
Fliker8k dataset contains 8092 images and 5 captions for each image.
Disadvantages:
- The dataset is small and the model will not be able to generalize well.
- The model will not be able to generate captions for images that are not present in the dataset.

2. The `nlpconnect/vit-gpt2-image-captioning` model

Use the nlpconnect/vit-gpt2-image-captioning model to generate captions for the images.
nlpconnect/vit-gpt2-image-captioning model is one of the most downloaded model on the huggingface model hub. It has over 1.1 million downloads.
the script used in generating the results

Results of the `nlpconnect/vit-gpt2-image-captioning` model

temprature adjustments : (https://docs.google.com/spreadsheets/d/1yz25PL-s2VbGVhij0wr9SMotuXexYrgGU-svi4aFIic/edit?usp=sharing)
experimentation with top_k parameter: (https://docs.google.com/spreadsheets/d/1Wfhxj-4AX5WQpGO2v_0B9e0NspoHOsERJw229llngbY/edit?usp=sharing)
results of the full test of selected temprature and top_k values:(https://docs.google.com/spreadsheets/d/1Gb1XxMa3S2hSjamPjVTGfNRNz83Ff67HWuOXzQstEIU/edit?usp=sharing)
The generated sentences were not good enough.

3. The ` Salesforce Blip image captioning large` model

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation is a pre-trained model for image captioning.
The model is trained on the MS COCO dataset.
The model is trained on the large version of the ViT architecture.
This model is selected for the Implementation as the captions generated from this are good
Script used for experimentation

Model evaluation

Simple intuition was used here to evaluate the outputs of various models as the difference between the accuracy of the models was clearly visible.
For example, for the above image the following are sentences generated by nlpconnect/vit-gpt2-image-captioning and ` Salesforce Blip image captioning large` models
- nlpconnect/vit-gpt2-image-captioning: a woman in brown outfit holding a white horse behind it in dark cloudy sky
- ` Salesforce Blip image captioning large` : there are two horses that are standing next to each other in a field

Interface

1.CLI

A Simple command line application that can be run by running runner.py
Prompts the user for location of the folder and number of captions to generate.
Generates the captions and calculates the time taken to generate the results.
Suitable for batch generation of captions as it can take multiple images at once
Selected for implementation
2.GUI
A web application implemented using frameworks like flask.
Prompts the user to upload the files, generates and displays the captions on the web interface.
Requires UI Design and file handling.

Implementation

Data Files

📦listed_image_caption
 ┃ 
 ┗ 📂inputs
   ┣ 📜Image1.png
   ┣ 📜Image2.png
   ┗ 📜Image3.png
 

#### Code files

📦listed_image_caption
 ┃
 ┣ 📜Model.py
 ┣ 📜nlpconnect-vit-gpt-2-captioning.py
 ┣ 📜runner.py
 ┗ 📜salesforce_blip_image_captioning_large.py

##### Model.py Implements the predict() which takes the image and number of sequences as input and generates the captions. ##### runner.py Implements the interface of the application. ##### nlpconnect-vit-gpt-2-captioning.py Experiments with the nlpconnect-vit-gpt-2-captioning model. ##### salesforce_blip_image_captioning_large.py Experiments with the salesforce_blip_image_captioning_large model

Improvements and Conclusion

Create a GUI user interface using flask or tkinter, this will make the application more user friendly.
The model can be improved by using a larger dataset and training it for a longer period of time.
The runtimes can be improved by using a GPU.
Using metrics like BLEU score for model evaluation.

image caption

Problem Statement

Solution

Pre-Processing

Models

1. Train a model on the Fliker8k dataset

2. The nlpconnect/vit-gpt2-image-captioning model

Results of the nlpconnect/vit-gpt2-image-captioning model