This repository provides programs to (1) fine-tune a model for image captioning with ViT and GPT2 and (2) demonstrate image captioning with the learned models.
Modal builds amazing infrastructure for data/ML apps in the cloud. You can build and use microservices in the cloud as if you were writing and executing Python code locally. I was amazed at the ease of use and design of this new service. I like Modal.
To run this program, please register an account with Modal .
-
Base model: nlpconnect/vit-gpt2-image-captioning
-
Dataset for fine-tuning:
Download coco dataset from the official repositores to a shared volume on Modal.
The name of the shared volume is specified in model_training/config.py
(default: image-caption-vol
).
$ modal run model_training/download_coco_dataset.py
Check the shared volume.
$ modal volume ls image-caption-vol /coco
Directory listing of 'coco' in 'image-caption-vol'
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┓
┃ filename ┃ type ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━┩
│ train2017 │ dir │
│ annotations │ dir │
│ val2017 │ dir │
└──────────────────────────────┴──────┘
- Split the RedCaps dataset into training, validation, and testing, then save them into SharedVolume and/or your Hugging Face repository.
The name of the shared volume is specified inmodel_training/config.py
(default:image-caption-vol
).
$ modal run model_training/split_dataset.py --save-dir=red_caps --push-hub-rep=[YOUR-HF-ACCOUNT]/red_caps
Check the shared volume.
$ modal volume ls image-caption-vol /red_caps
Directory listing of '/red_caps' in 'image-caption-vol'
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━┓
┃ filename ┃ type ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━┩
│ test │ dir │
│ train │ dir │
│ dataset_dict.json │ file │
│ val │ dir │
└───────────────────┴──────┘
- Build a subset of the dataset the shared volume on Modal.
$ modal run model_training/build_dataset_subset.py --from-dataset-path=red_caps \ --to-dataset-path=red-caps-5k-01 --num-train=3500 --num-val=500 --num-test=1000
- Start the training on Modal.
The default stab name isvit-gpt2-image-caption-train
. Machine usage (e.g. GPU memory in use) is available at https://modal.com/apps.
$ modal run model_training/train.py
- Check the learning status with TensorBoard
$ modal deploy model_training/tfboard_webapp.py
Access the URL displayed as "Created tensorboard_app => https://XXXXXX.modal.run" to open the TensorBoard.
- Deploy the web endpoints for the demo.
$ modal deploy demo/vit_gpt2_image_caption.py
$ modal deploy demo/vit_gpt2_image_caption_webapp.py
- Open the website and try the demo.
"Created wrapper => https://[YOUR_ACCOUNT]--vit-gpt2-image-caption-webapp-wrapper.modal.run"