A simple pytorch implementation of baseline based-on CLIP for Image-text Matching.
This project provides a CLIP-based training and evaluation framework for image-text matching on MS-COCO dataset.
We recommend the following dependencies.
- Python 3.8
- PyTorch 1.7.1
- NumPy (>1.19.5)
- TensorBoard
pip install requirments.txt
We follow the same split provided by VSE++.
Dataset images can be found here or here. Dataset splits and annotations can be found here.
The final data directory tree should be:
${DATAPATH}/
├── annotations/
│ ├── captions_train2014.json
│ ├── captions_val2014.json
│ ├── coco_train_ids.npy
│ ├── coco_dev_ids.npy
│ ├── coco_test_ids.npy
│ ├──coco_restval_ids.npy
│ └── ...
│
└── images/ # all images of MS-COCO
You can finetune the model by running:
ViT-B/32:
python main.py --batch_size 256 --epochs 5 --lr 1e-5 --warmup 500 --vision_model ViT-B/32 --dataset_root ${DATAPATH}
ViT-B/16:
python main.py --batch_size 128 --epochs 5 --lr 1e-5 --warmup 500 --vision_model ViT-B/16
You can eval the model by running:
python main.py --eval --resume ${MODELPATH} --vision_model ${VISONMODEL}
Image-to-text | Text-to-image | |||||
Vision model | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
RN50 | 49.10 | 73.04 | 82.02 | 28.56 | 53.00 | 64.54 |
RN50x4 | 53.12 | 76.90 | 84.82 | 33.42 | 58.10 | 68.36 |
RN50x16 | 55.24 | 78.68 | 86.60 | 35.45 | 60.05 | 70.12 |
RN50x64 | 58.60 | 80.70 | 87.60 | 35.45 | 59.92 | 70.20 |
RN101 | 49.56 | 74.48 | 82.38 | 30.65 | 55.47 | 66.06 |
ViT-B/32 | 50.16 | 75.02 | 83.58 | 30.42 | 56.04 | 66.88 |
ViT-B/16 | 52.38 | 76.86 | 84.76 | 33.05 | 58.49 | 69.16 |
ViT-L/14 | 56.36 | 79.50 | 86.66 | 36.54 | 60.97 | 71.16 |
ViT-L/14 (336px) | 58.06 | 81.12 | 87.92 | 37.18 | 61.59 | 71.42 |
Image-to-text | Text-to-image | |||||
Vision model | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
ViT-32/B | 62.22 | 85.62 | 91.66 | 46.94 | 74.88 | 83.56 |
ViT-16/B | 68.76 | 88.66 | 93.94 | 52.45 | 78.66 | 86.66 |
- Mixed precision training
- Providing the training and evaluation codes on Flickr30K.