Skip to content

A simple pytorch implementation of baseline based-on CLIP for Image-text Matching.

License

Notifications You must be signed in to change notification settings

leolee99/CLIP_ITM

Repository files navigation

CLIP for Image-text Matching

A simple pytorch implementation of baseline based-on CLIP for Image-text Matching.

This project provides a CLIP-based training and evaluation framework for image-text matching on MS-COCO dataset.

Requirements

We recommend the following dependencies.

pip install requirments.txt

Dataset Preparation

COCO Caption

We follow the same split provided by VSE++.

Dataset images can be found here or here. Dataset splits and annotations can be found here.

The final data directory tree should be:

${DATAPATH}/
├── annotations/
│   ├── captions_train2014.json
│   ├── captions_val2014.json
│   ├── coco_train_ids.npy
│   ├── coco_dev_ids.npy
│   ├── coco_test_ids.npy
│   ├──coco_restval_ids.npy
│   └── ...
│          
└── images/ # all images of MS-COCO

Training

You can finetune the model by running:

ViT-B/32:

python main.py --batch_size 256 --epochs 5 --lr 1e-5 --warmup 500 --vision_model ViT-B/32 --dataset_root ${DATAPATH}

ViT-B/16:

python main.py --batch_size 128 --epochs 5 --lr 1e-5 --warmup 500 --vision_model ViT-B/16

Evaluation

You can eval the model by running:

python main.py --eval --resume ${MODELPATH} --vision_model ${VISONMODEL}

Zero-shot Results on MS-COCO

Image-to-text Text-to-image
Vision model R@1 R@5 R@10 R@1 R@5 R@10
RN50 49.10 73.04 82.02 28.56 53.00 64.54
RN50x4 53.12 76.90 84.82 33.42 58.10 68.36
RN50x16 55.24 78.68 86.60 35.45 60.05 70.12
RN50x64 58.60 80.70 87.60 35.45 59.92 70.20
RN101 49.56 74.48 82.38 30.65 55.47 66.06
ViT-B/32 50.16 75.02 83.58 30.42 56.04 66.88
ViT-B/16 52.38 76.86 84.76 33.05 58.49 69.16
ViT-L/14 56.36 79.50 86.66 36.54 60.97 71.16
ViT-L/14 (336px) 58.06 81.12 87.92 37.18 61.59 71.42

Fine-tuned Results on MS-COCO 5K

Image-to-text Text-to-image
Vision model R@1 R@5 R@10 R@1 R@5 R@10
ViT-32/B 62.22 85.62 91.66 46.94 74.88 83.56
ViT-16/B 68.76 88.66 93.94 52.45 78.66 86.66

Planning

  • Mixed precision training
  • Providing the training and evaluation codes on Flickr30K.

About

A simple pytorch implementation of baseline based-on CLIP for Image-text Matching.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages