CLIP for Image-text Matching

A simple pytorch implementation of baseline based-on CLIP for Image-text Matching.

This project provides a CLIP-based training and evaluation framework for image-text matching on MS-COCO dataset.

Requirements

We recommend the following dependencies.

Python 3.8
PyTorch 1.7.1
NumPy (>1.19.5)
TensorBoard

pip install requirments.txt

Dataset Preparation

COCO Caption

We follow the same split provided by VSE++.

Dataset images can be found here or here. Dataset splits and annotations can be found here.

The final data directory tree should be:

${DATAPATH}/
├── annotations/
│   ├── captions_train2014.json
│   ├── captions_val2014.json
│   ├── coco_train_ids.npy
│   ├── coco_dev_ids.npy
│   ├── coco_test_ids.npy
│   ├──coco_restval_ids.npy
│   └── ...
│          
└── images/ # all images of MS-COCO

Training

You can finetune the model by running:

ViT-B/32:

python main.py --batch_size 256 --epochs 5 --lr 1e-5 --warmup 500 --vision_model ViT-B/32 --dataset_root ${DATAPATH}

ViT-B/16:

python main.py --batch_size 128 --epochs 5 --lr 1e-5 --warmup 500 --vision_model ViT-B/16

Evaluation

You can eval the model by running:

python main.py --eval --resume ${MODELPATH} --vision_model ${VISONMODEL}

Zero-shot Results on MS-COCO

	Image-to-text			Text-to-image
Vision model	R@1	R@5	R@10	R@1	R@5	R@10
RN50	49.10	73.04	82.02	28.56	53.00	64.54
RN50x4	53.12	76.90	84.82	33.42	58.10	68.36
RN50x16	55.24	78.68	86.60	35.45	60.05	70.12
RN50x64	58.60	80.70	87.60	35.45	59.92	70.20
RN101	49.56	74.48	82.38	30.65	55.47	66.06
ViT-B/32	50.16	75.02	83.58	30.42	56.04	66.88
ViT-B/16	52.38	76.86	84.76	33.05	58.49	69.16
ViT-L/14	56.36	79.50	86.66	36.54	60.97	71.16
ViT-L/14 (336px)	58.06	81.12	87.92	37.18	61.59	71.42

Fine-tuned Results on MS-COCO 5K

	Image-to-text			Text-to-image
Vision model	R@1	R@5	R@10	R@1	R@5	R@10
ViT-32/B	62.22	85.62	91.66	46.94	74.88	83.56
ViT-16/B	68.76	88.66	93.94	52.45	78.66	86.66

Planning

Mixed precision training
Providing the training and evaluation codes on Flickr30K.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
clip		clip
dataloader		dataloader
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
eval.py		eval.py
main.py		main.py
params.py		params.py
requirements.txt		requirements.txt
scheduler.py		scheduler.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLIP for Image-text Matching

Requirements

Dataset Preparation

COCO Caption

Training

Evaluation

Zero-shot Results on MS-COCO

Fine-tuned Results on MS-COCO 5K

Planning

About

Releases

Packages

Languages

License

leolee99/CLIP_ITM

Folders and files

Latest commit

History

Repository files navigation

CLIP for Image-text Matching

Requirements

Dataset Preparation

COCO Caption

Training

Evaluation

Zero-shot Results on MS-COCO

Fine-tuned Results on MS-COCO 5K

Planning

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages