This is the code implementation for the NAACL2024 paper: "ComCLIP: Training-Free Compositional Image and Text Matching" [Arxiv][Project Website]
Please follow the instructions below to prepare the datasets.
- Winoground
Download images and store them asdatasets/winoground_images
. Code includes the download of csv file. - Compositional Visual Genome (ComVG)
Download images and store them asdatasets/comvg_images
. Test csv file at atdatasets/ComVG.csv
- SVO-Probe
Download dataset and store the images asdatasets/SVO-Probes
. Store csv asdatasets/svo-probes.csv
- Flickr30k
Download images and store them asdatasets/flickr30k_image
(Please only select images that are in the test sets). Test pickle file isdatasets/flickr30k_test.pkl
.
Please follow GRiT and detectron2 Setup and CLIP Setup first.
Download grit_b_densecap_objectdet.pth and store it in GRiT/models
.
Please follow SLIP and download ViT-L weights in SLIP/MODEL_PATH
conda create --name comclip python=3.10 conda activate comclip pip install -r requirements.txt pip install git+https://github.com/openai/CLIP.git
### clip baseline python winoground/clip_baseline.py --huggingface_token HUGGINGFACE_TOKEN ### blip baseline python winoground/blip_baseline.py --huggingface_token HUGGINGFACE_TOKEN ### slip baseline python winoground/slip_baseline.py --huggingface_token HUGGINGFACE_TOKEN ### comclip winoground/comclip.sh datasets/winoground_images DENSE_CAPTION_PATH PARSE_TEXT_PATH GRiT_MODEL HUGGINGFACE_KEY OPENAI_KEY ### comblip winoground/comclip.sh datasets/winoground_images DENSE_CAPTION_PATH PARSE_TEXT_PATH GRiT_MODEL HUGGINGFACE_KEY OPENAI_KEY ### comslip winoground/comslip.sh datasets/winoground_images DENSE_CAPTION_PATH PARSE_TEXT_PATH GRiT_MODEL HUGGINGFACE_KEY OPENAI_KEY
### clip baseline python ComVG/clip_baseline.py --model ViT-L/14 --data_path datasets/ComVG.csv --image_path datasets/comvg_images ### comclip ComVG/comclip.sh datasets/comvg_images DENSE_CAPTION_PATH GRiT_MODEL_PATH datasets/ComVG.csv OPENAI_KEY ViT-L/14
### clip baseline (precompted in datasets/flickr30k_test.pkl already) python image_retrieval/clip_baseline.py --model VISION_ENCODER_TYPE --dataset datasets/flickr30k_test.pkl --image_path datasets/flickr30k_image ### comclip image_retrieval/comclip.sh datasets/flickr30k_image DENSE_CAPTION_FOLDER GRiT_MODEL_PATH datasets/flickr30k_test.pkl OPENAI_KEY VISION_ENCODER_VERSION
This code is mainly built on 1.GRiT 2.CLIP. We thank the authors for their model and code.
@article{jiang2022comclip,
title={Comclip: Training-free compositional image and text matching},
author={Jiang, Kenan and He, Xuehai and Xu, Ruize and Wang, Xin Eric},
journal={arXiv preprint arXiv:2211.13854},
year={2022}
}