Name	Name	Last commit message	Last commit date
Latest commit History 34 Commits
fig	fig
itr	itr
README.md	README.md
test.py	test.py
train.py	train.py

Where Does the Performance Improvement Come From? - A Reproducibility Concern about Image-Text Retrieval

PyTorch code of the paper "Where Does the Performance Improvement Come From? - A Reproducibility Concern about Image-Text Retrieval". It includes five models of VSE++, SCAN, VSRN, SAEM, SGRAF and CAMERA.

Jun Rao, Fei Wang, Liang Ding, Shuhan. Qi, Yibin Zhan, Weifeng Liu, and Dacheng Tao, “Where does the performance improvement come from - a reproducibility concern about image-text retrieval,” in SIGIR, 2022.

Introduction
Code
- Requirement
- Down data and vocab
- Pretrained BERT model
- Train
  - VSE++
  - SCAN
  - VSRN
  - SAEM
  - SGRAF
  - CAMERA
- Test
Performance
Statement
Vision
License
Citation

Introduction

This article aims to provide the information retrieval community with some reflections on recent advances in retrieval learning by analyzing the reproducibility of image-text retrieval models. Due to the increase of multimodal data over the last decade, image-text retrieval has steadily become a major research direction in the field of information retrieval. Numerous researchers train and evaluate image-text retrieval algorithms using benchmark datasets such as MS-COCO and Flickr30k. Research in the past has mostly focused on performance, with multiple state-of-the-art methodologies being suggested in a variety of ways. According to their assertions, these techniques provide improved modality interactions and hence more precise multimodal representations. In contrast to previous works, we focus on the reproducibility of the approaches and the examination of the elements that lead to improved performance by pretrained and nonpretrained models in retrieving images and text.

To be more specific, we first examine the related reproducibility concerns and explain why our focus is on image-text retrieval tasks. Second, we systematically summarize the current paradigm of image-text retrieval models and the stated contributions of those approaches. Third, we analyze various aspects of the reproduction of pretrained and nonpretrained retrieval models. To complete this, we conducted ablation experiments and obtained some influencing factors that affect retrieval recall more than the improvement claimed in the original paper. Finally, we present some reflections and challenges that the retrieval community should consider in the future. Our source code is publicly available at https://github.com/WangFei-2019/Image-text-Retrieval.

Fig: A Unified Framework OF image-text retrieval

Code

We change all sub-project code to fit torch1.7 and CUDA11 and add random seed for all methods. You can use it from code.

Requirement

We recommended the following dependencies.

Python 3.7
PyTorch (1.7.1)
NumPy (1.19.5)
torchvision(0.8.2)
TensorBoard
pycocotools
Punkt Sentence Tokenizer:

import nltk
nltk.download()
> d punkt

Down data and vocab

We follow bottom-up attention model and SCAN to obtain image features for fair comparison. More details about data pre-processing (optional) can be found here. All the data needed for reproducing the experiments in the paper, including image features and vocabularies, can be downloaded from SCAN by using:

wget https://scanproject.blob.core.windows.net/scan-data/data.zip
wget https://scanproject.blob.core.windows.net/scan-data/vocab.zip
# You can also get the data from google drive: https://drive.google.com/drive/u/1/folders/1os1Kr7HeTbh8FajBNegW8rjJf6GIhFqC.

We use bottom-up attention to extract the positions of detected boxes, including coordinate, width and height, which can be downloaded from https://drive.google.com/file/d/1K9LnWJc71dK6lF1BJMPlbkIu_vYmHjVP/view?usp=sharing. You can put MSCOCO/Flickr30K data in a same file.

We refer to the path of extracted files as $DATA_PATH.

An example for a $DATA_PATH.

Pretrained BERT model

We use the BERT code from BERT-pytorch. Please following here to convert the Google BERT model to a PyTorch save file $BERT_PATH.

Train

You can see the details of all hyperparams in config.py. If you want to know more detail about each method, look at the README of each method project in barch "original".

# An example for training.
python train.py with "$METHOD_NAME" data_path="$DATA_PATH" data_name="$DATA_NAME"

VSE++

python train.py with VSE_PP data_path="$DATA_PATH" data_name="$DATA_NAME" max_violation=True

SCAN

t-i LSE

# For MSCOCO
python train.py with SCAN data_path="$DATA_PATH" data_name=coco_precomp max_violation=True bi_gru=True agg_func=LogSumExp cross_attn=t2i lambda_lse=6 lambda_softmax=9
#For Flickr30K
python train.py with SCAN data_path="$DATA_PATH" data_name=f30k_precomp max_violation=True bi_gru=True agg_func=LogSumExp cross_attn=t2i lambda_lse=6 lambda_softmax=9

t-i AVG

# For MSCOCO
python train.py with SCAN data_path="$DATA_PATH" data_name=coco_precomp max_violation=True bi_gru=True agg_func=Mean cross_attn=t2i lambda_lse=6 lambda_softmax=9
# For Flickr30K
python train.py with SCAN data_path="$DATA_PATH" data_name=f30k_precomp max_violation=True bi_gru=True agg_func=Mean cross_attn=t2i lambda_lse=6 lambda_softmax=9

i-t LSE

# For MSCOCO
python train.py with SCAN data_path="$DATA_PATH" data_name=coco_precomp max_violation=True bi_gru=True agg_func=LogSumExp cross_attn=i2t lambda_lse=20 lambda_softmax=4
# For Flickr30K
python train.py with SCAN data_path="$DATA_PATH" data_name=f30k_precomp max_violation=True bi_gru=True agg_func=LogSumExp cross_attn=i2t lambda_lse=5 lambda_softmax=4

i-t AVG

# For MSCOCO
python train.py with SCAN data_path="$DATA_PATH" data_name=coco_precomp max_violation=True bi_gru=True agg_func=Mean cross_attn=i2t lambda_lse=6 lambda_softmax=4
# For Flickr30K
python train.py with SCAN data_path="$DATA_PATH" data_name=f30k_precomp max_violation=True bi_gru=True agg_func=Mean cross_attn=i2t lambda_lse=6 lambda_softmax=4

VSRN

# For MSCOCO
python train.py with VSRN data_path="$DATA_PATH" data_name=coco_precomp max_violation=True lr_update=15
# For Flickr30K
python train.py with VSRN data_path="$DATA_PATH" data_name=f30k_precomp max_violation=True lr_update=10

SAEM

python train.py with SAEM data_path="$DATA_PATH" data_name="$DATA_NAME" max_violation=True bert_path="$BERT_PATH"

SGRAF

SGR

# For MSCOCO
python train.py with SGRAF data_path="$DATA_PATH" data_name=coco_precomp module_name=SGR max_violation=True num_epochs=20 lr_update=10
# For Flickr30K
python train.py with SGRAF data_path="$DATA_PATH" data_name=f30k_precomp module_name=SGR max_violation=True num_epochs=40 lr_update=30

SAF

# For MSCOCO
python train.py with SGRAF data_path="$DATA_PATH" data_name=coco_precomp module_name=SAF max_violation=True num_epochs=20 lr_update=10
# For Flickr30K
python train.py with SGRAF data_path="$DATA_PATH" data_name=f30k_precomp module_name=SAF max_violation=True num_epochs=30 lr_update=20

CAMERA

# For MSCOCO
python train.py with CAMERA data_path="$DATA_PATH" data_name=coco_precomp bert_path="$BERT_PATH" max_violation=True num_epochs=40 lr_update=20
# For Flickr30K
python train.py with CAMERA data_path="$DATA_PATH" data_name=f30k_precomp bert_path="$BERT_PATH" max_violation=True num_epochs=30 lr_update=10

Test

There is a complete test progress in test.py.

from itr.metricmodule import evaluation

# Evaluate A Single Modal.
DATA_PATH = None  # If test data path is different from train data path, please give a new path to test.
MODEL_PATH = '$MODEL_PATH'
# ## Test on Flickr30k
evaluation.evalrank_single(model_path=MODEL_PATH, data_path=DATA_PATH, split='test')
# ## Test on MSCOCO (1000test→fold5=True; 5000test→fold5=False)
evaluation.evalrank_single(model_path=MODEL_PATH, data_path=DATA_PATH, split='testall', fold5=True)


# Evaluate The Ensemble Modal.
DATA_PATH = None  # If test data path is different from train data path, please give a new path to test.
MODEL_PATH_1 = '$MODEL_PATH'
MODEL_PATH_2 = '$MODEL_PATH'
# ## Test on Flickr30k
evaluation.evalrank_ensemble(model_path=MODEL_PATH_1, model_path2=MODEL_PATH_2, data_path=DATA_PATH, split='test')
## Test on MSCOCO (1000test→fold5=True; 5000test→fold5=False)
evaluation.evalrank_ensemble(model_path=MODEL_PATH_1, model_path2=MODEL_PATH_2, data_path=DATA_PATH, split='testall', fold5=False)

Performance

The detailed data in the paper.

The Result on F30K Test.

Method				Flickr30K	Dataset
			Image-to-Text			Text-to-Image
		R@1	R@5	R@10	R@1	R@5	R@10
VSE++		43.7/44.4/31.7	71.9/73.1/61.4	82.1/83.1/73.1	32.3/32.6/26.2	60.9/61.2/57.4	72.1/79.5/70.1
SCAN	t-i LSE	61.1/58.9/45.4	85.4/85.5/78.9	91.5/91.5/87.7	43.3/41.3/35	71.9/69.8/65.1	80.9/79.4/76.4
	t-i AVG	61.8/63.0/47.5	87.5/88.3/80.2	93.7/93.9/89.0	45.8/44.5/35.8	74.4/73.9/67.0	83.0/81.7/77.8
	i-t LSE	67.7/66.3/46.5	88.9/88.4/77	94/93.7/86.2	44.0/42.0/33.9	74.2/72.2/64.0	82.6/81.1/74.8
	i-t AVG	67.9/67.7/46.0	89/88.7/77.5	94.4/94.6/87	43.9/44.5/34.1	74.2/73.5/65.9	82.8/82.3/76.4
VSRN		71.3/67.7/54.7	90.6/88.2/82.5	96/93.8/90.2	54.7/49.1/40.6	81.8/75.8/72.0	88.2/84.1/81.5
SGRAF	SAF	73.7/73.5/74.5	93.3/90.5/92.5	96.3/95.5/96.8	56.1/53.1/56.9	81.5/78.7/82.4	88/85.4/89.1
	SGR	75.2/74.6/73.4	93.3/93.1/93.1	96.6/96.5/97.2	56.2/56.1/54.9	81.0/80.4/81.4	86.5/87.3/88.1
SAEM (random initialization Bert)		69.1/50.4/40.3	91.0/76.5/70.4	95.1/85.7/79.3	52.4/34.1/28.5	81.1/62.4/57.6	88.1/72.3/68.5
SAEM (pretrained Bert)		69.1/68.1/63.1	91.0/90.6/88.9	95.1/95.5/94.4	52.4/52.6/50.1	81.1/80.1/79.7	88.1/87.1/87.6
CAMERA (random initialization Bert)		78.0/38.8/50.1	95.1/61.6/79.2	97.9/71.6/87.8	60.3/23.7/35.5	85.9/47.9/65.7	91.7/56.9/76.4
CAMERA (pretrained Bert)		78.0/67.0/71.1	95.1/90.6/91.7	97.9/96.2/95.7	60.3/52.0/55.1	85.9/84.1/82.9	91.7/92.6/89.8

Result on MSCOCO 1K Test.

Method				MSCOCO(1K)	Dataset
			Image-to-Text			Text-to-Image
		R@1	R@5	R@10	R@1	R@5	R@10
VSE++		-/67.8/35.4	-/90.9/65.2	-/96.1/75.9	-/56.9/25.2	-/87.6/54.9	-/93.2/67.4
SCAN	t-i LSE	67.5/64.4/54.4	92.9/91.9/86.0	97.6/96.9/94.5	53/49.5/40.7	85.4/83/76.1	92.9/91.2/86.7
	t-i AVG	70.9/69.4/57.1	94.5/93.8/88.6	97.8/97.5/95.8	56.4/54.9/45.4	87/85.8/79	93.9/93.2/88.5
	i-t LSE	68.4/65.9/49.0	93.9/93.3/83.0	98.0/98.0/92.9	54.8/53.1/38.1	86.1/85.2/75.4	93.3/92.7/87.3
	i-t AVG	69.2/68.1/55.9	93.2/93.7/88.8	97.5/97.5/95.7	54.4/55.5/44.5	86/86.3/79.6	93.6/93.5/89.5
VSRN		76.2/71.1/66.7	94.8/94.1/92.4	98.2/97.6/96.8	62.8/58.5/55.4	89.7/87.2/86.4	95.1/93.5/92.9
SGRAF	SAF	76.1/75.9/75.5	95.4/95.5/95.3	98.3/98.3/98.1	61.8/60.5/60.7	89.4/88.5/89.0	95.3/94.7/95.1
	SGR	78/76.6/74.9	95.8/95.8/95.5	98.2/98.4/98.2	61.4/61.0/60.3	89.3/89.2/89.0	95.4/95.1/94.9
SAEM (random initialization Bert)		71.2/58.9/56.3	94.1/87.2/83.8	97.7/93.6/91.5	57.8/47.5/41.3	88.6/81.2/76.4	94.9/90.5/87.7
SAEM (pretrained Bert)		71.2/73.9/70.2	94.1/93.9/92.7	97.7/97.7/97.0	57.8/59.9/58.1	88.6/89.8/88.2	94.9/95.4/94.9
CAMERA (random initialization Bert)		77.5/62.9/57.4	96.3/89.2/87.0	98.8/95.1/93.8	63.4/49.7/64.4	90.9/82.1/81.4	95.8/90.8/90.8
CAMERA (pretrained Bert)		77.5/75.4/72.1	96.3/95.3/94.3	98.8/98.6/98.3	63.4/62.0/59.7	90.9/90.1/89.4	95.8/95.0/95.0

Result on MSCOCO 5K Test.

Method				MSCOCO(5K)	Dataset
			Image-to-Text			Text-to-Image
		R@1	R@5	R@10	R@1	R@5	R@10
VSE++		49/44.8/21.3	79.8/75.2/45.9	88.4/84.6/59.1	37.1/32.6/16.5	72.2/66.7/42.4	83.8/79.1/56.7
SCAN	t-i LSE	-/39.1/28.5	-/71/59.0	-/82.5/73.4	-/27.3/20.0	-/56.6/46.9	-/69.6/61.3
	t-i AVG	-/45.1/30.4	-/75.7/62.5	-/86.4/76.5	-/32.9/23.9	-/62.2/51.7	-/74.4/65.1
	i-t LSE	46.4/41.0/23.3	77.4/73.6/54.4	87.2/84.5/68.9	34.4/30.6/17.2	63.7/60.9/44.5	75.7/73.5/59.2
	i-t AVG	-/43.4/30.4	-/75.0/62.7	-/86.6/76.6	-/32.9/22.0	-/62.8/51.7	-/75/65.7
VSRN		53/48/41.3	81.1/77.6/73.4	89.4/87.3/84.4	40.5/35.9/31.8	70.6/66.4/63.4	81.1/77.7/75.7
SGRAF	SAF	53.3/54.8/51.9	-/82.4/81.6	90.1/90.4/89.8	39.8/38.8/38.7	-/67.7/68.1	80.2/79/79.3
	SGR	56.9/55.1/51.3	-/82.7/81.2	90.5/90.7/89.5	40.2/39.1/3+D119:G1208.7	-/68.5/68.1	79.8/79.5/79.3
SAEM (random initialization Bert)		-/34.6/28.8	-/63.8/57.6	-/76.5/70.4	-/25.0/20.8	-/53.3/47.0	-/66.9/60.9
SAEM (pretrained Bert)		-/47.2/43.3	-/76.8/74.2	-/87.0/85.1	-/34.9/32.2	-/65.7/63.1	-/77.8/75.9
CAMERA (random initialization Bert)		55.1/38.3/32.0	82.9/68.1/32.4	91.2/80/75.2	40.5/27.5/24.4	71.7/57.3/53.4	82.5/69.9/66.8
CAMERA (pretrained Bert)		55.1/52.6/48.2	82.9/81.9/78.3	91.2/90.0/87.7	40.5/39.0/35.9	71.7/70.3/67.7	82.5/81.4/79.7

Fine-tuned 10 Times with Different Random Seeds on Flickr30K Test.

VSE++			Flickr30K	Dataset
		Image-to-Text			Text-to-Image
	R@1	R@5	R@10	R@1	R@5	R@10
①	43.2	71.4	81.3	32.5	61	72
②	45.1	74.3	83.2	32.7	61.8	72.3
③	41.9	71.8	82.7	32.5	61.7	72.2
④	45.2	73	83.3	32.9	61.6	72.5
⑤	44	71.5	83	32.4	61.5	72.7
⑥	43.2	73.3	83.3	32.9	61.2	72.3
⑦	44.2	72.2	82.9	33	61.8	72.9
⑧	47	72.8	82.5	32.6	61.6	72.6
⑨	44.9	73.3	83.4	32.7	61.7	72.5
⑩	45.3	73	82.7	32.8	61.3	72.7

SCAN i-t AVG			Flickr30K	Dataset
		Image-to-Text			Text-to-Image
	R@1	R@5	R@10	R@1	R@5	R@10
①	66.7	88.5	94.4	44	73.1	82.1
②	65.8	88.4	93.9	44.5	72.9	82.3
③	66.3	87.6	94.4	43.9	73.3	81.7
④	66.3	87.6	93.7	43.1	72.6	81.2
⑤	65.7	88.1	94.2	42.3	72.5	81
⑥	66.5	87.8	93.8	43.7	72.5	81.8
⑦	65.6	88.3	93.7	44.2	73.7	81.8
⑧	66.8	88.6	93.9	44.7	73.2	82.2
⑨	66.3	88.7	94.7	44.2	73.7	82.2
⑩	66.2	89.1	94	44.7	73.2	82.3

VSRN			Flickr30K	Dataset
		Image-to-Text			Text-to-Image
	R@1	R@5	R@10	R@1	R@5	R@10
①	65.3	87.7	92.8	48.7	76.3	84.1
②	66.6	89.1	93.1	49	76.4	84.5
③	67.8	89.2	93.8	51	76.9	84.5
④	66.9	87.7	93.5	50.2	76.4	84.9
⑤	64.8	89.3	93.6	47.8	76.4	84.3
⑥	66.5	88.5	93.2	49.8	76.2	84.1
⑦	65.5	87.2	93.2	48.4	75.3	83.7
⑧	70.7	90.8	95	52	78.5	85.6
⑨	68.8	89.2	93.5	48.6	76.3	84.6
⑩	65.7	88.8	94.4	48.7	75.9	84.5

SAEM			Flickr30K	Dataset
		Image-to-Text			Text-to-Image
	R@1	R@5	R@10	R@1	R@5	R@10
①	69.5	91.9	95.1	53.2	80.9	88.3
②	71.4	91.5	94.2	53.3	81.0	88.2
③	69.3	90.1	94.6	52.5	80.2	88.1
④	68.3	90.8	95.3	52.0	80.6	88.3
⑤	69.9	90.7	95.5	52.3	81.0	88.5
⑥	70.1	91.6	95.5	53.0	81.6	88.5
⑦	71.3	90.8	95.7	52.2	80.9	88.2
⑧	72.0	91.0	96.0	52.9	81.4	88.3
⑨	69.2	91.6	95.3	52.8	80.7	87.8
⑩	67.8	90.7	95.5	52.2	80.4	87.8

SGRAF-SGR			Flickr30K	Dataset
		Image-to-Text			Text-to-Image
	R@1	R@5	R@10	R@1	R@5	R@10
①	75.6	92.4	96.3	56.6	81.3	87.3
②	78.4	93.9	96.6	57.6	81.2	86.3
③	76.4	93.9	96.4	55.9	81.5	87.4
④	75.9	93.3	96.5	56.2	80.6	85.6
⑤	74.8	93.2	97.1	56.2	81.5	86.8
⑥	78.4	93.9	96.6	57.6	81.2	86.3
⑦	75.4	93.1	96.2	56.4	81.8	87.1
⑧	75	91.5	96.4	56.1	81.9	87.8
⑨	76.5	93.2	96.4	55.9	81.6	87.5
⑩	73.5	90.5	95.5	53.1	78.7	85.4

CMERA			Flickr30K	Dataset
		Image-to-Text			Text-to-Image
	R@1	R@5	R@10	R@1	R@5	R@10
①	77.2	93.3	97.0	59.0	84.5	90.7
②	76.9	93.9	97.2	58.8	85.1	90.9
③	77.6	94.3	96.8	58.8	85.0	90.8
④	76.3	92.9	96.7	58.6	84.7	90.3
⑤	76.8	95.1	98.0	59.3	84.7	90.7
⑥	74.7	94.8	97.2	58.1	84.4	90.5
⑦	76.3	94.4	96.9	58.7	84.9	90.5
⑧	76.2	94.1	97.2	58.4	84.7	90.7
⑨	75.3	93.9	96.8	59.0	84.9	90.6
⑩	76.8	94.0	97.2	58.8	84.5	90.7

SAEM			Flickr30K	Dataste
with random initialization Bert		Image-to-Text			Text-to-Image
	R@1	R@5	R@10	R@1	R@5	R@10
①	50.6	77.3	86.3	34.3	63	72.6
②	51.2	78.8	86.6	34.2	62.6	73.1
③	50.1	76.2	84.2	34.9	62.6	72
④	50.6	78.5	85.8	34.1	62.1	72.9
⑤	50.7	77.2	86.2	33.1	62.3	72.5
⑥	49.8	79.6	86.3	33.9	62	72.5
⑦	50.6	79.3	86.5	33.7	63	72.7
⑧	51.2	78.4	86.5	33.9	62.4	72.4
⑨	50.4	77.4	86.4	33.4	62	72.6
⑩	51.5	78.9	85.6	33.9	62.7	72.3

CMERA			Flickr30K	Dataste
with random initialization Bert		Image-To-Text			Text-to-Image
	R@1	R@5	R@10	R@1	R@5	R@10
①	51.8	76.7	83.6	33.7	61.2	70.3
②	49.8	76.1	84.5	33.7	60.5	70.3
③	40	66.4	77.2	26.3	50.6	59.9
④	30.3	54.9	65.9	19.9	41.1	51.1
⑤	46.3	72.8	80.1	29.1	55.2	64.3
⑥	51.4	78.2	84.8	34.4	61	70
⑦	49.8	75.4	82.8	33.1	58.9	67.4
⑧	46.4	70.6	79.5	28.3	53.3	62.5
⑨	44.1	70	79	28.3	54.3	63.6
⑩	39.9	66.3	75.2	25.2	49.1	59.3

Statement

In the research, we found that the image-text retrieval community need a unified project for a funture research. We publish a beta vision for Image-Test retrieval research. Every researcher is welcome to test the code, we will accept your valuable comments and reply in the issue area.

Vision

beta-v0.1

License

The license is CC-BY-NC 4.0.

Citation

Please cite as:

@inproceedings{rao2022reproducibility,
    title = {Where Does the Performance Improvement Come From - A Reproducibility Concern about Image-Text Retrieval},
    author = {Jun Rao and Fei Wang and Liang Ding and Shuhan Qi and Yibing Zhan and Weifeng Liu and Dacheng Tao},
    booktitle = {SIGIR},
    year = {2022}
}

DATA_PATH

+-- $DATA_PATH
    +-- coco_precomp
        |-- dev_boxes.npy
        |-- dev_caps.txt
        |-- dev_ids.txt
        |-- dev_img_sizes.npy
        |-- dev_ims.npy
        |-- testall_boxes.npy
        |-- testall_caps.txt
        |-- testall_ids.txt
        |-- testall_img_sizes.npy
        |-- testall_ims.npy
        |-- test_boxes.npy
        |-- test_caps.txt
        |-- test_ids.txt
        |-- test_img_sizes.npy
        |-- test_ims.npy
        |-- train_boxes.npy
        |-- train_caps.txt
        |-- train_ids.txt
        |-- train_img_sizes.npy
        |-- train_ims.npy
    +-- f30k_precomp
        |-- dev_boxes.npy
        |-- dev_caps.txt
        |-- dev_ids.txt
        |-- dev_img_sizes.npy
        |-- dev_ims.npy
        |-- dev_tags.txt
        |-- test_boxes.npy
        |-- test_caps.txt
        |-- test_ids.txt
        |-- test_img_sizes.npy
        |-- test_ims.npy
        |-- test_tags.txt
        |-- train_boxes.npy
        |-- train_caps.txt
        |-- train_ids.txt
        |-- train_img_sizes.npy
        |-- train_ims.npy
        |-- train_tags.txt
    +-- coco
    +-- f30k
    +-- 10crop_precomp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Where Does the Performance Improvement Come From? - A Reproducibility Concern about Image-Text Retrieval

Contents

Introduction

Code

Requirement

Down data and vocab

Pretrained BERT model

Train

VSE++

SCAN

t-i LSE

t-i AVG

i-t LSE

i-t AVG

VSRN

SAEM

SGRAF

SGR

SAF

CAMERA

Test

Performance

The Result on F30K Test.

Result on MSCOCO 1K Test.

Result on MSCOCO 5K Test.

Fine-tuned 10 Times with Different Random Seeds on Flickr30K Test.

Statement

Vision

License

Citation

Please cite as:

DATA_PATH

About

Releases

Packages

Languages

easton-cau/Image-text-Retrieval

Folders and files

Latest commit

History

Repository files navigation

Where Does the Performance Improvement Come From? - A Reproducibility Concern about Image-Text Retrieval

Contents

Introduction

Code

Requirement

Down data and vocab

Pretrained BERT model

Train

VSE++

SCAN

t-i LSE

t-i AVG

i-t LSE

i-t AVG

VSRN

SAEM

SGRAF

SGR

SAF

CAMERA

Test

Performance

The Result on F30K Test.

Result on MSCOCO 1K Test.

Result on MSCOCO 5K Test.

Fine-tuned 10 Times with Different Random Seeds on Flickr30K Test.

Statement

Vision

License

Citation

Please cite as:

DATA_PATH

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages