diff --git a/README.md b/README.md index cfff5e2..7d8f0d3 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,12 @@ # SGRAF -PyTorch implementation for AAAI2021 paper of [**“Similarity Reasoning and Filtration for Image-Text Matching”**](https://drive.google.com/file/d/1tAE_qkAxiw1CajjHix9EXoI7xu2t66iQ/view?usp=sharing). -It is built on top of the [SCAN](https://github.com/kuanghuei/SCAN) and [Cross-modal_Retrieval_Tutorial](https://github.com/Paranioar/Cross-modal_Retrieval_Tutorial). +*PyTorch implementation for AAAI2021 paper of [**“Similarity Reasoning and Filtration for Image-Text Matching”**](https://drive.google.com/file/d/1tAE_qkAxiw1CajjHix9EXoI7xu2t66iQ/view?usp=sharing).* + +*It is built on top of the [SCAN](https://github.com/kuanghuei/SCAN) and [Awesome_Matching](https://github.com/Paranioar/Awesome_Matching_Pretraining_Transfering).* + +*We have released two versions of SGRAF: **Branch `main` for python2.7**; **Branch `python3.6` for python3.6**.* + +*If any problems, please contact me at r1228240468@gmail.com. (r1228240468@mail.dlut.edu.cn is deprecated)* + ## Introduction @@ -8,48 +14,52 @@ It is built on top of the [SCAN](https://github.com/kuanghuei/SCAN) and [Cross-m -**The updated results (Better than the original paper)** - - - - - - - - - - - - - - - - - +## Requirements +We recommended the following dependencies for ***Branch `python3.6`***. +* Python 3.6 +* [PyTorch (>=0.4.1)](http://pytorch.org/) +* [NumPy (>=1.12.1)](http://www.numpy.org/) +* [TensorBoard](https://github.com/TeamHG-Memex/tensorboard_logger) +[Note]: The code applies to ***Python3.6 + Pytorch1.7***. + +## Acknowledgements +Thanks to the exploration and discussion with [KevinLight831](https://github.com/KevinLight831), we made some adjustments as follows: +**1. Adjust `evaluation.py`**: +*for i, (k, v) in enumerate(self.meters.iteritems()):* +***------>** ```for i, (k, v) in enumerate(self.meters.items()):```* +*for k, v in self.meters.iteritems():* +***------>** ```for k, v in self.meters.items():```* + +**2. Adjust `model.py`**: +*cap_emb = (cap_emb[:, :, :cap_emb.size(2)/2] + cap_emb[:, :, cap_emb.size(2)/2:])/2* +***------>** ```cap_emb = (cap_emb[:, :, :cap_emb.size(2)//2] + cap_emb[:, :, cap_emb.size(2)//2:])/2```* + +**3. Adjust `data.py`**: +*img_id = index/self.im_div* +***------>** ```img_id = index//self.im_div```* -
Dataset ModuleSentence retrieval Image retrieval
R@1R@5R@10 R@1R@5R@10
Flick30kSAF 75.692.796.9 56.582.088.4
SGR 76.693.796.6 56.180.987.0
SGRAF 78.494.697.5 58.283.089.1
MSCOCO1kSAF 78.095.998.5 62.289.595.4
SGR 77.396.098.6 62.189.695.3
SGRAF 79.296.598.6 63.590.295.8
MSCOCO5kSAF 55.583.891.8 40.169.780.4
SGR 57.383.290.6 40.569.680.3
SGRAF 58.884.892.1 41.670.981.5
+*for line in open(loc+'%s_caps.txt' % data_split, 'rb'):* +*tokens = nltk.tokenize.word_tokenize(str(caption).lower().decode('utf-8'))* -## Requirements -We recommended the following dependencies. +***------>** ```for line in open(loc+'%s_caps.txt' % data_split, 'rb'):```* +***------>** ```tokens = nltk.tokenize.word_tokenize(caption.lower().decode('utf-8'))```* -* Python **(2.7 not 3.\*)** -* [PyTorch](http://pytorch.org/) **(0.4.1 not 1.\*)** -* [NumPy](http://www.numpy.org/) **(>1.12.1)** -* [TensorBoard](https://github.com/TeamHG-Memex/tensorboard_logger) -* Punkt Sentence Tokenizer: -```python -import nltk -nltk.download() -> d punkt -``` +or + +***------>** ```for line in open(loc+'%s_caps.txt' % data_split, 'r', encoding='utf-8'):```* +***------>** ```tokens = nltk.tokenize.word_tokenize(str(caption).lower())```* ## Download data and vocab We follow [SCAN](https://github.com/kuanghuei/SCAN) to obtain image features and vocabularies, which can be downloaded by using: ```bash -wget https://scanproject.blob.core.windows.net/scan-data/data.zip -wget https://scanproject.blob.core.windows.net/scan-data/vocab.zip +https://www.kaggle.com/datasets/kuanghueilee/scan-features +``` +Another download link is available below: + +```bash +https://drive.google.com/drive/u/0/folders/1os1Kr7HeTbh8FajBNegW8rjJf6GIhFqC ``` ## Pre-trained models and evaluation @@ -82,16 +92,18 @@ For Flickr30K: If SGRAF is useful for your research, please cite the following paper: - @inproceedings{Diao2021SGRAF, - title={Similarity Reasoning and Filtration for Image-Text Matching}, - author={Diao, Haiwen and Zhang, Ying and Ma, Lin and Lu, Huchuan}, - booktitle={AAAI}, - year={2021} - } + @inproceedings{Diao2021SGRAF, + title={Similarity reasoning and filtration for image-text matching}, + author={Diao, Haiwen and Zhang, Ying and Ma, Lin and Lu, Huchuan}, + booktitle={Proceedings of the AAAI conference on artificial intelligence}, + volume={35}, + number={2}, + pages={1218--1226}, + year={2021} + } ## License [Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0). -If any problems, please contact me at (r1228240468@mail.dlut.edu.cn) or (r1228240468@gmail.com). diff --git a/data.py b/data.py index 03c0b9e..2ab256a 100644 --- a/data.py +++ b/data.py @@ -20,9 +20,15 @@ def __init__(self, data_path, data_split, vocab): # load the raw captions self.captions = [] - with open(loc+'%s_caps.txt' % data_split, 'rb') as f: - for line in f: - self.captions.append(line.strip()) + + # -------- The main difference between python2.7 and python3.6 --------# + # The suggestion from Hongguang Zhu (https://github.com/KevinLight831) + # ---------------------------------------------------------------------# + # for line in open(loc+'%s_caps.txt' % data_split, 'r', encoding='utf-8'): + # self.captions.append(line.strip()) + + for line in open(loc+'%s_caps.txt' % data_split, 'rb'): + self.captions.append(line.strip()) # load the image features self.images = np.load(loc+'%s_ims.npy' % data_split) @@ -40,14 +46,18 @@ def __init__(self, data_path, data_split, vocab): def __getitem__(self, index): # handle the image redundancy - img_id = index/self.im_div + img_id = index//self.im_div image = torch.Tensor(self.images[img_id]) caption = self.captions[index] vocab = self.vocab + # -------- The main difference between python2.7 and python3.6 --------# + # The suggestion from Hongguang Zhu(https://github.com/KevinLight831) + # ---------------------------------------------------------------------# + # tokens = nltk.tokenize.word_tokenize(str(caption).lower()) + # convert caption (string) to word ids. - tokens = nltk.tokenize.word_tokenize( - str(caption).lower().decode('utf-8')) + tokens = nltk.tokenize.word_tokenize(caption.lower().decode('utf-8')) caption = [] caption.append(vocab('')) caption.extend([vocab(token) for token in tokens]) diff --git a/evaluation.py b/evaluation.py index 5d300ff..4a48a9b 100644 --- a/evaluation.py +++ b/evaluation.py @@ -59,7 +59,7 @@ def __str__(self): """Concatenate the meters in one log line """ s = '' - for i, (k, v) in enumerate(self.meters.iteritems()): + for i, (k, v) in enumerate(self.meters.items()): if i > 0: s += ' ' s += k + ' ' + str(v) @@ -68,7 +68,7 @@ def __str__(self): def tb_log(self, tb_logger, prefix='', step=None): """Log using tensorboard """ - for k, v in self.meters.iteritems(): + for k, v in self.meters.items(): tb_logger.log_value(prefix + k, v.val, step=step) @@ -125,7 +125,7 @@ def evalrank(model_path, data_path=None, split='dev', fold5=False): opt.data_path = data_path # load vocabulary used by the model - vocab = deserialize_vocab(os.path.join(opt.vocab_path, '%s_vocab.json' % opt.data_name)) + vocab = deserialize_vocab('./vocab/%s_vocab.json' % opt.data_name) opt.vocab_size = len(vocab) # construct model @@ -295,5 +295,5 @@ def t2i(images, captions, caplens, sims, npts=None, return_ranks=False): if __name__ == '__main__': - evalrank("/apdcephfs/share_1313228/home/haiwendiao/SGRAF-master/runs/SAF_module/checkpoint/model_best.pth.tar", - data_path="/apdcephfs/share_1313228/home/haiwendiao", split="test", fold5=False) + evalrank("./runs/Flickr30K_SGRAF/f30k_SAF/model_best.pth.tar", + data_path='./data', split="test", fold5=False) diff --git a/model.py b/model.py index 1b1171b..1985b1e 100644 --- a/model.py +++ b/model.py @@ -119,7 +119,7 @@ def forward(self, captions, lengths): cap_emb, _ = pad_packed_sequence(out, batch_first=True) if self.use_bi_gru: - cap_emb = (cap_emb[:, :, :cap_emb.size(2)/2] + cap_emb[:, :, cap_emb.size(2)/2:])/2 + cap_emb = (cap_emb[:, :, :cap_emb.size(2)//2] + cap_emb[:, :, cap_emb.size(2)//2:])/2 # normalization in the joint embedding space if not self.no_txtnorm: diff --git a/requirements.txt b/requirements.txt index 870eadc..dcfe491 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,56 +1,51 @@ -backports.functools-lru-cache==1.6.1 -backports.weakref==1.0.post1 -bleach==1.5.0 -boto3==1.17.8 -botocore==1.20.8 -certifi==2019.11.28 -cffi==1.14.0 +absl-py==0.12.0 +astor==0.8.1 +boto3==1.17.53 +botocore==1.20.53 +cached-property==1.5.2 +certifi==2020.12.5 +cffi==1.14.5 chardet==4.0.0 click==7.1.2 -cloudpickle==1.3.0 -cycler==0.10.0 -Cython==0.29.13 -decorator==4.4.2 -enum34==1.1.10 -funcsigs==1.0.2 -futures==3.3.0 -html5lib==0.9999999 +docopt==0.6.2 +gast==0.4.0 +google-pasta==0.2.0 +grpcio==1.37.0 +h5py==3.1.0 idna==2.10 +importlib-metadata==3.10.1 jmespath==0.10.0 -joblib==0.14.1 -kiwisolver==1.1.0 -Markdown==3.1.1 -matplotlib==2.2.4 -mock==3.0.5 -networkx==2.2 -nltk==3.4.5 -numpy==1.16.5 +joblib==1.0.1 +Keras-Applications==1.0.8 +Keras-Preprocessing==1.1.2 +Markdown==3.3.4 +mkl-fft==1.3.0 +mkl-random==1.1.1 +mkl-service==2.3.0 +nltk==3.6.1 +numpy==1.16.4 olefile==0.46 -opencv-python==4.2.0.32 -pandas==0.24.2 -Pillow==6.2.1 -protobuf==3.12.2 -ptflops==0.6.4 -pycocotools==2.0 +Pillow==8.2.0 +pipreqs==0.4.10 +protobuf==3.15.8 pycparser==2.20 -pyparsing==2.4.7 python-dateutil==2.8.1 -pytz==2020.1 -PyWavelets==1.0.3 -regex==2020.11.13 +regex==2021.4.4 requests==2.25.1 -s3transfer==0.3.4 -sacremoses==0.0.43 -scikit-image==0.14.5 -scipy==1.2.3 -singledispatch==3.4.0.3 +s3transfer==0.3.7 +scipy==1.5.4 six==1.15.0 -subprocess32==3.5.4 +tensorboard==1.14.0 tensorboard-logger==0.1.0 -tensorflow==1.4.0 -tensorflow-tensorboard==0.4.0 -torch==0.4.1.post2 -torchvision==0.2.0 -tqdm==4.56.2 -urllib3==1.26.3 +tensorflow-estimator==1.14.0 +tensorflow-gpu==1.14.0 +termcolor==1.1.0 +torch==1.1.0 +torchvision==0.3.0 +tqdm==4.60.0 +typing-extensions==3.7.4.3 +urllib3==1.26.4 Werkzeug==1.0.1 +wrapt==1.12.1 +yarg==0.1.9 +zipp==3.4.1 diff --git a/visualize.py b/visualize.py new file mode 100644 index 0000000..2bc6577 --- /dev/null +++ b/visualize.py @@ -0,0 +1,16 @@ +""" +# Please refer to https://github.com/Paranioar/RCAR for related visualization code. +# It now includes visualize_attention_mechanism, visualize_similarity_distribution, visualize_rank_result, and etc. + +# I will continue to update more related visualization codes when I am free. +# If you find these codes are useful, please cite our papers and star our projects. (We do need it! HaHaHaHa.) +# Thanks for the interest in our projects. +""" + + + + + + + + diff --git a/vocab.py b/vocab.py index c0e5329..c727bb3 100644 --- a/vocab.py +++ b/vocab.py @@ -1,11 +1,3 @@ -# ----------------------------------------------------------- -# Stacked Cross Attention Network implementation based on -# https://arxiv.org/abs/1803.08024. -# "Stacked Cross Attention for Image-Text Matching" -# Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, Xiaodong He -# -# Writen by Kuang-Huei Lee, 2018 -# --------------------------------------------------------------- """Vocabulary wrapper""" import nltk