Yuchi Wang* Shuhuai Ren* Rundong Gao Linli Yao Qingyan Guo
Kaikai An Jianhong Bai Xu Sun †
This is the repo for the official implementation of the NAACL 2024 paper: LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-text Generation?
Diffusion models have demonstrated remarkable capabilities in text-to-image generation. However, their performance in image-to-text generation has lagged behind Auto-Regressive (AR) models, raising doubts about their applicability for such tasks. In this study, we revisit diffusion models, emphasizing their unique advantages compared to AR methods. We meticulously design a novel latent diffusion-based architecture, LaDiC, to further amplify the previously untapped potential of diffusion models in image-to-text generation.
An overview of our LaDiC model. It mainly consists of the Image Encoder, Text Encoder, Diffuser, and Text Decoder. The diffusion process is depicted on the left, while the denoising process is depicted on the right. Initially, the caption
Comparison results on COCO dataset. We can see that our model achieves state-of-the-art performance across various metrics for both diffusion-based and traditional NAR models, and exhibits comparable performance with some well-established pretraining auto-regressive frameworks, despite being trained on significantly less data.
Apart from achieving exceptional performance, compare to AR methods, we also observe the superiority of our model in:
-
Parallel Token Emission: Diffusion-based model emits all tokens in parallel, effectively reducing the inference latency compared to autoregressive models, particularly as the length of the caption increases.
-
Holistic Context Consideration: Diffusion model takes into account a more comprehensive context, thereby helping to alleviate the error accumulation issues inherent in autoregressive models.
-
Flexible Generation Approach: In contrast to the unidirectional generation approach of AR models, the diffusion model adopts a more flexible manner of generation.
Required packages and dependencies are listed in the ladic.yaml
file. You can install the environment using Conda with the following command:
conda env create -f ladic.yaml
conda activate ladic
pip install git+https://github.com/openai/CLIP.git
We also provide docker image as follows:
docker pull wangyuchi/diffcap:python3.8
We use accelerate package developed by Huggingface.
Configure Accelerate by using the following command in the command line:
accelerate config
Answer the questions based on your actual setup. You will be prompted to specify the GPU to use, and other configurations can be left as default. For more information, refer to this link.
We test on the COCO dataset. You can download MSCOCO dataset and place it into datasets
folder.
Meanwhile, we follow Karpathy split, and its annotation files can be found in its orginial paper. Our code will also automatically download these files and you may find them in datasets/
folder.
In our LaDiC model, Text Encoder and Decoder are initialized from BERT-base-uncased, which can be downloaded from Huggingface.
As for image encoder, we utilized pretrained ViT in BLIP. You may download from here and put it into pretrained_ckpt
folder. More information can be found in BLIP's official repo.
We provide a version of our pre-trained weights here.
Launch the main.py
script using Accelerate with the following command:
accelerate launch main.py [--args]
We list some important optional parameters as follows. The notes
parameter is both a note to be placed at the top of the filename and the running name for wandb. More hyperparameters and their description can be found in configs/
parser.add_argument('--notes', type=str, default=None, help='Note to be included in the trial name')
parser.add_argument('--bsz', type=int, default=64, help='batch size')
parser.add_argument('--seqlen', type=int, default=24, help='sequence length')
parser.add_argument('--epoch', type=int, default=60, help='epoch num')
parser.add_argument('--resume_epoch', type=int, default=0, help='start epoch of resume')
parser.add_argument('--resume_ckpt', type=str, default=None, help='resume or not')
parser.add_argument('--logdir', type=str, default='checkpoint', help='logdir')
Specify MODEL_NAME
and RESULLT_FILE
in coco_eval.py
representing checkpoint to be evaluated and output path respectively. Then you can run
python coco_eval.py
- Provide pretrained checkpoint.
- Provide training and testing code.
- Paper released on arXiv.
If you find our projects helpful to your research, please consider citing our paper:
@misc{wang2024ladic,
title={LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?},
author={Yuchi Wang and Shuhuai Ren and Rundong Gao and Linli Yao and Qingyan Guo and Kaikai An and Jianhong Bai and Xu Sun},
year={2024},
eprint={2404.10763},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
For any issues or further discussions, feel free to contact [email protected]
Our code is heavily based on projects like diffusion-image-captioning, BLIP and Huggingface transformers. Thanks for their splendid works!