Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback (NeurIPS 2023)
The field of text-conditioned image generation has made unparalleled progress with the recent advent of latent diffusion models. While remarkable, as the complexity of given text input increases, the state-of-the-art diffusion models may still fail in generating images which accurately convey the semantics of the given prompt. Furthermore, it has been observed that such misalignments are often left undetected by pretrained multi-modal models such as CLIP. To address these problems, in this paper we explore a simple yet effective decompositional approach towards both evaluation and improvement of text-to-image alignment. In particular, we first introduce a Decompositional-Alignment-Score which given a complex prompt decomposes it into a set of disjoint assertions. The alignment of each assertion with generated images is then measured using a VQA model. Finally, alignment scores for different assertions are combined aposteriori to give the final text-to-image alignment score. Experimental analysis reveals that the proposed alignment metric shows significantly higher correlation with human ratings as opposed to traditional CLIP, BLIP scores. Furthermore, we also find that the assertion level alignment scores provide a useful feedback which can then be used in a simple iterative procedure to gradually increase the expression of different assertions in the final image outputs. Human user studies indicate that the proposed approach surpasses previous state-of-the-art by 8.7% in overall text-to-image alignment accuracy.
We propose a training-free decompositional framework which helps both better evaluate (top) and gradually improve (bottom) text-to-image alignment using iterative VQA feedback.
Official Implementation for our NeurIPS-2023 paper on Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback.
- (09/11/23) Code for both evaluation and improvement of T2I generation is now available as a diffusers pipeline.
- Getting Started
- QuickStart Demo
- DA-Score: Evaluating Text to Image Aligment
- Divide-Evaluate and Refine: Improving Text to Image Aligment
- Citation
- Linux or macOS
- NVIDIA GPU + CUDA CuDNN (CPU may be possible with some modifications, but is not inherently supported)
- Python 3
- Tested on Ubuntu 20.04, Nvidia RTX 3090 and CUDA 11.5 (though will likely run on other setups without modification)
- Dependencies:
To set up the environment, please run:
conda env create -f environment/environment.yml
conda activate dascore
We provide a quickstart demo notebook demo.ipynb
to get started with Divide-Evaluate-and-Refine. The demo notebook includes a step-by-step analysis including:
- Using DA-Score for evaluating text to image alignment.
- Using Eval&Refine for gradually improving the quality of generated image outputs.
- Analysing three main ways in which Eval&Refine improves over Attend&Excite.
Evaluation of T2I alignment can be done in just few lines of code:
import openai
from t2i_eval.utils import generate_questions, VQAModel, compute_dascores
openai.api_key = "[Your OpenAI Key]"
# create vqa_model
vqa_model = VQAModel()
# generate set of disjoint questions from the input prompt using LLM model
questions, parsed_input = generate_questions(prompt)
# compute DA-Score from the generated questions
da_score, assertion_alignment_scores = compute_dascores([image], questions, vqa_model)
Assertion/Question level alignment scores also provide a useful feedback to determine which parts of the input prompt are not being expressed in the final image generation. Eval&Refine uses this knowledge in order to propose a very simple yet effective iterative refinement process which gradually improves the quality of the final images.
Eval&Refine is available as a convenient diffusers pipeline for easy use.
- Quick Usage:
from t2i_improve.pipeline_evaluate_and_refine import StableDiffusionEvalAndRefinePipeline
from t2i_eval.utils import generate_questions, VQAModel, compute_dascores
# import openai api
import openai
openai.api_key = "[Your OpenAI Key]"
# define and load the Pipeline
pipe = StableDiffusionEvalAndRefinePipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16).to("cuda")
# create vqa_model for DA-Score
vqa_model = VQAModel()
# define prompt and generate parsed_input using LLM model
prompt = 'a penguin wearing a bowtie on a surfboard in a swimming pool'
questions, parsed_input = generate_questions(prompt)
# perform inference using iterative refinement
outputs = pipe.eval_and_refine(parsed_input, vqa_model, seed_list=[77,402], max_update_steps=5, verbose=False)
Notes:
-
We provide two mechanisms for iterative refinement with Eval&Refine:
- Prompt-Weighting (PW)
- Cross-Attention modulation (CA).
The use of the above can be controlled through
use_pw
anduse_ca
keywords while calling the pipeline. For e.g. to use Prompt-Weighting (PW) but not Cross-Attention modulation (CA), please use:outputs = pipe.eval_and_refine(parsed_input, vqa_model, use_pw=True, use_ca=False, max_update_steps=5)
-
The maximum number of iterative-refinement steps can be controlled by
max_update_steps
parameter. More iterative refinement can potentially help get better image outputs. Eval&Refine can adaptively adjust the actual number of refinement steps by monitoring the DA-Score. -
The threshold for what is considered as a "good enough output" is controlled by the
dascore_threshold=0.85
andassertion_alignment_threshold=0.75
. The iterative refinement process considers the final output image to be good enough if the overall DA-Score is greator than thedascore_threshold
, or, if individual alignment-scores for each assertion are greator thanassertion_alignment_threshold
-
Reducing the above thresolds can lead to faster convergence at cost of poor T2I alignment and vice versa.
-
Finally, we can visualize how the iterative refinement process gradually improves the generated image outputs by setting
verbose=True
while calling the pipeline.
outputs = pipe.eval_and_refine(parsed_input, vqa_model, max_update_steps=5, verbose=True)
If you find our work useful in your research, please consider citing:
@inproceedings{singh2023divide,
title={Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback},
author={Singh, Jaskirat and Zheng, Liang},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023}
}