What do we expect from LMMs as AIGI evaluators and how do they perform?
T2I models aim to create images that accurately align with the text and showcase high perceptual quality. Therefore, the proposed A-Bench includes two parts to diagnose whether LMMs are masters at evaluating AIGIs: 1) Semantic Understanding, 2) Quality Perception.
- [2024/9/26]🔥 Update the performance of GPT & Gemini with the latest version on A-Bench.
- [2024/8/1]🔥 The A-Bench is released on VLMEvalKit, come and test your LMM with one command.
- [2024/6/17]🔥 The A-Bench has now joined lmms-eval, which makes it easier to test LMM !!
- [2024/6/5] 🔥 We are releasing the A-Bench data and meta information at Huggingface.
- [2024/6/3] 🔥 Github repo for A-Bench is online. Do you want to find out if your LMM is a master at evaluating AI-generated images? Come and test on A-Bench !!
Two key diagnostic subsets are defined: A-Bench-P1 → high-level semantic understanding, and A-Bench-P2 → low-level quality perception. For high-level semantic understanding, A-Bench-P1 targets three critical areas: Basic Recognition, Bag-of-Words Pitfalls Discrimination, and Outside Knowledge Realization, which are designed to progressively test the LMM’s capability in AIGI semantic understanding, moving from simple to complex prompt-related content. For low-level quality perception, A-Bench-P2 concentrates on Technical Quality Perception, Aesthetic Quality Evaluation, and Generative Distortion Assessment, which are designed to cover the common quality issues and AIGI-specific quality problems.
Specifically, a comprehensive dataset of 2,864 AIGIs sourced from various T2I models is compiled, including 1,408 AIGIs for A-Bench-P1 and 1,456 for A-Bench-P2. Each AIGI is paired with a question-answer set annotated by human experts. We are open to submission-based evaluation for A-Bench. The details for submission are in the Evaluate your model on A-Bench Section.
For open-source models, LLaVA-NeXT (Qwen-110B) takes the first place. For closed-source models, GEMINI 1.5 PRO takes the first place.
A Quick Look of the A-Bench Outcomes.
Participant Name | Major↑ | Minor↑ | Attr.↑ | N. Adj.↑ | Comp.↑ | Number↑ | Term↑ | Contra.↑ | Technical↑ | Aesthetic↑ | Generative↑ |
---|---|---|---|---|---|---|---|---|---|---|---|
Gemini 1.5 Pro | 93.82% | 95.18% | 94.35% | 80.27% | 72.14% | 79.35% | 72.88% | 61.56% | 84.70% | 71.22% | 77.61% |
GPT-4v | 92.95% | 96.00% | 87.40% | 82.67% | 64.39% | 68.84% | 77.60% | 66.73% | 83.60% | 67.82% | 68.34% |
GPT-4o | 94.34% | 95.14% | 91.99% | 79.54% | 76.40% | 73.30% | 77.47% | 68.59% | 85.44% | 70.59% | 61.61% |
Qwen-VL-Max | 92.56% | 94.75% | 91.99% | 85.78% | 68.94% | 75.85% | 78.94% | 65.05% | 84.47% | 71.31% | 69.77% |
Human (Worst) | 95.18% | 94.24% | 96.78% | 88.70% | 85.49% | 82.46% | 81.76% | 88.91% | 92.40% | 94.32% | 84.49% |
Human (Best) | 95.40% | 95.21% | 99.42% | 95.17% | 93.34% | 91.73% | 84.29% | 96.05% | 94.02% | 94.69% | 86.01% |
We release the performance of top-tier closed-source LMMs against humans. Two conclusions can be obtained:
-
LMMs excel at basic recognition tasks but tend to be less effective when it comes to nuanced semantic understanding.
-
LMMs are poor quality evaluators.
Use LMMs-eval to automatically evaluate A-Bench:
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval.git
cd lmms-eval
pip install -e .
export NUM_GPUS=8
export MODEL_NAME=idefics2
python3 -m accelerate.commands.launch --num_processes=$NUM_GPUS -m lmms_eval --model $MODEL_NAME --tasks abench_dev --batch_size 1 --log_samples --log_samples_suffix $MODEL_NAME_a_bench --output_path ./logs/
Use VLMEvalKit to automatically evaluate A-Bench:
git clone https://github.com/open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .
Example, quick test InternVL2-1B on the val and test sets of A-Bench:
python run.py --data A-Bench_VAL A-Bench_TEST --model InternVL2-1B --verbose
The val set has the correct answers and you can directly get the acc results. For test set performance, please submit the results to e-mail
TO evaluate on your custom model, you can use our converted dataset in huggingface datasets
format:
pip install datasets
from datasets import load_dataset
ds = load_dataset("q-future/A-Bench-HF")
ds["dev"][0]
Outputs should be as follows:
{'id': 0,
'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x288>,
'question': 'May I ask where the scene in the picture is located?',
'option0': 'Forest',
'option1': 'Riverside',
'option2': 'Desert',
'option3': 'N/A',
'category': 'part1 -> bag_of_words -> attribute',
'correct_choice': 'B'}
Which can be then evaluated with your own model's format for MCQ.
For example, if your model follows llava's format, it should be as follows:
di = ds["dev"][0]
prompt = di["question"] + "\n"
for i in range(4):
if di[f"option{i}"] != "N/A":
prompt += chr(ord("A")+i) + ". " + di[f"option{i}"] + "\n"
prompt = prompt + "Answer with the option's letter from the given choices directly."
print(prompt)
The prompt for the previous data item should be May I ask where the scene in the picture is located? A. Forest B. Riverside C. Desert Answer with the option's letter from the given choices directly.
First download the dataset and meta information from Huggingface.
The imgs.zip contains all the AI-generated images and Abench.json contains all the meta information including the img_path, questions, answers, and categories. The item of Abench.json is structured like:
"img_path": "part1_0000.png",
"question": "What is the color of the windows in the house in the picture?",
"answers": [
"white",
"yellow",
"blue"
],
"category": "part1 -> basic_recognition -> major"
The "img_path" indicates the path to the image in imgs.zip, the "question" is a string, the "answers" is a list of answer candidates (several false answers and the correct answer).
The correct answers are kept confidential to ensure A-Bench retains its long-term value as a benchmark for assessing AIGI evaluation capabilities.
To test with your LMM, we suggest using the following prompt:
import json
with open("Abench.json", "r") as f:
f = f.read()
data = json.loads(f)
for item in data:
image_file = 'path-to-imgs' + item["img_path"]
message = item["question"] + "\n"
for choice, ans in zip(["A.", "B.", "C.", "D."], item["answers"]):
message += f"{choice} {ans}\n"
message = message + "Answer with the option's letter from the given choices directly."
print(message)
# What is the color of the windows in the house in the picture?
# A.white
# B.yellow
# C.blue
# Answer with the option's letter from the given choices directly.
# do your test here
# response = LMM(image_file,message)
item['response'] = response
with open("results.jsonl", "a") as wf:
json.dump(item, wf)
wf.write("\n")
After finishing validation, you can submit the results via e-mail to get your LMM results on A-Bench !
Please contact any of the first authors of this paper for queries.
- Zicheng Zhang,
[email protected]
, @zzc-1998 - Haoning Wu,
[email protected]
, @teowu
If you find our work interesting, please feel free to cite our paper:
@misc{zhang2024abench,
title={A-Bench: Are LMMs Masters at Evaluating AI-generated Images?},
author={Zicheng Zhang and Haoning Wu and Chunyi Li and Yingjie Zhou and Wei Sun and Xiongkuo Min and Zijian Chen and Xiaohong Liu and Weisi Lin and Guangtao Zhai},
year={2024},
eprint={2406.03070},
archivePrefix={arXiv},
primaryClass={cs.CV}
}