How do multi-modaility LLMs perform on low-level computer vision?
Paper | Project Page | Github | Data (LLVisionQA) | Data (LLDescribe) | 质衡 (Chinese-Q-Bench)
The proposed Q-Bench includes three realms for low-level vision: perception (A1), description (A2), and assessment (A3).
- For perception (A1) /description (A2), we collect two benchmark datasets LLVisionQA/LLDescribe.
- We are open to submission-based evaluation for the two tasks. The details for submission is as follows.
- For assessment (A3), as we use public datasets, we provide an abstract evaluation code for arbitrary MLLMs for anyone to test.
Our latest experiment suggests that GPT-4V is primarily entry human-level on general low-level perception, marking a new era for low-level visual perception and understanding!
Here is the comparison of GPT-4V and non-expert human on test
set of Task A1 (Perception).
Participant Name | yes-or-no | what | how | distortion | others | in-context distortion | in-context others | overall |
---|---|---|---|---|---|---|---|---|
GPT-4V | 0.7792 | 0.7918 | 0.6268 | 0.7058 | 0.7303 | 0.7466 | 0.7795 | 0.7336 (+0.1142 to best open-source) |
human-1 | 0.8248 | 0.7939 | 0.6029 | 0.7562 | 0.7208 | 0.7637 | 0.7300 | 0.7431 (+0.0095 to GPT-4V) |
human-2-senior | 0.8431 | 0.8894 | 0.7202 | 0.7965 | 0.7947 | 0.8390 | 0.8707 | 0.8174 (+0.0838 to GPT-4V) |
Human-1 is an ordinary person with no training while human-2-senior is a trained ordinary person but still not expert. GPT-4V is witnessed to be on par with human-1, but still room to go to surpass human-2-expert.
We sincerely hope that one day open-source models can also get that level (or even better) and we believe that it is coming soon. Try to challenge and beat it!
We now provide two ways to download the datasets (LLVisionQA&LLDescribe)
-
via GitHub Release: Please see our release for details.
-
via Huggingface Datasets: Please refer to the data release notes to download the images.
It is highly recommended to convert your model into Huggingface format to smoothly test these data. See the example scripts for Huggingface's IDEFICS-9B-Instruct as an example, and modify them for your custom model to test on your model.
Please email [email protected]
to submit your result in json format.
You can also submit your model (could be Huggingface AutoModel or ModelScope AutoModel) to us, alongside your custom evaluation scripts. Your custom scripts can be modified from the template scripts that works for LLaVA-v1.5 (for A1/A2), and here (for image quality assessment).
Please email [email protected]
to submit your model if you are outside China Mainland.
Please email [email protected]
to submit your model if you are inside China Mainland.
A snapshot for LLVisionQA benchmark dataset for MLLM low-level perception ability is as follows. See the leaderboard here.
We measure the answer accuracy of MLLMs (provided with the question and all choices) as the metric here.
A snapshot for LLDescribe benchmark dataset for MLLM low-level description ability is as follows. See the leaderboard here.
We measure the completeness, precision, and relevance of MLLM descriptions as the metric here.
An exciting ability that MLLMs are able to predict quantitative scores for IQA!
Similarly as above, as long as a model (based on causal language models) has the following two methods: embed_image_and_text
(to allow multi-modality inputs), and forward
(for computing logits), the Image Quality Assessment (IQA) with the model can be achieved as follows:
from PIL import Image
from my_mllm_model import Model, Tokenizer, embed_image_and_text
model, tokenizer = Model(), Tokenizer()
prompt = "##User: Rate the quality of the image.\n" \
"##Assistant: The quality of the image is" ### This line can be modified based on MLLM's default behaviour.
good_idx, poor_idx = tokenizer(["good","poor"]).tolist()
image = Image.open("image_for_iqa.jpg")
input_embeds = embed_image_and_text(image, prompt)
output_logits = model(input_embeds=input_embeds).logits[0,-1]
q_pred = (output_logits[[good_idx, poor_idx]] / 100).softmax(0)[0]
*Note that you can modify the second line based on your model's default format, e.g. for Shikra, the "##Assistant: The quality of the image is" is modified as "##Assistant: The answer is". It is okay if your MLLM will first answer "Ok, I would like to help! The image quality is", just replace this into line 2 of the prompt.
We further provide a full implementation of IDEFICS on IQA. See example on how to run IQA with this MLLM. Other MLLMs can also be modified in the same way for use in IQA.
We have prepared JSON format human opinion scores (MOS) for the seven IQA databases as evaluated in our benchmark.
Please see IQA_databases for details.
Moved to leaderboards. Please click to see details.
Please contact any of the first authors of this paper for queries.
- Haoning Wu,
[email protected]
, @teowu - Zicheng Zhang,
[email protected]
, @zzc-1998 - Erli Zhang,
[email protected]
, @ZhangErliCarl
If you find our work interesting, please feel free to cite our paper:
@article{wu2023qbench,
title={Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision},
author={Wu, Haoning and Zhang, Zicheng and Zhang, Erli and Chen, Chaofeng and Liao, Liang and Wang, Annan and Li, Chunyi and Sun, Wenxiu and Yan, Qiong and Zhai, Guangtao and Lin, Weisi},
year={2023},
}