This is the official implementation of the CVPR2024 paper Prompt Highlighter: Interactive Control for Multi-Modal LLMs.
Control text generation by highlighting your prompt! Prompt Highlighter is a training-free inference pipeline that facilitates token-level user interactions for a customized generation. Our method is compatible with both LLMs and VLMs.
-
20231130
LLaMA attention modification & LLaVA descriptive task inference. -
20231130
Test data & mask upload. -
20231201
LLaVA highlighter benchmark test inference (MMBench & MME) -
20231201
LLaVA partial highlight inference -
20231202
Vicuna (LLM) partial highlight inference -
20231202
InstructBLIP partial highlight inference -
20231204
Current Code Release! -
TBD
InternLM-VLComposer benchmark test inference
Basic enviornment setup:
conda create -n highlighter python=3.10 -y
conda activate highlighter
pip install -r requirements.txt
Install latest LLaVA model 2023-11-30
in base_models. If you already have one, you can use the installed one in your own enviornment.
# you may also use your installed llava if you have installed.
cd base_models
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip install --upgrade pip # enable PEP 660 support
pip install -e .
Model Download: Please refer to LLaVAv1.5 Model Zoo to get the base pretrained model.
Partial Highlighting task: We provide examples in assets/test_data/questions_descriptions.json
, you may add your new case to test our method.
python examples/llava_test.py
Descriptive task (highlighting all input contexts): We provide examples in assets/test_data/questions_descriptions.json
, you may add your new case to test our method.
python examples/llava_descriptions.py
We will also provide a script for descriptive COCO caption generation (TODO here).
If you want to add your customized data, please provide a squared image that uses a darker (uint color < 128) marked region as the highlighter area. Add your case to the JSON file.
Benchmark Test: Please refer to evaluation data to get your benchmark dataset (MMBench & MME). Benchmark result:
Method | MME-perception | MMBench-dev | MMBench-test |
---|---|---|---|
Baseline (LLaVAv1.5-13B) | 1531.3 | 67.7 | 67.0 |
Ours (Official Reported) | 1552.5 | 69.7 | 69.5 |
Ours (This Repo) | 1552.5 | 70.1 | 70.7 |
For MMBench, you may change your hyper-params in the following script and run:
bash examples/eval_scripts/mmbench_dev_hl.sh
bash examples/eval_scripts/mmbench_test_hl.sh
For MME:
bash examples/eval_scripts/mme_hl.sh
You may found the evaluated metric at base_models/LLaVA/playground/data/eval/MME/eval_tool/answers/llava-v1.5-13b-hl-1.3-2.0-0.01/eval.log
We provide a script to test the partial highlighter of the pure language input. Download the Vicuna model. We use the version Vicuna-13B-v1.1. You may change to any llama-based LLMs. In this case, you will also need to change the conversation prompt template. Please follow the instructions to - install the LLaVA in the base_model. If you have already installed the LLaVA, you may directly test with the script:
python examples/llama_test.py \
--txt "Please write a summary of A Mid-Summer Nights' Dream, make it compact." \
--hl "make it compact."
Here you may change your input prompt and highlighted segments by passing --txt
and --hl
, respectively. If you want to pass multiple highlighted segments, you may use a <s>
to split them. For example, you can pass --hl "write a summary<s>make it compact."
to highlight multiple requirements.
Install the latest LAVIS 2023-11-30
in base_models. If you already have one, you can use the installed one in your own environment.
To run the InstructBLIP-Vicuna, you need to add the LLM path (vicuna-13b v1.1) to the key llm_model
in the configuration file base_models/LAVIS/lavis/configs/models/blip2/blip2_instruct_vicuna13b.yaml
.
# Please install with your highlighter env activated.
cd base_models
git clone https://github.com/salesforce/LAVIS.git
cd LAVIS
pip install -e .
Partial Highlighting task: Run examples in assets/test_data/questions_descriptions.json
, you may add your new case to test our method.
Note: Here, we only implement a highlighting mechanism in the QFormer. We may update a hybrid highlighting (visual & text token) version in the future.
python examples/instructblip_test.py
TBD.
An abstract pipeline of Prompt Highlighter. Users can control the focus of generation by marking out specific image regions or text spans. Then a token-level mask
If you find this repo useful for your research, please consider citing the paper
@inproceedings{zhang2024prompt,
title={Prompt Highlighter: Interactive Control for Multi-Modal LLMs},
author={Zhang, Yuechen and Qian, Shengju and Peng, Bohao and Liu, Shu and Jia, Jiaya},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={13215--13224},
year={2024}
}
We would like to thank the following repos for their great work:
- This work utilizes multi-modal LLMs with base models in LLaVA, Vicuna, InstructBLIP, and InternLM-VLComposer.
- This work utilizes the logit processor referenced in CFG-LLM.
- Part of the logo at the top of this page is generated with Bing Image Creator.