360VL is developed based on the LLama3 language model and is also the industry's first open source large multi-modal model based on LLama3-70B[🤗Meta-Llama-3-70B-Instruct]. In addition to applying the Llama3 language model, the 360VL model also designs a globally aware multi-branch projector architecture, which enables the model to have more sufficient image understanding capabilities.
- Clone this repository and navigate to 360VL folder
git clone https://github.com/360CVGroup/360VL.git
cd 360VL
- Install Package
conda create -n qh360_vl python=3.10 -y
conda activate qh360_vl
bash deploy.sh
Model | Checkpoints | MMBT | MMBD | MMB-CNT | MMB-CND | MMMUV | MMMUT | MME |
---|---|---|---|---|---|---|---|---|
QWen-VL-Chat | 🤗LINK | 61.8 | 60.6 | 56.3 | 56.7 | 37 | 32.9 | 1860 |
mPLUG-Owl2 | 🤖LINK | 66.0 | 66.5 | 60.3 | 59.5 | 34.7 | 32.1 | 1786.4 |
CogVLM | 🤗LINK | 65.8 | 63.7 | 55.9 | 53.8 | 37.3 | 30.1 | 1736.6 |
Monkey-Chat | 🤗LINK | 72.4 | 71 | 67.5 | 65.8 | 40.7 | - | 1887.4 |
MM1-7B-Chat | LINK | - | 72.3 | - | - | 37.0 | 35.6 | 1858.2 |
IDEFICS2-8B | 🤗LINK | 75.7 | 75.3 | 68.6 | 67.3 | 43.0 | 37.7 | 1847.6 |
SVIT-v1.5-13B | 🤗LINK | 69.1 | - | 63.1 | - | 38.0 | 33.3 | 1889 |
LLaVA-v1.5-13B | 🤗LINK | 69.2 | 69.2 | 65 | 63.6 | 36.4 | 33.6 | 1826.7 |
LLaVA-v1.6-13B | 🤗LINK | 70 | 70.7 | 68.5 | 64.3 | 36.2 | - | 1901 |
Honeybee | LINK | 73.6 | 74.3 | - | - | 36.2 | - | 1976.5 |
YI-VL-34B | 🤗LINK | 72.4 | 71.1 | 70.7 | 71.4 | 45.1 | 41.6 | 2050.2 |
360VL-8B | 🤗LINK | 75.3 | 73.7 | 71.1 | 68.6 | 39.7 | 37.1 | 1944.6 |
360VL-70B | 🤗LINK | 78.1 | 80.4 | 76.9 | 77.7 | 50.8 | 44.3 | 2012.3 |
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from PIL import Image
checkpoint = "qihoo360/360VL-70B"
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map='auto', trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
vision_tower = model.get_vision_tower()
vision_tower.load_model()
vision_tower.to(device="cuda", dtype=torch.float16)
image_processor = vision_tower.image_processor
tokenizer.pad_token = tokenizer.eos_token
image = Image.open("docs/008.jpg").convert('RGB')
query = "Who is this cartoon character?"
terminators = [
tokenizer.convert_tokens_to_ids("<|eot_id|>",)
]
inputs = model.build_conversation_input_ids(tokenizer, query=query, image=image, image_processor=image_processor)
input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
images = inputs["image"].to(dtype=torch.float16, device='cuda', non_blocking=True)
output_ids = model.generate(
input_ids,
images=images,
do_sample=False,
eos_token_id=terminators,
num_beams=1,
max_new_tokens=512,
use_cache=True)
input_token_len = input_ids.shape[1]
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
print(outputs)
To run our demo, you need to download the weights of 360VL🤗LINK and the weights of CLIP-ViT-336🤗LINK
To launch a Gradio demo locally, please run the following commands one by one. If you plan to launch multiple model workers to compare between different checkpoints, you only need to launch the controller and the web server ONCE.
python -m qh360_vl.serve.controller --host 0.0.0.0 --port 10000
python -m qh360_vl.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload
You just launched the Gradio web interface. Now, you can open the web interface with the URL printed on the screen. You may notice that there is no model in the model list. Do not worry, as we have not launched any model worker yet. It will be automatically updated when you launch a model worker.
This is the actual worker that performs the inference on the GPU. Each worker is responsible for a single model specified in --model-path
.
Note that the 8B model supports single-card inference, but the 70B model requires 8-card inference.
CUDA_VISIBLE_DEVICES=0 python -m qh360_vl.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path qihoo360/360VL-8B
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m qh360_vl.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path qihoo360/360VL-70B
Chat about images using 360VL without the need of Gradio interface.
INIT_MODEL_PATH="/hbox2dir"
name="360VL-8B"
python -m qh360_vl.eval.infer \
--model-path $INIT_MODEL_PATH/$name \
360VL is developed based on Llama 3. If you have needs, please download the weights yourself.
[🤗Meta-Llama-3-8B-Instruct] [🤗Meta-Llama-3-70B-Instruct]
We refer to the evaluation data organization method of LLava-1.5, which can be found in the following documents. Evaluation.md
bash scripts/eval/mme.sh
bash scripts/eval/mmb_cn.sh
bash scripts/eval/mmb_en.sh
bash scripts/eval/refcoco.sh
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ./scripts/eval/gqa.sh
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ./scripts/eval/vqav2.sh
bash scripts/eval/llavabench.sh
bash scripts/eval/mmmu.sh
bash scripts/eval/pope.sh
bash scripts/eval/textvqa.sh
This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the Apache license 2.0.
This work wouldn't be possible without the incredible open-source code of these projects. Huge thanks!