[New Model]: We can able to run phi-3.5 vision instruct model but wanted to run in int4 quantization #8463

thalapandi · 2024-09-13T13:05:01Z

The model to consider.

from typing import List
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from PIL import Image
from vllm import LLM, SamplingParams
import os
import uvicorn
import time

app = FastAPI()

class InferenceRequest(BaseModel):
    model: str
    question: str
    image_paths: List[str]

# Initialize models once during application startup
models = {}

def load_image_from_path(image_path: str) -> Image.Image:
    """Load a PIL image from a local file path."""
    if not os.path.isfile(image_path):
        raise ValueError(f"File {image_path} does not exist.")
    return Image.open(image_path).convert("RGB")

def load_phi3v():
    """Load Phi3V model and return instance."""
    return LLM(
        model="microsoft/Phi-3.5-vision-instruct",
        trust_remote_code=True,
        max_model_len=4096,
        limit_mm_per_prompt={"image": 1},
    )

def initialize_models():
    """Initialize all models required for inference."""
    global models
    models["phi3_v"] = load_phi3v()

def load_phi3v_prompt(question, image_paths: List[str]):
    placeholders = "\n".join(f"<|image_{i}|>" for i, _ in enumerate(image_paths, start=1))
    prompt = f"<|user|>\n{placeholders}\n{question}<|end|>\n<|assistant|>\n"
    stop_token_ids = None
    return prompt, stop_token_ids

def run_generate(model_name: str, question: str, image_paths: List[str]):
    if model_name not in models:
        raise ValueError(f"Model {model_name} is not loaded.")
    
    llm = models[model_name]
    prompt, stop_token_ids = load_phi3v_prompt(question, image_paths)
    image_data = [load_image_from_path(path) for path in image_paths]
    
    sampling_params = SamplingParams(temperature=0.0, max_tokens=128, stop_token_ids=stop_token_ids)
    outputs = llm.generate(
        {
            "prompt": prompt,
            "multi_modal_data": {
                "image": image_data
            },
        },
        sampling_params=sampling_params
    )
    return [o.outputs[0].text for o in outputs]

@app.on_event("startup")
async def startup_event():
    initialize_models()

@app.post("/inference")
async def inference(request: InferenceRequest):
    try:
        start_time = time.time()

        result = run_generate(request.model, request.question, request.image_paths)
        end_time = time.time()
        print("total time taken",end_time-start_time)
        return {"results": result}
    except ValueError as ve:
        raise HTTPException(status_code=400, detail=str(ve))
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8002)

The closest model vllm already supports.

phi-3.5 vision instruct model and need reference for this

What's your difficulty of supporting the model you want?

does not contain any information about quantization for phi-3.5 vision instruct model

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

DarkLight1337 · 2024-09-13T13:09:55Z

Are you using a custom quantized model? I don't see it on HuggingFace.

thalapandi · 2024-09-13T14:04:45Z

i am using only phi-3 .5 vision instruct model and wanted to run in vllm with 4 bit quantization and one more douts i have can use engine configure for phi-3.5 model

like this """
Saves each worker's model state dict directly to a checkpoint, which enables a
fast load path for large tensor-parallel models where each worker only needs to
read its own shard rather than the entire checkpoint.

Example usage:

python save_sharded_state.py
--model /path/to/load
--quantization deepspeedfp
--tensor-parallel-size 8
--output /path/to/save

Then, the model can be loaded with

llm = LLM(
model="/path/to/save",
load_format="sharded_state",
quantization="deepspeedfp",
tensor_parallel_size=8,
)
"""
import dataclasses
import os
import shutil
from pathlib import Path

from vllm import LLM, EngineArgs
from vllm.utils import FlexibleArgumentParser

parser = FlexibleArgumentParser()
EngineArgs.add_cli_args(parser)
parser.add_argument("--output",
"-o",
required=True,
type=str,
help="path to output checkpoint")
parser.add_argument("--file-pattern",
type=str,
help="string pattern of saved filenames")
parser.add_argument("--max-file-size",
type=str,
default=5 * 1024**3,
help="max size (in bytes) of each safetensors file")

def main(args):
engine_args = EngineArgs.from_cli_args(args)
if engine_args.enable_lora:
raise ValueError("Saving with enable_lora=True is not supported!")
model_path = engine_args.model
if not Path(model_path).is_dir():
raise ValueError("model path must be a local directory")
# Create LLM instance from arguments
llm = LLM(**dataclasses.asdict(engine_args))
# Prepare output directory
Path(args.output).mkdir(exist_ok=True)
# Dump worker states to output directory
model_executor = llm.llm_engine.model_executor
model_executor.save_sharded_state(path=args.output,
pattern=args.file_pattern,
max_size=args.max_file_size)
# Copy metadata files to output directory
for file in os.listdir(model_path):
if os.path.splitext(file)[1] not in (".bin", ".pt", ".safetensors"):
if os.path.isdir(os.path.join(model_path, file)):
shutil.copytree(os.path.join(model_path, file),
os.path.join(args.output, file))
else:
shutil.copy(os.path.join(model_path, file), args.output)

if name == "main":
args = parser.parse_args()
main(args)

DarkLight1337 · 2024-09-14T09:26:53Z

@Isotr0py are you familiar with this?

Isotr0py · 2024-09-14T09:40:42Z

I'm not sure which quantization "int4 quantization" exactly means here, because seems that there is no BNB 4-bit quantized Phi3-V model released in HF. (The code given above is using deepspeedfp quantization, which should be fp6/fp8 quantization)

If "int4 quantization" just means 4-Bit quantization, Phi-3.5-vision-instruct-AWQ with awq quantization should work on VLLM.

thalapandi · 2024-09-14T10:43:52Z

How many gpu is need to execute awq quantization?
In vllm is it possible to run tensorrt if it is there any documentation for phi-3.5 vision instruct

Isotr0py · 2024-09-14T12:30:22Z

It costs about 4GB VRAM to run 4-bit awq quantized Phi-3.5-vision-instruct.

BTW, the AWQ model I uploaded is calibrated with default dataset in autoawq, because I just used it to check code consistency. You had better calibrate from source model with your custom datasets to get better quality.

I think vllm can't run tensorrt currently. (FYI, #5134 (comment))

thalapandi · 2024-09-14T12:57:47Z

ok

DarkLight1337 · 2024-09-17T03:13:05Z

Does this work for you?

thalapandi added the new model Requests to new models label Sep 13, 2024

DarkLight1337 mentioned this issue Oct 18, 2024

[RFC]: Multi-modality Support Refactoring #4194

Open

84 tasks

DarkLight1337 mentioned this issue Oct 29, 2024

[Bugfix] Fix prefix strings for quantized VLMs #9772

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Model]: We can able to run phi-3.5 vision instruct model but wanted to run in int4 quantization #8463

[New Model]: We can able to run phi-3.5 vision instruct model but wanted to run in int4 quantization #8463

thalapandi commented Sep 13, 2024 •

edited by mgoin

Loading

DarkLight1337 commented Sep 13, 2024

thalapandi commented Sep 13, 2024

DarkLight1337 commented Sep 14, 2024

Isotr0py commented Sep 14, 2024 •

edited

Loading

thalapandi commented Sep 14, 2024 •

edited

Loading

Isotr0py commented Sep 14, 2024 •

edited

Loading

thalapandi commented Sep 14, 2024

DarkLight1337 commented Sep 17, 2024

[New Model]: We can able to run phi-3.5 vision instruct model but wanted to run in int4 quantization #8463

[New Model]: We can able to run phi-3.5 vision instruct model but wanted to run in int4 quantization #8463

Comments

thalapandi commented Sep 13, 2024 • edited by mgoin Loading

The model to consider.

The closest model vllm already supports.

What's your difficulty of supporting the model you want?

Before submitting a new issue...

DarkLight1337 commented Sep 13, 2024

thalapandi commented Sep 13, 2024

DarkLight1337 commented Sep 14, 2024

Isotr0py commented Sep 14, 2024 • edited Loading

thalapandi commented Sep 14, 2024 • edited Loading

Isotr0py commented Sep 14, 2024 • edited Loading

thalapandi commented Sep 14, 2024

DarkLight1337 commented Sep 17, 2024

thalapandi commented Sep 13, 2024 •

edited by mgoin

Loading

Isotr0py commented Sep 14, 2024 •

edited

Loading

thalapandi commented Sep 14, 2024 •

edited

Loading

Isotr0py commented Sep 14, 2024 •

edited

Loading