How to get text coordinates (bbox) from phi-3 vision #123

ladanisavan · 2024-08-02T10:25:48Z

This issue is for a: (mark with an `x`)

- [ ] bug report -> please search issues before submitting
- [x] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Hello,

First, thank you for the incredible work you have shared with the phi community. I am wondering if there is a way to obtain the text coordinates (bounding boxes) from the phi-3 vision generated output for an input image? This feature would be immensely beneficial for various applications that rely on precise text positioning.

Thank you for considering this request.

The text was updated successfully, but these errors were encountered:

leestott · 2024-08-05T06:33:28Z

@ChenRocks thoughts on the above feature?

leestott · 2024-08-14T13:06:46Z

@ladanisavan

To achieve this, you can use the ONNX Runtime with the Phi-3 vision model.

Here’s a general approach:

Setup: Ensure you have the necessary tools and libraries installed, such as ONNX Runtime and the Phi-3 vision model. You can find the models on platforms like Azure AI Catalog or Hugging Face.
Run the Model: Use the ONNX Runtime to run the Phi-3 vision model on your input image. The model will process the image and generate the output, including text and its coordinates.
Extract Bounding Boxes: The output from the model will include the bounding boxes for the detected text. These boxes are typically represented by the coordinates of the top-left corner (x, y) and the width and height of the box.

Here is a simplified example of how you might set this up in Python:

import onnxruntime as ort
import numpy as np
from PIL import Image

# Load the model
session = ort.InferenceSession("path_to_phi3_model.onnx")

# Preprocess the image
image = Image.open("path_to_image.jpg")
input_data = np.array(image).astype(np.float32)

# Run the model
outputs = session.run(None, {"input": input_data})

# Extract bounding boxes from the output
bounding_boxes = outputs[0]  # Assuming the first output contains the bounding boxes

for box in bounding_boxes:
    x, y, width, height = box
    print(f"Bounding box: x={x}, y={y}, width={width}, height={height}")

Source Code Examples & ONNX Models:
Phi-3 vision tutorial | onnxruntime

Phi-3 vision onnx cpu Model
Phi-3 vision cuda onnx Model

ladanisavan · 2024-08-14T14:25:57Z

@leestott

Thank you for getting back to me. Have you tested this on your side? It's not working on my side.

ChenRocks · 2024-08-20T22:03:26Z

Thanks @ladanisavan for your inquiry. Unfortunately, BBox support is currently not available in Phi-3.x-vision. We appreciate this feedback and will discuss this feature request for future versions.

In the meanwhile, I personally recommend Florence-2.

leestott added question Further information is requested enhancement New feature or request labels Aug 6, 2024

leestott assigned ChenRocks Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get text coordinates (bbox) from phi-3 vision #123

How to get text coordinates (bbox) from phi-3 vision #123

ladanisavan commented Aug 2, 2024

leestott commented Aug 5, 2024

leestott commented Aug 14, 2024 •

edited

Loading

ladanisavan commented Aug 14, 2024

ChenRocks commented Aug 20, 2024

How to get text coordinates (bbox) from phi-3 vision #123

How to get text coordinates (bbox) from phi-3 vision #123

Comments

ladanisavan commented Aug 2, 2024

This issue is for a: (mark with an x)

leestott commented Aug 5, 2024

leestott commented Aug 14, 2024 • edited Loading

ladanisavan commented Aug 14, 2024

ChenRocks commented Aug 20, 2024

This issue is for a: (mark with an `x`)

leestott commented Aug 14, 2024 •

edited

Loading