Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get text coordinates (bbox) from phi-3 vision #123

Open
ladanisavan opened this issue Aug 2, 2024 · 4 comments
Open

How to get text coordinates (bbox) from phi-3 vision #123

ladanisavan opened this issue Aug 2, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@ladanisavan
Copy link

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [x] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Hello,

First, thank you for the incredible work you have shared with the phi community. I am wondering if there is a way to obtain the text coordinates (bounding boxes) from the phi-3 vision generated output for an input image? This feature would be immensely beneficial for various applications that rely on precise text positioning.

Thank you for considering this request.

@leestott
Copy link
Contributor

leestott commented Aug 5, 2024

@ChenRocks thoughts on the above feature?

@leestott leestott added question Further information is requested enhancement New feature or request labels Aug 6, 2024
@leestott
Copy link
Contributor

leestott commented Aug 14, 2024

@ladanisavan

To achieve this, you can use the ONNX Runtime with the Phi-3 vision model.

Here’s a general approach:

  1. Setup: Ensure you have the necessary tools and libraries installed, such as ONNX Runtime and the Phi-3 vision model. You can find the models on platforms like Azure AI Catalog or Hugging Face.

  2. Run the Model: Use the ONNX Runtime to run the Phi-3 vision model on your input image. The model will process the image and generate the output, including text and its coordinates.

  3. Extract Bounding Boxes: The output from the model will include the bounding boxes for the detected text. These boxes are typically represented by the coordinates of the top-left corner (x, y) and the width and height of the box.

Here is a simplified example of how you might set this up in Python:

import onnxruntime as ort
import numpy as np
from PIL import Image

# Load the model
session = ort.InferenceSession("path_to_phi3_model.onnx")

# Preprocess the image
image = Image.open("path_to_image.jpg")
input_data = np.array(image).astype(np.float32)

# Run the model
outputs = session.run(None, {"input": input_data})

# Extract bounding boxes from the output
bounding_boxes = outputs[0]  # Assuming the first output contains the bounding boxes

for box in bounding_boxes:
    x, y, width, height = box
    print(f"Bounding box: x={x}, y={y}, width={width}, height={height}")

Source Code Examples & ONNX Models:
Phi-3 vision tutorial | onnxruntime

Phi-3 vision onnx cpu Model
Phi-3 vision cuda onnx Model

@ladanisavan
Copy link
Author

@leestott

Thank you for getting back to me. Have you tested this on your side? It's not working on my side.

@ChenRocks
Copy link
Contributor

Thanks @ladanisavan for your inquiry. Unfortunately, BBox support is currently not available in Phi-3.x-vision. We appreciate this feedback and will discuss this feature request for future versions.

In the meanwhile, I personally recommend Florence-2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants