Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research: Add bounding boxes to response #7

Open
tylermaran opened this issue Jul 29, 2024 · 16 comments · May be fixed by #44
Open

Research: Add bounding boxes to response #7

tylermaran opened this issue Jul 29, 2024 · 16 comments · May be fixed by #44
Labels
help wanted Extra attention is needed

Comments

@tylermaran
Copy link
Contributor

Generally I would love to have some bounding boxes come back with the text response. Primarily for highlighting locations in the original document where the text got pulled. Not sure exactly how I would proceed with this one, but would love to hear some thoughts.

I think the general flow would be:

  1. Parse the document with gpt mini
  2. Split the resulting markdown into semantic sections (i.e. headers, subheaders, tables, etc.)
  3. For each semantic section, use [insert ai tool] to find bounding boxes in the original image
@tylermaran tylermaran added the help wanted Extra attention is needed label Jul 29, 2024
@getwithashish
Copy link

Hey @tylermaran

This seems exciting. Would love to work on it.

I think tweaking the system prompt would do the trick.

@getwithashish
Copy link

Hey @tylermaran

I played around with the prompts for sometime. I was able to get the bounding boxes back but it is not 100% accurate.
Some boxes are off by 10-20 pixels. Maybe, it is due to the image scaling done by gpt.
Looking if that can be solved.

@getwithashish
Copy link

image_with_bb

It is able to identify the sections: heading, paragraph, paragraph, table.
But, the bounding boxes become more inaccurate when there are more data in the page.

@getwithashish
Copy link

Seems like we need to go with a different approach.

This is the flow that I have in mind:

  1. Get the different sections and the corresponding markdown, using GPT
  2. Use an OCR package to extract text and corresponding bounding boxes
  3. Compare it with the obtained markdown, to get the bounding box of a section

@tylermaran
Copy link
Contributor Author

Hey @getwithashish! This is really promising. Can you share the prompts you were using to get the bounding boxes returned?

@pradhyumna85
Copy link
Contributor

pradhyumna85 commented Sep 6, 2024

Seems like we need to go with a different approach.

This is the flow that I have in mind:

  1. Get the different sections and the corresponding markdown, using GPT
  2. Use an OCR package to extract text and corresponding bounding boxes
  3. Compare it with the obtained markdown, to get the bounding box of a section

@getwithashish i think the most straightforward way and a little modified workflow to do this would be:

So, in this approach the biggest concern I have is cost - how economical would be calling vision model APIs couple of hundred times for every page on different bounding box crops?

The approach you shared, specially in the last step, we have to do a reverse matching, i.e., compare text only (remove markdown formatting) vision model extracted text to tesseract ocr text by fuzzy search to obtain the mapping for each bounding box, there will be certain hyperparams here also like fuzzy matching threshold, text chunking logic, chunk size etc

Could be an interesting milestone though for future to make it comparable, compatible similar to traditional bbox ocr methods.

@tylermaran

@getwithashish
Copy link

getwithashish commented Sep 10, 2024

Hey @getwithashish! This is really promising. Can you share the prompts you were using to get the bounding boxes returned?

@tylermaran

System Prompt 1:
Convert the following PDF page to markdown.
Return only the markdown with no explanation text.
Do not exclude any content from the page.

System Prompt 2:
Group each semantic sections like header, footer, body, headings, table and so on.
Include the bounding box of the corresponding section in pascal voc format.
Image width is 768px and Image height is 768px.
The response format should be of the following format: """{"type": "semantic section type", "bbox": [x_min, y_min, x_max, y_max], "markdown": "markdown content of the corresponding section"}""".
Make sure to replace semantic section type with the actual type, and [x_min, y_min, x_max, y_max] with the actual bounding box coordinates in Pascal VOC format.
Ensure that the markdown content is accurate and includes all relevant data from the page.
Only return the contents which are in the page.

I also resized the image according to the docs before sending it:

Image inputs are metered and charged in tokens, just as text inputs are. The token cost of a given image is determined by two factors: its size, and the detail option on each image_url block. All images with detail: low cost 85 tokens each. detail: high images are first scaled to fit within a 2048 x 2048 square, maintaining their aspect ratio. Then, they are scaled such that the shortest side of the image is 768px long. Finally, we count how many 512px squares the image consists of. Each of those squares costs 170 tokens. Another 85 tokens are always added to the final total.

Here are some examples demonstrating the above.

  • A 1024 x 1024 square image in detail: high mode costs 765 tokens
    1024 is less than 2048, so there is no initial resize.
    The shortest side is 1024, so we scale the image down to 768 x 768.
    4 512px square tiles are needed to represent the image, so the final token cost is 170 * 4 + 85 = 765.
  • A 2048 x 4096 image in detail: high mode costs 1105 tokens
    We scale down the image to 1024 x 2048 to fit within the 2048 square.
    The shortest side is 1024, so we further scale down to 768 x 1536.
    6 512px tiles are needed, so the final token cost is 170 * 6 + 85 = 1105.
  • A 4096 x 8192 image in detail: low most costs 85 tokens
    Regardless of input size, low detail images are a fixed cost.

@getwithashish
Copy link

Hey @pradhyumna85,

You’re right—there’s no guarantee that the sections identified through OCR will align perfectly with those derived from markdown. Moreover, using vision models for each bounding box crop is not viable. 😥

The ideal solution would indeed be to leverage the bounding boxes directly from the model that generates the markdown. Since visual grounding is not supported by GPT, I suppose we have to go with a workaround of using OCR.

Instead of relying on visual models, I am working on an algorithm to effectively perform a similarity search between the text extracted via OCR and the model's output. I'm on it like a squirrel with a nut🐿️🤓🥜

@pradhyumna85
Copy link
Contributor

@getwithashish This would be something interesting.
On the similarity part, have a look at this research paper which interestingly use DTW for the same: Measuring text similarity with dynamic time warping.

@getwithashish
Copy link

@getwithashish This would be something interesting. On the similarity part, have a look at this research paper which interestingly use DTW for the same: Measuring text similarity with dynamic time warping.

The paper was intriguing, but I've got a few qualms. Converting text into a time series and then using DTW sounds pretty cool, but the tricky part is choosing the right keywords from the text. Since the documents can come from any random domain, picking out the right keywords gets a lot harder.

Sure, we could use TF-IDF to select keywords, but that works best for big datasets. When we're dealing with smaller sections, it’s like trying to pick the ripest fruit in a basket while wearing sunglasses indoors — you might grab something, but there's a good chance it’s not what you were looking for.

That said, it’s definitely an interesting approach.

@getwithashish
Copy link

cs101_with_bb

I am currently working on the bounding box for the table. With a little more fine-tuning, we should be good to go.

@tylermaran What are your thoughts?

@pradhyumna85
Copy link
Contributor

@getwithashish, just trying to understand here, how many types of (elements) bounding boxes you are targetting exactly and how?

@getwithashish
Copy link

@tylermaran @pradhyumna85

This is the current flow:

  • Get the segmented markdown from GPT
  • Use pytesseract for OCR
  • Match the Sections - Each section from the GPT markdown needs to find its twin in the OCR data (using similarity search)
  • Similarity search is performed by calculating the edit distance (Levenshtein distance)
  • To keep things smooth, we use a sliding window approach.
  • Finally, we have the bounding box for the section: (left, top, width, height) - normalized it as well. Because why not?

@getwithashish
Copy link

image

We'll now get section-wise normalized bounding boxes along with content.

@getwithashish
Copy link

getwithashish commented Sep 20, 2024

I will be kicking off the PR today! It’s been a hot minute since I started on this feature, but hey, better late than never. 😄

@getwithashish
Copy link

Hey @tylermaran,

PR’s up and ready for your review! 🧐
Let me know what you think!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants