Research: Add bounding boxes to response #7

tylermaran · 2024-07-29T18:43:24Z

Generally I would love to have some bounding boxes come back with the text response. Primarily for highlighting locations in the original document where the text got pulled. Not sure exactly how I would proceed with this one, but would love to hear some thoughts.

I think the general flow would be:

Parse the document with gpt mini
Split the resulting markdown into semantic sections (i.e. headers, subheaders, tables, etc.)
For each semantic section, use [insert ai tool] to find bounding boxes in the original image

getwithashish · 2024-08-29T08:35:37Z

Hey @tylermaran

This seems exciting. Would love to work on it.

I think tweaking the system prompt would do the trick.

getwithashish · 2024-08-29T14:38:14Z

Hey @tylermaran

I played around with the prompts for sometime. I was able to get the bounding boxes back but it is not 100% accurate.
Some boxes are off by 10-20 pixels. Maybe, it is due to the image scaling done by gpt.
Looking if that can be solved.

getwithashish · 2024-08-30T11:17:45Z

It is able to identify the sections: heading, paragraph, paragraph, table.
But, the bounding boxes become more inaccurate when there are more data in the page.

getwithashish · 2024-08-30T11:59:32Z

Seems like we need to go with a different approach.

This is the flow that I have in mind:

Get the different sections and the corresponding markdown, using GPT
Use an OCR package to extract text and corresponding bounding boxes
Compare it with the obtained markdown, to get the bounding box of a section

tylermaran · 2024-09-05T05:46:46Z

Hey @getwithashish! This is really promising. Can you share the prompts you were using to get the bounding boxes returned?

pradhyumna85 · 2024-09-06T02:02:21Z

Seems like we need to go with a different approach.

This is the flow that I have in mind:

Get the different sections and the corresponding markdown, using GPT

Use an OCR package to extract text and corresponding bounding boxes

Compare it with the obtained markdown, to get the bounding box of a section

@getwithashish i think the most straightforward way and a little modified workflow to do this would be:

tesseract ocr to get bounding boxes with a certain threshold: https://pyimagesearch.com/2020/05/25/tesseract-ocr-text-localization-and-detection/ . But still there are a few things more to figure out - how possible code sections in a page will be detected as they do with zero shot ocr with zerox's markdown?
then for each bounding box crop of the image, do a raw zero shot ocr using vision a model.
voila!?

So, in this approach the biggest concern I have is cost - how economical would be calling vision model APIs couple of hundred times for every page on different bounding box crops?

The approach you shared, specially in the last step, we have to do a reverse matching, i.e., compare text only (remove markdown formatting) vision model extracted text to tesseract ocr text by fuzzy search to obtain the mapping for each bounding box, there will be certain hyperparams here also like fuzzy matching threshold, text chunking logic, chunk size etc

Could be an interesting milestone though for future to make it comparable, compatible similar to traditional bbox ocr methods.

@tylermaran

getwithashish · 2024-09-10T06:37:45Z

Hey @getwithashish! This is really promising. Can you share the prompts you were using to get the bounding boxes returned?

@tylermaran

System Prompt 1:
Convert the following PDF page to markdown.
Return only the markdown with no explanation text.
Do not exclude any content from the page.

System Prompt 2:
Group each semantic sections like header, footer, body, headings, table and so on.
Include the bounding box of the corresponding section in pascal voc format.
Image width is 768px and Image height is 768px.
The response format should be of the following format: """{"type": "semantic section type", "bbox": [x_min, y_min, x_max, y_max], "markdown": "markdown content of the corresponding section"}""".
Make sure to replace semantic section type with the actual type, and [x_min, y_min, x_max, y_max] with the actual bounding box coordinates in Pascal VOC format.
Ensure that the markdown content is accurate and includes all relevant data from the page.
Only return the contents which are in the page.

I also resized the image according to the docs before sending it:

Image inputs are metered and charged in tokens, just as text inputs are. The token cost of a given image is determined by two factors: its size, and the detail option on each image_url block. All images with detail: low cost 85 tokens each. detail: high images are first scaled to fit within a 2048 x 2048 square, maintaining their aspect ratio. Then, they are scaled such that the shortest side of the image is 768px long. Finally, we count how many 512px squares the image consists of. Each of those squares costs 170 tokens. Another 85 tokens are always added to the final total.

Here are some examples demonstrating the above.

A 1024 x 1024 square image in detail: high mode costs 765 tokens
1024 is less than 2048, so there is no initial resize.
The shortest side is 1024, so we scale the image down to 768 x 768.
4 512px square tiles are needed to represent the image, so the final token cost is 170 * 4 + 85 = 765.

A 2048 x 4096 image in detail: high mode costs 1105 tokens
We scale down the image to 1024 x 2048 to fit within the 2048 square.
The shortest side is 1024, so we further scale down to 768 x 1536.
6 512px tiles are needed, so the final token cost is 170 * 6 + 85 = 1105.

A 4096 x 8192 image in detail: low most costs 85 tokens
Regardless of input size, low detail images are a fixed cost.

getwithashish · 2024-09-11T12:15:22Z

Hey @pradhyumna85,

You’re right—there’s no guarantee that the sections identified through OCR will align perfectly with those derived from markdown. Moreover, using vision models for each bounding box crop is not viable. 😥

The ideal solution would indeed be to leverage the bounding boxes directly from the model that generates the markdown. Since visual grounding is not supported by GPT, I suppose we have to go with a workaround of using OCR.

Instead of relying on visual models, I am working on an algorithm to effectively perform a similarity search between the text extracted via OCR and the model's output. I'm on it like a squirrel with a nut🐿️🤓🥜

pradhyumna85 · 2024-09-11T16:05:48Z

@getwithashish This would be something interesting.
On the similarity part, have a look at this research paper which interestingly use DTW for the same: Measuring text similarity with dynamic time warping.

getwithashish · 2024-09-12T13:58:08Z

@getwithashish This would be something interesting. On the similarity part, have a look at this research paper which interestingly use DTW for the same: Measuring text similarity with dynamic time warping.

The paper was intriguing, but I've got a few qualms. Converting text into a time series and then using DTW sounds pretty cool, but the tricky part is choosing the right keywords from the text. Since the documents can come from any random domain, picking out the right keywords gets a lot harder.

Sure, we could use TF-IDF to select keywords, but that works best for big datasets. When we're dealing with smaller sections, it’s like trying to pick the ripest fruit in a basket while wearing sunglasses indoors — you might grab something, but there's a good chance it’s not what you were looking for.

That said, it’s definitely an interesting approach.

getwithashish · 2024-09-13T10:47:44Z

I am currently working on the bounding box for the table. With a little more fine-tuning, we should be good to go.

@tylermaran What are your thoughts?

pradhyumna85 · 2024-09-13T20:42:48Z

@getwithashish, just trying to understand here, how many types of (elements) bounding boxes you are targetting exactly and how?

getwithashish · 2024-09-20T02:58:48Z

@tylermaran @pradhyumna85

This is the current flow:

Get the segmented markdown from GPT
Use pytesseract for OCR
Match the Sections - Each section from the GPT markdown needs to find its twin in the OCR data (using similarity search)
Similarity search is performed by calculating the edit distance (Levenshtein distance)
To keep things smooth, we use a sliding window approach.
Finally, we have the bounding box for the section: (left, top, width, height) - normalized it as well. Because why not?

getwithashish · 2024-09-20T03:02:36Z

We'll now get section-wise normalized bounding boxes along with content.

getwithashish · 2024-09-20T03:07:42Z

I will be kicking off the PR today! It’s been a hot minute since I started on this feature, but hey, better late than never. 😄

getwithashish · 2024-09-20T19:13:40Z

Hey @tylermaran,

PR’s up and ready for your review! 🧐
Let me know what you think!

tylermaran mentioned this issue Jul 29, 2024

Please provide Python API #2

Closed

tylermaran added the help wanted Extra attention is needed label Jul 29, 2024

getwithashish linked a pull request Sep 20, 2024 that will close this issue

Feat: Find bounding box for each section in the image #44

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research: Add bounding boxes to response #7

Research: Add bounding boxes to response #7

tylermaran commented Jul 29, 2024

getwithashish commented Aug 29, 2024

getwithashish commented Aug 29, 2024

getwithashish commented Aug 30, 2024

getwithashish commented Aug 30, 2024

tylermaran commented Sep 5, 2024

pradhyumna85 commented Sep 6, 2024 •

edited

Loading

getwithashish commented Sep 10, 2024 •

edited

Loading

getwithashish commented Sep 11, 2024

pradhyumna85 commented Sep 11, 2024

getwithashish commented Sep 12, 2024

getwithashish commented Sep 13, 2024

pradhyumna85 commented Sep 13, 2024

getwithashish commented Sep 20, 2024

getwithashish commented Sep 20, 2024

getwithashish commented Sep 20, 2024 •

edited

Loading

getwithashish commented Sep 20, 2024

Research: Add bounding boxes to response #7

Research: Add bounding boxes to response #7

Comments

tylermaran commented Jul 29, 2024

getwithashish commented Aug 29, 2024

getwithashish commented Aug 29, 2024

getwithashish commented Aug 30, 2024

getwithashish commented Aug 30, 2024

tylermaran commented Sep 5, 2024

pradhyumna85 commented Sep 6, 2024 • edited Loading

getwithashish commented Sep 10, 2024 • edited Loading

getwithashish commented Sep 11, 2024

pradhyumna85 commented Sep 11, 2024

getwithashish commented Sep 12, 2024

getwithashish commented Sep 13, 2024

pradhyumna85 commented Sep 13, 2024

getwithashish commented Sep 20, 2024

getwithashish commented Sep 20, 2024

getwithashish commented Sep 20, 2024 • edited Loading

getwithashish commented Sep 20, 2024

pradhyumna85 commented Sep 6, 2024 •

edited

Loading

getwithashish commented Sep 10, 2024 •

edited

Loading

getwithashish commented Sep 20, 2024 •

edited

Loading