-
Notifications
You must be signed in to change notification settings - Fork 320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research: Add bounding boxes to response #7
Comments
Hey @tylermaran This seems exciting. Would love to work on it. I think tweaking the system prompt would do the trick. |
Hey @tylermaran I played around with the prompts for sometime. I was able to get the bounding boxes back but it is not 100% accurate. |
Seems like we need to go with a different approach. This is the flow that I have in mind:
|
Hey @getwithashish! This is really promising. Can you share the prompts you were using to get the bounding boxes returned? |
@getwithashish i think the most straightforward way and a little modified workflow to do this would be:
So, in this approach the biggest concern I have is cost - how economical would be calling vision model APIs couple of hundred times for every page on different bounding box crops? The approach you shared, specially in the last step, we have to do a reverse matching, i.e., compare text only (remove markdown formatting) vision model extracted text to tesseract ocr text by fuzzy search to obtain the mapping for each bounding box, there will be certain hyperparams here also like fuzzy matching threshold, text chunking logic, chunk size etc Could be an interesting milestone though for future to make it comparable, compatible similar to traditional bbox ocr methods. |
System Prompt 1: System Prompt 2: I also resized the image according to the docs before sending it:
|
Hey @pradhyumna85, You’re right—there’s no guarantee that the sections identified through OCR will align perfectly with those derived from markdown. Moreover, using vision models for each bounding box crop is not viable. 😥 The ideal solution would indeed be to leverage the bounding boxes directly from the model that generates the markdown. Since visual grounding is not supported by GPT, I suppose we have to go with a workaround of using OCR. Instead of relying on visual models, I am working on an algorithm to effectively perform a similarity search between the text extracted via OCR and the model's output. I'm on it like a squirrel with a nut🐿️🤓🥜 |
@getwithashish This would be something interesting. |
The paper was intriguing, but I've got a few qualms. Converting text into a time series and then using DTW sounds pretty cool, but the tricky part is choosing the right keywords from the text. Since the documents can come from any random domain, picking out the right keywords gets a lot harder. Sure, we could use TF-IDF to select keywords, but that works best for big datasets. When we're dealing with smaller sections, it’s like trying to pick the ripest fruit in a basket while wearing sunglasses indoors — you might grab something, but there's a good chance it’s not what you were looking for. That said, it’s definitely an interesting approach. |
I am currently working on the bounding box for the table. With a little more fine-tuning, we should be good to go. @tylermaran What are your thoughts? |
@getwithashish, just trying to understand here, how many types of (elements) bounding boxes you are targetting exactly and how? |
This is the current flow:
|
I will be kicking off the PR today! It’s been a hot minute since I started on this feature, but hey, better late than never. 😄 |
Hey @tylermaran, PR’s up and ready for your review! 🧐 |
Generally I would love to have some bounding boxes come back with the text response. Primarily for highlighting locations in the original document where the text got pulled. Not sure exactly how I would proceed with this one, but would love to hear some thoughts.
I think the general flow would be:
The text was updated successfully, but these errors were encountered: