-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PdfPageTextObject.chars() returns wrong results for text objects with overlapping bounding boxes #98
Comments
Hi @cemerick , thanks for this interesting question. Pdfium doesn't expose any information about the order in which it renders page objects. For the purposes of experimentation, I guess we can start with the assumption that the rendering order of page objects is the same as the iteration order of those page objects. That is just an assumption, however (although I will check to see if there's anything in the PDF standard about this). So assuming that assumption is correct, then yes, you should be able to iterate through all page objects in order and assume that later page objects are rendered "on top" of earlier page objects. I'm not understanding why you are interested in each individual character inside a text object. I would have thought text object itself is the rendering primitive, not the characters within it - that is, there is no way a (e.g.) path object could occlude some characters in a text object, but itself be occluded by other characters in the same text object. Either all the characters in the text object would be "behind" the path, or they'd all be "in front of" the path. Can you explain a little more why you're interested in each individual character? Do you have some test documents you can share that have interesting examples of occlusion that you're trying to detect? |
I'm still murky on the abstractions that pdfium provides ("page objects" in the PDF spec are specifically related to the page tree, not individuated rendered elements), but insofar as e.g.
For example, say you have a text object consisting of the characters "0123456789", and a filled rect is positioned to overlap chars 5-9, how are you going to determine that programmatically? Character-level bounding boxes are necessary, but as I said, I hope this is clarifying! Although, given what I've read since my first message in the pdfium sources and google group (e.g. https://groups.google.com/g/pdfium/c/qivGc4X2r2E/m/1TWKF1tJBgAJ), I'm not optimistic that what I'm after is a reasonable objective with pdfium, at least not without some enhancements to it. Of course, if you find that I'm being too pessimistic, I'll be all ears. 😃 |
Ok, I see what you're getting at. Am I right in thinking that the fundamental problem here is that If I'm right in thinking that's the fundamental problem, then let's pretend for a moment that that problem could be solved, such that The reason I'm asking this is because I think there probably is a way to work around the limitations of (You're right in thinking that Pdfium itself does not expose the exact functionality you ideally need here - just the ability to return all characters within a given bounding box, whether they overlap or not.) |
I mean,
That's a great motto for working with PDFs in general! 🤷 😆 |
Yeah, you're not wrong :) Given that text overlap is the primary problem, we need to remove the possibility of overlap. There are two options that come to mind:
Both have pros and cons. With option 1, you need to do some manual translation and un-translation, which is a bit cumbersome. Option 2 avoids this (the copied object will be at the same position on the new page as it was on the original page), but there are some limitations when copying objects in Pdfium (as detailed in #60). Option 1 is probably also likely a bit more efficient performance-wise if you're processing thousands or millions of objects. (EDIT: actually, on reflection I'm not sure about this: the I do consider this to be a bug in |
Option 1 there does work, with some caveats:
Thank you very much for the creative pointer re: the translation trick. (I rarely think of using a mutable document model, so I'm slightly embarrassed that I didn't think of it!) I'll continue to tinker with other "creative" options to avoid the performance problems, perhaps translating every text object into deterministic off-page space. Seeing the translation trick basically working, I'm left really confused as to how the link between text objects and (EDIT: I see now that the text objects are the basal representation in pdfium, and that char-level data is a second-order artifact via |
A final (I think?) update from me: Performance has now exceeded my expectations, given:
I can't imagine that this kind of implementation would be a good addition to the library, or I'd suggest a PR; it works for my purposes, but I wouldn't think it a reasonable approach in general. |
It might make for an interesting example, if you felt like sharing... up to you. I will take a more general approach in |
Initially I thought a simple check comparing the length of the text returned for the text object's bounding box against the text returned by calling use pdfium_render::prelude::*;
fn main() -> Result<(), PdfiumError> {
let pdfium = Pdfium::new(Pdfium::bind_to_library(
Pdfium::pdfium_platform_library_name_at_path("../pdfium/"),
)?);
// Create a new document with two overlapping text objects.
let mut document = pdfium.create_new_pdf()?;
let mut page = document
.pages_mut()
.create_page_at_start(PdfPagePaperSize::a4())?;
let font = document.fonts_mut().times_roman();
let txt1 = page.objects_mut().create_text_object(
PdfPoints::ZERO,
PdfPoints::ZERO,
"AAAAAA",
font,
PdfPoints::new(10.0),
)?;
let txt2 = page.objects_mut().create_text_object(
PdfPoints::ZERO,
PdfPoints::ZERO,
"BBBBBB",
font,
PdfPoints::new(10.0),
)?;
let page_text = page.text()?;
println!("{}", page_text.all());
if let Some(txt1) = txt1.as_text_object() {
println!("{}", txt1.text());
println!("{}", page_text.for_object(txt1));
for (index, char) in txt1.chars(&page_text)?.iter().enumerate() {
println!(
"{}: {:?} ==? {:?}",
index,
txt1.text().chars().nth(index),
char.unicode_string()
);
}
}
if let Some(txt2) = txt2.as_text_object() {
println!("{}", txt2.text());
println!("{}", page_text.for_object(txt2));
for (index, char) in txt2.chars(&page_text)?.iter().enumerate() {
println!(
"{}: {:?} ==? {:?}",
index,
txt2.text().chars().nth(index),
char.unicode_string()
);
}
}
Ok(())
} A general solution to this probably requires always creating a temporary page containing nothing but the text object for which characters are being retrieved. Terrible for performance, obviously. |
Adjusted |
I'd like to use pdfium-render to access all "primitive" elements (characters, paths, images) in the order that they are rendered, so that I can determine visibility for each such element (accounting for occlusion of primitives rendered earlier due to simple obstruction, clipping paths, etc).
I figured that I would be able to do this by iterating through
PdfPage.objects()
, and within that, iterating through eachPdfPageTextObject.chars()
. However, the latter doesn't retrieve individual chars specifically associated with a given text object; rather, it grounds out in a bounding-box search:pdfium-render/src/page_text.rs
Lines 95 to 101 in c0038a6
Of course, this doesn't reflect original rendering order at all, and ironically will result in the same character being visited multiple times, in the case of overlapping text objects.
Is there a way to access primitives, down to the character level, in rendered order (or with a render-order property if direct iteration isn't possible)?
(Thanks so much for this library, the work is greatly appreciated. 🙇)
The text was updated successfully, but these errors were encountered: