-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pdfrenderer.cpp: Ignore non-text blocks #3959
pdfrenderer.cpp: Ignore non-text blocks #3959
Conversation
Can you fix sw? |
I'll check sw issues. |
fb90979
to
f4c1946
Compare
f4c1946
to
c196456
Compare
Here is the pdf file that Tesseract produces after applying the patch from this PR. |
Here is the pdf file that Tesseract produces before applying the patch from this PR. |
Doing 'select all', 'copy' and then 'paste' to a text file, and using Still, there is a difference internally between the files. The 'after' pdf is slightly smaller (in bytes) than the 'before' pdf file. |
Waiting for a feedback from @bleze. |
https://www.pdf-online.com/osa/validate.aspx File 3957-0.pdf Validating file "3957-0.pdf" for conformance level pdf1.5 The document does conform to the PDF 1.5 standard. File 3957.pdf Validating file "3957.pdf" for conformance level pdf1.5 The document does conform to the PDF 1.5 standard. |
@stweil, can I merge this PR? It removes unneeded stuff from the pdf output in documents with non-text blocks and make the document slightly smaller in bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you!
I already took the code from the patch and applied to my copy. I can confirm that the spaces are no longer included in the output. Thank you for fixing this! |
Fix #3957.