-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use lxml.etree, iterate ocr_line > ocr_word #57
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using lxml.etree
, going through lines looks good.
For going through the words I two comments/questions.
except: | ||
continue | ||
for word in line.xpath('.//*[@class="ocrx_word"]'): | ||
rawtext = word.xpath('./text()')[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does only look at the first direct text node, but ignores more nested structure as e.g.
<span class='ocrx_word' id='word_1_1' title='bbox 118 3884 122 5088; x_wconf 95' lang='eng' dir='ltr'><strong><em> </em></strong></span>
cf. #33 (comment)
I guess that we should go through all text nodes .//text()
and concatenate the output. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should parse as (X)HTML and use text_content
.
continue | ||
for word in line.xpath('.//*[@class="ocrx_word"]'): | ||
rawtext = word.xpath('./text()')[0] | ||
# sys.stderr.write("WORD: '%s', type '%s'\n" % (rawtext, type(rawtext))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest also to check afterwards rawtext
and if it is None
or empty up to some spaces, then we should continue
. Or is there any reason to draw a word box with spaces only in the pdf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/tmbdev/hocr- tools/pull/57/commits/fb994c30bce0df838506bf1d85c9f7dbf66e3928 should be sensible solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for me, now!
Thank you very much! |
[Move from https://github.com/UB-Mannheim/pull/19 here. ]
@kba writes:
This should not change existing behavior, but additionally allow processing non-XHTML-namespaced non-
span
ocr_line
.In the long run, integrating the code from jbarlow83/OCRmyPDF would be useful. Making the assumptions hocr-pdf makes on the hocr explicit would help too.