Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use lxml.etree, iterate ocr_line > ocr_word #57

Merged
merged 2 commits into from
Sep 17, 2016

Conversation

zuphilip
Copy link
Collaborator

@zuphilip zuphilip commented Sep 14, 2016

[Move from https://github.com/UB-Mannheim/pull/19 here. ]

@kba writes:

This should not change existing behavior, but additionally allow processing non-XHTML-namespaced non-span ocr_line.

In the long run, integrating the code from jbarlow83/OCRmyPDF would be useful. Making the assumptions hocr-pdf makes on the hocr explicit would help too.

Copy link
Collaborator Author

@zuphilip zuphilip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using lxml.etree, going through lines looks good.

For going through the words I two comments/questions.

except:
continue
for word in line.xpath('.//*[@class="ocrx_word"]'):
rawtext = word.xpath('./text()')[0]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does only look at the first direct text node, but ignores more nested structure as e.g.

<span class='ocrx_word' id='word_1_1' title='bbox 118 3884 122 5088; x_wconf 95' lang='eng' dir='ltr'><strong><em> </em></strong></span>

cf. #33 (comment)

I guess that we should go through all text nodes .//text() and concatenate the output. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should parse as (X)HTML and use text_content.

continue
for word in line.xpath('.//*[@class="ocrx_word"]'):
rawtext = word.xpath('./text()')[0]
# sys.stderr.write("WORD: '%s', type '%s'\n" % (rawtext, type(rawtext)))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest also to check afterwards rawtext and if it is None or empty up to some spaces, then we should continue. Or is there any reason to draw a word box with spaces only in the pdf?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/tmbdev/hocr- tools/pull/57/commits/fb994c30bce0df838506bf1d85c9f7dbf66e3928 should be sensible solution.

Copy link
Collaborator Author

@zuphilip zuphilip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for me, now!

@zuphilip zuphilip merged commit b482964 into ocropus:master Sep 17, 2016
@zuphilip
Copy link
Collaborator Author

Thank you very much!

@zuphilip zuphilip mentioned this pull request Sep 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants