Use lxml.etree, iterate ocr_line > ocr_word #57

zuphilip · 2016-09-14T15:53:32Z

[Move from https://github.com/UB-Mannheim/pull/19 here. ]

@kba writes:

This should not change existing behavior, but additionally allow processing non-XHTML-namespaced non-span ocr_line.

In the long run, integrating the code from jbarlow83/OCRmyPDF would be useful. Making the assumptions hocr-pdf makes on the hocr explicit would help too.

zuphilip

Using lxml.etree, going through lines looks good.

For going through the words I two comments/questions.

zuphilip · 2016-09-15T05:43:59Z

hocr-pdf

-        except:
-          continue
+    for word in line.xpath('.//*[@class="ocrx_word"]'):
+      rawtext = word.xpath('./text()')[0]


This does only look at the first direct text node, but ignores more nested structure as e.g.

 

cf. #33 (comment)

I guess that we should go through all text nodes .//text() and concatenate the output. What do you think?

Probably should parse as (X)HTML and use text_content.

zuphilip · 2016-09-15T05:47:56Z

hocr-pdf

-          continue
+    for word in line.xpath('.//*[@class="ocrx_word"]'):
+      rawtext = word.xpath('./text()')[0]
+      #  sys.stderr.write("WORD: '%s', type '%s'\n" % (rawtext, type(rawtext)))


I would suggest also to check afterwards rawtext and if it is None or empty up to some spaces, then we should continue. Or is there any reason to draw a word box with spaces only in the pdf?

https://github.com/tmbdev/hocr- tools/pull/57/commits/fb994c30bce0df838506bf1d85c9f7dbf66e3928 should be sensible solution.

zuphilip

Looks good for me, now!

zuphilip · 2016-09-17T12:22:36Z

Thank you very much!

Use lxml.etree, iterate ocr_line > ocr_word

64f3399

zuphilip commented Sep 15, 2016

View reviewed changes

hocr-pdf: Parse as XHTML, recursive text, content, skip space-only words

fb994c3

zuphilip commented Sep 15, 2016

View reviewed changes

zuphilip merged commit b482964 into ocropus:master Sep 17, 2016

zuphilip mentioned this pull request Sep 17, 2016

Release v1.0.1 #65

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use lxml.etree, iterate ocr_line > ocr_word #57

Use lxml.etree, iterate ocr_line > ocr_word #57

zuphilip commented Sep 14, 2016 •

edited

Loading

zuphilip left a comment

zuphilip Sep 15, 2016

kba Sep 15, 2016

zuphilip Sep 15, 2016

kba Sep 15, 2016

zuphilip left a comment

zuphilip commented Sep 17, 2016

Use lxml.etree, iterate ocr_line > ocr_word #57

Use lxml.etree, iterate ocr_line > ocr_word #57

Conversation

zuphilip commented Sep 14, 2016 • edited Loading

zuphilip left a comment

Choose a reason for hiding this comment

zuphilip Sep 15, 2016

Choose a reason for hiding this comment

kba Sep 15, 2016

Choose a reason for hiding this comment

zuphilip Sep 15, 2016

Choose a reason for hiding this comment

kba Sep 15, 2016

Choose a reason for hiding this comment

zuphilip left a comment

Choose a reason for hiding this comment

zuphilip commented Sep 17, 2016

zuphilip commented Sep 14, 2016 •

edited

Loading