Method WordFontAttributes does not work #1074

zikcheng · 2017-08-11T07:38:14Z

Environment

Tesseract Version: tesseract 4.00.00alpha
Commit Number: 8e55e52
Platform: Ubuntu 16.04.1

Current Behavior:

Method WordFontAttributes returns null if using tesseract 4.00.00alpha with 4.00 tessdata, but it returns font name if using tesseract 4.00.00alpha with 3.04.00 tessdata. The test image link is eurotext.tif
I first met this problem when I use tesserocr [tesserocr#68] .(sirfz/tesserocr#68)

Expected Behavior:

With method WordFontAttributes we can get correct font attributes of recognized words.

The text was updated successfully, but these errors were encountered:

amitdo · 2017-08-11T08:40:25Z

The new LSTM engine does not support this feature and probably won't support it any time soon.

phildrip · 2017-08-31T16:32:24Z

Is there an alternative way to get font sizing etc? Do you mean that just this method won't be supported, or the feature in general?

amitdo · 2017-08-31T17:53:42Z

Is there an alternative way to get font sizing etc?

You can still use --oem 0 with traineddata from here: https://github.com/tesseract-ocr/tessdata.
Note that the traineddata in the 'best' folder won't work with --oem 0.

amitdo · 2017-08-31T18:30:55Z

Do you mean that just this method won't be supported, or the feature in general?

I have reasons to believe that the new LSTM engine is unlikely to have a feature that includes font identification (name and properties like is_bold) in the near future.

Important note: I'm a contributer from the community, and the main developer not always shares all his plans for upcoming release(s) with the community.

phildrip · 2017-09-01T09:28:41Z

Thanks for the reply! It looks like the old ocr engine is going to be removed, though (issue #707)... And does using OcrEngineMode 0 mean the behaviour is the same as v3?

What I'm getting to is:

I need to be able to extract font size information (font names aren't so useful) - is there any way at all of doing so with LSTM/v4?
If I use OcrEngineMode 0 to be able to get this info, will that be removed from v4 at a later date?
Is there any advantage to using v4 with OcrEngineMode 0 vs v3.05?

Thanks again for the help!

amitdo · 2017-09-01T11:29:19Z

It looks like the old ocr engine is going to be removed, though (issue #707)...

It's not known when exactly it will be removed. Until then you can still use it.

And does using OcrEngineMode 0 mean the behaviour is the same as v3?

It's basically the same as 3.05.01.

I need to be able to extract font size information (font names aren't so useful) - is there any way at all of doing so with LSTM/v4?

There is no method in the API to get font sizes for the lstm engine.

If I use OcrEngineMode 0 to be able to get this info, will that be removed from v4 at a later date?

Probably yes.

Is there any advantage to using v4 with OcrEngineMode 0 vs v3.05?

The accuracy should be the same.

amitdo · 2017-09-01T11:59:37Z

The relative font size for a textline can be estimated by calculating the xheight of the line and compare it
to the median xheight of the other textlines in the page.

phildrip · 2017-09-01T12:58:39Z

Ok, thanks for the info 👍

amitdo · 2017-09-01T15:08:31Z

@phildrip,

I looked at the relevant code again, and I think the font size functionality (but not font name and properties like is_bold) can be restored when using the lstm engine.

I will provide further details (and probably send a PR) in the upcoming days.

phildrip · 2017-09-01T15:20:47Z

That's great news, thanks!

amitdo · 2017-09-02T18:49:51Z

// Returns the font attributes of the current word. If iterating at a higher
// level object than words, eg textlines, then this will return the
// attributes of the first word in that textline.
// The actual return value is a string representing a font name. It points
// to an internal table and SHOULD NOT BE DELETED. Lifespan is the same as
// the iterator itself, ie rendered invalid by various members of
// TessBaseAPI, including Init, SetImage, End or deleting the TessBaseAPI.
// Pointsize is returned in printers points (1/72 inch.)
const char* LTRResultIterator::WordFontAttributes(bool* is_bold,
                                                  bool* is_italic,
                                                  bool* is_underlined,
                                                  bool* is_monospace,
                                                  bool* is_serif,
                                                  bool* is_smallcaps,
                                                  int* pointsize,
                                                  int* font_id) const {
  if (it_->word() == NULL) return NULL;  // Already at the end!
  if (it_->word()->fontinfo == NULL) {
    *font_id = -1;
    return NULL;  // No font information.
  }
  const FontInfo& font_info = *it_->word()->fontinfo;
  *font_id = font_info.universal_id;
  *is_bold = font_info.is_bold();
  *is_italic = font_info.is_italic();
  *is_underlined = false;  // TODO(rays) fix this!
  *is_monospace = font_info.is_fixed_pitch();
  *is_serif = font_info.is_serif();
  *is_smallcaps = it_->word()->small_caps;
  float row_height = it_->row()->row->x_height() +
      it_->row()->row->ascenders() - it_->row()->row->descenders();
  // Convert from pixels to printers points.
  *pointsize = scaled_yres_ > 0
      ? static_cast<int>(row_height * kPointsPerInch / scaled_yres_ + 0.5)
      : 0;

  return font_info.name;
}

The problem:

if (it_->word()->fontinfo == NULL) {
    *font_id = -1;
    return NULL;  // No font information.
}

With the LSTM engine the it_->word()->fontinfo will always be NULL.
So pointsize has no chance to be calculated.

pointsize is calculated based on row (=line) height. pointsize is the font size in points of the line, so it should not be in WordFontAttributes().

There is another function where you can get row height.

void LTRResultIterator::RowAttributes(float* row_height, float* descenders,
                                      float* ascenders) const {
  *row_height = it_->row()->row->x_height() + it_->row()->row->ascenders() -
                it_->row()->row->descenders();
  *descenders = it_->row()->row->descenders();
  *ascenders = it_->row()->row->ascenders();
}

I think pointsize calculation should be moved into this function.

amitdo · 2017-09-03T07:58:42Z

@zdenop, @stweil
Do you have any comment?

zdenop · 2017-09-05T18:39:22Z

At the moment I have a limited internet access. If you make a pull request I can merge it ;.-)

stweil · 2017-09-06T15:24:10Z

Although my current main focus is getting the text from images, there are also important use cases where text attributes are important as well. As I understand your comments, currently the new LSTM recognizer does not support the method WordFontAttributes, so it is not possible to get text attributes with that recognizer. Adding support for the font size recognition with LSTM seems to be feasible, but other text attributes like for example bold or italic are desirable, too.

theraysmith · 2017-09-07T14:20:25Z

It would be feasible to add bold and italic attributes by making them a separate output from the model.
Underline would also be possible.
All these attributes would require changes to the rendering pipeline, and datapath for the ground truth.
Fixed-pitch(monospace), serif and smallcaps would be much more difficult, due to lack of reliable data available for the fonts. It could be possible to re-use the existing fontinfo table for that.
I wouldn't rule it out as impossible, but I will add this request to my list of stoppers for obsoleting the old engine.
I have a bunch of updates to push, which I didn't quite get to before my office move...

stweil · 2017-09-07T14:30:42Z

Thank you for this clarification, Ray.

amitdo · 2017-09-07T14:41:39Z

Thank you for this clarification, Ray.

+1

Ray,
In the meantime, can I fix the font size issue?
#1074 (comment)

theraysmith · 2017-09-07T14:55:19Z

Yes of course. Just re-order the code in WordFontAttributes.

amitdo · 2017-09-07T15:16:39Z

Yes of course. Just re-order the code in WordFontAttributes.

That was my first thought, but it seems to give you font size in the line level, while the name of the method implies otherwise (WordFontAttributese), so I suggested to move pointsize to the RowAttributes() method.

Shreeshrii · 2017-09-07T15:38:07Z

It would be feasible to add bold and italic attributes by making them a separate output from the model. Underline would also be possible.

You could also take bold/italic into account when people use multiple languages for recognition, because many times the words in the additional language may be emphasized with bold or italics..

For an example, see the image in tesseract-ocr/langdata#4 (comment) where Roman transliteration of Hindi is italicized with English text.

Shreeshrii · 2017-09-07T15:45:55Z

it seems to give you font size in the line level

While that would work in most cases, what of an extreme case of text of different size being on the same line - eg. http://www.teach-ict.com/programming/html/intro/step17a.jpg

theraysmith · 2017-09-07T16:03:53Z

That has always been a problem.
The old code would often output garbage.
The LSTM engine will split the line at such words and recognize them separately, pasting the results back together. It doesn\t give an estimate of the x-height though. The overall accuracy on such images is better though.

Shreeshrii · 2017-09-11T14:53:34Z

@theraysmith Please see related issue #538

regarding recognition problems when an image has many different font sizes in it.

vtigranv · 2017-10-10T08:21:21Z

+1

**Partial** fix for issue #1074

troplin · 2018-06-28T07:40:43Z

IMO the current state of this method is not very satisfying.
In version 3, it was clear that no information was available if the method returned NULL.

Now in version 4 with LSTM, the method returns NULL, but the font size is still computed. The rest of the properties currently seem to be set to true unconditionally.
It's not possible to find out, if those are actually correct or just garbage.

At least the method should not change the values, if the information is not available.

amitdo · 2018-06-28T08:24:10Z

It's not possible to find out, if those are actually correct or just garbage.

What's the value of font_id?

troplin · 2018-06-28T09:09:24Z

font_id is -1.
I realize that I can probably just assume that the font size is always correct and the rest only if the method returns something != NULL or if font_id != -1.

But that's just implicit knowledge and not at all clear from the signature.
And going forward, if e.g. the bold property is correctly recognized too in a future version, there's no way to recognize that.
I'd very much prefer an API where it is inherently clear which properties are meaningful and which aren't, without relying on implicit knowledge.

amitdo · 2018-06-28T10:13:55Z

See also #1074 (comment)

hoangaeye · 2020-01-20T21:20:46Z

Do we have a solution for this?

amitdo · 2020-01-21T11:03:59Z

As you can see the issue is still open.

It's unknown when font name, bold and italic identification will be supported for the LSTM engine.

hoangaeye · 2020-01-21T17:37:43Z

is there another method or package that can determine font size?

amitdo · 2020-01-21T18:05:22Z

font size is supported:

#1173

amitdo · 2020-01-21T18:17:09Z

https://github.com/tesseract-ocr/tesseract/blob/master/src/ccmain/ltrresultiterator.cpp#L164

pdiwadkar · 2021-01-08T08:32:44Z

Is this issue still open?

amitdo · 2021-01-08T13:19:42Z

Is this issue still open?

#1074 (comment)

shubham1206agra · 2021-05-20T05:34:18Z

Can you provide some partial solution to this, like access only font size as I think there is support.
Please

amitdo · 2021-05-20T07:20:49Z

#1074 (comment)

coco2121 · 2021-06-07T17:21:19Z

Hello!
Is this issue still open?
I need to get some font properties from scanned pdf like when text is bold or underlined.
WordFontAttribute is returning None, any suggestion on what I can use to get these properties?

Thanks!

kalai2033 · 2021-11-10T11:11:25Z

@coco2121 Hi, did you manage to find any solutions? I am also trying to solve exactly the same problem as yours?

amitdo · 2021-11-10T12:00:35Z

The LSTM engine does not support font attributes other than point size, and as I said 4 years ago, it won't support these attributes any time soon (It is not planned).

However, the legacy engine is still available in versions 4.x and 5.x and it supports these attributes. You need a model that includes data for the legacy engine and you need to use --oem 0 (It might also work with --oem 3, not sure).

amitdo · 2021-11-10T12:18:57Z

If you still have a question about this topic after reading my previous comment, please use our forum.

I locked this issue because people keep asking here the same questions and I answered the questions multiple times.

stweil mentioned this issue Sep 7, 2017

RFC: Remove the legacy OCR Engine #707

Closed

amitdo mentioned this issue Sep 13, 2017

Prevent uninitialized integer from being used for x_fsize output #1125

Closed

amitdo mentioned this issue Oct 20, 2017

Make font size estimation work with the lstm engine #1173

Merged

zdenop pushed a commit that referenced this issue Oct 20, 2017

Make font size estimation work with the lstm engine (#1173)

ad5ee18

**Partial** fix for issue #1074

stweil mentioned this issue Feb 19, 2018

Support different help texts for normal and advanced users and restore legacy mode #1325

Merged

amitdo mentioned this issue Feb 23, 2018

HOCR: x_font missing, x_fsize broken in 4.00-alpha #684

Closed

amitdo mentioned this issue Mar 11, 2018

Italic info in hocr output #1371

Closed

Silex mentioned this issue Jul 6, 2018

Docker Image doesn't work with Ubuntu 18.04 openalpr/openalpr#706

Open

nguyenq mentioned this issue Feb 5, 2019

Tess4j not able to extract font info from a scanned document nguyenq/tess4j#140

Closed

maximumspatium mentioned this issue May 11, 2019

Compatibility with tesseract 4 Audiveris/audiveris#273

Closed

stweil added the feature request label Jan 25, 2020

amitdo mentioned this issue May 8, 2020

Add RowAttributes getter to PageIterator #2971

Merged

bozhodimitrov mentioned this issue Oct 16, 2020

Detect font style attributes madmaze/pytesseract#305

Closed

zbw8388 mentioned this issue Jan 17, 2021

Need to extract more features Duke-Chronicle-Project/TessTools#1

Closed

zikcheng closed this as completed Oct 24, 2021

tesseract-ocr locked and limited conversation to collaborators Nov 10, 2021

Method WordFontAttributes does not work #1074

Method WordFontAttributes does not work #1074

Comments

zikcheng commented Aug 11, 2017

Environment

Current Behavior:

Expected Behavior:

amitdo commented Aug 11, 2017

phildrip commented Aug 31, 2017

amitdo commented Aug 31, 2017 • edited Loading

amitdo commented Aug 31, 2017

phildrip commented Sep 1, 2017

amitdo commented Sep 1, 2017 • edited Loading

amitdo commented Sep 1, 2017

phildrip commented Sep 1, 2017

amitdo commented Sep 1, 2017

phildrip commented Sep 1, 2017

amitdo commented Sep 2, 2017

amitdo commented Sep 3, 2017

zdenop commented Sep 5, 2017

stweil commented Sep 6, 2017

theraysmith commented Sep 7, 2017

stweil commented Sep 7, 2017

amitdo commented Sep 7, 2017

theraysmith commented Sep 7, 2017

amitdo commented Sep 7, 2017 • edited Loading

Shreeshrii commented Sep 7, 2017

Shreeshrii commented Sep 7, 2017

theraysmith commented Sep 7, 2017

Shreeshrii commented Sep 11, 2017

vtigranv commented Oct 10, 2017

troplin commented Jun 28, 2018

amitdo commented Jun 28, 2018

troplin commented Jun 28, 2018

amitdo commented Jun 28, 2018

hoangaeye commented Jan 20, 2020

amitdo commented Jan 21, 2020

hoangaeye commented Jan 21, 2020

amitdo commented Jan 21, 2020

amitdo commented Jan 21, 2020

pdiwadkar commented Jan 8, 2021

amitdo commented Jan 8, 2021

shubham1206agra commented May 20, 2021

amitdo commented May 20, 2021

coco2121 commented Jun 7, 2021

kalai2033 commented Nov 10, 2021 • edited Loading

amitdo commented Nov 10, 2021

amitdo commented Nov 10, 2021

amitdo commented Aug 31, 2017 •

edited

Loading

amitdo commented Sep 1, 2017 •

edited

Loading

amitdo commented Sep 7, 2017 •

edited

Loading

kalai2033 commented Nov 10, 2021 •

edited

Loading