-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Method WordFontAttributes does not work #1074
Comments
The new LSTM engine does not support this feature and probably won't support it any time soon. |
Is there an alternative way to get font sizing etc? Do you mean that just this method won't be supported, or the feature in general? |
You can still use --oem 0 with traineddata from here: https://github.com/tesseract-ocr/tessdata. |
I have reasons to believe that the new LSTM engine is unlikely to have a feature that includes font identification (name and properties like is_bold) in the near future. Important note: I'm a contributer from the community, and the main developer not always shares all his plans for upcoming release(s) with the community. |
Thanks for the reply! It looks like the old ocr engine is going to be removed, though (issue #707)... And does using What I'm getting to is:
Thanks again for the help! |
It's not known when exactly it will be removed. Until then you can still use it.
It's basically the same as 3.05.01.
There is no method in the API to get font sizes for the lstm engine.
Probably yes.
The accuracy should be the same. |
The relative font size for a textline can be estimated by calculating the xheight of the line and compare it |
Ok, thanks for the info 👍 |
I looked at the relevant code again, and I think the font size functionality (but not font name and properties like is_bold) can be restored when using the lstm engine. I will provide further details (and probably send a PR) in the upcoming days. |
That's great news, thanks! |
The problem:
With the LSTM engine the pointsize is calculated based on row (=line) height. pointsize is the font size in points of the line, so it should not be in WordFontAttributes(). There is another function where you can get row height.
I think pointsize calculation should be moved into this function. |
At the moment I have a limited internet access. If you make a pull request I can merge it ;.-) |
Although my current main focus is getting the text from images, there are also important use cases where text attributes are important as well. As I understand your comments, currently the new LSTM recognizer does not support the method |
It would be feasible to add bold and italic attributes by making them a separate output from the model. |
Thank you for this clarification, Ray. |
+1 Ray, |
Yes of course. Just re-order the code in WordFontAttributes. |
That was my first thought, but it seems to give you font size in the line level, while the name of the method implies otherwise (WordFontAttributese), so I suggested to move pointsize to the RowAttributes() method. |
You could also take bold/italic into account when people use multiple languages for recognition, because many times the words in the additional language may be emphasized with bold or italics.. For an example, see the image in tesseract-ocr/langdata#4 (comment) where Roman transliteration of Hindi is italicized with English text. |
While that would work in most cases, what of an extreme case of text of different size being on the same line - eg. http://www.teach-ict.com/programming/html/intro/step17a.jpg |
That has always been a problem. |
@theraysmith Please see related issue #538 regarding recognition problems when an image has many different font sizes in it. |
+1 |
IMO the current state of this method is not very satisfying. Now in version 4 with LSTM, the method returns At least the method should not change the values, if the information is not available. |
What's the value of |
But that's just implicit knowledge and not at all clear from the signature. |
See also #1074 (comment) |
Do we have a solution for this? |
As you can see the issue is still open. It's unknown when font name, bold and italic identification will be supported for the LSTM engine. |
is there another method or package that can determine font size? |
font size is supported: |
Is this issue still open? |
|
Can you provide some partial solution to this, like access only font size as I think there is support. |
Hello! Thanks! |
@coco2121 Hi, did you manage to find any solutions? I am also trying to solve exactly the same problem as yours? |
The LSTM engine does not support font attributes other than point size, and as I said 4 years ago, it won't support these attributes any time soon (It is not planned). However, the legacy engine is still available in versions 4.x and 5.x and it supports these attributes. You need a model that includes data for the legacy engine and you need to use |
If you still have a question about this topic after reading my previous comment, please use our forum. I locked this issue because people keep asking here the same questions and I answered the questions multiple times. |
Environment
Current Behavior:
Method WordFontAttributes returns null if using tesseract 4.00.00alpha with 4.00 tessdata, but it returns font name if using tesseract 4.00.00alpha with 3.04.00 tessdata. The test image link is eurotext.tif
I first met this problem when I use tesserocr [tesserocr#68] .(sirfz/tesserocr#68)
Expected Behavior:
With method WordFontAttributes we can get correct font attributes of recognized words.
The text was updated successfully, but these errors were encountered: