Support different help texts for normal and advanced users and restore legacy mode #1325

stweil · 2018-02-18T20:14:15Z

These series partially reverts commit 173ad2b.

It is still needed to get text attributes which are unsupported by the LSTM engine, and it also has better recognition rates for some texts. Signed-off-by: Stefan Weil <[email protected]>

Signed-off-by: Stefan Weil <[email protected]>

The old option --help now shows a very basic help text. The new option --help-extra shows the full help information. It now also includes a hint that Tesseract supports lists of images. Fix also the indentation in the PSM help and use a more neutral text in the OEM help. Signed-off-by: Stefan Weil <[email protected]>

stweil · 2018-02-18T20:16:17Z

New help texts:

$ tesseract --help
Usage:
  tesseract --help | --help-extra | --version
  tesseract --list-langs
  tesseract imagename outputbase [options...] [configfile...]

OCR options:
  -l LANG[+LANG]        Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.

Single options:
  --help                Show this help message.
  --help-extra          Show extra help for advanced users.
  --version             Show version information.
  --list-langs          List available languages for tesseract engine.


$ tesseract --help-extra
Usage:
  tesseract --help | --help-extra | --help-psm | --help-oem | --version
  tesseract --list-langs [--tessdata-dir PATH]
  tesseract --print-parameters [options...] [configfile...]
  tesseract imagename|imagelist|stdin outputbase|stdout [options...] [configfile...]

OCR options:
  --tessdata-dir PATH   Specify the location of tessdata path.
  --user-words PATH     Specify the location of user words file.
  --user-patterns PATH  Specify the location of user patterns file.
  -l LANG[+LANG]        Specify language(s) used for OCR.
  -c VAR=VALUE          Set value for config variables.
                        Multiple -c arguments are allowed.
  --psm NUM             Specify page segmentation mode.
  --oem NUM             Specify OCR Engine mode.
NOTE: These options must occur before any configfile.

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

OCR Engine modes:
  0    Legacy Tesseract only.
  1    Neural nets LSTM only.
  2    Legacy + LSTM Tesseract.
  3    Default, based on what is available.

Single options:
  -h, --help            Show minimal help message.
  --help-extra          Show extra help for advanced users.
  --help-psm            Show page segmentation modes.
  --help-oem            Show OCR Engine modes.
  -v, --version         Show version information.
  --list-langs          List available languages for tesseract engine.
  --print-parameters    Print tesseract parameters.

Signed-off-by: Stefan Weil <[email protected]>

zdenop · 2018-02-18T21:25:24Z

My notes:

user still can easily run legacy engine that will cause a crash
legacy engine is not supported anymore and this is not indicated by help => users will continue open issues...

stweil · 2018-02-19T05:49:16Z

When I run the legacy engine, I don't get a crash. Which crash scenario are you referring to? You could add a tag "legacy" to issues related to the legacy engine.
Doesn't "legacy" already say that it is old? And I don't think that it is completely unsupported. As I already said, I'll try to fix issues which are reported. It should be easy to add some help text which clarifies the situation. What about this text:

...
OCR Engine modes:
0 Legacy Tesseract only.
1 Neural nets LSTM only.
2 Legacy + LSTM Tesseract.
3 Default, based on what is available.

Please note that the legacy engine is no longer supported by Google!
...

stweil · 2018-02-19T06:12:47Z

See issue #707 and also #1074 (comment) on the role of the legacy engine.

Shreeshrii · 2018-02-19T07:22:27Z

@stweil The crash scenario is because traineddata from tessdata_fast do not have legacy models in them (at least for some languages). For some languages such as Hindi, sanskrit etc it is intentional, as the accuracy is very much improved with LSTM engine and model. However, for other Latin script based languages, 'tesseract' may provide better results (as you have mentioned). I will test and post some specific crash scenarios for you. I am sure the crash can be avoided by checking for available models in the traineddata files. I will also add link to issues where I have commented about the same.

amitdo · 2018-02-19T08:01:15Z

0 Legacy Tesseract only.

2 Legacy + LSTM Tesseract.

Tesseract is (also) the name of the legacy engine.

What about:

OCR Engine modes:
0 Legacy engine only
1 Neural nets LSTM engine only
2 Legacy + LSTM engines
3 Default, based on what is available

?

stweil · 2018-02-19T08:05:58Z

Fine for me. Adding "engine" in the descriptions looks indeed better. Do you want to send a pull request, or should I prepare one?

amitdo · 2018-02-19T08:13:06Z

You.

amitdo · 2018-02-19T08:16:59Z

I think the periods are unnecessary.

The new text was suggested by Amit Dovev, see tesseract-ocr#1325 (comment). Signed-off-by: Stefan Weil <[email protected]>

stweil · 2018-02-19T08:28:55Z

PrintHelpForPSM also has periods, and so do the other help texts. I'd keep them for now, but yes, it would also be possible to have a good help text without those periods.

The new text was suggested by Amit Dovev, see #1325 (comment). Signed-off-by: Stefan Weil <[email protected]>

Shreeshrii · 2018-02-19T16:24:14Z

The crash scenario is because traineddata from tessdata_fast do not have
legacy models in them (at least for some languages).

@stweil

I checked and my earlier comment seems not to be correct. It is NOT a crash/assert. I will open a new issue with what I noticed.

edit: Added issue #1327

amitdo · 2018-03-13T18:05:53Z

'extra' => 'advanced'

stweil added 3 commits February 18, 2018 20:27

Restore support for the legacy engine

78c46f4

It is still needed to get text attributes which are unsupported by the LSTM engine, and it also has better recognition rates for some texts. Signed-off-by: Stefan Weil <[email protected]>

tesseractmain: Add missing 'static' attributes

25526b9

Signed-off-by: Stefan Weil <[email protected]>

Add missing line feed in error message

bec9e4f

Signed-off-by: Stefan Weil <[email protected]>

stweil mentioned this pull request Feb 19, 2018

RFC: Remove the legacy OCR Engine #707

Closed

zdenop merged commit 349de8b into tesseract-ocr:master Feb 19, 2018

stweil deleted the legacy branch February 19, 2018 07:36

stweil added a commit to stweil/tesseract that referenced this pull request Feb 19, 2018

Improve help text for OCR engine mode

522c571

The new text was suggested by Amit Dovev, see tesseract-ocr#1325 (comment). Signed-off-by: Stefan Weil <[email protected]>

stweil mentioned this pull request Feb 19, 2018

Improve help text for OCR engine mode #1326

Merged

zdenop pushed a commit that referenced this pull request Feb 19, 2018

Improve help text for OCR engine mode (#1326)

a50ff52

The new text was suggested by Amit Dovev, see #1325 (comment). Signed-off-by: Stefan Weil <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support different help texts for normal and advanced users and restore legacy mode #1325

Support different help texts for normal and advanced users and restore legacy mode #1325

stweil commented Feb 18, 2018

stweil commented Feb 18, 2018

zdenop commented Feb 18, 2018

stweil commented Feb 19, 2018

stweil commented Feb 19, 2018

Shreeshrii commented Feb 19, 2018 via email

amitdo commented Feb 19, 2018 •

edited

Loading

stweil commented Feb 19, 2018

amitdo commented Feb 19, 2018

amitdo commented Feb 19, 2018

stweil commented Feb 19, 2018

Shreeshrii commented Feb 19, 2018 •

edited

Loading

amitdo commented Mar 13, 2018

Support different help texts for normal and advanced users and restore legacy mode #1325

Support different help texts for normal and advanced users and restore legacy mode #1325

Conversation

stweil commented Feb 18, 2018

stweil commented Feb 18, 2018

zdenop commented Feb 18, 2018

stweil commented Feb 19, 2018

stweil commented Feb 19, 2018

Shreeshrii commented Feb 19, 2018 via email

amitdo commented Feb 19, 2018 • edited Loading

stweil commented Feb 19, 2018

amitdo commented Feb 19, 2018

amitdo commented Feb 19, 2018

stweil commented Feb 19, 2018

Shreeshrii commented Feb 19, 2018 • edited Loading

amitdo commented Mar 13, 2018

amitdo commented Feb 19, 2018 •

edited

Loading

Shreeshrii commented Feb 19, 2018 •

edited

Loading