Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support different help texts for normal and advanced users and restore legacy mode #1325

Merged
merged 4 commits into from
Feb 19, 2018

Conversation

stweil
Copy link
Contributor

@stweil stweil commented Feb 18, 2018

These series partially reverts commit 173ad2b.

It is still needed to get text attributes which are unsupported by the
LSTM engine, and it also has better recognition rates for some texts.

Signed-off-by: Stefan Weil <[email protected]>
The old option --help now shows a very basic help text.
The new option --help-extra shows the full help information.
It now also includes a hint that Tesseract supports lists of images.

Fix also the indentation in the PSM help and
use a more neutral text in the OEM help.

Signed-off-by: Stefan Weil <[email protected]>
@stweil
Copy link
Contributor Author

stweil commented Feb 18, 2018

New help texts:

$ tesseract --help
Usage:
  tesseract --help | --help-extra | --version
  tesseract --list-langs
  tesseract imagename outputbase [options...] [configfile...]

OCR options:
  -l LANG[+LANG]        Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.

Single options:
  --help                Show this help message.
  --help-extra          Show extra help for advanced users.
  --version             Show version information.
  --list-langs          List available languages for tesseract engine.


$ tesseract --help-extra
Usage:
  tesseract --help | --help-extra | --help-psm | --help-oem | --version
  tesseract --list-langs [--tessdata-dir PATH]
  tesseract --print-parameters [options...] [configfile...]
  tesseract imagename|imagelist|stdin outputbase|stdout [options...] [configfile...]

OCR options:
  --tessdata-dir PATH   Specify the location of tessdata path.
  --user-words PATH     Specify the location of user words file.
  --user-patterns PATH  Specify the location of user patterns file.
  -l LANG[+LANG]        Specify language(s) used for OCR.
  -c VAR=VALUE          Set value for config variables.
                        Multiple -c arguments are allowed.
  --psm NUM             Specify page segmentation mode.
  --oem NUM             Specify OCR Engine mode.
NOTE: These options must occur before any configfile.

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

OCR Engine modes:
  0    Legacy Tesseract only.
  1    Neural nets LSTM only.
  2    Legacy + LSTM Tesseract.
  3    Default, based on what is available.

Single options:
  -h, --help            Show minimal help message.
  --help-extra          Show extra help for advanced users.
  --help-psm            Show page segmentation modes.
  --help-oem            Show OCR Engine modes.
  -v, --version         Show version information.
  --list-langs          List available languages for tesseract engine.
  --print-parameters    Print tesseract parameters.

@zdenop
Copy link
Contributor

zdenop commented Feb 18, 2018

My notes:

  1. user still can easily run legacy engine that will cause a crash
  2. legacy engine is not supported anymore and this is not indicated by help => users will continue open issues...

@stweil
Copy link
Contributor Author

stweil commented Feb 19, 2018

  1. When I run the legacy engine, I don't get a crash. Which crash scenario are you referring to? You could add a tag "legacy" to issues related to the legacy engine.

  2. Doesn't "legacy" already say that it is old? And I don't think that it is completely unsupported. As I already said, I'll try to fix issues which are reported. It should be easy to add some help text which clarifies the situation. What about this text:

    ...
    OCR Engine modes:
    0 Legacy Tesseract only.
    1 Neural nets LSTM only.
    2 Legacy + LSTM Tesseract.
    3 Default, based on what is available.

    Please note that the legacy engine is no longer supported by Google!
    ...

@stweil
Copy link
Contributor Author

stweil commented Feb 19, 2018

See issue #707 and also #1074 (comment) on the role of the legacy engine.

@zdenop zdenop merged commit 349de8b into tesseract-ocr:master Feb 19, 2018
@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Feb 19, 2018 via email

@stweil stweil deleted the legacy branch February 19, 2018 07:36
@amitdo
Copy link
Collaborator

amitdo commented Feb 19, 2018

0 Legacy Tesseract only.

2 Legacy + LSTM Tesseract.

Tesseract is (also) the name of the legacy engine.

What about:

OCR Engine modes:
0 Legacy engine only
1 Neural nets LSTM engine only
2 Legacy + LSTM engines
3 Default, based on what is available

?

@stweil
Copy link
Contributor Author

stweil commented Feb 19, 2018

Fine for me. Adding "engine" in the descriptions looks indeed better. Do you want to send a pull request, or should I prepare one?

@amitdo
Copy link
Collaborator

amitdo commented Feb 19, 2018

You.

@amitdo
Copy link
Collaborator

amitdo commented Feb 19, 2018

I think the periods are unnecessary.

stweil added a commit to stweil/tesseract that referenced this pull request Feb 19, 2018
The new text was suggested by Amit Dovev, see
tesseract-ocr#1325 (comment).

Signed-off-by: Stefan Weil <[email protected]>
@stweil
Copy link
Contributor Author

stweil commented Feb 19, 2018

PrintHelpForPSM also has periods, and so do the other help texts. I'd keep them for now, but yes, it would also be possible to have a good help text without those periods.

zdenop pushed a commit that referenced this pull request Feb 19, 2018
The new text was suggested by Amit Dovev, see
#1325 (comment).

Signed-off-by: Stefan Weil <[email protected]>
@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Feb 19, 2018

The crash scenario is because traineddata from tessdata_fast do not have
legacy models in them (at least for some languages).

@stweil

I checked and my earlier comment seems not to be correct. It is NOT a crash/assert. I will open a new issue with what I noticed.

edit: Added issue #1327

@amitdo
Copy link
Collaborator

amitdo commented Mar 13, 2018

'extra' => 'advanced'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants