Migrate Python code to a dedicated package #309

stefan6419846 · 2022-05-24T08:03:17Z

This is my attempt to migrate the existing code for working with artificial training data to a dedicated Python package, as proposed in #308 and #307. This includes some additional refactoring to the module structure to better encapsulate specific functionality.

I have used version number 0.1 for now, although I am up to changing this.

When migrating, I had two parameters which I am not sure about:

overwrite, defaulting to False, does not seem to be used at all.
It has not been clear enough for me what extract_font_properties really means and therefore misses documentation. text2image --help did not really help me in this case as well:

--only_extract_font_properties Assumes that the input file contains a list of ngrams. Renders each ngram, extracts spacing properties and records them in output_base/[font_name].fontinfo file. (type:bool default:false)

What would be an appropriate documentation of the parameter?

If there is anything unclear or you want to see anything changed about this, feel free to ask or report.

Shreeshrii · 2022-05-24T08:42:39Z

Thanks. A dedicated package for training from fonts is a good idea.

You may want to look at the original bash scripts that were sought to be replicated in these python scripts in older versions eg. https://github.com/tesseract-ocr/tesseract/tree/4.0/src/training

'overwrite' if I recall correctly was used for the legacy training offered in the bash scripts.

stefan6419846 · 2022-05-24T09:39:35Z

linedata = False is a legacy-only parameter which is unsupported, so we might be able to drop it in this process as well.
extract_font_properties has always been without any documentation, see usage in https://github.com/tesseract-ocr/tesseract/blob/4.1/src/training/tesstrain.sh#L18-L51 for example.
overwrite has been used in make__traineddata only (see https://github.com/tesseract-ocr/tesseract/blob/4.1/src/training/tesstrain_utils.sh#L622-L624). As this method is not available any more, we can probably drop it.

src/README.md

stweil · 2022-05-24T09:48:22Z

src/setup.py

+        'Topic :: Scientific/Engineering :: Image Recognition',
+        'License :: OSI Approved :: Apache Software License',
+        'Programming Language :: Python :: 3',
+        'Programming Language :: Python :: 3.6',


Suggested change

'Programming Language :: Python :: 3.6',

stweil · 2022-05-24T09:48:50Z

src/setup.py

+    ],
+    keywords='Tesseract,tesseract-ocr,OCR,optical character recognition',
+
+    python_requires='>=3.6',


Suggested change

python_requires='>=3.6',

python_requires='>=3.7',

stweil · 2022-05-24T09:59:14Z

Font properties (bold, italic, ...) are also from the legacy training and still unsupported with the LSTM recognizer. That's one of the reasons why there remains a certain need for legacy models. The old tesstrain.sh supported training of legacy models, and I think that it would be good to support it in the Python code, too. That should be done separately, not in this pull request here, but maybe you can keep the corresponding parameters with appropriate TODO comments.

stefan6419846 · 2022-05-24T10:05:25Z

@stweil Do you mean that we should keep the existing parameters for now when you are talking about the legacy support? Or does this refer to the linedata parameter only?

src/README.md

stweil · 2022-05-24T10:26:19Z

@stweil Do you mean that we should keep the existing parameters for now when you are talking about the legacy support? Or does this refer to the linedata parameter only?

I'd keep all existing parameters for now (with comments).

stefan6419846 · 2022-05-24T11:13:28Z

I have updated the requirements inside the README and fixed the parameters for tesstrain.wrapper.run().

stale · 2022-07-10T11:55:04Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stefan6419846 · 2022-07-11T06:34:32Z

Is there anything to be changed here to get this PR merged?

.gitignore

stweil · 2022-07-20T20:17:48Z

src/.gitignore

@@ -0,0 +1,160 @@
+# Byte-compiled / optimized / DLL files


This whole file looks like a copy from somewhere else. Which parts are really required? Why can those parts not be included in the root .gitignore?

I think this is the standard GitHub Python .gitignore

Yes, this is the default Python .gitignore template which usually is recommended for Python projects.

Only keeping the relevant entries might be an option, mainly concerning specific libraries which are currently unused. Nevertheless, I usually prefer to keep the default .gitignore file as it usually is some type of common sense.

I did not add this to the root .gitignore file to avoid unintended side effects for now, as the whole repository consists of two more or less independent parts, while not being clearly separated for now (see #307).

@stweil I have reduced the .gitignore to the entries which actually make sense in this use case (at least from my point of view).

Do you still want me to migrate the corresponding entries to the root .gitignore file?

stale · 2022-09-21T06:28:42Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stefan6419846 · 2022-09-21T07:42:59Z

Any further update on this?

stale · 2022-11-02T00:24:17Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

zdenop · 2023-01-05T12:36:06Z

Can this be merged?

stefan6419846 · 2023-01-08T15:49:23Z

At least from my side, yes. @stweil had some objections about the .gitignore file, but I did not yet hear back after my latest request for further changes which might be required.

stweil · 2023-01-08T16:03:50Z

I'd remove any reference to Python 3.6 (see my two comments). .gitignore still contains lots of entries which are not strictly necessary, but that is not critical for merging.

stefan6419846 · 2023-01-08T16:39:03Z

I have removed the references to Python 3.6 and updated the README to make clear that Tesseract version 5 is supported as well.

stweil

Let's try the new code. Thank you @stefan6419846 for your patience.

zdenop · 2023-01-08T17:46:31Z

@stweil: what about tagging the previous code/commit as version 1.0?
So we can maybe do more reorganization of code without breaking somebody's workflow.
I am interested to make training on windows just by using python. If python is required then Auxiliaries are really not needed.

migrate Python code to a dedicated package

96f31a2

stweil reviewed May 24, 2022

View reviewed changes

src/README.md Outdated Show resolved Hide resolved

stweil reviewed May 24, 2022

View reviewed changes

src/README.md Outdated Show resolved Hide resolved

document extract_font_properties parameter; update supported Python

ab60ad8

fix cleanup when used without log file

0ac2bda

stale bot added the stale Issues which require input by the reporter which is not provided label Jul 10, 2022

stweil removed the stale Issues which require input by the reporter which is not provided label Jul 11, 2022

stweil reviewed Jul 20, 2022

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

stweil reviewed Jul 20, 2022

View reviewed changes

stefan6419846 added 3 commits July 26, 2022 19:06

remove IDE entry from .gitignore file

01c9fb9

improve trove classifiers

d65fd3d

remove clearly not required entries from .gitignore file

c4d7b86

stale bot added the stale Issues which require input by the reporter which is not provided label Nov 2, 2022

stweil removed the stale Issues which require input by the reporter which is not provided label Nov 2, 2022

stefan6419846 added 2 commits January 8, 2023 17:28

update supported Python versions

e91a31d

update supported Python versions

345379f

stweil approved these changes Jan 8, 2023

View reviewed changes

stweil merged commit f4103bd into tesseract-ocr:main Jan 8, 2023

stefan6419846 deleted the pip_package branch January 8, 2023 19:40

zdenop mentioned this pull request Jan 9, 2023

Feat/generate trainingsets #205

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate Python code to a dedicated package #309

Migrate Python code to a dedicated package #309

stefan6419846 commented May 24, 2022

Shreeshrii commented May 24, 2022

stefan6419846 commented May 24, 2022

stweil May 24, 2022

stweil May 24, 2022

stweil commented May 24, 2022

stefan6419846 commented May 24, 2022

stweil commented May 24, 2022

stefan6419846 commented May 24, 2022

stale bot commented Jul 10, 2022

stefan6419846 commented Jul 11, 2022

stweil Jul 20, 2022

kba Jul 21, 2022

stefan6419846 Jul 26, 2022

stefan6419846 Aug 17, 2022

stale bot commented Sep 21, 2022

stefan6419846 commented Sep 21, 2022

stale bot commented Nov 2, 2022

zdenop commented Jan 5, 2023

stefan6419846 commented Jan 8, 2023

stweil commented Jan 8, 2023

stefan6419846 commented Jan 8, 2023

stweil left a comment

zdenop commented Jan 8, 2023

Migrate Python code to a dedicated package #309

Migrate Python code to a dedicated package #309

Conversation

stefan6419846 commented May 24, 2022

Shreeshrii commented May 24, 2022

stefan6419846 commented May 24, 2022

stweil May 24, 2022

Choose a reason for hiding this comment

stweil May 24, 2022

Choose a reason for hiding this comment

stweil commented May 24, 2022

stefan6419846 commented May 24, 2022

stweil commented May 24, 2022

stefan6419846 commented May 24, 2022

stale bot commented Jul 10, 2022

stefan6419846 commented Jul 11, 2022

stweil Jul 20, 2022

Choose a reason for hiding this comment

kba Jul 21, 2022

Choose a reason for hiding this comment

stefan6419846 Jul 26, 2022

Choose a reason for hiding this comment

stefan6419846 Aug 17, 2022

Choose a reason for hiding this comment

stale bot commented Sep 21, 2022

stefan6419846 commented Sep 21, 2022

stale bot commented Nov 2, 2022

zdenop commented Jan 5, 2023

stefan6419846 commented Jan 8, 2023

stweil commented Jan 8, 2023

stefan6419846 commented Jan 8, 2023

stweil left a comment

Choose a reason for hiding this comment

zdenop commented Jan 8, 2023