Migrating from tesstrain.sh #307

stefan6419846 · 2022-05-19T13:47:07Z

I have used the tesstrain.sh approach (including tesstrain_utils.sh and language-specific.sh) for fine tuning an existing model for a specific font in the past. As this is deprecated with the corresponding Bash scripts having been removed from the tesseract repository, I wanted to try the new approach which redirects to this repository.

Looking at this repository, it seems to provide four different types of training support:

A Makefile-based approach with short documentation inside the README file.
Some Python scripts in the root directory, which seem to be partially used by the Makefile.
Some plotting functionality in the plot directory.
A Python-based replacement for the old Bash scripts in src/training.

Given this situation, how do the four different "types" interact with each other? What is the correct approach to use for training, given that I want to avoid another deprecation after a short time?

Background: As a Python developer I considered using the approach from src/training, requiring less migration effort as well. But as this does not seem to be documented in the README, I am not sure whether this makes sense.

As an additional question: The module in src/training does not seem to be available as a regular Python package on PyPI, although it seems like it could be. Are there any plans to convert this to a library (leaving tesstrain as an entry point for standalone execution), making it easier to use this in own code without maintaining local copies?

The text was updated successfully, but these errors were encountered:

stweil · 2022-05-20T09:18:44Z

Initially this repository contained the Makefile and a few Python scripts which were used by the Makefile. Its main purpose was training from scanned text lines with transcriptions ("ground truth"), either from scratch or finetuning of existing models.

The tesseract repository contained a shell script (plus helper scripts) for training with generated images. All standard models for Tesseract were trained with such artificial data. Initially that shell script supported training of new models for the "legacy" (Tesseract 3) recognizer. Later it was enhanced to support training for the LSTM (Tesseract 4) recognizer. And even later it was replaced by Python code which provided the same command line interface, but never implemented the "legacy" training. The shell scripts are removed in newer releases, and the Python code was moved to tesstrain.

Making a Python package which is published on PyPI is a good idea. It only has to be done ...

stefan6419846 · 2022-05-20T09:45:25Z

Thanks for the explanations about the Makefile (and the files inside the root directory) being intended for "real-life" training, while src/tesstrain keeping the artificial approach alive.

In my opinion (mostly being a Tesseract user instead of a Tesseract developer) these aspects should be represented in the directory structure as well. This might be achieved by moving all the sources for "real-life" training into an own subdirectory and maybe improving the naming for the artificial approach directory. Then each directory could have a dedicated README, while the global README provides some basic explanations.

Regarding the Python package I am going to open a new issue.

stale · 2022-08-13T09:33:18Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stefan6419846 · 2022-08-15T06:11:09Z

Is there any interest in actually cleaning up the directory structure and improving the corresponding documentation? If yes, does it make sense to track it in this issue, or should this rather be a new one?

stale · 2022-11-02T01:24:27Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stefan6419846 mentioned this issue May 24, 2022

Migrate Python code to a dedicated package #309

Merged

stale bot added the stale Issues which require input by the reporter which is not provided label Aug 13, 2022

stale bot removed the stale Issues which require input by the reporter which is not provided label Aug 15, 2022

stale bot added the stale Issues which require input by the reporter which is not provided label Nov 2, 2022

stale bot closed this as completed Jan 8, 2023

stefan6419846 mentioned this issue Jan 9, 2023

Add --vertical_fontlist option to tesstrain.py #249

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrating from tesstrain.sh #307

Migrating from tesstrain.sh #307

stefan6419846 commented May 19, 2022

stweil commented May 20, 2022

stefan6419846 commented May 20, 2022

stale bot commented Aug 13, 2022

stefan6419846 commented Aug 15, 2022

stale bot commented Nov 2, 2022

Migrating from tesstrain.sh #307

Migrating from tesstrain.sh #307

Comments

stefan6419846 commented May 19, 2022

stweil commented May 20, 2022

stefan6419846 commented May 20, 2022

stale bot commented Aug 13, 2022

stefan6419846 commented Aug 15, 2022

stale bot commented Nov 2, 2022