Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrating from tesstrain.sh #307

Closed
stefan6419846 opened this issue May 19, 2022 · 5 comments
Closed

Migrating from tesstrain.sh #307

stefan6419846 opened this issue May 19, 2022 · 5 comments
Labels
stale Issues which require input by the reporter which is not provided

Comments

@stefan6419846
Copy link
Contributor

I have used the tesstrain.sh approach (including tesstrain_utils.sh and language-specific.sh) for fine tuning an existing model for a specific font in the past. As this is deprecated with the corresponding Bash scripts having been removed from the tesseract repository, I wanted to try the new approach which redirects to this repository.

Looking at this repository, it seems to provide four different types of training support:

  • A Makefile-based approach with short documentation inside the README file.
  • Some Python scripts in the root directory, which seem to be partially used by the Makefile.
  • Some plotting functionality in the plot directory.
  • A Python-based replacement for the old Bash scripts in src/training.

Given this situation, how do the four different "types" interact with each other? What is the correct approach to use for training, given that I want to avoid another deprecation after a short time?

Background: As a Python developer I considered using the approach from src/training, requiring less migration effort as well. But as this does not seem to be documented in the README, I am not sure whether this makes sense.

As an additional question: The module in src/training does not seem to be available as a regular Python package on PyPI, although it seems like it could be. Are there any plans to convert this to a library (leaving tesstrain as an entry point for standalone execution), making it easier to use this in own code without maintaining local copies?

@stweil
Copy link
Collaborator

stweil commented May 20, 2022

Initially this repository contained the Makefile and a few Python scripts which were used by the Makefile. Its main purpose was training from scanned text lines with transcriptions ("ground truth"), either from scratch or finetuning of existing models.

The tesseract repository contained a shell script (plus helper scripts) for training with generated images. All standard models for Tesseract were trained with such artificial data. Initially that shell script supported training of new models for the "legacy" (Tesseract 3) recognizer. Later it was enhanced to support training for the LSTM (Tesseract 4) recognizer. And even later it was replaced by Python code which provided the same command line interface, but never implemented the "legacy" training. The shell scripts are removed in newer releases, and the Python code was moved to tesstrain.

Making a Python package which is published on PyPI is a good idea. It only has to be done ...

@stefan6419846
Copy link
Contributor Author

Thanks for the explanations about the Makefile (and the files inside the root directory) being intended for "real-life" training, while src/tesstrain keeping the artificial approach alive.

In my opinion (mostly being a Tesseract user instead of a Tesseract developer) these aspects should be represented in the directory structure as well. This might be achieved by moving all the sources for "real-life" training into an own subdirectory and maybe improving the naming for the artificial approach directory. Then each directory could have a dedicated README, while the global README provides some basic explanations.

Regarding the Python package I am going to open a new issue.

@stale
Copy link

stale bot commented Aug 13, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Issues which require input by the reporter which is not provided label Aug 13, 2022
@stefan6419846
Copy link
Contributor Author

Is there any interest in actually cleaning up the directory structure and improving the corresponding documentation? If yes, does it make sense to track it in this issue, or should this rather be a new one?

@stale stale bot removed the stale Issues which require input by the reporter which is not provided label Aug 15, 2022
@stale
Copy link

stale bot commented Nov 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Issues which require input by the reporter which is not provided label Nov 2, 2022
@stale stale bot closed this as completed Jan 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues which require input by the reporter which is not provided
Projects
None yet
Development

No branches or pull requests

2 participants