-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrating from tesstrain.sh #307
Comments
Initially this repository contained the Makefile and a few Python scripts which were used by the Makefile. Its main purpose was training from scanned text lines with transcriptions ("ground truth"), either from scratch or finetuning of existing models. The Making a Python package which is published on PyPI is a good idea. It only has to be done ... |
Thanks for the explanations about the Makefile (and the files inside the root directory) being intended for "real-life" training, while src/tesstrain keeping the artificial approach alive. In my opinion (mostly being a Tesseract user instead of a Tesseract developer) these aspects should be represented in the directory structure as well. This might be achieved by moving all the sources for "real-life" training into an own subdirectory and maybe improving the naming for the artificial approach directory. Then each directory could have a dedicated README, while the global README provides some basic explanations. Regarding the Python package I am going to open a new issue. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Is there any interest in actually cleaning up the directory structure and improving the corresponding documentation? If yes, does it make sense to track it in this issue, or should this rather be a new one? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I have used the
tesstrain.sh
approach (includingtesstrain_utils.sh
andlanguage-specific.sh
) for fine tuning an existing model for a specific font in the past. As this is deprecated with the corresponding Bash scripts having been removed from the tesseract repository, I wanted to try the new approach which redirects to this repository.Looking at this repository, it seems to provide four different types of training support:
Given this situation, how do the four different "types" interact with each other? What is the correct approach to use for training, given that I want to avoid another deprecation after a short time?
Background: As a Python developer I considered using the approach from src/training, requiring less migration effort as well. But as this does not seem to be documented in the README, I am not sure whether this makes sense.
As an additional question: The module in src/training does not seem to be available as a regular Python package on PyPI, although it seems like it could be. Are there any plans to convert this to a library (leaving
tesstrain
as an entry point for standalone execution), making it easier to use this in own code without maintaining local copies?The text was updated successfully, but these errors were encountered: