-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate Python code to a dedicated package #309
Conversation
Thanks. A dedicated package for training from fonts is a good idea. You may want to look at the original bash scripts that were sought to be replicated in these python scripts in older versions eg. https://github.com/tesseract-ocr/tesseract/tree/4.0/src/training 'overwrite' if I recall correctly was used for the legacy training offered in the bash scripts. |
|
src/setup.py
Outdated
'Topic :: Scientific/Engineering :: Image Recognition', | ||
'License :: OSI Approved :: Apache Software License', | ||
'Programming Language :: Python :: 3', | ||
'Programming Language :: Python :: 3.6', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Programming Language :: Python :: 3.6', |
src/setup.py
Outdated
], | ||
keywords='Tesseract,tesseract-ocr,OCR,optical character recognition', | ||
|
||
python_requires='>=3.6', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python_requires='>=3.6', | |
python_requires='>=3.7', |
Font properties (bold, italic, ...) are also from the legacy training and still unsupported with the LSTM recognizer. That's one of the reasons why there remains a certain need for legacy models. The old |
@stweil Do you mean that we should keep the existing parameters for now when you are talking about the legacy support? Or does this refer to the |
I'd keep all existing parameters for now (with comments). |
I have updated the requirements inside the README and fixed the parameters for |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Is there anything to be changed here to get this PR merged? |
@@ -0,0 +1,160 @@ | |||
# Byte-compiled / optimized / DLL files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole file looks like a copy from somewhere else. Which parts are really required? Why can those parts not be included in the root .gitignore
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the standard GitHub Python .gitignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is the default Python .gitignore template which usually is recommended for Python projects.
Only keeping the relevant entries might be an option, mainly concerning specific libraries which are currently unused. Nevertheless, I usually prefer to keep the default .gitignore file as it usually is some type of common sense.
I did not add this to the root .gitignore file to avoid unintended side effects for now, as the whole repository consists of two more or less independent parts, while not being clearly separated for now (see #307).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stweil I have reduced the .gitignore
to the entries which actually make sense in this use case (at least from my point of view).
Do you still want me to migrate the corresponding entries to the root .gitignore
file?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Any further update on this? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Can this be merged? |
At least from my side, yes. @stweil had some objections about the |
I'd remove any reference to Python 3.6 (see my two comments). |
I have removed the references to Python 3.6 and updated the README to make clear that Tesseract version 5 is supported as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try the new code. Thank you @stefan6419846 for your patience.
@stweil: what about tagging the previous code/commit as version 1.0? |
This is my attempt to migrate the existing code for working with artificial training data to a dedicated Python package, as proposed in #308 and #307. This includes some additional refactoring to the module structure to better encapsulate specific functionality.
I have used version number 0.1 for now, although I am up to changing this.
When migrating, I had two parameters which I am not sure about:
overwrite
, defaulting toFalse
, does not seem to be used at all.It has not been clear enough for me what
extract_font_properties
really means and therefore misses documentation.text2image --help
did not really help me in this case as well:What would be an appropriate documentation of the parameter?
If there is anything unclear or you want to see anything changed about this, feel free to ask or report.