We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
With current git master version: $ ~/bin/tesseract -v tesseract 4.0.0-86-gbee8 leptonica-1.76.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found SSE
$ ~/bin/tesseract --psm 3 --oem 0 -c tessedit_create_hocr=1 -c hocr_font_info=1 -l deu page.tif page-4.0.0-86-gbee8
In page-4.0.0-86-gbee8.hocr: "x_fsize 0" and "x_font" is missing completly: <span class='ocrx_word' id='word_1_1' title='bbox 253 248 365 292; x_wconf 89; x_fsize 0'>rung</span>
<span class='ocrx_word' id='word_1_1' title='bbox 253 248 365 292; x_wconf 89; x_fsize 0'>rung</span>
With the current ubuntu version 4.0.0-beta.3-249-g607e: $ tesseract -v tesseract 4.0.0-beta.3-249-g607e leptonica-1.76.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found SSE $ tesseract --psm 3 --oem 0 -c tessedit_create_hocr=1 -c hocr_font_info=1 -l deu page.tif page-4.0.0-beta.3-249-g607e
In page-4.0.0-beta.3-249-g607e.hocr: "x_fsize" and "x_font" have sensible values: <span class='ocrx_word' id='word_1_1' title='bbox 253 248 365 292; x_wconf 89; x_font Times_New_Roman; x_fsize 56'>rung</span>
<span class='ocrx_word' id='word_1_1' title='bbox 253 248 365 292; x_wconf 89; x_font Times_New_Roman; x_fsize 56'>rung</span>
review c9e85ab or ad40131 one of these commits fixed too much ;)
The text was updated successfully, but these errors were encountered:
Testcase: page.zip
Sorry, something went wrong.
c9e85ab
if (it_->word()) {
should be: if (it_->word() == nullptr) {
if (it_->word() == nullptr) {
@hnesk and @amitdo, thank you for your reports and sorry for the regression. It is fixed now with commit 2c044df.
x_fsize is not written by default because the hocr config file sets hocr_font_info 0, so I assume most users won't notice the bug.
x_fsize
hocr
hocr_font_info 0
Great! Works as expected with 2c044df
No branches or pull requests
Environment
Current Behavior:
With current git master version:
$ ~/bin/tesseract -v
tesseract 4.0.0-86-gbee8
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE
$ ~/bin/tesseract --psm 3 --oem 0 -c tessedit_create_hocr=1 -c hocr_font_info=1 -l deu page.tif page-4.0.0-86-gbee8
In page-4.0.0-86-gbee8.hocr: "x_fsize 0" and "x_font" is missing completly:
<span class='ocrx_word' id='word_1_1' title='bbox 253 248 365 292; x_wconf 89; x_fsize 0'>rung</span>
Expected Behavior:
With the current ubuntu version 4.0.0-beta.3-249-g607e:
$ tesseract -v
tesseract 4.0.0-beta.3-249-g607e
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE
$ tesseract --psm 3 --oem 0 -c tessedit_create_hocr=1 -c hocr_font_info=1 -l deu page.tif page-4.0.0-beta.3-249-g607e
In page-4.0.0-beta.3-249-g607e.hocr: "x_fsize" and "x_font" have sensible values:
<span class='ocrx_word' id='word_1_1' title='bbox 253 248 365 292; x_wconf 89; x_font Times_New_Roman; x_fsize 56'>rung</span>
Suggested Fix:
review c9e85ab or ad40131 one of these commits fixed too much ;)
The text was updated successfully, but these errors were encountered: