-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/9 font attributes #342
Conversation
forty2
commented
Apr 20, 2017
- Add a FontInfo class that mirrors Tesseract's own FontInfo struct. This contains the font attributes that are cacheable by ID.
- Add a FontAttributes class to serve as the return value of GetWordFontAttributes()
- Add the necessary glue to actually get the word font attributes from the native side
- Wrap a few other "low-hanging fruit" API calls (WordRecognitionLanguage, WordIsFromDictionary, etc)
… This contains the font attributes that are cacheable by ID. * Add a FontAttributes class to serve as the return value of GetWordFontAttributes() * Add the necessary glue to actually get the word font attributes from the native side * Throw in a wrapper for WordRecognitionLanguage() as well since it seems like it might be useful.
* Take a swing at thread safety (may still need changes) * Push creation of FontInfo objects a little deeper so we can skip marshaling the font name string if we don't need to * Remove the FontProperties enum -- it's not really necessary and it complicates the code * Wrap a few other "low-hanging fruit" API calls (WordIsFromDictionary, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looking good but I don't think we can go ahead with a static cache given the reasons outlined in the review.
src/Tesseract/FontInfo.cs
Outdated
if (_cache.ContainsKey(id)) { | ||
return _cache[id]; | ||
} | ||
lock (_cacheLock) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your correct in that this solves the threading safety issue however using a static dictionary still has some other issues in this case the main one would be it's possible, perhaps even likely, that the Id's may not be unique across processing runs or engines. For instance if you process two seperate documents it's possible that the same id could be returned with different font information. Looking at the doco for LTRResultIterator::WordFontAttributes confirms this "Lifespan is the same as the iterator itself, ie rendered invalid by various members of TessBaseAPI, including Init, SetImage, End or deleting the TessBaseAPI."
I think the best solution would be to move it as a cache in ResultIterator as previously mentioned, this also wouldn't require any thread synchronisation\locking since ResultIterator can only be used by one thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's possible, perhaps even likely, that the Id's may not be unique across processing runs or engines. For instance if you process two seperate documents it's possible that the same id could be returned with different font information. Looking at the doco for LTRResultIterator::WordFontAttributes confirms this
Aha! This is the part that wasn't quite "clicking" for me, but I get it now. Thanks for being patient 😃 I'll go ahead and fix it.
src/Tesseract/ResultIterator.cs
Outdated
|
||
return new FontAttributes(fontInfo, is_underlined, is_smallcaps, pointsize); | ||
public bool GetWordIsFromDictionary() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these look good :)
src/Tesseract/ResultIterator.cs
Outdated
@@ -34,41 +34,67 @@ public string GetText(PageIteratorLevel level) | |||
return null; | |||
} | |||
|
|||
bool is_bold, is_italic, is_underlined, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that the cache is moved to Result Iterator maybe something like (in pesudo code):
IntPtr nameHandle = Interop.TessApi.ResultIteratorWordFontAttributes(handle, ...attributes, out fontId);
FontInfo fontInfo;
if(!_cache.TryGetValue(fontId, out fontInfo)) {
string fontName = MarshalHelper.PtrToString(nameHandle, TextEncoding.UTF8);
fontInfo = new FontInfo(fontName, isItalic, isBold, isMonospace, isSerif);
_cache.Add(fontId, fontInfo);
}
return new FontAttributes(fontInfo, isUnderlined, isSmallcaps, pointsize);
src/Tesseract/Interop/BaseApi.cs
Outdated
{ | ||
bool is_bold, is_italic, is_underlined, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency use camelCase for variable naming (e.g. isBold).
Looks good, I've merged the changes into develop. If you could add a regresion test to the test suite that would be great :) |
No problem, I should have time to work on that tomorrow or Monday. |
I put some tests for the new functionality in forty2/tesseract@14a9dc4. Is that the sort of thing you had in mind? |
Yes that's exactly what I had in mind. Thanks 🙂
…On Tue, 25 Apr 2017, 14:32 Zach Bean, ***@***.***> wrote:
I put some tests for the new functionality in ***@***.***
<forty2@14a9dc4>. Is that the sort of
thing you had in mind?
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#342 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAPzyApZkAKPu4MZXipXFnbWSRzSIHPLks5rzXdkgaJpZM4NChDt>
.
|