Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/9 font attributes #342

Merged
merged 4 commits into from
Apr 22, 2017
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions src/Tesseract/FontAttributes.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
using System;

namespace Tesseract
{
// This class is the return type of
// ResultIterator.GetWordFontAttributes(). We can't
// use FontInfo directly because there are properties
// here that are not accounted for in FontInfo
// (smallcaps, underline, etc.) Because of the caching
// scheme we're using for FontInfo objects, we can't simply
// augment that class since these extra properties are not
// accounted for by the FontInfo's unique ID.
public class FontAttributes
{
public FontInfo FontInfo { get; private set; }

public bool IsUnderlined { get; private set; }
public bool IsSmallCaps { get; private set; }
public int PointSize { get; private set; }

public FontAttributes(
FontInfo fontInfo, bool isUnderlined, bool isSmallCaps, int pointSize)
{
FontInfo = fontInfo;
IsUnderlined = isUnderlined;
IsSmallCaps = isSmallCaps;
PointSize = pointSize;
}
}
}
65 changes: 65 additions & 0 deletions src/Tesseract/FontInfo.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
using System;
using System.Collections.Generic;

namespace Tesseract
{
// The .NET equivalent of the ccstruct/fontinfo.h
// FontInfo struct. It's missing spacing info
// since we don't have any way of getting it (and
// it's probably not all that useful anyway)
public class FontInfo
{
private FontInfo(
string name, int id,
bool isItalic, bool isBold, bool isFixedPitch,
bool isSerif, bool isFraktur = false
)
{
Name = name;
Id = id;

IsItalic = isItalic;
IsBold = isBold;
IsFixedPitch = isFixedPitch;
IsSerif = isSerif;
IsFraktur = isFraktur;
}

public string Name { get; private set; }

public int Id { get; private set; }
public bool IsItalic { get; private set; }
public bool IsBold { get; private set; }
public bool IsFixedPitch { get; private set; }
public bool IsSerif { get; private set; }
public bool IsFraktur { get; private set; }

private static Dictionary<int, FontInfo> _cache = new Dictionary<int, FontInfo>();

public static FontInfo GetById(int id) {
if (_cache.ContainsKey(id)) {
return _cache[id];
}
return null;
}

public static object _cacheLock = new object();
public static FontInfo GetOrCreate(
string name, int id,
bool isItalic, bool isBold, bool isFixedPitch,
bool isSerif, bool isFraktur = false
)
{
lock (_cacheLock) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your correct in that this solves the threading safety issue however using a static dictionary still has some other issues in this case the main one would be it's possible, perhaps even likely, that the Id's may not be unique across processing runs or engines. For instance if you process two seperate documents it's possible that the same id could be returned with different font information. Looking at the doco for LTRResultIterator::WordFontAttributes confirms this "Lifespan is the same as the iterator itself, ie rendered invalid by various members of TessBaseAPI, including Init, SetImage, End or deleting the TessBaseAPI."

I think the best solution would be to move it as a cache in ResultIterator as previously mentioned, this also wouldn't require any thread synchronisation\locking since ResultIterator can only be used by one thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's possible, perhaps even likely, that the Id's may not be unique across processing runs or engines. For instance if you process two seperate documents it's possible that the same id could be returned with different font information. Looking at the doco for LTRResultIterator::WordFontAttributes confirms this

Aha! This is the part that wasn't quite "clicking" for me, but I get it now. Thanks for being patient 😃 I'll go ahead and fix it.

if (_cache.ContainsKey(id)) {
return _cache[id];
}

var newFont = new FontInfo(name, id, isItalic, isBold, isFixedPitch, isSerif, isFraktur);
_cache.Add(id, newFont);

return newFont;
}
}
}
}
67 changes: 67 additions & 0 deletions src/Tesseract/Interop/BaseApi.cs
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,27 @@ int BaseApiInit(HandleRef handle, string datapath, string language, int mode,
[RuntimeDllImport(Constants.TesseractDllName, CallingConvention = CallingConvention.Cdecl, EntryPoint = "TessResultIteratorConfidence")]
float ResultIteratorGetConfidence(HandleRef handle, PageIteratorLevel level);

[RuntimeDllImport(Constants.TesseractDllName, CallingConvention = CallingConvention.Cdecl, EntryPoint = "TessResultIteratorWordFontAttributes")]
IntPtr ResultIteratorWordFontAttributesInternal(HandleRef handle, out bool is_bold, out bool is_italic, out bool is_underlined, out bool is_monospace, out bool is_serif, out bool is_smallcaps, out int pointsize, out int font_id);

[RuntimeDllImport(Constants.TesseractDllName, CallingConvention = CallingConvention.Cdecl, EntryPoint = "TessResultIteratorWordIsFromDictionary")]
bool ResultIteratorWordIsFromDictionary(HandleRef handle);

[RuntimeDllImport(Constants.TesseractDllName, CallingConvention = CallingConvention.Cdecl, EntryPoint = "TessResultIteratorWordIsNumeric")]
bool ResultIteratorWordIsNumeric(HandleRef handle);

[RuntimeDllImport(Constants.TesseractDllName, CallingConvention = CallingConvention.Cdecl, EntryPoint = "TessResultIteratorWordRecognitionLanguage")]
IntPtr ResultIteratorWordRecognitionLanguageInternal(HandleRef handle);

[RuntimeDllImport(Constants.TesseractDllName, CallingConvention = CallingConvention.Cdecl, EntryPoint = "TessResultIteratorSymbolIsSuperscript")]
bool ResultIteratorSymbolIsSuperscript(HandleRef handle);

[RuntimeDllImport(Constants.TesseractDllName, CallingConvention = CallingConvention.Cdecl, EntryPoint = "TessResultIteratorSymbolIsSubscript")]
bool ResultIteratorSymbolIsSubscript(HandleRef handle);

[RuntimeDllImport(Constants.TesseractDllName, CallingConvention = CallingConvention.Cdecl, EntryPoint = "TessResultIteratorSymbolIsDropcap")]
bool ResultIteratorSymbolIsDropcap(HandleRef handle);

[RuntimeDllImport(Constants.TesseractDllName, CallingConvention = CallingConvention.Cdecl, EntryPoint = "TessResultIteratorGetPageIterator")]
IntPtr ResultIteratorGetPageIterator(HandleRef handle);

Expand Down Expand Up @@ -440,6 +461,52 @@ public static void Initialize()
}
}

public static FontAttributes ResultIteratorWordFontAttributes(HandleRef handle)
{
bool is_bold, is_italic, is_underlined,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency use camelCase for variable naming (e.g. isBold).

is_monospace, is_serif, is_smallcaps;

int pointsize, font_id;

// per docs (ltrresultiterator.h:104 as of 4897796 in github:tesseract-ocr/tesseract)
// this return value points to an internal table and should not be deleted.
IntPtr txtHandle =
Native.ResultIteratorWordFontAttributesInternal(
handle,
out is_bold, out is_italic, out is_underlined,
out is_monospace, out is_serif, out is_smallcaps,
out pointsize, out font_id
);

// this can happen in certain error conditions.
if (txtHandle == IntPtr.Zero) {
return null;
}

var fontInfo =
FontInfo.GetById(font_id)
?? FontInfo.GetOrCreate(
MarshalHelper.PtrToString(txtHandle, Encoding.UTF8),
font_id,
is_italic, is_bold,
is_monospace, is_serif
);

return new FontAttributes(fontInfo, is_underlined, is_smallcaps, pointsize);
}

public static string ResultIteratorWordRecognitionLanguage(HandleRef handle)
{
// per docs (ltrresultiterator.h:118 as of 4897796 in github:tesseract-ocr/tesseract)
// this return value should *NOT* be deleted.
IntPtr txtHandle =
Native.ResultIteratorWordRecognitionLanguageInternal(handle);

return txtHandle != IntPtr.Zero
? MarshalHelper.PtrToString(txtHandle, Encoding.UTF8)
: null;
}

public static string ResultIteratorGetUTF8Text(HandleRef handle, PageIteratorLevel level)
{
IntPtr txtHandle = Native.ResultIteratorGetUTF8TextInternal(handle, level);
Expand Down
69 changes: 69 additions & 0 deletions src/Tesseract/ResultIterator.cs
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,75 @@ public string GetText(PageIteratorLevel level)
return Interop.TessApi.ResultIteratorGetUTF8Text(handle, level);
}

public FontAttributes GetWordFontAttributes() {
VerifyNotDisposed();
if (handle.Handle == IntPtr.Zero) {
return null;
}

return Interop.TessApi.ResultIteratorWordFontAttributes(handle);
}

public string GetWordRecognitionLanguage()
{
VerifyNotDisposed();
if (handle.Handle == IntPtr.Zero) {
return null;
}

return Interop.TessApi.ResultIteratorWordRecognitionLanguage(handle);
}

public bool GetWordIsFromDictionary()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these look good :)

{
VerifyNotDisposed();
if (handle.Handle == IntPtr.Zero) {
return false;
}

return Interop.TessApi.Native.ResultIteratorWordIsFromDictionary(handle);
}

public bool GetWordIsNumeric()
{
VerifyNotDisposed();
if (handle.Handle == IntPtr.Zero) {
return false;
}

return Interop.TessApi.Native.ResultIteratorWordIsNumeric(handle);
}

public bool GetSymbolIsSuperscript()
{
VerifyNotDisposed();
if (handle.Handle == IntPtr.Zero) {
return false;
}

return Interop.TessApi.Native.ResultIteratorSymbolIsSuperscript(handle);
}

public bool GetSymbolIsSubscript()
{
VerifyNotDisposed();
if (handle.Handle == IntPtr.Zero) {
return false;
}

return Interop.TessApi.Native.ResultIteratorSymbolIsSubscript(handle);
}

public bool GetSymbolIsDropcap()
{
VerifyNotDisposed();
if (handle.Handle == IntPtr.Zero) {
return false;
}

return Interop.TessApi.Native.ResultIteratorSymbolIsDropcap(handle);
}

/// <summary>
/// Gets an instance of a choice iterator using the current symbol of interest. The ChoiceIterator allows a one-shot iteration over the
/// choices for this symbol and after that is is useless.
Expand Down
2 changes: 2 additions & 0 deletions src/Tesseract/Tesseract.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,8 @@
<Compile Include="PolyBlockType.cs" />
<Compile Include="Properties\AssemblyInfo.cs" />
<Compile Include="Rect.cs" />
<Compile Include="FontAttributes.cs" />
<Compile Include="FontInfo.cs" />
<Compile Include="ResultIterator.cs" />
<Compile Include="TesseractEnviornment.cs">
<SubType>Code</SubType>
Expand Down