Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle rotated text #15

Open
jbaiter opened this issue Jul 23, 2020 · 0 comments
Open

Handle rotated text #15

jbaiter opened this issue Jul 23, 2020 · 0 comments
Labels
bug Something isn't working rendering

Comments

@jbaiter
Copy link
Member

jbaiter commented Jul 23, 2020

Currently we don't include any heuristics for detecting text that is rotated, such as in this example from a Wellcome volume:

rotated

Due to this, we stretch the text content to fill the available width, which results in this garbled output (which also takes a long time to render for some reason):

garbled_000

For this case it probable suffices to include a simple heuristic to detect 90-degree rotated text. Since the OCR does not include any indication of the direction of the rotation, it should probably fall back to counter-clockwise due to the reading direction.

Obviously this should only apply to texts in western LTR scripts, Asian scripts that are written top-down would obviously be rendered badly with this approach.

I don't think we can "fix" this without help from the actual OCR markup. Even in the western script case, a line might be rotated 90 or 270 degrees, there's no way of telling from this markup, since it includes neither baseline nor rotation information:

<TextBlock ID="P9_TB00006" HEIGHT="316" WIDTH="57" HPOS="649" VPOS="651" STYLEREFS="TXT_43 PAR_CENTER">
  <TextLine ID="P9_TL00007" HEIGHT="316" WIDTH="31" HPOS="649" VPOS="651">
    <String ID="P9_ST00034" CONTENT="Vestry.—33" HEIGHT="167" WIDTH="31" HPOS="649" VPOS="800" WC="0.93" CC="0700000000"/>
    <SP ID="P9_SP00028" WIDTH="0" HPOS="674" VPOS="763"/>
    <String ID="P9_ST00035" CONTENT="Meetings." HEIGHT="133" WIDTH="30" HPOS="650" VPOS="651" WC="1" CC="000000000"/>
  </TextLine>
  <TextLine ID="P9_TL00008" HEIGHT="287" WIDTH="25" HPOS="681" VPOS="666">
    <String ID="P9_ST00036" CONTENT="No." HEIGHT="49" WIDTH="23" HPOS="681" VPOS="904" WC="1" CC="000"/>
    <SP ID="P9_SP00029" WIDTH="0" HPOS="706" VPOS="874"/>
    <String ID="P9_ST00037" CONTENT="of" HEIGHT="27" WIDTH="15" HPOS="690" VPOS="861" WC="1" CC="00"/>
    <SP ID="P9_SP00030" WIDTH="0" HPOS="706" VPOS="828"/>
    <String ID="P9_ST00038" CONTENT="Attendances." HEIGHT="184" WIDTH="22" HPOS="682" VPOS="666" WC="1" CC="000000000000"/>
  </TextLine>
</TextBlock>
@jbaiter jbaiter added bug Something isn't working rendering labels Jul 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working rendering
Projects
None yet
Development

No branches or pull requests

1 participant