Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多栏版面文档识别的阅读顺序不正确 #909

Open
guoguo0646 opened this issue Nov 8, 2024 · 3 comments
Open

多栏版面文档识别的阅读顺序不正确 #909

guoguo0646 opened this issue Nov 8, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@guoguo0646
Copy link

Description of the bug | 错误描述

使用0.9.0版本识别多栏版面文档识别的阅读顺序不正确

How to reproduce the bug | 如何复现

源pdf文档见附件
14-美国“马赛克战”作战概念解析_雷子欣.pdf
识别的版面阅读顺序不正确的截图
bdb8fd0fd49940fdbae97364999172f3
image

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.9.x

Device mode | 设备模式

cuda

@guoguo0646 guoguo0646 added the bug Something isn't working label Nov 8, 2024
@myhloli
Copy link
Collaborator

myhloli commented Nov 8, 2024

切换到layout顺序来看
image
image
这是有问题的两页,主要是因为排序模型是纯视觉的,没有使用到语义信息,
因此遇到图片占据上半空间且右侧有文本的情况下,会优先寻找右侧文本块。
同时由于该文档的span块比原始文本宽了不少,也容易导致排序模型作出错误的判断。
尝试开启强制ocr后,排序效果有一些改善,如下图
image
image

@guoguo0646
Copy link
Author

谢谢,使用ocr方法解析的效果有所提升,但还是存在顺序错乱的问题:第1栏结尾连接到了第3栏;还有个问题,原文档中的"效果网"解析成了“效 果网”,多出了个空格;另外请问下您所说的“由于该文档的span块比原始文本宽了不少,也容易导致排序模型作出错误的判断。”,怎么排查得到span块比原始文本宽了许多?
image

@myhloli
Copy link
Collaborator

myhloli commented Nov 14, 2024

image
正常情况span的红框是贴着文本的,这个可视化结果红色线框差不多是正常文本的三倍宽了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants