Header recognition is incomplete,table with border in pdf #1177
stevenwu2017
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment 1 reply
-
It seems that there are two issues:
Filtering out those extra graphical elements seems to work: def check(obj):
# Keep only `chars` and objects with a black fill
return obj.get("text") or obj.get("non_stroking_color") == (0, 0, 0)
page.filter(check).to_image().debug_tablefinder() page.filter(check).extract_tables()
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I had a problem with incomplete header recognition, table with border in pdf.
but i don't know what causes it. pdf file and debug table find result as follows:
02498_ZH.pdf
python code:
import pdfplumber
pdf_files = '02498_ZH.pdf'
with pdfplumber.open(pdf_files) as pdf:
# 遍历每个页面
for idx, page in enumerate(pdf.pages):
# if idx==0:
deduped_page = page.dedupe_chars()
tables = page.find_tables()
for table in tables:
content=table.extract()
print(content)
sub_page=page.crop(table.bbox)
im = sub_page.to_image(resolution=200)
im.debug_tablefinder().show()
Beta Was this translation helpful? Give feedback.
All reactions