Text extraction in combination with tables #1026
Replies: 4 comments 9 replies
-
Hi @lawrencenika, and thanks for your query. Re. this:
If you find a solution that works for you, I'd welcome a PR that adds a notebook demonstrating it — something that's been discussed here on a similar topic: #1012 (reply in thread) |
Beta Was this translation helpful? Give feedback.
-
As mentioned in #1019 - this particular PDF is extremely awkward due to the "table lines" not being actual single lines. To save having to re-read the linked discussion, here is an attempted summary: Identify tableWe search for the bolded "Table N" text followed by a line that contains all bolded words (the column names). Codedef find_horizontal_lines():
"""
Needs more logic to make sure the lines are "contiguous"
"""
line_groups = cluster_objects(
page.horizontal_edges,
itemgetter("top"),
tolerance = 1
)
for lines in line_groups:
left = min(lines, key=itemgetter("x0"))["x0"]
right = max(lines, key=itemgetter("x1"))["x1"]
width = right - left
# multiple line fragments and "almost" full width of page
if len(lines) > 2 and page.width / width <= 1.5:
yield {
"top": lines[0]["top"],
"left": left,
"right": right
}
def find_table_headings():
TABLE_HEADER_RE = r"(?<=\n)(Table \d.*)\n(.+)"
is_bold_font = lambda chars: (
chars and all("bold" in char["fontname"].casefold() for char in chars)
)
for lines in page.search(TABLE_HEADER_RE):
if is_bold_font(lines["chars"][:5]): # Is "Table" in bold font?
table_name = lines["groups"][0]
next_line = re.escape(lines["groups"][1])
next_line = page.search(next_line)[0]["chars"] # `.search` to get font info
if is_bold_font(next_line):
yield {
"name": table_name,
"top": lines["top"],
"page_number": page.page_number
} Filter table area from pageIf we take the first table as an example and ignore the fact it spans multiple pages. We define the table area as the "3 lines" that follow the "table heading/name". import bisect
import pdfplumber
import re
from operator import itemgetter
from pdfplumber.utils import cluster_objects, get_bbox_overlap, obj_to_bbox
pdf = pdfplumber.open("Downloads/Life_Cycle_Assessment_of_Cow_Tanned_Leather_Produc.pdf")
page = pdf.pages[3]
table = next(find_table_headings())
horizontal_lines = sorted(find_horizontal_lines(), key=itemgetter("top"))
filtered_page = page
idx = bisect.bisect_left(horizontal_lines, table["top"], key=itemgetter("top"))
table_lines = horizontal_lines[idx: idx + 3]
del horizontal_lines[idx: idx + 3]
line = table_lines[0]
bottom = table_lines[-1]['top']
bbox = line['left'], line['top'], line['right'], bottom
filtered_page = filtered_page.filter(lambda obj:
get_bbox_overlap(obj_to_bbox(obj), bbox) is None
) One way to define the next step could be: Can we Perhaps @jsvine has some ideas on the matter. [Update]: So it looks like the text is cached by pdfplumber Line 572 in d9561d1 If I remove this, I can successfully It then appears in the output of Is this possibly a reliable approach? |
Beta Was this translation helpful? Give feedback.
-
I think my previous replies may have derailed the discussion somewhat @jsvine. This is a more basic version of the task: basic-2-tables.pdf We can attempt to crop out the table and inject the new formatted text in its place: import pdfplumber
import pandas as pd
from pdfplumber.utils import extract_text, get_bbox_overlap, obj_to_bbox
pdf = pdfplumber.open("basic-2-tables.pdf")
page = pdf.pages[0]
filtered_page = page
chars = filtered_page.chars
for table in page.find_tables():
first_table_char = page.crop(table.bbox).chars[0]
filtered_page = filtered_page.filter(lambda obj:
get_bbox_overlap(obj_to_bbox(obj), table.bbox) is None
)
chars = filtered_page.chars
df = pd.DataFrame(table.extract())
df.columns = df.iloc[0]
markdown = df.drop(0).to_markdown(index=False)
chars.append(first_table_char | {"text": markdown})
print(extract_text(chars, layout=True)) It sort of works but the new text loses alignment after the first line. (Not sure if we can re-align properly without messing up the surrounding layout?)
|
Beta Was this translation helpful? Give feedback.
-
thank you both, these are really awesome solutions. However, as I try them back on the |
Beta Was this translation helpful? Give feedback.
-
I have this PDF https://drive.google.com/file/d/167Y6KKW5cv0-7r8FV830iofWvYZWP4b6/view?usp=sharing that contains tables and text.
Earlier I tried using the default
page.find_tables()
according to #242 but it still included my table 1 in response.So I have this crazy query, can pdfplumber read the text and the tables in sequential order, i.e. read text line by line, then when there is a table, reading the table column by column, and then it continues on the the next section. There might be table that span across pages, but I would want to read them column by column consistently still.
Beta Was this translation helpful? Give feedback.
All reactions