Text extraction in combination with tables #1026

lawrencenika · 2023-10-26T04:47:03Z

lawrencenika
Oct 26, 2023

I have this PDF https://drive.google.com/file/d/167Y6KKW5cv0-7r8FV830iofWvYZWP4b6/view?usp=sharing that contains tables and text.
Earlier I tried using the default page.find_tables() according to #242 but it still included my table 1 in response.

So I have this crazy query, can pdfplumber read the text and the tables in sequential order, i.e. read text line by line, then when there is a table, reading the table column by column, and then it continues on the the next section. There might be table that span across pages, but I would want to read them column by column consistently still.

jsvine · 2023-10-26T13:43:53Z

jsvine
Oct 26, 2023
Maintainer

Hi @lawrencenika, and thanks for your query. Re. this:

So I have this crazy query, can pdfplumber read the text and the tables in sequential order, i.e. read text line by line, then when there is a table, reading the table column by column, and then it continues on the the next section. There might be table that span across pages, but I would want to read them column by column consistently still.

pdfplumber does not currently provide this functionality because the layout of pages can differ so substantially between PDFs, and each user may have different expectations for how that layout is rendered. That said, you could probably get much of the way there using .find_tables(...), .crop(...), and .extract_text_lines(...), plus a little bit of custom logic.

If you find a solution that works for you, I'd welcome a PR that adds a notebook demonstrating it — something that's been discussed here on a similar topic: #1012 (reply in thread)

4 replies

cmdlineluser Oct 27, 2023

Has there been any suggestions for adding "directional" cropping helpers to Table objects @jsvine ?

There could be one for each direction and for table.bbox itself.

table.crop.above()
table.crop.below()
table.crop.left()
table.crop.right()
table.crop.self()

If we use the "find last line before table" example:

table = page.find_table()

# pseudo-code: page.crop((0, 0, page.width, table['top'])) 
title = table.crop.above().text_extract_lines()[-1]

Basically just a way to avoid having to manually unpack the x0, top, bbox information.

Another idea which I'm less sure about was perhaps some way to "register" table objects with a page.

page.register_table(table)
# page.objects['table'].append(table)
# page.filter(lambda obj: obj.bbox not in table.bbox)

# some general extract method that gives text_line objects + tables objects in "order"
page.extract()

The idea being we can just let pdfplumber sort the objects instead of extracting lines / tables manually and sorting them e.g. sorted(list_of_objects, key=itemgetter('top')) (like in #1005)

jsvine Oct 27, 2023
Maintainer

Really interesting @cmdlineluser; thanks for flagging. What if we did something like:

tables = page.find_tables()
page.anticrop(tables[0].bbox)
# and/or
page.anticrop_multiple([ t.bbox for t in tables ])

Or is it important, from the use-cases you've seen, to be able to specify left/right/above/below independently?

cmdlineluser Oct 27, 2023

Yeah, it wasn't really from any use-cases in particular, sorry.

It was just these table/text relative position topics got me wondering in general if perhaps the directional methods could allow for things like:

# find first `foo` below the first table
(page.find_table()
     .crop
     .below()
     .search('foo')
)

Or if for example there was also a Text object, it could have the same "api"

# first table below first `foo`
(page.search("foo")[0] # .first()?
     .crop
     .below()
     .find_table()
)

But perhaps I've gone too far off on a tangent.

jsvine Nov 2, 2023
Maintainer

Really interesting. I'm reluctant to add a bunch of methods to the Table objects (or the results of .search(...), etc.), but I wonder if there's another way to enable this sort of interaction. Maybe something like:

(
  page.pipe()  # <- A proposed helper, which would return a Pipe object
  .search("foo") # <- Returns a Pipe-wrapped version of the standard `.search(...)` method
  .get(0) # <- Returns a Pipe-wrapped version of the first search result
  .crop()
)

Probably a bit complicated to implement, but could be a powerful way to interact with a page.

cmdlineluser · 2023-10-29T18:19:39Z

cmdlineluser
Oct 29, 2023

As mentioned in #1019 - this particular PDF is extremely awkward due to the "table lines" not being actual single lines.

To save having to re-read the linked discussion, here is an attempted summary:

Identify table

We search for the bolded "Table N" text followed by a line that contains all bolded words (the column names).

Code

def find_horizontal_lines():
   """
   Needs more logic to make sure the lines are "contiguous"
   """
   line_groups = cluster_objects(
       page.horizontal_edges, 
       itemgetter("top"), 
       tolerance = 1
   )
   
   for lines in line_groups:
      left  = min(lines, key=itemgetter("x0"))["x0"]
      right = max(lines, key=itemgetter("x1"))["x1"]
      width = right - left
      
      # multiple line fragments and "almost" full width of page
      if len(lines) > 2 and page.width / width <= 1.5:
          yield {
              "top": lines[0]["top"],
              "left": left,
              "right": right
          }

def find_table_headings():
    TABLE_HEADER_RE = r"(?<=\n)(Table \d.*)\n(.+)"

    is_bold_font = lambda chars: (
        chars and all("bold" in char["fontname"].casefold() for char in chars)
    )

    for lines in page.search(TABLE_HEADER_RE):
        if is_bold_font(lines["chars"][:5]): # Is "Table" in bold font?
            table_name = lines["groups"][0]

            next_line = re.escape(lines["groups"][1])
            next_line = page.search(next_line)[0]["chars"] # `.search` to get font info

            if is_bold_font(next_line):
                yield {
                    "name": table_name, 
                    "top": lines["top"],
                    "page_number": page.page_number
                }

Filter table area from page

If we take the first table as an example and ignore the fact it spans multiple pages.

We define the table area as the "3 lines" that follow the "table heading/name".

import bisect
import pdfplumber
import re

from operator import itemgetter
from pdfplumber.utils import cluster_objects, get_bbox_overlap, obj_to_bbox

pdf = pdfplumber.open("Downloads/Life_Cycle_Assessment_of_Cow_Tanned_Leather_Produc.pdf")

page = pdf.pages[3]

table = next(find_table_headings())
horizontal_lines = sorted(find_horizontal_lines(), key=itemgetter("top"))

filtered_page = page 

idx = bisect.bisect_left(horizontal_lines, table["top"], key=itemgetter("top"))

table_lines = horizontal_lines[idx: idx + 3]
del horizontal_lines[idx: idx + 3]

line = table_lines[0]
bottom = table_lines[-1]['top']

bbox = line['left'], line['top'], line['right'], bottom

filtered_page = filtered_page.filter(lambda obj: 
    get_bbox_overlap(obj_to_bbox(obj), bbox) is None
)

One way to define the next step could be:

Can we filtered_page.extract_text(layout=True) but somehow insert/overlay the original table area with our own custom reformatted table output?

Perhaps @jsvine has some ideas on the matter.

[Update]: So it looks like the text is cached by pdfplumber

pdfplumber/pdfplumber/page.py

Line 572 in d9561d1

self.get_textmap = lru_cache()(self._get_textmap)

If I remove this, I can successfully filtered_page.chars.append() by copying the first char from the table area and modifying 'text': markdown_text

It then appears in the output of .extract_text(layout=True) in the "correct spot" (but something needs to be done for the newlines/indentation)

Is this possibly a reliable approach?

2 replies

jsvine Oct 30, 2023
Maintainer

[Update]: So it looks like the text is cached by pdfplumber

If I remove this, I can successfully filtered_page.chars.append() by copying the first char from the table area and modifying 'text': markdown_text

If you want to bypass the cache (or just generally run text extraction on an arbitrary set of chars), you can do this:

text = pdfplumber.utils.extract_text(my_list_of_chars)

You can also pass any of the standard arguments, e.g.,:

text = pdfplumber.utils.extract_text(my_list_of_chars, layout=True)

Does that help, or have I misunderstood the goal here?

cmdlineluser Oct 30, 2023

My bad, I didn't realize it was available in pdfplumber.utils - thank you.

Using a simpler PDF example: table.pdf

I suppose the underlying question was can we crop out the table and inject the new text:

import pdfplumber
import pandas as pd

from pdfplumber.utils import extract_text, get_bbox_overlap, obj_to_bbox

pdf = pdfplumber.open("table.pdf")
page = pdf.pages[0]

table = page.find_table()

first_table_char = page.crop(table.bbox).chars[0]

filtered_page = page.filter(lambda obj: 
    get_bbox_overlap(obj_to_bbox(obj), table.bbox) is None
)

df = pd.DataFrame(table.extract())
df.columns = df.iloc[0]

markdown = df.drop(0).to_markdown(index=False)

chars = filtered_page.chars
chars.append(first_table_char | {"text": markdown})

print(extract_text(chars, layout=True))

It sort of works, but the indentation/alignment for lines 2-onwards needs fixing.


    Hello
             world!



    | First name   | Last name   |   Age | City                     |
|:-------------|:------------|------:|:-------------------------|
| Jules        | Smith       |    34 | San Juan                 |
| Mary         | Ramos       |    45 | Orlando                  |
| Carlson      | Banks       |    19 | Los Angeles              |
| Lucas        | Cimon       |    31 | Saint-Mahturin-sur-Loire |











    Some  more
    text
           over here.

cmdlineluser · 2023-11-01T20:17:38Z

cmdlineluser
Nov 1, 2023

I think my previous replies may have derailed the discussion somewhat @jsvine.

This is a more basic version of the task: basic-2-tables.pdf

We can attempt to crop out the table and inject the new formatted text in its place:

import pdfplumber
import pandas as pd

from pdfplumber.utils import extract_text, get_bbox_overlap, obj_to_bbox

pdf = pdfplumber.open("basic-2-tables.pdf")

page = pdf.pages[0]

filtered_page = page
chars = filtered_page.chars

for table in page.find_tables():
    first_table_char = page.crop(table.bbox).chars[0]

    filtered_page = filtered_page.filter(lambda obj: 
        get_bbox_overlap(obj_to_bbox(obj), table.bbox) is None
    )

    chars = filtered_page.chars

    df = pd.DataFrame(table.extract())
    df.columns = df.iloc[0]

    markdown = df.drop(0).to_markdown(index=False)

    chars.append(first_table_char | {"text": markdown})


print(extract_text(chars, layout=True))

It sort of works but the new text loses alignment after the first line.

(Not sure if we can re-align properly without messing up the surrounding layout?)



    Hello
             world!



    | First name   | Last name   |   Age | City        |
|:-------------|:------------|------:|:------------|
| Foo          | Smith       |    34 | San Juan    |
| Bar          | Ramos       |    45 | Orlando     |
| Baz          | Banks       |    19 | Los Angeles |









    Some  more
    text
           over here.



    | AAA   | BBBB   | C   | DD   |
|:------|:-------|:----|:-----|
| A1    | B1     | C1  | D1   |
| A2    | B2     | C2  | D2   |
| A3    | B3     | C3  | D3   |











                                 Footer.

2 replies

jsvine Nov 2, 2023
Maintainer

Wow, that's a very clever approach. Re. the text alignment, what if you try replacing this:

chars.append(first_table_char | {"text": markdown})

... with this:

for i, markdown_line in enumerate(markdown.split("\n")):
    new_attrs = {
        "text": markdown_line,
        "doctop": first_table_char["doctop"] + first_table_char["height"] * i,
    }
    chars.append(first_table_char | new_attrs)

The idea here is to split up the table into its lines, and to append those lines individually, incrementing the doctop for each line.

This, however, might cause some problems if, e.g., the final doctop exceeds the next chunk of non-table text. I can see a few ways around this (maybe by only incrementing by y_tolerance at a time, or by moving all the other non-table chars down after each pass?) ... but I imagine you may have other thoughts based on the above.

cmdlineluser Nov 2, 2023

Ah, very interesting!

Incrementing doctop and top does work for the given example.

This, however, might cause some problems

Yeah, very good point - I was also thinking if a table has text on either side of it, that would probably get messed up too.

I'll mess around with it some more - thanks for the pointers.

    Hello
             world!



    | First name   | Last name   |   Age | City        |
    |:-------------|:------------|------:|:------------|
    | Foo          | Smith       |    34 | San Juan    |
    | Bar          | Ramos       |    45 | Orlando     |
    | Baz          | Banks       |    19 | Los Angeles |





    Some  more
    text
           over here.



    | AAA   | BBBB   | C   | DD   |
    |:------|:-------|:----|:-----|
    | A1    | B1     | C1  | D1   |
    | A2    | B2     | C2  | D2   |
    | A3    | B3     | C3  | D3   |







                                 Footer.

lawrencenika · 2023-11-05T15:58:05Z

lawrencenika
Nov 5, 2023
Author

thank you both, these are really awesome solutions. However, as I try them back on the
Life_Cycle_Assessment_of_Cow_Tanned_Leather_Produc.pdf
page.find_tables() just simply doesnt find any table in there without tweaking any of the table_settings. But then even if it can be customized, the customization may not work for other PDFs. Do you have any recommendations on ways to make it identify all types of tables while combining with texts outside of tables?

1 reply

gksucs Sep 9, 2024

hey, i was just wondering did you find any answer?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text extraction in combination with tables #1026

{{title}}

Replies: 4 comments 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Text extraction in combination with tables #1026

Replies: 4 comments · 9 replies

jsvine Oct 26, 2023 Maintainer

jsvine Oct 27, 2023 Maintainer

jsvine Nov 2, 2023 Maintainer

Identify table

Filter table area from page

jsvine Oct 30, 2023 Maintainer

jsvine Nov 2, 2023 Maintainer

lawrencenika Nov 5, 2023 Author

Replies: 4 comments 9 replies

jsvine
Oct 26, 2023
Maintainer

jsvine Oct 27, 2023
Maintainer

jsvine Nov 2, 2023
Maintainer

jsvine Oct 30, 2023
Maintainer

jsvine Nov 2, 2023
Maintainer

lawrencenika
Nov 5, 2023
Author