Extracting tables spanning multiple pages #1188

hvina · 2024-08-20T04:34:34Z

hvina
Aug 20, 2024

My requirement is to extract tables that are spanning multiple pages .I am stuck at extracting the page numbers from the annotations present in the page so that I can define a region.

Approach 1 :

with pdfplumber.open(pdf_path,password=password) as pdf:
        text = ''
        tables=[]
        
        for page in pdf.pages:
             table = page.extract_table();
             tables.append(table)
return tables

And I identified the table with specific text in rows . Like

       for table in tables: 
             for row in table:
                 if len(row) >=6 and (row[0] == None):
                      continue
                 if len(row) >=6 and (row[0] == head3 or row[0] == head1 or row[0] == head2 or row[0] == head2_1):

This worked perfectly fine for tables that are contained in a single page . But did not work for tables that spanned pages

Approach 2
I had a look at the issue #864 and wanted to follow the same . I figured out that the PDF's were having annotations and the thought was to obtain the destination object from the annotation and do some kind of bbox per page to get the the table. I am stuck here to get the page number from the PDFStream / PDFObjRef

Is there a way to get the page number from the PDFStream and mark the bbox ?
The annotations were like the following

from pdfplumber.utils.pdfinternals import resolve_and_decode, resolve
for annotations in page.annots:
                if annotations.get("data") and annotations["data"].get("A") and annotations["data"]["A"].get("D"):  
                    d1=resolve(annotations["data"]["A"]["D"][0])
                    if d1 and d1.get("Contents"):
                        d2=resolve_and_decode(d1["Contents"])
                        print(d2)
                    print(d1)
                
                print(annotations)

printed

<PDFStream(53): raw=3472, {'Length': 3472, 'Filter': /'FlateDecode'}>

{'Parent': <PDFObjRef:40>, 'Type': /'Page', 'Contents': <PDFObjRef:53>, 'Resources': {'ProcSet': [/'PDF', /'Text', /'ImageB', /'ImageC', /'ImageI'], 'XObject': {'img3': <PDFObjRef:54>, 'img5': <PDFObjRef:55>, 'img4': <PDFObjRef:56>, 'img2': <PDFObjRef:19>, 'img1': <PDFObjRef:18>}, 'Font': {'F1': <PDFObjRef:14>, 'F2': <PDFObjRef:15>, 'F3': <PDFObjRef:16>, 'F4': <PDFObjRef:49>, 'F5': <PDFObjRef:51>, 'F6': <PDFObjRef:57>, 'F8': <PDFObjRef:58>, 'F7': <PDFObjRef:59>}}, 'MediaBox': [0, 0, 595, 842], 'Annots': [<PDFObjRef:60>, <PDFObjRef:61>, <PDFObjRef:62>, <PDFObjRef:63>, <PDFObjRef:64>, <PDFObjRef:65>, <PDFObjRef:66>, <PDFObjRef:67>, <PDFObjRef:68>, <PDFObjRef:69>]}

{'page_number': 1, 'object_type': 'annot', 'x0': 30.5, 'y0': 765, 'x1': 88.5, 'y1': 777, 'doctop': 65, 'top': 65, 'bottom': 77, 'width': 58.0, 'height': 12, 'uri': None, 'title': None, 'contents': None, 'data': {'C': [0, 0, 1], 'Border': [0, 0, 0], 'Subtype': /'Link', 'A': {'D': [<PDFObjRef:21>, /'XYZ', 270.16, 570, 0], 'S': /'GoTo'}, 'Rect': [30.5, 765, 88.5, 777]}}

I also tried the pdfminer.six approch mentioned in pdfminer

from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdftypes import resolve1
from pdfminer.utils import open_filename

with open_filename(pdf_path, "rb") as pdf:
        resource_manager = PDFResourceManager()
        device = PDFPageAggregator(resource_manager, laparams=LAParams())
        interpreter = PDFPageInterpreter(resource_manager, device)
        for page in PDFPage.get_pages(pdf, password=password):
            for annotation_ref in resolve1(page.annots):
                annotation = resolve1(annotation_ref)
                if annotation['Subtype'].name == 'Link':
                    action = resolve1(annotation['A'])
                    print(action['D'])
                    pageDetails=resolve1(action['D'])
                    print(pageDetails)
                    p2=resolve1(pageDetails)
                    print(p2)

and I get the same kind of results

<PDFObjRef:20>
[<PDFObjRef:21>, /'XYZ', 270.16, 570, 0]
[<PDFObjRef:21>, /'XYZ', 270.16, 570, 0]

jsvine · 2024-08-28T13:20:45Z

jsvine
Aug 28, 2024
Maintainer

Hi @hvina, and thanks for your interest in pdfplumber. It sounds like the approaches you are describing are very specific to your particular PDF. Are you able to share it here, along with a pointer to the specific pages that were causing you problems? I cannot promise I'll be able to examine them closely soon (perhaps someone else in the community can), but the PDF and those details will be a helpful start.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting tables spanning multiple pages #1188

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Extracting tables spanning multiple pages #1188

hvina Aug 20, 2024

Replies: 1 comment

jsvine Aug 28, 2024 Maintainer

hvina
Aug 20, 2024

jsvine
Aug 28, 2024
Maintainer