Extracting tables spanning multiple pages #1188
hvina
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment
-
Hi @hvina, and thanks for your interest in |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
My requirement is to extract tables that are spanning multiple pages .I am stuck at extracting the page numbers from the annotations present in the page so that I can define a region.
Approach 1 :
And I identified the table with specific text in rows . Like
This worked perfectly fine for tables that are contained in a single page . But did not work for tables that spanned pages
Approach 2
I had a look at the issue #864 and wanted to follow the same . I figured out that the PDF's were having annotations and the thought was to obtain the destination object from the annotation and do some kind of bbox per page to get the the table. I am stuck here to get the page number from the PDFStream / PDFObjRef
Is there a way to get the page number from the PDFStream and mark the bbox ?
The annotations were like the following
printed
I also tried the pdfminer.six approch mentioned in pdfminer
and I get the same kind of results
Beta Was this translation helpful? Give feedback.
All reactions