-
Notifications
You must be signed in to change notification settings - Fork 673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text capturing without tabular #242
Comments
Hi @ibrahimshuail, I’m not sure whether I’m understanding your question, since there is not much detail or any example, but I think you’re looking for |
in extract text, we are getting the tabular data also, which is converted to plain text. I don't want that tabular data. other than that table data I want to extract other information from the pdf |
Hi @ibrahimshuail , this is not something natively supported but there are a few ways by which you can achieve the desired result. For example, one of that could be to run |
Then the reported issues should be moved to enhancement... It shouldn't be closed... Correct me if I'm wrong |
Thanks, @samkit-jain, I think that's a good solution! @ibrahimshuail: Thank you for your interest in the pdfplumber and its features. I closed the issue because it lacked a clear description and contained no specific examples. From the best I could ascertain, however, the problem you are solving does not seem to be a very common one. (Typically, people are extracting tables precisely for the tabular data.) If you'd like to propose an enhancement, feel free to open a new issue with a fuller explanation, specific example, and some explanation of the motivation for such a feature. (Per the issue template for feature requests: "Please describe, in as much detail as possible, your proposal and how it would improve your experience with pdfplumber.") |
@jsvine, just a correction in your proposed solution. It would return the text in the tabular region. To get the text not in the tabular region, would have to run text extraction on the full page and then do a replace ( |
Ah yes! My apologies, I misunderstood the request. I think your original solution is the most direct and useful. |
@samkit-jain is there any working examples , i tried but i'm not able to achieve because the alignment of the text completely changes , so the complete tabular data doesn't gets cropped |
@ibrahimshuail there's not much I can do without the PDF. If you can share, I can try and help. |
@samkit-jain please find the pdf (https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf). I want all text other than the tabular data |
Hi @ibrahimshuail You can play around with this piece of code I wrote: import pdfplumber
def curves_to_edges(cs):
"""See https://github.com/jsvine/pdfplumber/issues/127"""
edges = []
for c in cs:
edges += pdfplumber.utils.rect_to_edges(c)
return edges
# Import the PDF.
pdf = pdfplumber.open("file.pdf")
# Load the first page.
p = pdf.pages[0]
# Table settings.
ts = {
"vertical_strategy": "explicit",
"horizontal_strategy": "explicit",
"explicit_vertical_lines": curves_to_edges(p.curves + p.edges),
"explicit_horizontal_lines": curves_to_edges(p.curves + p.edges),
"intersection_y_tolerance": 10,
}
# Get the bounding boxes of the tables on the page.
bboxes = [table.bbox for table in p.find_tables(table_settings=ts)]
def not_within_bboxes(obj):
"""Check if the object is in any of the table's bbox."""
def obj_in_bbox(_bbox):
"""See https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/table.py#L404"""
v_mid = (obj["top"] + obj["bottom"]) / 2
h_mid = (obj["x0"] + obj["x1"]) / 2
x0, top, x1, bottom = _bbox
return (h_mid >= x0) and (h_mid < x1) and (v_mid >= top) and (v_mid < bottom)
return not any(obj_in_bbox(__bbox) for __bbox in bboxes)
print("Text outside the tables:")
print(p.filter(not_within_bboxes).extract_text()) Result on page 1:
Result on page 2:
Using the |
Thanks alot @samkit-jain ,, one of the best solution !!!!! |
Thanks for this solution @samkit-jain ! I think this use case may be more common than expected - for example, if a user wants to use a different table extractor other than pdfplumber but still wants to use the library's text extraction features, having an easy way to extract the text without the tables would be convenient. Regardless, this seems to work for me as well. |
Updating after pdfplumber changes (post 0.6.0)@samkit-jain brilliant solution! When I yesterday wanted to re-use this solution again I found that I ran into a KeyError inside of # Table settings.
ts = {
"vertical_strategy": "explicit",
"horizontal_strategy": "explicit",
"explicit_vertical_lines": curves_to_edges(p.curves + p.rect_edges),
"explicit_horizontal_lines": curves_to_edges(p.curves + p.rect_edges),
"intersection_y_tolerance": 10,
} However, in my case this specification of table settings actually underminds the quality of the results, i.e. text that is not part out the table is also mistakenly filtered out. Instead, what works well for me is simply relying on the defaults under pdfplumber 0.10.0: def filter_tables(page: pdfplumber.page.Page) -> pdfplumber.page.Page:
if page.find_tables() != []:
# Get the bounding boxes of the tables on the page.
# Adapted from
# https://github.com/jsvine/pdfplumber/issues/242#issuecomment-668448246
bboxes = [table.bbox for table in page.find_tables()]
bbox_not_within_bboxes = partial(not_within_bboxes, bboxes=bboxes)
# Filter-out tables from page
page = page.filter(bbox_not_within_bboxes)
return page
def not_within_bboxes(obj, bboxes):
"""Check if the object is in any of the table's bbox."""
def obj_in_bbox(_bbox):
"""Define objects in box.
See https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/table.py#L404
"""
v_mid = (obj["top"] + obj["bottom"]) / 2
h_mid = (obj["x0"] + obj["x1"]) / 2
x0, top, x1, bottom = _bbox
return (h_mid >= x0) and (h_mid < x1) and (v_mid >= top) and (v_mid < bottom)
return not any(obj_in_bbox(__bbox) for __bbox in bboxes) If you have any feedback or comments, this would be greatly appreciated. And thank you for this fantastic package! |
what if the PDF has table that has no lines marking, such as https://drive.google.com/file/d/167Y6KKW5cv0-7r8FV830iofWvYZWP4b6/view?usp=sharing I have. I tried running your code but it errors out with But since your mentioned that default ts should work under pdfplumber 0.10.0, when I tried on this PDF, it did not filter out the table 1 content. |
@lawrencenika For the PDF you shared, it is not going to be a simple affair. What you can try is, create your own logic to find when a table starts and ends. The horizontal lines that are on top and bottom of a table will be useful in determining. |
Is there any way I can extract only the text information without the tabular data
The text was updated successfully, but these errors were encountered: