Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text capturing without tabular #242

Closed
ibrahimshuail opened this issue Aug 2, 2020 · 16 comments
Closed

Text capturing without tabular #242

ibrahimshuail opened this issue Aug 2, 2020 · 16 comments

Comments

@ibrahimshuail
Copy link

Is there any way I can extract only the text information without the tabular data

@ibrahimshuail ibrahimshuail added the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Aug 2, 2020
@jsvine
Copy link
Owner

jsvine commented Aug 2, 2020

Hi @ibrahimshuail, I’m not sure whether I’m understanding your question, since there is not much detail or any example, but I think you’re looking for page.extract_text(...) and/or page.extract_words(...). See here: https://github.com/jsvine/pdfplumber#the-pdfplumberpage-class

@jsvine jsvine closed this as completed Aug 2, 2020
@jsvine jsvine removed the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Aug 2, 2020
@ibrahimshuail
Copy link
Author

ibrahimshuail commented Aug 2, 2020

in extract text, we are getting the tabular data also, which is converted to plain text. I don't want that tabular data. other than that table data I want to extract other information from the pdf

@samkit-jain
Copy link
Collaborator

Hi @ibrahimshuail , this is not something natively supported but there are a few ways by which you can achieve the desired result. For example, one of that could be to run page.find_tables() and store the coordinates of the identified table. Then, run page.extract_words() on the full page and discard all the words that fall under the tabular region.

@ibrahimshuail
Copy link
Author

Then the reported issues should be moved to enhancement... It shouldn't be closed... Correct me if I'm wrong

@jsvine
Copy link
Owner

jsvine commented Aug 2, 2020

Thanks, @samkit-jain, I think that's a good solution! Another approach, depending on the particular example: After selecting the table through page.find_tables(...), you could crop the page to that bounding box, with cropped = page.crop(...), and then run cropped.extract_text(...) or cropped.extract_words(...).

@ibrahimshuail: Thank you for your interest in the pdfplumber and its features. I closed the issue because it lacked a clear description and contained no specific examples. From the best I could ascertain, however, the problem you are solving does not seem to be a very common one. (Typically, people are extracting tables precisely for the tabular data.) If you'd like to propose an enhancement, feel free to open a new issue with a fuller explanation, specific example, and some explanation of the motivation for such a feature. (Per the issue template for feature requests: "Please describe, in as much detail as possible, your proposal and how it would improve your experience with pdfplumber.")

@samkit-jain
Copy link
Collaborator

@jsvine, just a correction in your proposed solution. It would return the text in the tabular region. To get the text not in the tabular region, would have to run text extraction on the full page and then do a replace (full_page_text.replace(cropped_page_text, "")). This assumes that there is no text to the right or left of the table.

@jsvine
Copy link
Owner

jsvine commented Aug 2, 2020

Ah yes! My apologies, I misunderstood the request. I think your original solution is the most direct and useful.

@ibrahimshuail
Copy link
Author

@samkit-jain is there any working examples , i tried but i'm not able to achieve because the alignment of the text completely changes , so the complete tabular data doesn't gets cropped

@samkit-jain
Copy link
Collaborator

@ibrahimshuail there's not much I can do without the PDF. If you can share, I can try and help.

@ibrahimshuail
Copy link
Author

@samkit-jain please find the pdf (https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf). I want all text other than the tabular data

@samkit-jain
Copy link
Collaborator

samkit-jain commented Aug 4, 2020

Hi @ibrahimshuail You can play around with this piece of code I wrote:

import pdfplumber

def curves_to_edges(cs):
    """See https://github.com/jsvine/pdfplumber/issues/127"""
    edges = []
    for c in cs:
        edges += pdfplumber.utils.rect_to_edges(c)
    return edges

# Import the PDF.
pdf = pdfplumber.open("file.pdf")

# Load the first page.
p = pdf.pages[0]

# Table settings.
ts = {
    "vertical_strategy": "explicit",
    "horizontal_strategy": "explicit",
    "explicit_vertical_lines": curves_to_edges(p.curves + p.edges),
    "explicit_horizontal_lines": curves_to_edges(p.curves + p.edges),
    "intersection_y_tolerance": 10,
}

# Get the bounding boxes of the tables on the page.
bboxes = [table.bbox for table in p.find_tables(table_settings=ts)]

def not_within_bboxes(obj):
    """Check if the object is in any of the table's bbox."""
    def obj_in_bbox(_bbox):
        """See https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/table.py#L404"""
        v_mid = (obj["top"] + obj["bottom"]) / 2
        h_mid = (obj["x0"] + obj["x1"]) / 2
        x0, top, x1, bottom = _bbox
        return (h_mid >= x0) and (h_mid < x1) and (v_mid >= top) and (v_mid < bottom)
    return not any(obj_in_bbox(__bbox) for __bbox in bboxes)

print("Text outside the tables:")
print(p.filter(not_within_bboxes).extract_text())

Result on page 1:

NATIONAL PARTNERSHIP FOR QUALITY AFTERSCHOOL LEARNING
www.sedl.org/afterschool/toolkits
����������� �������� �������
Tutoring to Enhance Science Skills
Tutoring Two: Learning to Make Data Tables
..............................................................................................
Sample Data for Data Tables
Use these data to create data tables following the Guidelines for Making a Data Table and 
Checklist for a Data Table.
Example 1: Pet Survey (GR 2–3)
Ms. Hubert’s afterschool students took a survey of the 600 students at Morales Elementary 
School. Students were asked to select their favorite pet from a list of eight animals. Here 
are the results. 
Lizard 25, Dog 250, Cat 115, Bird 50, Guinea pig 30, Hamster 45, Fish 75, 
Ferret 10 
Example 2: Electromagnets—Increasing Coils (GR 3–5)
The following data were collected using an electromagnet with a 1.5 volt battery, a switch, 
a piece of #20 insulated wire, and a nail. Three trials were run. Safety precautions in 
repeating this experiment include using safety goggles or safety spectacles and avoiding 
short circuits.  
 
 
       
Example 3: pH of Substances (GR 5–10)
The following are pH values of common household substances taken by three different 
teams using pH probes. Safety precautions in repeating this experiment include hooded 
ventilation, chemical-splash safety goggles, gloves, and apron. Do not use bleach, 
ammonia, or strong acids with children.
Lemon juice 2.4, 2.0, 2.2; Baking soda (1 Tbsp) in Water (1 cup) 8.4, 8.3, 8.7; 
Orange juice 3.5, 4.0, 3.4; Battery acid 1.0, 0.7, 0.5; Apples 3.0, 3.2, 3.5; 
Tomatoes 4.5, 4.2, 4.0; Bottled water 6.7, 7.0, 7.2; Milk of magnesia 10.5, 10.3, 
10.6; Liquid hand soap 9.0, 10.0, 9.5; Vinegar 2.2, 2.9, 3.0; Household bleach 
12.5, 12.5, 12.7; Milk 6.6, 6.5, 6.4; Household ammonia 11.5, 11.0, 11.5;
Lye 13.0, 13.5, 13.4; and Sodium hydroxide 14.0, 14.0, 13.9; Anti-freeze 10.1, 
10.9, 9.7; Windex 9.9. 10.2, 9.5; Liquid detergent 10.5, 10.0, 10.3; and 
Cola 3.0, 2.5, 3.2
Teaching tip: The pH scale is from 0 to 14. Have students make two data tables, one 
with the data as given and one with the pH scale 0 to 14 with the substances’ average 
pH in rank order on the scale (Battery acid at the lower end and Sodium hydroxide at 
the upper end) or create a pH graphic organizer.
1

Result on page 2:

Example 4: Automobile Land Speed Records (GR 5-10)
In the first recorded automobile race in 1898, Count Gaston de Chasseloup-Laubat of 
Paris, France, drove 1 kilometer in 57 seconds for an average speed of 39.2 miles per hour 
(mph) or 63.1 kilometers per hour (kph). In 1904, Henry Ford drove his Ford Arrow across 
frozen Lake St. Clair, MI, at an average speed of 91.4 mph. Now, the North American 
Eagle is trying to break a land speed record of 800 mph. The Federation International de 
L’Automobile (FIA), the world’s governing body for motor sport and land speed records, 
recorded the following land speed records. (Retrieved on February 5, 2006, from 
http://www.landspeed.com/lsrinfo.asp.)
Example 5: Distance and Time (GR 8-10)
The following data were collected using a car with a water clock set to release a drop in 
a unit of time and a meter stick. The car rolled down an inclined plane. Three trials were 
run. Create a data table with an average distance column and an average velocity column, 
create an average distance-time graph, and draw the best-fit line or curve. Estimate the 
car’s distance traveled and velocity at six drops of water. Describe the motion of the car. Is 
it going at a constant speed, accelerating, or decelerating? How do you know?
 
   
         
© 2006 WGBH Educational Foundation. All rights reserved.
2

Using the .filter() method provided by pdfplumber to drop any objects that fall inside the bounding box of any of the tables and creating a filtered version of the page.

@ibrahimshuail
Copy link
Author

Thanks alot @samkit-jain ,, one of the best solution !!!!!

@thefirebanks
Copy link

Thanks for this solution @samkit-jain !

I think this use case may be more common than expected - for example, if a user wants to use a different table extractor other than pdfplumber but still wants to use the library's text extraction features, having an easy way to extract the text without the tables would be convenient. Regardless, this seems to work for me as well.

@pcschreiber1
Copy link

pcschreiber1 commented Aug 21, 2023

Updating after pdfplumber changes (post 0.6.0)

@samkit-jain brilliant solution!

When I yesterday wanted to re-use this solution again I found that I ran into a KeyError inside of rect_to_edges from pdfpblumber.utils.geometry that is used in the above curves_to_edges. From my understanding, the reason is that page.edges now also contains curve_edge objects which do not have information for the y-axis. The objects under attribute rect_edges conform to the required form. Hence, the solution runs when adapted in curves to edges as follows:

# Table settings.
ts = {
    "vertical_strategy": "explicit",
    "horizontal_strategy": "explicit",
    "explicit_vertical_lines": curves_to_edges(p.curves + p.rect_edges),
    "explicit_horizontal_lines": curves_to_edges(p.curves + p.rect_edges),
    "intersection_y_tolerance": 10,
}

However, in my case this specification of table settings actually underminds the quality of the results, i.e. text that is not part out the table is also mistakenly filtered out. Instead, what works well for me is simply relying on the defaults under pdfplumber 0.10.0:

def filter_tables(page: pdfplumber.page.Page) -> pdfplumber.page.Page:
    if page.find_tables() != []:
        # Get the bounding boxes of the tables on the page.
        # Adapted from
        # https://github.com/jsvine/pdfplumber/issues/242#issuecomment-668448246
        bboxes = [table.bbox for table in page.find_tables()]
        bbox_not_within_bboxes = partial(not_within_bboxes, bboxes=bboxes)

        # Filter-out tables from page
        page = page.filter(bbox_not_within_bboxes)

    return page


def not_within_bboxes(obj, bboxes):
    """Check if the object is in any of the table's bbox."""

    def obj_in_bbox(_bbox):
        """Define objects in box.

        See https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/table.py#L404
        """
        v_mid = (obj["top"] + obj["bottom"]) / 2
        h_mid = (obj["x0"] + obj["x1"]) / 2
        x0, top, x1, bottom = _bbox
        return (h_mid >= x0) and (h_mid < x1) and (v_mid >= top) and (v_mid < bottom)

    return not any(obj_in_bbox(__bbox) for __bbox in bboxes)

If you have any feedback or comments, this would be greatly appreciated. And thank you for this fantastic package!

@lawrencenika
Copy link

lawrencenika commented Oct 25, 2023

what if the PDF has table that has no lines marking, such as https://drive.google.com/file/d/167Y6KKW5cv0-7r8FV830iofWvYZWP4b6/view?usp=sharing I have. I tried running your code but it errors out with KeyError: 'y1' @samkit-jain @pcschreiber1

But since your mentioned that default ts should work under pdfplumber 0.10.0, when I tried on this PDF, it did not filter out the table 1 content.

@samkit-jain
Copy link
Collaborator

@lawrencenika For the PDF you shared, it is not going to be a simple affair. What you can try is, create your own logic to find when a table starts and ends. The horizontal lines that are on top and bottom of a table will be useful in determining.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants