Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract tables from images #279

Open
turicas opened this issue May 28, 2018 · 3 comments
Open

Extract tables from images #279

turicas opened this issue May 28, 2018 · 3 comments
Assignees
Labels

Comments

@turicas
Copy link
Owner

turicas commented May 28, 2018

We can generalize the algorithm inside the PDF plugin to receive objects from an OCR and then extract tables from images!
The tasks related to this extraction would be:

  • Find the table and then align/rotate/stretch (scikit-image?)
  • Extract text and (if possible) rectangle objects and its positions (tesseract?)
  • Convert objects to the algorithm's interface
  • Call the algorithm

This code can help. cc @danilobellini

@turicas
Copy link
Owner Author

turicas commented May 28, 2018

Examples:
photo4907138477032843266
photo4907138477032843265 1

@turicas
Copy link
Owner Author

turicas commented May 28, 2018

If we take the JSON response from Google Vision we can use the following code (WARNING: missing the transformation) to feed the algorithm:

import json


class Object:

    def __init__(self, text, points):
        self.text = text
        self.points = [(item['x'], item['y']) for item in points]
        x_ordered = sorted(self.points, key=lambda point: point[0])
        y_ordered = sorted(self.points, key=lambda point: point[1])
        self.x0 = x_ordered[0][0]
        self.x1 = x_ordered[-1][0]
        self.y0 = y_ordered[0][1]
        self.y1 = y_ordered[-1][1]
        self.width = self.x1 - self.x0
        self.height = self.y1 - self.y0
        self.bbox = (self.x0, self.y0, self.x1, self.y1)

    def get_text(self):
        return self.text

    def __str__(self):
        return f'<TextObject {self.bbox} {repr(self.text)}>'

    def __equal__(self, other):
        return self.bbox == other.bbox

    def __gt__(self, other):
        return self.bbox > other.bbox

    def __lt__(self, other):
        return self.bbox < other.bbox


def extract_objects(filename):
    data = json.load(open(filename))
    # TODO: align/rotate/stretch before doing this
    return sorted([Object(obj['description'], obj['boundingPoly']['vertices'])
                   for obj in data['textAnnotations']])


if __name__ == '__main__':
    filename = 'photo4907138477032843265_gvision.json'
    objs = extract_objects(filename)
    for obj in objs:
        print(obj)

@turicas turicas self-assigned this Apr 12, 2019
@turicas
Copy link
Owner Author

turicas commented Apr 12, 2019

Work in progress (using pytesseract) on feature/plugin-ocr branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant