Extract tables from images #279

turicas · 2018-05-28T04:08:10Z

We can generalize the algorithm inside the PDF plugin to receive objects from an OCR and then extract tables from images!
The tasks related to this extraction would be:

Find the table and then align/rotate/stretch (scikit-image?)
Extract text and (if possible) rectangle objects and its positions (tesseract?)
Convert objects to the algorithm's interface
Call the algorithm

This code can help. cc @danilobellini

turicas · 2018-05-28T04:14:25Z

Examples:

turicas · 2018-05-28T04:16:38Z

If we take the JSON response from Google Vision we can use the following code (WARNING: missing the transformation) to feed the algorithm:

import json


class Object:

    def __init__(self, text, points):
        self.text = text
        self.points = [(item['x'], item['y']) for item in points]
        x_ordered = sorted(self.points, key=lambda point: point[0])
        y_ordered = sorted(self.points, key=lambda point: point[1])
        self.x0 = x_ordered[0][0]
        self.x1 = x_ordered[-1][0]
        self.y0 = y_ordered[0][1]
        self.y1 = y_ordered[-1][1]
        self.width = self.x1 - self.x0
        self.height = self.y1 - self.y0
        self.bbox = (self.x0, self.y0, self.x1, self.y1)

    def get_text(self):
        return self.text

    def __str__(self):
        return f'<TextObject {self.bbox} {repr(self.text)}>'

    def __equal__(self, other):
        return self.bbox == other.bbox

    def __gt__(self, other):
        return self.bbox > other.bbox

    def __lt__(self, other):
        return self.bbox < other.bbox


def extract_objects(filename):
    data = json.load(open(filename))
    # TODO: align/rotate/stretch before doing this
    return sorted([Object(obj['description'], obj['boundingPoly']['vertices'])
                   for obj in data['textAnnotations']])


if __name__ == '__main__':
    filename = 'photo4907138477032843265_gvision.json'
    objs = extract_objects(filename)
    for obj in objs:
        print(obj)

turicas · 2019-04-12T04:26:43Z

Work in progress (using pytesseract) on feature/plugin-ocr branch.

turicas added help wanted plugin labels May 28, 2018

turicas self-assigned this Apr 12, 2019

turicas removed the help wanted label Apr 12, 2019

jsbueno mentioned this issue Apr 16, 2019

Adds code to merge contiguous rectangular areas #324

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract tables from images #279

Extract tables from images #279

turicas commented May 28, 2018 •

edited

Loading

turicas commented May 28, 2018

turicas commented May 28, 2018

turicas commented Apr 12, 2019

Extract tables from images #279

Extract tables from images #279

Comments

turicas commented May 28, 2018 • edited Loading

turicas commented May 28, 2018

turicas commented May 28, 2018

turicas commented Apr 12, 2019

turicas commented May 28, 2018 •

edited

Loading