-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract tables from images #279
Labels
Comments
If we take the JSON response from Google Vision we can use the following code (WARNING: missing the transformation) to feed the algorithm: import json
class Object:
def __init__(self, text, points):
self.text = text
self.points = [(item['x'], item['y']) for item in points]
x_ordered = sorted(self.points, key=lambda point: point[0])
y_ordered = sorted(self.points, key=lambda point: point[1])
self.x0 = x_ordered[0][0]
self.x1 = x_ordered[-1][0]
self.y0 = y_ordered[0][1]
self.y1 = y_ordered[-1][1]
self.width = self.x1 - self.x0
self.height = self.y1 - self.y0
self.bbox = (self.x0, self.y0, self.x1, self.y1)
def get_text(self):
return self.text
def __str__(self):
return f'<TextObject {self.bbox} {repr(self.text)}>'
def __equal__(self, other):
return self.bbox == other.bbox
def __gt__(self, other):
return self.bbox > other.bbox
def __lt__(self, other):
return self.bbox < other.bbox
def extract_objects(filename):
data = json.load(open(filename))
# TODO: align/rotate/stretch before doing this
return sorted([Object(obj['description'], obj['boundingPoly']['vertices'])
for obj in data['textAnnotations']])
if __name__ == '__main__':
filename = 'photo4907138477032843265_gvision.json'
objs = extract_objects(filename)
for obj in objs:
print(obj) |
Work in progress (using pytesseract) on feature/plugin-ocr branch. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We can generalize the algorithm inside the PDF plugin to receive objects from an OCR and then extract tables from images!
The tasks related to this extraction would be:
This code can help. cc @danilobellini
The text was updated successfully, but these errors were encountered: