-
Notifications
You must be signed in to change notification settings - Fork 521
How to extract text in natural reading order (up2down, left2right)
Aaron Taylor edited this page Jun 11, 2023
·
2 revisions
First of all, use SortedCollection.
from operator import itemgetter
from itertools import groupby
import fitz
doc = fitz.open( 'mydocument.pdf' )
for page in doc:
text_words = page.get_text_words()
# The words should be ordered by y1 and x0
sorted_words = SortedCollection( key = itemgetter( 3, 0 ) )
for word in text_words:
sorted_words.insert( word )
# At this point you already have an ordered list. If you need to
# group the content by lines, use groupby with y1 as a key
lines = groupby( sorted_words, key = itemgetter( 3 ) )
# Enjoy!
HOWTO Button annots with JavaScript
HOWTO work with PDF embedded files
HOWTO extract text from inside rectangles
HOWTO extract text in natural reading order
HOWTO create or extract graphics
HOWTO create your own PDF Drawing
Rectangle inclusion & intersection
Metadata & bookmark maintenance