Skip to content

How to extract text in natural reading order (up2down, left2right)

Aaron Taylor edited this page Jun 11, 2023 · 2 revisions

Easiest way

First of all, use SortedCollection.

from operator import itemgetter from itertools import groupby import fitz doc = fitz.open( 'mydocument.pdf' ) for page in doc: text_words = page.get_text_words() # The words should be ordered by y1 and x0 sorted_words = SortedCollection( key = itemgetter( 3, 0 ) ) for word in text_words: sorted_words.insert( word ) # At this point you already have an ordered list. If you need to  # group the content by lines, use groupby with y1 as a key lines = groupby( sorted_words, key = itemgetter( 3 ) ) # Enjoy!

Clone this wiki locally