Tesseract
As of 20182020, the best available open source OCR software is Tesseract 4 (beta) with its new LSTM neural network OCR modelLSTM neural network OCR model. Its OCR performance is much better than the previous OCR model used in version 3.
Example (produce a PDF file output.pdf with a text layer for a scanned german document):
$ echo page-*.png > input.list $ tesseract --oem 1 -l deu input.list output pdf (--oem 1 enables the LSTM engine)
Print the recognized text to stdout:
$ tesseract --oem 1 -l deu page page-0001.png stdout List installed languages:
$ tesseract --list-langs Support for quite many languages/scriptscripts is available in the form a downloadable trained data sets, e.g. there is even a data set for Fraktur.
With the new LSTM model, Tesseract takes some inspiration from the OCRopus research project.
The Tesseract version 3 performs relatively bad even on good quality input images, i.e. often it falsely detects single characters in dust pixels (outside of any textual context) and easily introduces single character errors in well-known words.
Cuneiform
Cuneiform OCR performance isn't that bad, but it isn't actively maintained (last release in 2011, version 1.1) and easily crashes and has some other issues:
- Segmentation faults with various packages and releases
- its layout algorithm is simply broken, i.e. in one-column documents paragraphs are often randomly shuffled around
- it does not error out on unknown options
You can disable the layout algorithm like this:
$ cuneiform --singlecolumn -l ger -f text -o foo.txt image-0001 (-l specifies the language of the source document)
ocrad
Ocrad example call:
$ ocrad -F utf8 image-0001 Text is printed by default to stdout.
In a business document, it missed an underlined word, where cuneiform/tesseract/gocr didn't.
The Ocrad manual contains a section on the used algorithms, e.g.:
- Detect characters and group them in lines.
6) Recognize characters (very ad hoc; one algorithm per character).
7) Correct some ambiguities (transform l.OOO into 1.000, etc).
gocrGOCR
GOCR example call:
$ gocr image-0001 Text is printed by default to stdout.
The GOCR documentation doesn't include much details on which models/methods are used for OCR.
Hardware
Sane has very good support for a lot ofsome automated document feed (ADF) scanners, e.g. for the Avision and Fujitsu ones.
Included with Sane is the scanimage command line program which you can use to build scripted scan pipelines (cf. e.g. my adf2pdf.py script).