Skip to main content
update links, update tesseract details
Source Link
maxschlepzig
  • 59.7k
  • 53
  • 224
  • 298

Tesseract

As of 20182020, the best available open source OCR software is Tesseract 4 (beta) with its new LSTM neural network OCR modelLSTM neural network OCR model. Its OCR performance is much better than the previous OCR model used in version 3.

Example (produce a PDF file output.pdf with a text layer for a scanned german document):

$ echo page-*.png > input.list $ tesseract --oem 1 -l deu input.list output pdf 

(--oem 1 enables the LSTM engine)

Print the recognized text to stdout:

$ tesseract --oem 1 -l deu page page-0001.png stdout 

List installed languages:

$ tesseract --list-langs 

Support for quite many languages/scriptscripts is available in the form a downloadable trained data sets, e.g. there is even a data set for Fraktur.

With the new LSTM model, Tesseract takes some inspiration from the OCRopus research project.

The Tesseract version 3 performs relatively bad even on good quality input images, i.e. often it falsely detects single characters in dust pixels (outside of any textual context) and easily introduces single character errors in well-known words.

Cuneiform

Cuneiform OCR performance isn't that bad, but it isn't actively maintained (last release in 2011, version 1.1) and easily crashes and has some other issues:

You can disable the layout algorithm like this:

$ cuneiform --singlecolumn -l ger -f text -o foo.txt image-0001 

(-l specifies the language of the source document)

ocrad

Ocrad example call:

$ ocrad -F utf8 image-0001 

Text is printed by default to stdout.

In a business document, it missed an underlined word, where cuneiform/tesseract/gocr didn't.

The Ocrad manual contains a section on the used algorithms, e.g.:

  1. Detect characters and group them in lines.
    6) Recognize characters (very ad hoc; one algorithm per character).
    7) Correct some ambiguities (transform l.OOO into 1.000, etc).

gocrGOCR

GOCR example call:

$ gocr image-0001 

Text is printed by default to stdout.

The GOCR documentation doesn't include much details on which models/methods are used for OCR.

Hardware

Sane has very good support for a lot ofsome automated document feed (ADF) scanners, e.g. for the Avision and Fujitsu ones.

Included with Sane is the scanimage command line program which you can use to build scripted scan pipelines (cf. e.g. my adf2pdf.py script).

Tesseract

As of 2018, the best available open source OCR software is Tesseract 4 (beta) with its new LSTM neural network OCR model. Its OCR performance is much better than the previous OCR model used in version 3.

Example (produce a PDF file output.pdf with a text layer for a scanned german document):

$ echo page-*.png > input.list $ tesseract --oem 1 -l deu input.list output pdf 

Print the recognized text to stdout:

$ tesseract --oem 1 -l deu page page-0001.png stdout 

List installed languages:

$ tesseract --list-langs 

Support for quite many languages/script is available in the form a downloadable trained data sets, e.g. there is even a data set for Fraktur.

With the new LSTM model, Tesseract takes some inspiration from the OCRopus research project.

The Tesseract version 3 performs relatively bad even on good quality input images, i.e. often it falsely detects single characters in dust pixels (outside of any textual context) and easily introduces single character errors in well-known words.

Cuneiform

Cuneiform OCR performance isn't that bad, but it isn't actively maintained (last release in 2011, version 1.1) and easily crashes and has some other issues:

You can disable the layout algorithm like this:

$ cuneiform --singlecolumn -l ger -f text -o foo.txt image-0001 

(-l specifies the language of the source document)

ocrad

$ ocrad -F utf8 image-0001 

Text is printed by default to stdout.

In a business document, it missed an underlined word, where cuneiform/tesseract/gocr didn't.

gocr

$ gocr image-0001 

Text is printed by default to stdout.

Hardware

Sane has very good support for a lot of automated document feed (ADF) scanners, e.g. for the Avision and Fujitsu ones.

Included with Sane is the scanimage command line program which you can use to build scripted scan pipelines (cf. e.g. my adf2pdf.py script).

Tesseract

As of 2020, the best available open source OCR software is Tesseract 4 with its new LSTM neural network OCR model. Its OCR performance is much better than the previous OCR model used in version 3.

Example (produce a PDF file output.pdf with a text layer for a scanned german document):

$ echo page-*.png > input.list $ tesseract --oem 1 -l deu input.list output pdf 

(--oem 1 enables the LSTM engine)

Print the recognized text to stdout:

$ tesseract --oem 1 -l deu page page-0001.png stdout 

List installed languages:

$ tesseract --list-langs 

Support for quite many languages/scripts is available in the form a downloadable trained data sets, e.g. there is even a data set for Fraktur.

With the new LSTM model, Tesseract takes some inspiration from the OCRopus research project.

The Tesseract version 3 performs relatively bad even on good quality input images, i.e. often it falsely detects single characters in dust pixels (outside of any textual context) and easily introduces single character errors in well-known words.

Cuneiform

Cuneiform OCR performance isn't that bad, but it isn't actively maintained (last release in 2011, version 1.1) and easily crashes and has some other issues:

You can disable the layout algorithm like this:

$ cuneiform --singlecolumn -l ger -f text -o foo.txt image-0001 

(-l specifies the language of the source document)

ocrad

Ocrad example call:

$ ocrad -F utf8 image-0001 

Text is printed by default to stdout.

In a business document, it missed an underlined word, where cuneiform/tesseract/gocr didn't.

The Ocrad manual contains a section on the used algorithms, e.g.:

  1. Detect characters and group them in lines.
    6) Recognize characters (very ad hoc; one algorithm per character).
    7) Correct some ambiguities (transform l.OOO into 1.000, etc).

GOCR

GOCR example call:

$ gocr image-0001 

Text is printed by default to stdout.

The GOCR documentation doesn't include much details on which models/methods are used for OCR.

Hardware

Sane has very good support for some automated document feed (ADF) scanners, e.g. for the Avision and Fujitsu ones.

Included with Sane is the scanimage command line program which you can use to build scripted scan pipelines (cf. e.g. my adf2pdf.py script).

update with notes about tesseract 4
Source Link
maxschlepzig
  • 59.7k
  • 53
  • 224
  • 298

Well, depends what you mean by 'business documents'.

CuneiformTesseract

I testedAs of 2018, the best available open source OCR software is cuneiformTesseract 4 (beta) with some business letters and I was quite astonished by its low error ratenew LSTM neural network OCR model. Its OCR performance is much better than the previous OCR model used in version 3.

Example (regarding mis-recognized characters or wordsproduce a PDF file output.pdf with a text layer for a scanned german document).:

$ echo page-*.png > input.list $ tesseract --oem 1 -l deu input.list output pdf 

For my use case I only needPrint the rawrecognized text to stdout:

$ tesseract --oem 1 -l deu page page-0001.png stdout 

List installed languages:

$ tesseract --list-langs 

Support for indexing purposes - I am not interestedquite many languages/script is available in text to layout element mapping or something like thatthe form a downloadable trained data sets, e.g. there is even a data set for Fraktur.

AndWith the documents are just one columnnew LSTM model, Tesseract takes some inspiration from the OCRopus research project.

UnfortunatelyThe Tesseract version 3 performs relatively bad even on good quality input images, cuneiform currentlyi.e. often it falsely detects single characters in dust pixels (asoutside of any textual context) and easily introduces single character errors in well-known words.

Cuneiform

Cuneiform OCR performance isn't that bad, but it isn't actively maintained (last release in 2011, version 1.1) and easily crashes and has some problemsother issues:

Tesseract

Convert an image to text and print to stdout:

$ tesseract image-0001 stdout 

Display the supported languages:

$ tesseract --list-langs 

Has issues with umlauts (when scanning with english as specified language).

Also supports 'Orientation and script detection' (OSD).

ocrad

Included with Sane is the scanadfscanimage command line program which you can use to build scripted scan pipelines (cf. e.g. my adf2pdf.py script).

Well, depends what you mean by 'business documents'.

Cuneiform

I tested cuneiform with some business letters and I was quite astonished by its low error rate (regarding mis-recognized characters or words).

For my use case I only need the raw text for indexing purposes - I am not interested in text to layout element mapping or something like that.

And the documents are just one column.

Unfortunately, cuneiform currently (as of 1.1) has some problems:

Tesseract

Convert an image to text and print to stdout:

$ tesseract image-0001 stdout 

Display the supported languages:

$ tesseract --list-langs 

Has issues with umlauts (when scanning with english as specified language).

Also supports 'Orientation and script detection' (OSD).

ocrad

Included with Sane is the scanadf command line program which you can use to build scripted scan pipelines.

Tesseract

As of 2018, the best available open source OCR software is Tesseract 4 (beta) with its new LSTM neural network OCR model. Its OCR performance is much better than the previous OCR model used in version 3.

Example (produce a PDF file output.pdf with a text layer for a scanned german document):

$ echo page-*.png > input.list $ tesseract --oem 1 -l deu input.list output pdf 

Print the recognized text to stdout:

$ tesseract --oem 1 -l deu page page-0001.png stdout 

List installed languages:

$ tesseract --list-langs 

Support for quite many languages/script is available in the form a downloadable trained data sets, e.g. there is even a data set for Fraktur.

With the new LSTM model, Tesseract takes some inspiration from the OCRopus research project.

The Tesseract version 3 performs relatively bad even on good quality input images, i.e. often it falsely detects single characters in dust pixels (outside of any textual context) and easily introduces single character errors in well-known words.

Cuneiform

Cuneiform OCR performance isn't that bad, but it isn't actively maintained (last release in 2011, version 1.1) and easily crashes and has some other issues:

ocrad

Included with Sane is the scanimage command line program which you can use to build scripted scan pipelines (cf. e.g. my adf2pdf.py script).

add tesseract/ocrad/gocr
Source Link
maxschlepzig
  • 59.7k
  • 53
  • 224
  • 298

Well, depends what you mean by 'business documents'.

Cuneiform

I tested cuneiform with some business letters and I was quite astonished by its low error rate (regarding mis-recognized characters or words).

For my use case I only need the raw text for indexing purposes - I am not interested in text to layout element mapping or something like that.

And the documents are just one column.

Unfortunately, cuneiform currently (as of 1.1) has some problems:

You can disable the layout algorithm like this:

$ cuneiform --singlecolumn -l ger -f text -o foo.txt imgfileimage-0001 

(-l specifies the language of the source document)

Tesseract

Can't really test it because it directly segfaults (using Ubuntu 11.10 packages - i.e. Tesseract 2.04)Convert an image to text and print to stdout:

$ tesseract image-0001.tif foo -l deustdout Tesseract Open Source

Display the supported languages:

$ OCRtesseract Engine--list-langs index < len:Error:Assert failed:in

Has issues with umlauts (when scanning with english as specified language).

Also supports 'Orientation and script detection' (OSD).

ocrad

$ fileocrad ../ccstruct/rejctmap.h,-F lineutf8 240image-0001 zsh:

Text is printed by default to stdout.

In a business document, it missed an underlined word, where cuneiform/tesseract/gocr didn't.

gocr

$ segmentationgocr faultimage-0001 

Text is printed by default to stdout.

Hardware

Sane has very good support for a lot of automated document feed (ADF) scanners, e.g. for the Avision and Fujitsu ones.

Included with Sane is the scanadf command line program which you can use to build scripted scan pipelines.

Well, depends what you mean by 'business documents'.

Cuneiform

I tested cuneiform with some business letters and I was quite astonished by its low error rate (regarding mis-recognized characters or words).

For my use case I only need the raw text for indexing purposes - I am not interested in text to layout element mapping or something like that.

And the documents are just one column.

Unfortunately, cuneiform currently (as of 1.1) has some problems:

You can disable the layout algorithm like this:

$ cuneiform --singlecolumn -l ger -f text -o foo.txt imgfile 

(-l specifies the language of the source document)

Tesseract

Can't really test it because it directly segfaults (using Ubuntu 11.10 packages - i.e. Tesseract 2.04):

$ tesseract image-0001.tif foo -l deu Tesseract Open Source OCR Engine index < len:Error:Assert failed:in file ../ccstruct/rejctmap.h, line 240 zsh: segmentation fault 

Hardware

Sane has very good support for a lot of automated document feed (ADF) scanners, e.g. for Avision and Fujitsu ones.

Included with Sane is the scanadf command line program which you can use to build scripted scan pipelines.

Well, depends what you mean by 'business documents'.

Cuneiform

I tested cuneiform with some business letters and I was quite astonished by its low error rate (regarding mis-recognized characters or words).

For my use case I only need the raw text for indexing purposes - I am not interested in text to layout element mapping or something like that.

And the documents are just one column.

Unfortunately, cuneiform currently (as of 1.1) has some problems:

You can disable the layout algorithm like this:

$ cuneiform --singlecolumn -l ger -f text -o foo.txt image-0001 

(-l specifies the language of the source document)

Tesseract

Convert an image to text and print to stdout:

$ tesseract image-0001 stdout 

Display the supported languages:

$ tesseract --list-langs 

Has issues with umlauts (when scanning with english as specified language).

Also supports 'Orientation and script detection' (OSD).

ocrad

$ ocrad -F utf8 image-0001 

Text is printed by default to stdout.

In a business document, it missed an underlined word, where cuneiform/tesseract/gocr didn't.

gocr

$ gocr image-0001 

Text is printed by default to stdout.

Hardware

Sane has very good support for a lot of automated document feed (ADF) scanners, e.g. for the Avision and Fujitsu ones.

Included with Sane is the scanadf command line program which you can use to build scripted scan pipelines.

version number
Source Link
maxschlepzig
  • 59.7k
  • 53
  • 224
  • 298
Loading
Tesseract
Source Link
maxschlepzig
  • 59.7k
  • 53
  • 224
  • 298
Loading
Source Link
maxschlepzig
  • 59.7k
  • 53
  • 224
  • 298
Loading