Revisions to OCR on Linux systems [closed]

update links, update tesseract details

edited May 10, 2020 at 7:14

59.7k
53
224
298

Tesseract

As of 20182020, the best available open source OCR software is Tesseract 4 (beta) with its new LSTM neural network OCR model LSTM neural network OCR model. Its OCR performance is much better than the previous OCR model used in version 3.

Example (produce a PDF file output.pdf with a text layer for a scanned german document):

$ echo page-*.png > input.list $ tesseract --oem 1 -l deu input.list output pdf

(--oem 1 enables the LSTM engine)

Print the recognized text to stdout:

$ tesseract --oem 1 -l deu page page-0001.png stdout

List installed languages:

$ tesseract --list-langs

Support for quite many languages/scriptscripts is available in the form a downloadable trained data sets, e.g. there is even a data set for Fraktur.

With the new LSTM model, Tesseract takes some inspiration from the OCRopus research project.

The Tesseract version 3 performs relatively bad even on good quality input images, i.e. often it falsely detects single characters in dust pixels (outside of any textual context) and easily introduces single character errors in well-known words.

Cuneiform

Cuneiform OCR performance isn't that bad, but it isn't actively maintained (last release in 2011, version 1.1) and easily crashes and has some other issues:

Segmentation faults with various packages and releases
its layout algorithm is simply broken, i.e. in one-column documents paragraphs are often randomly shuffled around
it does not error out on unknown options

You can disable the layout algorithm like this:

$ cuneiform --singlecolumn -l ger -f text -o foo.txt image-0001

(-l specifies the language of the source document)

ocrad

Ocrad example call:

$ ocrad -F utf8 image-0001

Text is printed by default to stdout.

In a business document, it missed an underlined word, where cuneiform/tesseract/gocr didn't.

The Ocrad manual contains a section on the used algorithms, e.g.:

Detect characters and group them in lines.
6) Recognize characters (very ad hoc; one algorithm per character).
7) Correct some ambiguities (transform l.OOO into 1.000, etc).

gocrGOCR

GOCR example call:

$ gocr image-0001

Text is printed by default to stdout.

The GOCR documentation doesn't include much details on which models/methods are used for OCR.

Hardware

Sane has very good support for a lot ofsome automated document feed (ADF) scanners, e.g. for the Avision and Fujitsu ones.

Included with Sane is the scanimage command line program which you can use to build scripted scan pipelines (cf. e.g. my adf2pdf.py script).

Tesseract

As of 2018, the best available open source OCR software is Tesseract 4 (beta) with its new LSTM neural network OCR model. Its OCR performance is much better than the previous OCR model used in version 3.

Example (produce a PDF file output.pdf with a text layer for a scanned german document):

$ echo page-*.png > input.list $ tesseract --oem 1 -l deu input.list output pdf

Print the recognized text to stdout:

$ tesseract --oem 1 -l deu page page-0001.png stdout

List installed languages:

$ tesseract --list-langs

Support for quite many languages/script is available in the form a downloadable trained data sets, e.g. there is even a data set for Fraktur.

With the new LSTM model, Tesseract takes some inspiration from the OCRopus research project.

The Tesseract version 3 performs relatively bad even on good quality input images, i.e. often it falsely detects single characters in dust pixels (outside of any textual context) and easily introduces single character errors in well-known words.

Cuneiform

Cuneiform OCR performance isn't that bad, but it isn't actively maintained (last release in 2011, version 1.1) and easily crashes and has some other issues:

Segmentation faults with various packages and releases
its layout algorithm is simply broken, i.e. in one-column documents paragraphs are often randomly shuffled around
it does not error out on unknown options

You can disable the layout algorithm like this:

$ cuneiform --singlecolumn -l ger -f text -o foo.txt image-0001

(-l specifies the language of the source document)

ocrad

$ ocrad -F utf8 image-0001

Text is printed by default to stdout.

In a business document, it missed an underlined word, where cuneiform/tesseract/gocr didn't.

gocr

$ gocr image-0001

Text is printed by default to stdout.

Hardware

Sane has very good support for a lot of automated document feed (ADF) scanners, e.g. for the Avision and Fujitsu ones.

Included with Sane is the scanimage command line program which you can use to build scripted scan pipelines (cf. e.g. my adf2pdf.py script).

Tesseract

As of 2020, the best available open source OCR software is Tesseract 4 with its new LSTM neural network OCR model. Its OCR performance is much better than the previous OCR model used in version 3.

Example (produce a PDF file output.pdf with a text layer for a scanned german document):

$ echo page-*.png > input.list $ tesseract --oem 1 -l deu input.list output pdf

(--oem 1 enables the LSTM engine)

Print the recognized text to stdout:

$ tesseract --oem 1 -l deu page page-0001.png stdout

List installed languages:

$ tesseract --list-langs

Support for quite many languages/scripts is available in the form a downloadable trained data sets, e.g. there is even a data set for Fraktur.

With the new LSTM model, Tesseract takes some inspiration from the OCRopus research project.

The Tesseract version 3 performs relatively bad even on good quality input images, i.e. often it falsely detects single characters in dust pixels (outside of any textual context) and easily introduces single character errors in well-known words.

Cuneiform

Cuneiform OCR performance isn't that bad, but it isn't actively maintained (last release in 2011, version 1.1) and easily crashes and has some other issues:

Segmentation faults with various packages and releases
its layout algorithm is simply broken, i.e. in one-column documents paragraphs are often randomly shuffled around
it does not error out on unknown options

You can disable the layout algorithm like this:

$ cuneiform --singlecolumn -l ger -f text -o foo.txt image-0001

(-l specifies the language of the source document)

ocrad

Ocrad example call:

$ ocrad -F utf8 image-0001

Text is printed by default to stdout.

In a business document, it missed an underlined word, where cuneiform/tesseract/gocr didn't.

The Ocrad manual contains a section on the used algorithms, e.g.:

Detect characters and group them in lines.
6) Recognize characters (very ad hoc; one algorithm per character).
7) Correct some ambiguities (transform l.OOO into 1.000, etc).

GOCR

GOCR example call:

$ gocr image-0001

Text is printed by default to stdout.

The GOCR documentation doesn't include much details on which models/methods are used for OCR.

Hardware

Sane has very good support for some automated document feed (ADF) scanners, e.g. for the Avision and Fujitsu ones.

Included with Sane is the scanimage command line program which you can use to build scripted scan pipelines (cf. e.g. my adf2pdf.py script).

update with notes about tesseract 4

Source Link

edited Apr 30, 2018 at 9:01

maxschlepzig

59.7k
53
224
298

Well, depends what you mean by 'business documents'.

CuneiformTesseract

I testedAs of 2018, the best available open source OCR software is cuneiform Tesseract 4 (beta) with some business letters and I was quite astonished by its low error ratenew LSTM neural network OCR model. Its OCR performance is much better than the previous OCR model used in version 3.

Example (regarding mis-recognized characters or wordsproduce a PDF file output.pdf with a text layer for a scanned german document).:

$ echo page-*.png > input.list $ tesseract --oem 1 -l deu input.list output pdf

For my use case I only needPrint the rawrecognized text to stdout:

$ tesseract --oem 1 -l deu page page-0001.png stdout

List installed languages:

$ tesseract --list-langs

Support for indexing purposes - I am not interestedquite many languages/script is available in text to layout element mapping or something like thatthe form a downloadable trained data sets, e.g. there is even a data set for Fraktur.

AndWith the documents are just one columnnew LSTM model, Tesseract takes some inspiration from the OCRopus research project.

UnfortunatelyThe Tesseract version 3 performs relatively bad even on good quality input images, cuneiform currentlyi.e. often it falsely detects single characters in dust pixels (asoutside of any textual context) and easily introduces single character errors in well-known words.

Cuneiform

Cuneiform OCR performance isn't that bad, but it isn't actively maintained (last release in 2011, version 1.1) and easily crashes and has some problemsother issues:

Segmentation faults with various packages and releases
Segmentation faults with various packages and releases
its layout algorithm is simply broken, i.e. in one-column documents paragraphs are often randomly shuffled around
its layout algorithm is simply broken, i.e. in one-column documents paragraphs are often randomly shuffled around
it does not error out on unknown options
it does not error out on unknown options

Tesseract

Convert an image to text and print to stdout:

$ tesseract image-0001 stdout

Display the supported languages:

$ tesseract --list-langs

Has issues with umlauts (when scanning with english as specified language).

Also supports 'Orientation and script detection' (OSD).

ocrad

Included with Sane is the scanadfscanimage command line program which you can use to build scripted scan pipelines (cf. e.g. my adf2pdf.py script).

Well, depends what you mean by 'business documents'.

Cuneiform

I tested cuneiform with some business letters and I was quite astonished by its low error rate (regarding mis-recognized characters or words).

For my use case I only need the raw text for indexing purposes - I am not interested in text to layout element mapping or something like that.

And the documents are just one column.

Unfortunately, cuneiform currently (as of 1.1) has some problems:

Segmentation faults with various packages and releases
its layout algorithm is simply broken, i.e. in one-column documents paragraphs are often randomly shuffled around
it does not error out on unknown options

Tesseract

Convert an image to text and print to stdout:

$ tesseract image-0001 stdout

Display the supported languages:

$ tesseract --list-langs

Has issues with umlauts (when scanning with english as specified language).

Also supports 'Orientation and script detection' (OSD).

ocrad

Included with Sane is the scanadf command line program which you can use to build scripted scan pipelines.

Tesseract

As of 2018, the best available open source OCR software is Tesseract 4 (beta) with its new LSTM neural network OCR model. Its OCR performance is much better than the previous OCR model used in version 3.

Example (produce a PDF file output.pdf with a text layer for a scanned german document):

$ echo page-*.png > input.list $ tesseract --oem 1 -l deu input.list output pdf

Print the recognized text to stdout:

$ tesseract --oem 1 -l deu page page-0001.png stdout

List installed languages:

$ tesseract --list-langs

Support for quite many languages/script is available in the form a downloadable trained data sets, e.g. there is even a data set for Fraktur.

With the new LSTM model, Tesseract takes some inspiration from the OCRopus research project.

The Tesseract version 3 performs relatively bad even on good quality input images, i.e. often it falsely detects single characters in dust pixels (outside of any textual context) and easily introduces single character errors in well-known words.

Cuneiform

Cuneiform OCR performance isn't that bad, but it isn't actively maintained (last release in 2011, version 1.1) and easily crashes and has some other issues:

Segmentation faults with various packages and releases
its layout algorithm is simply broken, i.e. in one-column documents paragraphs are often randomly shuffled around
it does not error out on unknown options

ocrad

Included with Sane is the scanimage command line program which you can use to build scripted scan pipelines (cf. e.g. my adf2pdf.py script).

add tesseract/ocrad/gocr

Source Link

edited May 27, 2015 at 20:26

maxschlepzig

59.7k
53
224
298

Well, depends what you mean by 'business documents'.

Cuneiform

I tested cuneiform with some business letters and I was quite astonished by its low error rate (regarding mis-recognized characters or words).

For my use case I only need the raw text for indexing purposes - I am not interested in text to layout element mapping or something like that.

And the documents are just one column.

Unfortunately, cuneiform currently (as of 1.1) has some problems:

Segmentation faults with various packages and releases
its layout algorithm is simply broken, i.e. in one-column documents paragraphs are often randomly shuffled around
it does not error out on unknown options

You can disable the layout algorithm like this:

$ cuneiform --singlecolumn -l ger -f text -o foo.txt imgfileimage-0001

(-l specifies the language of the source document)

Tesseract

Can't really test it because it directly segfaults (using Ubuntu 11.10 packages - i.e. Tesseract 2.04)Convert an image to text and print to stdout:

$ tesseract image-0001.tif foo -l deustdout Tesseract Open Source

Display the supported languages:

$ OCRtesseract Engine--list-langs index < len:Error:Assert failed:in

Has issues with umlauts (when scanning with english as specified language).

Also supports 'Orientation and script detection' (OSD).

ocrad

$ fileocrad ../ccstruct/rejctmap.h,-F lineutf8 240image-0001 zsh:

Text is printed by default to stdout.

In a business document, it missed an underlined word, where cuneiform/tesseract/gocr didn't.

gocr

$ segmentationgocr faultimage-0001

Text is printed by default to stdout.

Hardware

Sane has very good support for a lot of automated document feed (ADF) scanners, e.g. for the Avision and Fujitsu ones.

Included with Sane is the scanadf command line program which you can use to build scripted scan pipelines.

Well, depends what you mean by 'business documents'.

Cuneiform

I tested cuneiform with some business letters and I was quite astonished by its low error rate (regarding mis-recognized characters or words).

For my use case I only need the raw text for indexing purposes - I am not interested in text to layout element mapping or something like that.

And the documents are just one column.

Unfortunately, cuneiform currently (as of 1.1) has some problems:

Segmentation faults with various packages and releases
its layout algorithm is simply broken, i.e. in one-column documents paragraphs are often randomly shuffled around
it does not error out on unknown options

You can disable the layout algorithm like this:

$ cuneiform --singlecolumn -l ger -f text -o foo.txt imgfile

(-l specifies the language of the source document)

Tesseract

Can't really test it because it directly segfaults (using Ubuntu 11.10 packages - i.e. Tesseract 2.04):

$ tesseract image-0001.tif foo -l deu Tesseract Open Source OCR Engine index < len:Error:Assert failed:in file ../ccstruct/rejctmap.h, line 240 zsh: segmentation fault

Hardware

Sane has very good support for a lot of automated document feed (ADF) scanners, e.g. for Avision and Fujitsu ones.

Included with Sane is the scanadf command line program which you can use to build scripted scan pipelines.

Well, depends what you mean by 'business documents'.

Cuneiform

I tested cuneiform with some business letters and I was quite astonished by its low error rate (regarding mis-recognized characters or words).

For my use case I only need the raw text for indexing purposes - I am not interested in text to layout element mapping or something like that.

And the documents are just one column.

Unfortunately, cuneiform currently (as of 1.1) has some problems:

Segmentation faults with various packages and releases
its layout algorithm is simply broken, i.e. in one-column documents paragraphs are often randomly shuffled around
it does not error out on unknown options

You can disable the layout algorithm like this:

$ cuneiform --singlecolumn -l ger -f text -o foo.txt image-0001

(-l specifies the language of the source document)

Tesseract

Convert an image to text and print to stdout:

$ tesseract image-0001 stdout

Display the supported languages:

$ tesseract --list-langs

Has issues with umlauts (when scanning with english as specified language).

Also supports 'Orientation and script detection' (OSD).

ocrad

$ ocrad -F utf8 image-0001

Text is printed by default to stdout.

In a business document, it missed an underlined word, where cuneiform/tesseract/gocr didn't.

gocr

$ gocr image-0001

Text is printed by default to stdout.

Hardware

Sane has very good support for a lot of automated document feed (ADF) scanners, e.g. for the Avision and Fujitsu ones.

Included with Sane is the scanadf command line program which you can use to build scripted scan pipelines.

version number

Source Link

edited Dec 18, 2011 at 22:45

maxschlepzig

59.7k
53
224
298

Loading

Tesseract

Source Link

edited Dec 18, 2011 at 22:39

maxschlepzig

59.7k
53
224
298

Loading

Source Link

answered Dec 15, 2011 at 20:44

maxschlepzig

59.7k
53
224
298

Loading

Stack Exchange Network

Return to Answer

Tesseract

Cuneiform

ocrad

gocrGOCR

Hardware

Tesseract

Cuneiform

ocrad

gocr

Hardware

Tesseract

Cuneiform

ocrad

GOCR

Hardware

CuneiformTesseract

Cuneiform

Tesseract

ocrad

Cuneiform

Tesseract

ocrad

Tesseract

Cuneiform

ocrad

Cuneiform

Tesseract

ocrad

gocr

Hardware

Cuneiform

Tesseract

Hardware

Cuneiform

Tesseract

ocrad

gocr

Hardware