Finding known text in an image (guided OCR)

Question

I'm looking for a way to locate known text within an image.

Specifically, I'm trying to create a tool convert a set of scanned pages into PDFs that support searching and copy+paste. I understand how this is usually done: OCR the page, retaining the position of the text, and then add the text as an invisible layer to the PDF. Acrobat has this functionality built in, and tesseract can output hOCR files (containing the recognized text along with its location), which can be used by hocr2pdf to generate a text layer.

Unfortunately, my source images are rather low quality (at most 150 DPI, with plenty of JPEG artifacts, and non-solid backgrounds behind some of the text), leading to pretty poor OCR results. However, I do have the a copy of the text (sans pictures and layout) that appears on each page.

Matching already known text to it's location on a scanned page seems like it would be much easier to do accurately, but I failed to discover any software with this capability built-in. How can I leverage existing software to do this?

Edit: The text varies in size and font, though passages of it are consistent.

Are you able to post a sample or two? Do you know the details of the font style and size? And are these constant throughout the document? — Mark Setchell
– Mark Setchell, Commented Feb 23, 2015 at 23:02
Are you asking for a tool/software that can do this for you? — kkuilla
– kkuilla, Commented Feb 24, 2015 at 9:49
@MarkSetchell I can't post the exact documents, but I'll see if I can generate a comparable-quality sample. — rkjnsn
– rkjnsn, Commented Feb 25, 2015 at 4:13
@kkuilla, I'm developing a utility for processing these documents. I'm either looking for an existing tool or library that can locate known text in an image out of the box, or suggestions on how to implement it myself using some lower-level api. I only need to figure out how to get the location of each character given the image and the text. I know how to do the rest. — rkjnsn
– rkjnsn, Commented Feb 25, 2015 at 4:24
I would then say that this question is probably off topic either because it is "too broad" or "Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it." Please see How to ask — kkuilla
– kkuilla, Commented Feb 25, 2015 at 9:03

Mark Setchell · Accepted Answer · 2015-02-26 09:53:23Z

The thought that springs to mind for me would be a cross-correlation. So, I would take the list of words that you know occur on the page and render them one at a time onto a canvas to create a picture of that word. You would need to use a similar font and size as the words in the document - which is what I asked in my comment. Then I would run a normalised cross-correlation of the picture of the word against the scanned image to see where it occurs. I would do all that with ImageMagick which is available for Windows and OSX (use homebrew on OS X) and included in most Linux distros.

So, let's take a screengrab of the second paragraph of your question and look for the word pretty - where you mention pretty poor OCR.

First, you need to render the word pretty onto a white background. The command will be something like this:

convert -background white -fill black -font Times -pointsize 14 label:pretty word.png

Result:

enter image description here

Then perform a normalised cross-correlation using Fred Weinhaus's script from here like this:

normcrosscorr -p word.png scan.png correlation-result.png Match Coords: (504,30) And Score In Range 0 to 1: (0.999803)

and you can see the coordinates of the match are 504,30.

Result: enter image description here

Another Idea

Another idea might be to take Google's Tesseract-OCR and replace the standard dictionary with the text file containing the words on the page you are processing...

Collectives™ on Stack Overflow

Finding known text in an image (guided OCR)

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related