Can tesseract correctly recognise underscores in images?

Question

I have pictures that look like this:

And I am trying to get the output: "_ _ _ _ _ _ _ _ _ _ c _."

I was working in Python 3.6 and tried to use tesseract for this. What I got so far is the following code:

import pytesseract from PIL import Image # set tesseract file path pytesseract.pytesseract.tesseract_cmd = "C:/Program Files/Tesseract-OCR/tesseract.exe" # configurations config = "--psm 10 --oem 3 -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzßäöü0123456789_-" image = Image.open("test2.png") text = pytesseract.image_to_string(image, config=config)

However, this doesn't work. It just produces "ee" as output. With other pictures, it sometimes recognizes the correct letters, but never the underscores. I tried to whitelist them, but that didn't work either. How can this be done better? I would be grateful for any suggestions.

shaman · Accepted Answer · 2021-10-31 09:54:37Z

I am currently having a similar problem.

One possible solution which I was thinking may works (but heavy on performance I suppose), is to use the cv2 module to detect horizontal lines and use the detected pixelpositions to fill the space inbetween with underscore.

You also have to get the words which are adjacent to the min and max line-pixels, then find the words in the result-string from pytesseract to put the underscores at the right place in the string.

Here's a nice thread about finding lines in a picture, which may is helpful: Horizontal Line detection with OpenCV

Edit: What I now do may is a bit dirty but I use the horizontal line detection from the link above and then use the cv2.putText to write a string like this "QQQQQQQ" at the start-position of the line. Then I search for the Qs which are recognized by OCR and replace them with underscores again.

Armaan Priyadarshan · Accepted Answer · 2022-07-13 17:54:06Z

I had a similar problem, and I looked into solving it with OpenCV rather than an OCR library as shaman said. I tried horizontal line detection but it didn't accurately count the number of underscores. OpenCV ended up having a LineSegmentDetector (4.6 has it) which worked really well for me.

LineSegmentDetector in Opencv 3 with Python

The length of lines as a list divided by 2 gave me the number of underscores in the image. Additionally, it took a bit of image preprocessing for it to work properly. This included thresholding, upscaling, and dilation, but those parts shouldn't be hard to figure out.

Collectives™ on Stack Overflow

Can tesseract correctly recognise underscores in images?

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related