How to get text from image

Question

I want to read the text from an image.

I use pytesseract in Python.

Here is my code:

import pytesseract from PIL import Image pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe" image = Image.open(r'a.jpg') image.resize((150, 50),Image.ANTIALIAS).save("pic.jpg") image = Image.open("pic.jpg") captcha = pytesseract.image_to_string(image).replace(" ", "").replace("-", "").replace("$", "")

image

However, it returns empty string.

What should be the correct way?

Thanks.

a-sam · Accepted Answer · 2019-07-19 19:42:01Z

i agree with @Jon Betts

tesseract is not very strong in OCR, only good in binary cases with right settings
CAPTCHAs ment to fool OCRs!

but if you really need to do it, you need to come up with the manual procedure for it,

i created the code below specifically for the type of CAPTCHAs that you gave (but its completely rigid and is not generalized/optimized for all cases)

psudo code

apply median blur
apply a threshold to get Blue colors only (binary image output from this stage)
apply opening to reduce small white pixels in binary image
give the image to tesseract with options:
- limited whitelist of output chars
- OEM 3 : tesseract + cube
- PSM 8 : one word per image

Code

from PIL import Image import pytesseract import numpy as np import cv2 img = cv2.imread('a.jpg') img = cv2.medianBlur(img, 3) # extract blue parts img2 = np.zeros((img.shape[0], img.shape[1]), dtype=np.uint8) cond = np.bitwise_and(img[:, :, 0] >= 100, img[:, :, 2] < 100) img2[np.where(cond)] = 255 img = img2 # delete the noise kernel = cv2.getStructuringElement(cv2.MORPH_CROSS, (3, 3)) img = cv2.morphologyEx(img, cv2.MORPH_OPEN, kernel) str1 = pytesseract.image_to_string(Image.fromarray(img), config='-c tessedit_char_whitelist=abcedfghijklmnopqrtuvwxyz0123456789 -oem 3 -psm 8') cv2.imwrite("frame.png", img) print(str1)

output

f2e4

image

in order to see full options of tesseract, type the following command tesseract --help-extra or refere to this_link

I got the error : pytesseract.pytesseract.TesseractError: (1, "Error, unknown command line argument '-psm'")
Comment the tesseract line and get the image output, try tesseract using the command line/terminal, if you're still getting the error, the tesseract setup has a problem

Jon Betts · Accepted Answer · 2019-07-19 19:05:46Z

Tesseract is intended for performing OCR on text documents. In my experience it's good but a bit patchy even with very clean data.

In this case it appears you are trying to solve a CAPTCHA which is specifically designed to defeat OCR software. It's very likely you cannot use Tesseract to solve this issue, because:

It's not really designed for that
The scenario is adversarial:
- The example is specifically designed to prevent what you are trying to do
- If you could get it to work, the other party would likely change it to break again

If you want to proceed I would suggest:

Working on cleaning up the image before attempting to process it (can you get a nice readable black and white image?)
Train your own recognition network using a lot of examples

Collectives™ on Stack Overflow

How to get text from image

2 Answers 2

2 Comments

Comments

Hot Network Questions