0

I want to read the text from an image.

I use pytesseract in Python.

Here is my code:

import pytesseract from PIL import Image pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe" image = Image.open(r'a.jpg') image.resize((150, 50),Image.ANTIALIAS).save("pic.jpg") image = Image.open("pic.jpg") captcha = pytesseract.image_to_string(image).replace(" ", "").replace("-", "").replace("$", "") 

image

However, it returns empty string.

What should be the correct way?

Thanks.

2 Answers 2

3

i agree with @Jon Betts

  1. tesseract is not very strong in OCR, only good in binary cases with right settings
  2. CAPTCHAs ment to fool OCRs!

but if you really need to do it, you need to come up with the manual procedure for it,

i created the code below specifically for the type of CAPTCHAs that you gave (but its completely rigid and is not generalized/optimized for all cases)

psudo code

  1. apply median blur
  2. apply a threshold to get Blue colors only (binary image output from this stage)
  3. apply opening to reduce small white pixels in binary image
  4. give the image to tesseract with options:
    • limited whitelist of output chars
    • OEM 3 : tesseract + cube
    • PSM 8 : one word per image

Code

from PIL import Image import pytesseract import numpy as np import cv2 img = cv2.imread('a.jpg') img = cv2.medianBlur(img, 3) # extract blue parts img2 = np.zeros((img.shape[0], img.shape[1]), dtype=np.uint8) cond = np.bitwise_and(img[:, :, 0] >= 100, img[:, :, 2] < 100) img2[np.where(cond)] = 255 img = img2 # delete the noise kernel = cv2.getStructuringElement(cv2.MORPH_CROSS, (3, 3)) img = cv2.morphologyEx(img, cv2.MORPH_OPEN, kernel) str1 = pytesseract.image_to_string(Image.fromarray(img), config='-c tessedit_char_whitelist=abcedfghijklmnopqrtuvwxyz0123456789 -oem 3 -psm 8') cv2.imwrite("frame.png", img) print(str1) 

output

f2e4 

enter image description here

image

in order to see full options of tesseract, type the following command tesseract --help-extra or refere to this_link

Sign up to request clarification or add additional context in comments.

2 Comments

I got the error : pytesseract.pytesseract.TesseractError: (1, "Error, unknown command line argument '-psm'")
Comment the tesseract line and get the image output, try tesseract using the command line/terminal, if you're still getting the error, the tesseract setup has a problem
3

Tesseract is intended for performing OCR on text documents. In my experience it's good but a bit patchy even with very clean data.

In this case it appears you are trying to solve a CAPTCHA which is specifically designed to defeat OCR software. It's very likely you cannot use Tesseract to solve this issue, because:

  • It's not really designed for that
  • The scenario is adversarial:
    • The example is specifically designed to prevent what you are trying to do
    • If you could get it to work, the other party would likely change it to break again

If you want to proceed I would suggest:

  • Working on cleaning up the image before attempting to process it (can you get a nice readable black and white image?)
  • Train your own recognition network using a lot of examples

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.