OCR not performing well on clean image | Python Pytesseract

Question

I have been working on project which involves extracting text from an image. I have researched that tesseract is one of the best libraries available and I decided to use the same along with opencv. Opencv is needed for image manipulation.

I have been playing a lot with tessaract engine and it does not seems to be giving the expected results to me. I have attached the image as an reference. Output I got is:

1] =501 [

Instead, expected output is

TM10-50%L

What I have done so far:

Remove noise
Adaptive threshold
Sending it tesseract ocr engine

Are there any other suggestions to improve the algorithm?

Thanks in advance.

Snippet of the code:

import cv2 import sys import pytesseract import numpy as np from PIL import Image if __name__ == '__main__': if len(sys.argv) < 2: print('Usage: python ocr_simple.py image.jpg') sys.exit(1) # Read image path from command line imPath = sys.argv[1] gray = cv2.imread(imPath, 0) # Blur blur = cv2.GaussianBlur(gray,(9,9), 0) # Binarizing thres = cv2.adaptiveThreshold(blur, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 5, 3) text = pytesseract.image_to_string(thresh) print(text)

Images attached. First image is original image. Original image

Second image is what has been fed to tessaract. Input to tessaract

You need to preprocess the image before throwing into OCR with the target to extract in the foreground in black and the background in white. — nathancy
– nathancy, Commented May 11, 2022 at 1:31
@nathancy Hi man, Much thanks for your suggestions. I would definitely go over them. Meanwhile, I have attached couple of images, original and the one which is being fed to tessaract. Could you share your suggestions on this as well. — srikanth2016
– srikanth2016, Commented May 11, 2022 at 6:54
@nathancy We are already doing the preprocessing of the image before feeding to tessaract. Attached the image for the reference. Do you see any problem in that as well? — srikanth2016
– srikanth2016, Commented May 11, 2022 at 6:55

nathancy · Accepted Answer · 2022-05-11 07:19:46Z

Before performing OCR on an image, it's important to preprocess the image. The idea is to obtain a processed image where the text to extract is in black with the background in white. For this specific image, we need to obtain the ROI before we can OCR.

To do this, we can convert to grayscale, apply a slight Gaussian blur, then adaptive threshold to obtain a binary image. From here, we can apply morphological closing to merge individual letters together. Next we find contours, filter using contour area filtering, and then extract the ROI. We perform text extraction using the --psm 6 configuration option to assume a single uniform block of text. Take a look here for more options.

Detected ROI

Extracted ROI

Result from Pytesseract OCR

TM10=50%L

Code

import cv2 import pytesseract pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe" # Grayscale, Gaussian blur, Adaptive threshold image = cv2.imread('1.jpg') original = image.copy() gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) blur = cv2.GaussianBlur(gray, (3,3), 0) thresh = cv2.adaptiveThreshold(blur, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV, 5, 5) # Perform morph close to merge letters together kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5,5)) close = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel, iterations=3) # Find contours, contour area filtering, extract ROI cnts, _ = cv2.findContours(close, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)[-2:] for c in cnts: area = cv2.contourArea(c) if area > 1800 and area < 2500: x,y,w,h = cv2.boundingRect(c) ROI = original[y:y+h, x:x+w] cv2.rectangle(image, (x, y), (x + w, y + h), (36,255,12), 3) # Perform text extraction ROI = cv2.GaussianBlur(ROI, (3,3), 0) data = pytesseract.image_to_string(ROI, lang='eng', config='--psm 6') print(data) cv2.imshow('ROI', ROI) cv2.imshow('close', close) cv2.imshow('image', image) cv2.waitKey()

It sounds a great answer. I am wondering if the area thing would blow in case there are many such texts in the picture. Something like: TM 10-30%L at the top somewhere. TM 10-40%L at the middle somewhere. TM 10-50%L at the end. I can clarify my question if it is not self explanatory.
@HemantBhargava it may, the answer was designed for this specific image. To make it more robust, you could add in aspect ratio filtering as well. There's no single solution that would work for all cases using simple image processing techniques. You would have to train your own custom deep/machine learning model to handle all cases

Collectives™ on Stack Overflow

OCR not performing well on clean image | Python Pytesseract

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related