Extract specific contents from text using python and Tesseract OCR

Question

I am using tesseract OCR to extract text from image file .

Below is the sample text I got from my Image:

Certificate No. Certificate Issued Date Acoount Reference Unique Doc. Reference IN-KA047969602415880 18-Feb-2016 01:39 PM NONACC(FI)/kakfscI08/BTM LAYOUT/KA-BA SUBIN-KAKAKSFCL0858710154264833O

How can I extract Certificate No. from this? Any hint or solution will help me here.

Ronny Efronny · Accepted Answer · 2019-09-16 07:50:17Z

1

If the certificate number is always in the structure it is given here (2 letters, hyphen, 17 digits) you can use regex:

import regex as re # i took the entire sequence originally but this is just an example sequence = 'Reference IN-KA047969602415880 18-Feb-2016 01:39' re.search('[A-Z]{2}-.{17}', seq).group() #'IN-KA047969602415880'

.search searches for a specific pattern you dictate, and .group() return the first result (in this case there would be only one). You can search for anything like this in a given string, I suggest a review of regex here.

answered Sep 16, 2019 at 7:50

Ronny Efronny

1,6081 gold badge18 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Rajeev Srivastava Over a year ago

Thanks for your answer. I'm trying to make it dynamic, like looking for some solutions other than using Indexing or character Length.

Ronny Efronny Over a year ago

@RajeevSrivastava There is no indexing here, the search is for regular expressions. [A-Z]{2}-.{17} means an expression that contains {2} characters of type [A-Z], a hyphen, then any (.) {17} characters, which should match the general structure of a certificate number. You can make the search as dynamic as you like and alternative search expressions, regex is capable of plenty.

Rajeev Srivastava Over a year ago

In the above example which I posted along with ''IN-KA047969602415880'' it also matches 'KA-BA SUBIN-KAKAKSFC' @Ronny Efronny

Ronny Efronny Over a year ago

@RajeevSrivastava That would be becasue .{17} refers to any 17 characters at all, including spaces. Try instead doing [A-Z]{2}-\S{17}, where \S means any non-space character. Again I emphasize that to use this method you must only find a structure for the certificate number and search for that, but it has to be specific enough to not accidentally catch other things (like in this case, where a space was one of the 17 characters).

nathancy · Accepted Answer · 2019-09-17 01:40:00Z

Before throwing the image into Tesseract OCR, it's important to preprocess the image to remove noise and smooth the text. Here's a simple approach using OpenCV

Convert image to grayscale
Otsu's threshold to obtain binary image
Gaussian blur and invert image

After converting to grayscale, we Otsu's threshold to get a binary image

From here we give it a slight blur and invert the image to get our result

Results from Pytesseract

Certificate No. : IN-KA047969602415880

import cv2 import pytesseract pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe" image = cv2.imread('1.png',0) thresh = cv2.threshold(image, 0, 255, cv2.THRESH_OTSU + cv2.THRESH_BINARY_INV)[1] blur = cv2.GaussianBlur(thresh, (3,3), 0) result = 255 - blur data = pytesseract.image_to_string(result, lang='eng', config='--psm 6') print(data) cv2.imshow('thresh', thresh) cv2.imshow('result', result) cv2.waitKey()

Collectives™ on Stack Overflow

Extract specific contents from text using python and Tesseract OCR

2 Answers 2

4 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Related