1

I am using tesseract OCR to extract text from image file Image.

Below is the sample text I got from my Image:

Certificate No. Certificate Issued Date Acoount Reference Unique Doc. Reference IN-KA047969602415880 18-Feb-2016 01:39 PM NONACC(FI)/kakfscI08/BTM LAYOUT/KA-BA SUBIN-KAKAKSFCL0858710154264833O 

How can I extract Certificate No. from this? Any hint or solution will help me here.

2 Answers 2

1

If the certificate number is always in the structure it is given here (2 letters, hyphen, 17 digits) you can use regex:

import regex as re # i took the entire sequence originally but this is just an example sequence = 'Reference IN-KA047969602415880 18-Feb-2016 01:39' re.search('[A-Z]{2}-.{17}', seq).group() #'IN-KA047969602415880' 

.search searches for a specific pattern you dictate, and .group() return the first result (in this case there would be only one). You can search for anything like this in a given string, I suggest a review of regex here.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for your answer. I'm trying to make it dynamic, like looking for some solutions other than using Indexing or character Length.
@RajeevSrivastava There is no indexing here, the search is for regular expressions. [A-Z]{2}-.{17} means an expression that contains {2} characters of type [A-Z], a hyphen, then any (.) {17} characters, which should match the general structure of a certificate number. You can make the search as dynamic as you like and alternative search expressions, regex is capable of plenty.
In the above example which I posted along with ''IN-KA047969602415880'' it also matches 'KA-BA SUBIN-KAKAKSFC' @Ronny Efronny
@RajeevSrivastava That would be becasue .{17} refers to any 17 characters at all, including spaces. Try instead doing [A-Z]{2}-\S{17}, where \S means any non-space character. Again I emphasize that to use this method you must only find a structure for the certificate number and search for that, but it has to be specific enough to not accidentally catch other things (like in this case, where a space was one of the 17 characters).
1

Before throwing the image into Tesseract OCR, it's important to preprocess the image to remove noise and smooth the text. Here's a simple approach using OpenCV

  • Convert image to grayscale
  • Otsu's threshold to obtain binary image
  • Gaussian blur and invert image

After converting to grayscale, we Otsu's threshold to get a binary image

enter image description here

From here we give it a slight blur and invert the image to get our result

enter image description here

Results from Pytesseract

Certificate No. : IN-KA047969602415880

import cv2 import pytesseract pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe" image = cv2.imread('1.png',0) thresh = cv2.threshold(image, 0, 255, cv2.THRESH_OTSU + cv2.THRESH_BINARY_INV)[1] blur = cv2.GaussianBlur(thresh, (3,3), 0) result = 255 - blur data = pytesseract.image_to_string(result, lang='eng', config='--psm 6') print(data) cv2.imshow('thresh', thresh) cv2.imshow('result', result) cv2.waitKey() 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.