8

I want to extract text (mainly numbers) from images like this

enter image description here

I tried this code

import pytesseract from PIL import Image pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' img = Image.open('1.jpg') text = pytesseract.image_to_string(img, lang='eng') print(text) 

but all i get is this (hE PPAR)

1 Answer 1

12

When performing OCR, it is important to preprocess the image so the desired text to detect is in black with the background in white. To do this, here's a simple approach using OpenCV to Otsu's threshold the image which will result in a binary image. Here's the image after preprocessing:

enter image description here

We use the --psm 6 configuration setting to treat the image as a uniform block of text. Here's other configuration options you can try. Result from Pytesseract

01153521976

Code

import cv2 import pytesseract pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe" image = cv2.imread('1.png', 0) thresh = cv2.threshold(image, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1] data = pytesseract.image_to_string(thresh, lang='eng',config='--psm 6') print(data) cv2.imshow('thresh', thresh) cv2.waitKey() 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.