Remove background noise from image to make text more clear for OCR

Question

I've written an application that segments an image based on the text regions within it, and extracts those regions as I see fit. What I'm attempting to do is clean the image so OCR (Tesseract) gives an accurate result. I have the following image as an example:

Running this through tesseract gives a widely inaccurate result. However cleaning up the image (using photoshop) to get the image as follows:

Gives exactly the result I would expect. The first image is already being run through the following method to clean it to that point:

 public Mat cleanImage (Mat srcImage) { Core.normalize(srcImage, srcImage, 0, 255, Core.NORM_MINMAX); Imgproc.threshold(srcImage, srcImage, 0, 255, Imgproc.THRESH_OTSU); Imgproc.erode(srcImage, srcImage, new Mat()); Imgproc.dilate(srcImage, srcImage, new Mat(), new Point(0, 0), 9); return srcImage; }

What more can I do to clean the first image so it resembles the second image?

Edit: This is the original image before it's run through the cleanImage function.

If you know the text is always roughly in the center of the image you could remove connected segments of dark pixels where no pixel in the segment is outside some distance from the edges. If you know the text is always the same size you could remove connected segments of dark text which have fewer than some threshold number of pixels in them. If you aligned the image somehow and the numbers are all the same height you could try and calculate a top line and a bottom line and throw out outliers. If there are always 4 digits you could use that to remove segments greater than 4 on some rule. — Pace
– Pace, Commented Nov 24, 2015 at 23:08
You can filter noise segments (connected components) near image borders (i. e. connected to image borders): in you sample required text are not connected to the border. — avtomaton
– avtomaton, Commented Nov 26, 2015 at 14:33

dhanushka · Accepted Answer · 2015-11-30 13:10:06Z

My answer is based on following assumptions. It's possible that none of them holds in your case.

It's possible for you to impose a threshold for bounding box heights in the segmented region. Then you should be able to filter out other components.
You know the average stroke widths of the digits. Use this information to minimize the chance that the digits are connected to other regions. You can use distance transform and morphological operations for this.

This is my procedure for extracting the digits:

Apply Otsu threshold to the image
Take the distance transform
Threshold the distance transformed image using the stroke-width ( = 8) constraint
Apply morphological operation to disconnect
Filter bounding box heights and make a guess where the digits are

stroke-width = 8 stroke-width = 10

EDIT

Prepare a mask using the convexhull of the found digit contours
Copy digits region to a clean image using the mask

stroke-width = 8

stroke-width = 10

My Tesseract knowledge is a bit rusty. As I remember you can get a confidence level for the characters. You may be able to filter out noise using this information if you still happen to detect noisy regions as character bounding boxes.

C++ Code

Mat im = imread("aRh8C.png", 0); // apply Otsu threshold Mat bw; threshold(im, bw, 0, 255, CV_THRESH_BINARY_INV | CV_THRESH_OTSU); // take the distance transform Mat dist; distanceTransform(bw, dist, CV_DIST_L2, CV_DIST_MASK_PRECISE); Mat dibw; // threshold the distance transformed image double SWTHRESH = 8; // stroke width threshold threshold(dist, dibw, SWTHRESH/2, 255, CV_THRESH_BINARY); Mat kernel = getStructuringElement(MORPH_RECT, Size(3, 3)); // perform opening, in case digits are still connected Mat morph; morphologyEx(dibw, morph, CV_MOP_OPEN, kernel); dibw.convertTo(dibw, CV_8U); // find contours and filter Mat cont; morph.convertTo(cont, CV_8U); Mat binary; cvtColor(dibw, binary, CV_GRAY2BGR); const double HTHRESH = im.rows * .5; // height threshold vector<vector<Point>> contours; vector<Vec4i> hierarchy; vector<Point> digits; // points corresponding to digit contours findContours(cont, contours, hierarchy, CV_RETR_CCOMP, CV_CHAIN_APPROX_SIMPLE, Point(0, 0)); for(int idx = 0; idx >= 0; idx = hierarchy[idx][0]) { Rect rect = boundingRect(contours[idx]); if (rect.height > HTHRESH) { // append the points of this contour to digit points digits.insert(digits.end(), contours[idx].begin(), contours[idx].end()); rectangle(binary, Point(rect.x, rect.y), Point(rect.x + rect.width - 1, rect.y + rect.height - 1), Scalar(0, 0, 255), 1); } } // take the convexhull of the digit contours vector<Point> digitsHull; convexHull(digits, digitsHull); // prepare a mask vector<vector<Point>> digitsRegion; digitsRegion.push_back(digitsHull); Mat digitsMask = Mat::zeros(im.rows, im.cols, CV_8U); drawContours(digitsMask, digitsRegion, 0, Scalar(255, 255, 255), -1); // expand the mask to include any information we lost in earlier morphological opening morphologyEx(digitsMask, digitsMask, CV_MOP_DILATE, kernel); // copy the region to get a cleaned image Mat cleaned = Mat::zeros(im.rows, im.cols, CV_8U); dibw.copyTo(cleaned, digitsMask);

EDIT

Java Code

Mat im = Highgui.imread("aRh8C.png", 0); // apply Otsu threshold Mat bw = new Mat(im.size(), CvType.CV_8U); Imgproc.threshold(im, bw, 0, 255, Imgproc.THRESH_BINARY_INV | Imgproc.THRESH_OTSU); // take the distance transform Mat dist = new Mat(im.size(), CvType.CV_32F); Imgproc.distanceTransform(bw, dist, Imgproc.CV_DIST_L2, Imgproc.CV_DIST_MASK_PRECISE); // threshold the distance transform Mat dibw32f = new Mat(im.size(), CvType.CV_32F); final double SWTHRESH = 8.0; // stroke width threshold Imgproc.threshold(dist, dibw32f, SWTHRESH/2.0, 255, Imgproc.THRESH_BINARY); Mat dibw8u = new Mat(im.size(), CvType.CV_8U); dibw32f.convertTo(dibw8u, CvType.CV_8U); Mat kernel = Imgproc.getStructuringElement(Imgproc.MORPH_RECT, new Size(3, 3)); // open to remove connections to stray elements Mat cont = new Mat(im.size(), CvType.CV_8U); Imgproc.morphologyEx(dibw8u, cont, Imgproc.MORPH_OPEN, kernel); // find contours and filter based on bounding-box height final double HTHRESH = im.rows() * 0.5; // bounding-box height threshold List<MatOfPoint> contours = new ArrayList<MatOfPoint>(); List<Point> digits = new ArrayList<Point>(); // contours of the possible digits Imgproc.findContours(cont, contours, new Mat(), Imgproc.RETR_CCOMP, Imgproc.CHAIN_APPROX_SIMPLE); for (int i = 0; i < contours.size(); i++) { if (Imgproc.boundingRect(contours.get(i)).height > HTHRESH) { // this contour passed the bounding-box height threshold. add it to digits digits.addAll(contours.get(i).toList()); } } // find the convexhull of the digit contours MatOfInt digitsHullIdx = new MatOfInt(); MatOfPoint hullPoints = new MatOfPoint(); hullPoints.fromList(digits); Imgproc.convexHull(hullPoints, digitsHullIdx); // convert hull index to hull points List<Point> digitsHullPointsList = new ArrayList<Point>(); List<Point> points = hullPoints.toList(); for (Integer i: digitsHullIdx.toList()) { digitsHullPointsList.add(points.get(i)); } MatOfPoint digitsHullPoints = new MatOfPoint(); digitsHullPoints.fromList(digitsHullPointsList); // create the mask for digits List<MatOfPoint> digitRegions = new ArrayList<MatOfPoint>(); digitRegions.add(digitsHullPoints); Mat digitsMask = Mat.zeros(im.size(), CvType.CV_8U); Imgproc.drawContours(digitsMask, digitRegions, 0, new Scalar(255, 255, 255), -1); // dilate the mask to capture any info we lost in earlier opening Imgproc.morphologyEx(digitsMask, digitsMask, Imgproc.MORPH_DILATE, kernel); // cleaned image ready for OCR Mat cleaned = Mat.zeros(im.size(), CvType.CV_8U); dibw8u.copyTo(cleaned, digitsMask); // feed cleaned to Tesseract

A few things to consider: Its not about only digits; The minus needs also be detected; Detected elements need to be merged in one image as input source for tesseract.
@MarkusAtCvlabDotDe I've updated my answer with modifications needed to get a clean image.
I will present my solution as well later on. Its blob based and less code. +1
Thanks for this. My C++ is not that great. I've implemented this solution in Java only creating the mask using convexHull doesn't offer the same result as you displayed above. I've posted the code here: pastebin.com/KfYFu1vk
@XueQing It should be easy to convert to python as opencv calls are similar across c++ and java. Currently there's no plan to add a python code.

Community · Accepted Answer · 2020-06-20 09:12:55Z

I think you need to work more on the pre-processing part to prepare the image to be clear as much as you can before calling the tesseract.

What's my ideas to do that are the following:

1- Extract contours from the image and find contours in the image (check this) and this

2- Each contours have width, height and area, so you may filter the contours according to the width, height and its area (check this and this), plus you may use some part of the contour analysis code here to filter the contours and more you may delete the contours that are not similar to a "letter or number" contour using a template contour matching.

3- After filter the contour you may check where are the letters and the numbers in this image, so you may need to use some text detection methods like here

4- All what you need now if to remove the non-text area, and the contours that are not good from the image

5- Now you can create your binirization method or you may use the tesseract one to do the binirization to the image then call the OCR on the image.

Sure these are the best steps to do this, you may use some of them and it may enough for you.

Other ideas:

You may use different ways to do this the best idea is to find a way to detect the digit and character location using different methods like template matching, or feature based like HOG.
You may first to do binarization to your image and get the binary image, then you need to apply opening with line structural for the horizontal and vertical and this will help you to detect the edges after that and do the segmentation on the image then the OCR.
After detecting all the contours in the image, you also may use Hough transformation to detect any kind of line and defined curve like this one, and in this way you can detect the characters that are a lined so you may segment the image and do the OCR after that.

Much easier way:

1- Do binirization

2- Some morphology operation to separate the contours:

3- Inverse the color in the image (this may be before step 2)

4- Find all contours in the image

5- Delete all the contours that width is more than its height, delete the very small contours, the very large ones, and the not rectangle contours

Note : you may use the text detection methods (or using HOG or edge detection) instead of step 4 and 5

6- Find the large rectangle that contain all the remaining contours in the image

7- You may do some extra pre-processing to enhance the input for the tesseract then you may call the OCR now. (I advice you to crop the image and make it as an input to the OCR [I mean crop the yellow rectangle and do not make the whole image as an input just the yellow rectangle and that will enhance the results also])

MarkusAtCvlabDotDe · Accepted Answer · 2015-11-26 21:16:57Z

1

Would that image help you?

The algorithm producing that image would be easy to implement. I am sure, if you tweak some of its parameters, you can get very good results for that kind of images.

I tested all the images with tesseract:

Original image : Nothing detected
Processed image #1 : Nothing detected
Processed image #2 : 12-14 (exact match)
My processed image : y’1'2-14/j

edited Nov 26, 2015 at 21:16

answered Nov 26, 2015 at 20:45

MarkusAtCvlabDotDe

1,0525 silver badges12 bronze badges

5 Comments

HelloWorld123456789 Over a year ago

Did you try tesseract after removing the connected components at the edges? Since in your processed image, the connected components at the edges are not at all connected to the text, removing those might give better result.

MarkusAtCvlabDotDe Over a year ago

You are right! It definitely will get better results if those connected structures will be removed. At the point of posting that image i wasnt aware of that fact. I thouht tesseract was strong enough to do that on its own and it would be enough to just remove noise and other artifacts in between the digits. I will develop an extension to that algorithm which keeps it simple but gets rid of that border structures. Cheers!

HelloWorld123456789 Over a year ago

Also, can you add your algorithm to the answer?

Zy0n Over a year ago

Tesseract can be tricky. Try running tesseract -psm 7 yourimage.png digits which will force tesseract to recognize only digits. Could you please post your method for reducing your image above?

MarkusAtCvlabDotDe Over a year ago

Yes ofc i will post the code. I only have it theoretically and will implement and poste it later. Besides that it would be interesting to see, if we can solve your problem with bigger structures randomly distributed in the image (Not only connected to the borders).

Yannis Douros · Accepted Answer · 2015-11-28 10:18:06Z

Just a little bit of thinking out of the box:

I can see from your original image that it's a rather rigorously preformatted document, looks like a road tax badge or something like that, right?

If the assumption above is correct, then you could implement a less generic solution: The noise you are trying to get rid of is due to features of the specific document template, it occurs in specific and known regions of your image. In fact, so does the text.

In that case, one of the ways to go about is define the boundaries of the regions where you know that there is such "noise", and just white them out.

Then, follow the rest of the steps that you are already following: Do the noise reduction that will remove the finest detail (i.e. the background pattern that looks like the safety watermark or hologram in the badge). The result should be clear enough for Tesseract to process without trouble.

Just a thought anyway. Not a generic solution, I acknowledge that, so it depends on what your actual requirements are.

Gowthaman · Accepted Answer · 2016-07-08 12:05:21Z

The font size should not be so big or small, approximately it should in range of 10-12 pt(i.e, character height approximately above 20 and less than 80). you can down sample the image and try with tesseract. And few fonts are not trained in tesseract, the issue may arise if it is not in that trained fonts.

Collectives™ on Stack Overflow

Remove background noise from image to make text more clear for OCR

5 Answers 5

12 Comments

Comments

5 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

12 Comments

Comments

5 Comments

Comments

Comments

Linked

Related