3

Hi I am trying to clean as much as possible noise from historical documents.

These documents have staining that are like small dots throughout the document and is effecting OCR and handwriting recognition. Apart from Image Denoising from OpenCV is there a more effective way to clean such images?

enter image description here

2 Answers 2

4

A potential approach is to adaptive threshold, perform some morphological operations, and remove noise using aspect ratio + contour area filtering. From here we can bitwise-and the resulting mask and the input image to get a cleaned image. Here's the result:

enter image description here

Since you didn't specify a language, I implemented it in Python

import cv2 import numpy as np # Load image, create blank mask, convert to grayscale, Gaussian blur # then adaptive threshold to obtain a binary image image = cv2.imread('1.jpg') mask = np.zeros(image.shape, dtype=np.uint8) gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) blur = cv2.GaussianBlur(gray, (7,7), 0) thresh = cv2.adaptiveThreshold(blur,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV,51,9) # Create horizontal kernel then dilate to connect text contours kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5,2)) dilate = cv2.dilate(thresh, kernel, iterations=2) # Find contours and filter out noise using contour approximation and area filtering cnts = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) cnts = cnts[0] if len(cnts) == 2 else cnts[1] for c in cnts: peri = cv2.arcLength(c, True) approx = cv2.approxPolyDP(c, 0.04 * peri, True) x,y,w,h = cv2.boundingRect(c) area = w * h ar = w / float(h) if area > 1200 and area < 50000 and ar < 6: cv2.drawContours(mask, [c], -1, (255,255,255), -1) # Bitwise-and input image and mask to get result mask = cv2.cvtColor(mask, cv2.COLOR_BGR2GRAY) result = cv2.bitwise_and(image, image, mask=mask) result[mask==0] = (255,255,255) # Color background white cv2.imshow('thresh', thresh) cv2.imshow('mask', mask) cv2.imshow('result', result) cv2.waitKey() 
Sign up to request clarification or add additional context in comments.

1 Comment

These 2 lines are never used - peri = cv2.arcLength(c, True) approx = cv2.approxPolyDP(c, 0.04 * peri, True)
0

I don't know if you are still facing this problem but there is a recent dataset that will help with this:

ShabbyPages is the first dataset of its kind, being launched in concert with a new Kaggle competition. This document image dataset, created using Augraphy, dramatically improves document layout detection, text extraction, and OCR processes that depend on denoising and binarization preprocessing models.

This dataset is perfect to train a model to denoise historical document images like the one above. Let me know if you have any questions.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.