Originally asked in Graphic Design site here (but I don't know how to 'move' a question to another site)
- Credit to this guy: Python Tutorials for Digital Humanities - I followed a lot of his ideas, but I haven't got it down yet
I have a bunch of scanned document images and in each image there is a variance in background colour (sometimes a wide variance).
I want to adjust all image pixels such that they are 'normalized' towards the pixel colour mean. To put it another way, I want to adjust all pixel colour values (or grayscale) so that the background colour (e.g. the 'white page') loses the colour gradients and variation and instead becomes as close as possible to white (or white).
I have been looking into colour gradients, vectorization and various filters, thresh-holding and blurring techniques, but I can't quite work out how to do what I see in my head.
Take the following example image: (ignore my red annotation)
I want to find out what the overall mean colour is for all pixels, then for each pixel reduce or increase it towards the mean value to create a more 'flat' image. The reason I want to do that is because I believe it will improve the next step, which is to detect contours and edges for the purpose of text detection (and ultimately better OCR results).
So in the above example, the goal is to effectively remove the gray diagonal lines, but leave the text. I think there might be a way to automatically determine a threshold value (or have a dynamic threshold) but I am not sure exactly how to do that.
Here is another example image:
The goal for this second image would be to effectively remove the target logo and most of the other background 'document colour mess' picked up by the scanner, as per the annotation.
The overall objective is to improve text block detection to then improve OCR accuracy.
I know I can play around with filters and image enhancements using slides in GIMP or some other graphics GUI, but one of the key points is that this process has to be automated because doing it manually would essentially defeat the purpose and I may as well go back to manual data entry by eyeballing every document (ugh!).
This is why I have been trying to use OpenCV in Python.
The second point is that since every image is different I need to be able to determine the threshold (for final binarization prior to OCR) automatically, which is kind of what I mean by 'normalization' above.

