A Python utility for comparing expected text against OCR-extracted text from images. Perfect for verifying OCR outputs, debugging text recognition, or automatically validating image and text file pairs.
- Image Preprocessing: Automatic grayscale conversion, contrast enhancement, and binarization for improved OCR accuracy
- Tesseract Integration: Leverages Tesseract OCR for robust text extraction
- Smart Comparison: Normalizes, tokenizes, and compares text with multiple similarity metrics
- Detailed Analytics: Computes character ratio, partial ratio, token set ratio, and more
- Word-Level Diff: Shows precise differences and fuzzy correction suggestions
- Watch Mode: Continuously monitors directories for new image/text pairs
- Flexible Output: Exit on mismatch or continue logging for batch processing
- Python 3.7+ (tested with Python 3.9–3.13)
- Tesseract OCR Engine
1. Install Python Dependencies
pip install pillow pytesseract rapidfuzz2. Install Tesseract OCR
Windows:
- Download from Tesseract-OCR
- Add Tesseract to your system PATH, or configure
pytesseract.pytesseract.tesseract_cmd
Linux (Ubuntu/Debian):
sudo apt update sudo apt install tesseract-ocrmacOS:
brew install tesseract3. Clone Repository (Optional)
git clone https://github.com/Lavish-code/OCR-Validator .git cd OCR-Validator Validate a single image against expected text:
python ocr_validation.py --image path/to/image.png --text "Expected Text Here"Common Options:
| Option | Shorthand | Default | Description |
|---|---|---|---|
--threshold | -th | 80 | Minimum similarity score (0-100) |
--lang | eng | Tesseract language code | |
--psm | 6 | Page segmentation mode | |
--oem | 3 | OCR engine mode | |
--no-preprocess | False | Skip image preprocessing | |
--debug | False | Enable debug output |
Monitor a directory for new images and automatically validate them:
python ocr_validation.py --watch --watch-dir path/to/watch_folderWatch Mode Options:
| Option | Default | Description |
|---|---|---|
--image-glob | .png,.jpg,*.jpeg | Image file patterns (comma-separated) |
--text-exts | .txt,.caption,.json | Sidecar file extensions |
--json-key | text | JSON field containing text |
--interval | 1.0 | Polling interval (seconds) |
--fail-on-mismatch | False | Exit immediately on first mismatch |
python ocr_validation.py -i screenshots/output1.png -t "Hello, world!"Output:
[RESULT] Similarity Metrics (0-100): - char_ratio: 92 - partial_ratio: 94 - token_sort_ratio: 88 - token_set_ratio: 89 ✅ Text & Image look consistent. python ocr_validation.py -i sample.png -t "Nike Air Shoes"Output:
[RESULT] Similarity Metrics (0-100): - char_ratio: 65 ❌ Potential mismatch detected! [DIFFERENCES] - REPLACE | Expected: 'nike' | Found: 'nikee' - DELETE | Expected: 'air' | Found: '' [SUGGESTIONS] - 'air' → 'ar' (score: 80) Monitor a folder with image/text pairs:
python ocr_validation.py --watch --watch-dir outputs --fail-on-mismatchExpected structure:
outputs/ ├── img1.png ├── img1.txt ├── img2.png └── img2.json OCR-Validator / │ ├── ocr_validation.py # Main validation script ├── README.md # Documentation └── tests/ # Test files (optional) - Clean images: Use
--no-preprocessif your images are already optimized - Poor quality: Keep preprocessing enabled for scanned or low-quality images
- Strict matching: Use threshold ≥ 90 for critical applications
- Fuzzy matching: Use threshold 70-80 for more lenient validation
- Very loose: Use threshold < 70 for experimental setups
For nested JSON structures, use --json-key to specify the field:
python ocr_validation.py -i image.png --text-exts .json --json-key data.descriptionUse watch mode with logging for unattended batch validation:
python ocr_validation.py --watch --watch-dir batch_folder > validation.log 2>&1Contributions are welcome! Here are some areas for improvement:
- Add unit tests for normalization and fuzzy matching
- Support additional OCR engines (EasyOCR, PaddleOCR)
- Export results in CSV/JSON/HTML formats
- Multi-language support and custom dictionaries
- GUI interface for easier operation
- Parallel processing for batch operations
To contribute:
- Fork the repository
- Create a feature branch
- Submit a pull request with tests
This project is open source. Check the repository for license details.
Issue: TesseractNotFoundError
- Solution: Ensure Tesseract is installed and in your PATH
Issue: Low accuracy scores
- Solution: Try adjusting
--psmvalues (3 for fully automatic, 6 for uniform block)
Issue: Slow processing
- Solution: Use
--no-preprocessor increase--intervalin watch mode
For issues, questions, or feature requests, please visit the GitHub repository.
**Made with ❤️ for the OCR communit