Multimodal-OCR is an experimental, high-performance visual reasoning and optical character recognition suite designed to accurately extract text, analyze visual content, and parse complex document structures. Built upon a diverse ecosystem of cutting-edge vision-language models—including architectures based on Qwen2.5-VL, Qwen2-VL, and Cohere's Aya-Vision—this application excels at handling dense documents, multilingual texts, and real-world scene imagery. The suite features a highly customized, interactive web interface that enables users to effortlessly upload screenshots, receipts, and pages for rapid analysis. With built-in support for fully GPU-accelerated inference via Flash Attention 2 and granular manipulation of text generation parameters, Multimodal-OCR provides researchers and developers with a powerful, streamlined environment for testing and deploying robust multimodal AI workflows.
- Multi-Model Architecture: Seamlessly switch between specialized vision-language models directly from the interface. Supported models include
Nanonets-OCR2-3B,olmOCR-7B-0725,RolmOCR-7B,Aya-Vision-8B, andQwen2-VL-OCR-2B. - Custom User Interface: Features a bespoke, responsive Gradio frontend built with custom HTML, CSS, and JavaScript. It includes a drag-and-drop media zone, real-time output streaming, and an integrated advanced settings panel.
- Granular Inference Controls: Fine-tune the AI's output by adjusting text generation parameters such as Maximum New Tokens, Temperature, Top-p, Top-k, and Repetition Penalty.
- Output Management: Built-in actions allow users to instantly copy the raw output text to their clipboard or save the generated response directly as a
.txtfile. - Flash Attention 2 Integration: Utilizes
kernels-community/flash-attn2for optimized, memory-efficient inference on compatible GPUs.
├── examples/ │ ├── 1.jpg │ ├── 2.jpg │ ├── 3.jpg │ ├── 4.jpg │ └── 5.jpg ├── app.py ├── LICENSE ├── pre-requirements.txt ├── README.md └── requirements.txt To run Multimodal-OCR locally, you need to configure a Python environment with the following dependencies. Ensure you have a compatible CUDA-enabled GPU for optimal performance.
1. Install Pre-requirements Run the following command to update pip to the required version:
pip install pip>=23.0.02. Install Core Requirements Install the necessary machine learning and UI libraries. You can place these in a requirements.txt file and run pip install -r requirements.txt.
git+https://github.com/huggingface/transformers.git@v4.57.6 git+https://github.com/huggingface/accelerate.git git+https://github.com/huggingface/peft.git transformers-stream-generator huggingface_hub qwen-vl-utils sentencepiece opencv-python torch==2.8.0 torchvision matplotlib requests kernels hf_xet spaces pillow gradio av Once your environment is set up and the dependencies are installed, you can launch the application by running the main Python script:
python app.pyAfter the script initializes the interface, it will provide a local web address (usually http://127.0.0.1:7860/) which you can open in your browser to interact with the models. Note that the selected models will be downloaded and loaded into VRAM upon their first invocation.
- License: Apache License - Version 2.0
- GitHub Repository: https://github.com/PRITHIVSAKTHIUR/Multimodal-OCR.git