Skip to content

Add OpenAI-Compatible Vision Model OCR Service with Streaming Support#1146

Open
TioSisai wants to merge 5 commits intopot-app:masterfrom
TioSisai:master
Open

Add OpenAI-Compatible Vision Model OCR Service with Streaming Support#1146
TioSisai wants to merge 5 commits intopot-app:masterfrom
TioSisai:master

Conversation

@TioSisai
Copy link

Add OpenAI-Compatible Vision Model OCR Service with Streaming Support

Summary

This PR introduces a new OCR service that supports any OpenAI-compatible vision model API, along with comprehensive streaming capabilities and internationalization support.

Features Added

🔍 OpenAI-Compatible OCR Service

  • Universal API Support: Works with OpenAI GPT-4V, local deployments, and third-party compatible APIs
  • Flexible Configuration: Customizable base URL, API key, and model selection
  • Dual Response Modes: Support for both streaming and non-streaming responses
  • Advanced OCR Prompts: Comprehensive system prompt with LaTeX formula support and layout preservation

⚡ Streaming OCR Functionality

  • Real-time Text Display: Live OCR results with progressive text updates
  • Visual Feedback: Cursor indicator (_) shows active streaming state
  • Enhanced UX: Immediate feedback during OCR processing
  • Backward Compatibility: Seamless integration with existing non-streaming services

🌍 Comprehensive Internationalization

  • Multi-language Support: Service names and UI elements translated across all supported languages
  • Global Coverage: Includes Chinese (Simplified/Traditional), English, Japanese, Korean, French, Spanish, Russian, German, Arabic, Hindi, Portuguese, Italian, and Dutch
  • Consistent Naming: Unified service identification across all locales

🎨 UI/UX Improvements

  • Modern Switch Component: Replaced dropdown with intuitive toggle for streaming control
  • Boolean Configuration: Improved type safety with proper boolean handling
  • Responsive Design: Clean and modern configuration interface

Technical Implementation

Core Components

  • src/services/recognize/openai_compatible/index.jsx: Main OCR service implementation
  • src/services/recognize/openai_compatible/Config.jsx: Configuration UI with Switch component
  • src/services/recognize/openai_compatible/info.ts: Service metadata and language definitions
  • src/window/Recognize/TextArea/index.jsx: Enhanced with streaming support

Key Features

  • SSE Streaming: Proper Server-Sent Events handling for real-time responses
  • Error Handling: Robust error management for network and parsing issues
  • Function Compatibility: Fixed parameter detection for service registration
  • Type Safety: Improved configuration with proper boolean types

Testing

  • ✅ Streaming OCR functionality verified
  • ✅ Non-streaming mode compatibility confirmed
  • ✅ Multi-language UI tested
  • ✅ Configuration persistence validated
  • ✅ Error handling scenarios covered

Breaking Changes

None. This is a purely additive feature that maintains full backward compatibility.

Dependencies

No new dependencies added. Uses existing Tauri HTTP client and React ecosystem.

Documentation

Service automatically appears in OCR service selection with proper localized names and configuration options.


This enhancement significantly expands POT's OCR capabilities by enabling integration with the growing ecosystem of OpenAI-compatible vision models, while providing a modern streaming experience for users.

TioSisai added 5 commits July 14, 2025 17:09
- Implement real-time text streaming display in OCR recognition window - Add streaming state management with cursor indicator (_) - Support progressive text updates during OCR processing - Enhance user experience with live OCR results feedback - Maintain backward compatibility with non-streaming OCR services This enables users to see OCR results as they are being processed, providing immediate feedback and improved responsiveness.
- Implement new OCR service supporting any OpenAI-compatible vision API - Add streaming and non-streaming response modes - Support custom base URL, API key, and model configuration - Include comprehensive system prompt for precise OCR with LaTeX support - Add Switch component for modern streaming toggle UI - Fix function parameter detection for proper service compatibility This allows integration with various vision models (OpenAI GPT-4V, local deployments, third-party compatible APIs) for OCR functionality.
- Add service name translations across all supported languages - Include localized strings for Chinese (Simplified/Traditional) - Add translations for English, Japanese, Korean, French, Spanish, Russian - Support German, Arabic, Hindi, Portuguese, Italian, and Dutch languages - Ensure consistent service naming across all locale files This provides native language support for OpenAI-compatible OCR service in all application-supported languages, improving accessibility for international users.
@TioSisai
Copy link
Author

resolves #977

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant