Automated financial anomaly detection for Paperless-ngx. Validates bank statements, invoices, and financial documents for arithmetic inconsistencies, formatting issues, and suspicious patterns. Features a web dashboard for monitoring and optional LLM enhancement for advanced analysis.
- 🧮 Balance Validation: Automatic verification of bank statement arithmetic
- 📐 Layout Analysis: Detects formatting irregularities and structural issues
- 🔍 Pattern Detection: Identifies duplicates, reversed columns, truncated totals
- 🤖 Optional LLM Enhancement: Claude/GPT integration for advanced reasoning
- 📊 Web Dashboard: Real-time monitoring with filters and statistics
- 🔄 Auto-Processing: Background polling for new documents
- 🏷️ Smart Tagging: Automatically adds anomaly tags to Paperless
- 📈 Custom Fields: Writes detection results to Paperless custom fields
- Features
- Architecture
- Use Cases
- Quick Start
- Configuration
- How It Works
- Web Dashboard
- API Reference
- Integration
- Troubleshooting
- Development
- FAQ
- License
Validates financial document math automatically:
- Balance Verification:
Beginning Balance + Credits - Debits = Ending Balance - Running Totals: Validates line-by-line balance progression
- Page Totals: Verifies subtotals match sum of transactions
- Configurable Tolerance: Customize acceptable variance (default: $0.01)
Tags Generated: anomaly:balance_mismatch Custom Fields: balance_check_status, balance_diff_amount
Identifies formatting and structural issues:
- Column Alignment: Detects misaligned data columns
- Font Consistency: Identifies suspicious font variations
- Spacing Anomalies: Finds unusual spacing patterns
- Page Structure: Validates consistent page layouts
- Score-Based: Produces 0-1 layout quality score
Tags Generated: anomaly:layout_irregularity Custom Fields: layout_score
Regex-based detection for common issues:
- Reversed Columns: Debits and credits swapped
- Duplicate Transactions: Repeated lines (copy/paste errors)
- Truncated Totals: Missing or incomplete totals
- Page Numbering Issues: Out of order or missing pages
- Date Sequence Problems: Non-chronological transactions
Tags Generated: anomaly:duplicate_lines, anomaly:reversed_columns, anomaly:truncated_total
Advanced analysis using Claude or GPT:
- Narrative Summaries: Human-readable anomaly explanations
- Context-Aware Analysis: Considers document type and patterns
- Confidence Scoring: Provides confidence levels for findings
- Evidence-Based: Only analyzes extracted data, never invents
Requirements: ANTHROPIC_API_KEY or OPENAI_API_KEY
Real-time monitoring interface:
- Document List: View all processed documents with anomaly indicators
- Filters: By anomaly type, date range, amount threshold
- Statistics: Overall detection rates and trends
- Quick Links: Direct links to Paperless documents
- Search: Find specific documents by ID or content
Access: http://localhost:8050
Automated polling system:
- Configurable Interval: Default 5 minutes, customize as needed
- State Persistence: Remembers last processed document
- Graceful Shutdown: Finishes current document before stopping
- Error Handling: Continues processing after transient failures
┌──────────────────┐ │ Paperless-ngx │ │ API │ └────────┬─────────┘ │ Poll for new documents ▼ ┌──────────────────┐ │ Anomaly │ │ Detector │ │ │ │ 1. Fetch OCR │ │ 2. Infer Type │──► Bank Statement │ 3. Extract Data │ Invoice │ 4. Validate │ Receipt └────────┬─────────┘ │ ├──────────────────────────────┐ ▼ ▼ ┌─────────────────┐ ┌─────────────────────┐ │ Deterministic │ │ Optional LLM │ │ Checks │ │ Analysis │ │ │ │ │ │ - Balance math │ │ - Narrative │ │ - Layout score │ │ - Context │ │ - Patterns │ │ - Confidence │ └────────┬────────┘ └──────────┬──────────┘ │ │ └────────────┬───────────────┘ ▼ ┌────────────────────┐ │ Results Storage │ │ (SQLite/Postgres) │ └──────────┬─────────┘ ▼ ┌────────────────────┐ │ Write to Paperless │ │ - Tags │ │ - Custom Fields │ └────────────────────┘ - Scenario: Managing properties in litigation/receivership
- Benefit: Automatically flag suspicious bank statements and rent rolls
- Tags: Perfect for legal discovery and audit preparation
- Scenario: Reviewing monthly bank and credit card statements
- Benefit: Catch bank errors, duplicate charges, unauthorized transactions
- Scenario: Processing vendor invoices and customer payments
- Benefit: Detect duplicate invoices, math errors, fraudulent documents
- Scenario: Reviewing documents for tampering or manipulation
- Benefit: Layout irregularities often indicate modified PDFs
- Scenario: M&A document review, loan applications
- Benefit: Automated validation of financial statements
- Docker and Docker Compose
- Running Paperless-ngx instance (v1.10.0+)
- Paperless API token (generate here)
services: paperless-anomaly-detector: image: dblagbro/paperless-anomaly-detector:latest container_name: paperless-anomaly-detector restart: unless-stopped environment: PAPERLESS_API_BASE_URL: http://paperless-web:8000 PAPERLESS_API_TOKEN: your_token_here POLLING_INTERVAL: 300 BALANCE_TOLERANCE: 0.01 volumes: - ./anomaly-detector/data:/app/data ports: - "8050:8050"git clone https://github.com/dblagbro/paperless-anomaly-detector.git cd paperless-anomaly-detector docker build -t paperless-anomaly-detector .-
Create environment file:
cp .env.example .env
-
Edit
.envwith your settings:PAPERLESS_API_TOKEN=your_actual_token_here PAPERLESS_API_BASE_URL=http://paperless-web:8000
-
Start the service:
docker compose up -d
-
Verify it's running:
docker compose logs -f paperless-anomaly-detector
-
Access the dashboard:
http://localhost:8050 -
Trigger initial scan (optional):
curl -X POST http://localhost:8050/api/trigger-scan
| Variable | Default | Description |
|---|---|---|
PAPERLESS_API_BASE_URL | http://paperless-web:8000 | Paperless API endpoint |
PAPERLESS_API_TOKEN | (required) | API authentication token |
POLLING_INTERVAL | 300 | Seconds between polling cycles |
BALANCE_TOLERANCE | 0.01 | Dollar tolerance for balance checks |
LAYOUT_VARIANCE_THRESHOLD | 0.3 | Layout score threshold (0-1) |
LLM_PROVIDER | None | anthropic or openai |
LLM_API_KEY | None | LLM API key (if enabled) |
LLM_MODEL | (auto) | Override model name |
BATCH_SIZE | 10 | Documents per polling batch |
DATABASE_URL | sqlite:///data/anomalies.db | Database connection string |
Add to your environment:
environment: LLM_PROVIDER: anthropic LLM_API_KEY: sk-ant-api03-xxxOr for OpenAI:
environment: LLM_PROVIDER: openai LLM_API_KEY: sk-proj-xxx LLM_MODEL: gpt-4-turbo-previewThe detector automatically creates these custom fields in Paperless:
- balance_check_status (Text): PASS / FAIL / NOT_APPLICABLE
- balance_diff_amount (Number): Dollar amount of mismatch
- layout_score (Number): 0-1 quality score
These are created on first run. No manual setup needed.
-
Polling Phase:
- Queries Paperless API every
POLLING_INTERVALseconds - Fetches documents with
modified > last_seen - Processes in batches of
BATCH_SIZE
- Queries Paperless API every
-
Content Extraction:
- Retrieves OCR text via Paperless API
- Extracts document metadata (title, date, tags)
- Identifies document type (bank statement, invoice, etc.)
-
Type Inference:
- Keyword matching for common document types
- Pattern recognition in content
- Falls back to generic analysis if unrecognized
-
Anomaly Detection:
- Balance Validation: Extracts beginning/ending balances, credits, debits
- Layout Analysis: Computes structural consistency score
- Pattern Matching: Applies regex rules for common issues
- LLM Enhancement (optional): Sends findings for analysis
-
Results Storage:
- Saves to internal database (
processed_documents,anomaly_logs) - Includes severity, description, amounts, timestamps
- Saves to internal database (
-
Paperless Integration:
- Adds tags:
anomaly:detected,anomaly:balance_mismatch, etc. - Updates custom fields with results
- Never modifies original documents
- Adds tags:
| Tag | Meaning |
|---|---|
anomaly:detected | At least one anomaly found |
anomaly:balance_mismatch | Arithmetic inconsistency detected |
anomaly:layout_irregularity | Formatting/structure issues |
anomaly:duplicate_lines | Repeated transaction entries |
anomaly:truncated_total | Missing or incomplete totals |
anomaly:reversed_columns | Debit/credit columns swapped |
anomaly:page_numbering | Page order issues |
Manual Tags (Recommended):
property:<id>- Property identifierrole:refereeorrole:receiver- Your capacitydoc_type:bank_statement,doc_type:rent_roll- Document typeperiod:YYYY-MM- Time period
- Document Cards: Visual cards for each processed document
- Anomaly Indicators: Red badges for detected issues
- Quick Stats: Total documents, anomalies found, success rate
- Filters: Type, date range, amount threshold
JSON response with:
{ "total_documents": 1234, "documents_with_anomalies": 45, "anomaly_rate": 3.6, "by_type": { "balance_mismatch": 20, "layout_irregularity": 15, "duplicate_lines": 10 } }Query parameters:
anomaly_type: Filter by specific anomalymin_amount: Minimum balance discrepancymax_amount: Maximum balance discrepancystart_date: ISO format (2024-01-01)end_date: ISO format (2024-12-31)limit: Results per page (default: 50)offset: Pagination offset
Example:
curl "http://localhost:8050/api/documents?anomaly_type=balance_mismatch&min_amount=100"Health check endpoint.
Response: {"status": "healthy"}
Overall statistics.
Response:
{ "total_documents": 1234, "documents_with_anomalies": 45, "anomaly_rate": 3.6, "by_type": {...} }List processed documents.
Query Params: See Document List section
Response:
{ "documents": [...], "total": 1234, "limit": 50, "offset": 0 }List all anomaly logs.
Query Params: document_id, severity, resolved
Response:
{ "anomalies": [ { "id": 123, "document_id": 456, "anomaly_type": "balance_mismatch", "severity": "high", "description": "Beginning + Credits - Debits != Ending", "amount": 150.00, "detected_at": "2024-01-15T10:30:00Z" } ] }Manually trigger document polling.
Response: {"status": "scan_initiated"}
Add to your nginx.conf:
location /paperless-anomaly-detector/ { proxy_pass http://localhost:8050/; proxy_http_version 1.1; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; # URL rewriting for subpath sub_filter_once off; sub_filter 'href="/' 'href="/paperless-anomaly-detector/'; sub_filter 'src="/' 'src="/paperless-anomaly-detector/'; sub_filter 'action="/' 'action="/paperless-anomaly-detector/'; }Then access at: https://yourdomain.com/paperless-anomaly-detector/
Create saved searches in Paperless:
-
High-Priority Anomalies:
tags:anomaly:balance_mismatch AND balance_diff_amount:>100 -
Recent Anomalies:
tags:anomaly:detected AND created:[now-7d TO now] -
Unresolved Issues:
tags:anomaly:detected AND NOT tags:reviewed
Symptoms: Dashboard shows 0 documents
Solutions:
-
Verify API connectivity:
docker exec paperless-anomaly-detector curl -H "Authorization: Token YOUR_TOKEN" \ http://paperless-web:8000/api/documents/?page_size=1
-
Check logs for errors:
docker compose logs -f paperless-anomaly-detector
-
Manually trigger scan:
curl -X POST http://localhost:8050/api/trigger-scan
-
Check API token permissions in Paperless
Symptoms: Too many anomalies detected
Solutions:
-
Increase
BALANCE_TOLERANCE:environment: BALANCE_TOLERANCE: 0.05 # $0.05 instead of $0.01
-
Adjust
LAYOUT_VARIANCE_THRESHOLD:environment: LAYOUT_VARIANCE_THRESHOLD: 0.5 # More lenient
-
Review pattern detection rules in
app/detector.py -
Use LLM enhancement for better context understanding
Symptoms: Slow processing, high CPU usage
Solutions:
-
Reduce batch size:
environment: BATCH_SIZE: 5
-
Increase polling interval:
environment: POLLING_INTERVAL: 600 # 10 minutes
-
Disable LLM if not needed:
environment: LLM_PROVIDER: ""
-
Use PostgreSQL instead of SQLite:
environment: DATABASE_URL: postgresql://user:pass@postgres:5432/anomalies
Symptoms: No LLM-enhanced analysis, errors in logs
Solutions:
-
Verify API key:
docker exec paperless-anomaly-detector printenv | grep LLM
-
Test API key manually:
curl -H "x-api-key: YOUR_KEY" https://api.anthropic.com/v1/messages -
Check rate limits in provider dashboard
-
Ensure
LLM_PROVIDERis set correctly
# Install dependencies pip install -r requirements.txt # Run locally cd app python main.py# Install test dependencies pip install pytest pytest-cov # Run tests pytest tests/ # With coverage pytest --cov=app tests/- Edit
app/detector.py - Add method to
AnomalyDetectorclass:def detect_my_anomaly(self, content, metadata): """Detect my custom anomaly.""" findings = [] # Your logic here return findings
- Update
detect_all_anomalies()to call your method - Add corresponding tag handling
- Test thoroughly
processed_documents:
CREATE TABLE processed_documents ( id INTEGER PRIMARY KEY, paperless_doc_id INTEGER UNIQUE, title TEXT, processed_at TIMESTAMP, has_anomalies BOOLEAN, balance_status TEXT, balance_diff REAL, layout_score REAL );anomaly_logs:
CREATE TABLE anomaly_logs ( id INTEGER PRIMARY KEY, document_id INTEGER REFERENCES processed_documents(id), anomaly_type TEXT, severity TEXT, description TEXT, amount REAL, detected_at TIMESTAMP, resolved BOOLEAN DEFAULT 0 );- Memory: 200-500MB depending on document volume
- CPU: Low during polling, spikes during processing
- Disk: SQLite database grows ~10KB per document
Typical processing times (Intel i7, 16GB RAM):
| Document Type | Pages | Processing Time |
|---|---|---|
| Bank Statement | 2 | 2-4 seconds |
| Invoice | 1 | 1-2 seconds |
| Credit Card | 5 | 5-8 seconds |
With LLM enabled, add 1-3 seconds per document
For high-volume deployments:
- Use PostgreSQL instead of SQLite
- Increase
BATCH_SIZEfor better throughput - Run multiple instances with partitioned document sets
- Consider async processing with message queue
-
API Token: Never logged or exposed in responses. Store in environment variable.
-
Database: SQLite by default. Use PostgreSQL with encrypted connections for production.
-
HTTPS: Always use NGINX reverse proxy with TLS in production.
-
Access Control: Add HTTP basic auth via NGINX for additional security.
-
Read-Only: Service only reads documents and writes tags/fields. Never modifies originals.
-
Audit Trail: All actions logged with timestamps in application logs.
Q: Can I reprocess documents? A: Yes, clear the database and restart: docker exec paperless-anomaly-detector rm /app/data/anomalies.db
Q: Does this work with scanned documents? A: Yes, as long as Paperless has performed OCR. Quality depends on scan quality.
Q: Can I customize which anomalies are detected? A: Yes, edit app/detector.py to add/remove detection rules.
Q: What document types are supported? A: Bank statements, credit cards, invoices, receipts. Easily extensible.
Q: How accurate is the balance validation? A: Very accurate for properly formatted statements. Configure tolerance for edge cases.
Q: Can I use this without LLM? A: Yes, deterministic checks work fine without LLM. LLM is optional enhancement.
Q: Does this modify my documents? A: No, it only adds tags and custom fields. Original PDFs are never modified.
Q: Can I run this on multiple Paperless instances? A: Run separate containers with different PAPERLESS_API_TOKEN values.
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Submit a Pull Request
MIT License - see LICENSE file for details.
- Built for property management and financial auditing use cases
- Integrates with Paperless-ngx
- Optional LLM support via Anthropic Claude or OpenAI
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: See CHANGELOG.md for version history
Perfect for property managers, accountants, auditors, and anyone who needs automated financial document validation.