Multi-tenant document extraction platform. Submit any document + a JSON schema describing what to extract → get back structured JSON with per-field confidence scores.
OPENAI_API_KEY=sk-... docker compose up --buildThen open http://localhost:8081 — the web UI loads with a dev API key pre-filled. Upload a document, pick a schema preset, and extract.
The stack (api, worker, postgres, redis) starts automatically. Migrations run on boot. A dev tenant is seeded with the key di_devkey_changeme_in_production (override via DEV_API_KEY env var).
# 1. Start infrastructure docker compose up -d postgres redis # 2. Set environment cp .env.example .env # Edit .env — add your OPENAI_API_KEY set -a && source .env && set +a # 3. Run migrations + create dev tenant make migrate # requires psql installed locally make seed # prints your API key — save it # 4. Start API and worker (separate terminals) make run-api make run-workerThe API server serves a frontend at /. In dev mode the API key is auto-filled. Steps:
- Upload a PDF, DOCX, or image (max 50 MB)
- Define a JSON Schema — or pick a preset (Invoice, Resume, Contract, Receipt, ID)
- Click Extract — the UI polls for the result and displays each field with a confidence score
A sample document is included at testdata/sample-invoice.docx.
curl -X POST http://localhost:8081/v1/extract \ -H "Authorization: Bearer di_your_key_here" \ -F "document=@invoice.pdf" \ -F 'schema={ "type": "object", "properties": { "vendor": {"type": "string"}, "invoice_number": {"type": "string"}, "total": {"type": "number"}, "line_items": { "type": "array", "items": { "type": "object", "properties": { "description": {"type": "string"}, "amount": {"type": "number"} } } } }, "required": ["vendor", "total"] }' # Response: # {"job_id": "abc-123", "status": "pending", "poll_url": "/v1/jobs/abc-123"}curl http://localhost:8081/v1/jobs/abc-123 \ -H "Authorization: Bearer di_your_key_here"curl "http://localhost:8081/v1/jobs?limit=20&offset=0" \ -H "Authorization: Bearer di_your_key_here"Default limit is 20, max is 100.
Register a URL to receive a POST when a job completes. The secret is generated server-side and shown once — store it to verify signatures.
# Register curl -X POST http://localhost:8081/v1/webhooks \ -H "Authorization: Bearer di_your_key_here" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com/webhook"}' # Response includes the secret — save it: # {"id": "...", "url": "...", "secret": "abc123...", "active": true} # Delete curl -X DELETE http://localhost:8081/v1/webhooks/{id} \ -H "Authorization: Bearer di_your_key_here"Each delivery is a POST with:
Content-Type: application/json— body is the full job objectX-DocPulse-Signature: sha256=<hmac>— HMAC-SHA256 of the body using your secret
Verify the signature on your server:
import hmac, hashlib def verify(secret: str, body: bytes, header: str) -> bool: expected = "sha256=" + hmac.new(secret.encode(), body, hashlib.sha256).hexdigest() return hmac.compare_digest(expected, header)Failed deliveries are retried up to 5 times with exponential backoff.
Client → API (Go/chi) → PostgreSQL (job queue) ↓ Worker Pool ┌────────┼────────┐ │ │ │ Ingest Chunk Extract │ │ │ PDF/OCR Semantic LLM Router DOCX Boundary (fast/strong) │ │ │ └────────┼────────┘ ↓ Result Assembly + Confidence Scoring ↓ Job Complete / Webhook Key decisions:
- Async-first: jobs never block HTTP connections
- FOR UPDATE SKIP LOCKED: safe concurrent job claiming without a separate queue
- Two-tier LLM routing: cheap model for simple schemas, strong model for complex ones + automatic escalation on validation failure
- Content-hash cache: SHA-256(document + schema) catches exact duplicates at zero cost
- Magic-byte format detection: more robust than trusting file extensions
- HMAC-signed webhooks: recipients can verify payload integrity
cmd/api/ — HTTP server entry point cmd/worker/ — Job processor entry point internal/ api/ — HTTP handlers, routing, embedded frontend api/middleware/ — Auth, logging, rate limiting auth/ — API key generation and hashing config/ — Environment-based configuration database/ — PostgreSQL stores (jobs, tenants, webhooks) domain/ — Core types shared across packages extraction/ — Chunking engine ingestion/ — Format detection, text extraction (PDF/OCR/DOCX) jobs/ — Worker loop and job processing pipeline llm/ — Model routing and structured extraction storage/ — Object storage interface (local filesystem only) webhook/ — Webhook delivery with HMAC signing + retries migrations/ — SQL schema (auto-applied on API startup) testdata/ — Sample documents for testing scripts/ — Dev utilities (seed tenant) Dockerfile — Multi-stage build: api and worker targets Go 1.24 · PostgreSQL 16 · Redis 7 · OpenAI API · Docker · Fly.io
System dependencies (for text extraction):
poppler-utils— pdftotext for native PDFstesseract-ocr— OCR for scanned PDFs and imagespandoc— DOCX to text conversion
- Storage: only local filesystem (
LocalStore) is implemented. S3 support is stubbed but not built. - Schema validation: validates structure (type=object, properties present, each property has a type), but does not implement the full JSON Schema specification.
- Job list pagination:
limit/offsetwork and response includes atotalcount, but there is no cursor-based pagination. - Worker cache: Redis-backed with a configurable TTL (
WORKER_CACHE_TTL, default 24h), but no LRU eviction beyond TTL. make migrate: runspsqldirectly — requirespsqlinstalled on your machine. When using Docker (docker compose up), migrations run automatically on API startup instead.