Skip to content

agxp/docpulse

Repository files navigation

DocPulse — Document Intelligence API

Multi-tenant document extraction platform. Submit any document + a JSON schema describing what to extract → get back structured JSON with per-field confidence scores.

Quickstart

Single command (Docker)

OPENAI_API_KEY=sk-... docker compose up --build

Then open http://localhost:8081 — the web UI loads with a dev API key pre-filled. Upload a document, pick a schema preset, and extract.

The stack (api, worker, postgres, redis) starts automatically. Migrations run on boot. A dev tenant is seeded with the key di_devkey_changeme_in_production (override via DEV_API_KEY env var).

Local development (without Docker for the Go services)

# 1. Start infrastructure docker compose up -d postgres redis # 2. Set environment cp .env.example .env # Edit .env — add your OPENAI_API_KEY set -a && source .env && set +a # 3. Run migrations + create dev tenant make migrate # requires psql installed locally make seed # prints your API key — save it # 4. Start API and worker (separate terminals) make run-api make run-worker

Usage

Web UI

The API server serves a frontend at /. In dev mode the API key is auto-filled. Steps:

  1. Upload a PDF, DOCX, or image (max 50 MB)
  2. Define a JSON Schema — or pick a preset (Invoice, Resume, Contract, Receipt, ID)
  3. Click Extract — the UI polls for the result and displays each field with a confidence score

A sample document is included at testdata/sample-invoice.docx.

API

Submit an extraction job

curl -X POST http://localhost:8081/v1/extract \ -H "Authorization: Bearer di_your_key_here" \ -F "document=@invoice.pdf" \ -F 'schema={  "type": "object",  "properties": {  "vendor": {"type": "string"},  "invoice_number": {"type": "string"},  "total": {"type": "number"},  "line_items": {  "type": "array",  "items": {  "type": "object",  "properties": {  "description": {"type": "string"},  "amount": {"type": "number"}  }  }  }  },  "required": ["vendor", "total"]  }' # Response: # {"job_id": "abc-123", "status": "pending", "poll_url": "/v1/jobs/abc-123"}

Poll for results

curl http://localhost:8081/v1/jobs/abc-123 \ -H "Authorization: Bearer di_your_key_here"

List jobs

curl "http://localhost:8081/v1/jobs?limit=20&offset=0" \ -H "Authorization: Bearer di_your_key_here"

Default limit is 20, max is 100.

Webhooks

Register a URL to receive a POST when a job completes. The secret is generated server-side and shown once — store it to verify signatures.

# Register curl -X POST http://localhost:8081/v1/webhooks \ -H "Authorization: Bearer di_your_key_here" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com/webhook"}' # Response includes the secret — save it: # {"id": "...", "url": "...", "secret": "abc123...", "active": true} # Delete curl -X DELETE http://localhost:8081/v1/webhooks/{id} \ -H "Authorization: Bearer di_your_key_here"

Each delivery is a POST with:

  • Content-Type: application/json — body is the full job object
  • X-DocPulse-Signature: sha256=<hmac> — HMAC-SHA256 of the body using your secret

Verify the signature on your server:

import hmac, hashlib def verify(secret: str, body: bytes, header: str) -> bool: expected = "sha256=" + hmac.new(secret.encode(), body, hashlib.sha256).hexdigest() return hmac.compare_digest(expected, header)

Failed deliveries are retried up to 5 times with exponential backoff.

Architecture

Client → API (Go/chi) → PostgreSQL (job queue) ↓ Worker Pool ┌────────┼────────┐ │ │ │ Ingest Chunk Extract │ │ │ PDF/OCR Semantic LLM Router DOCX Boundary (fast/strong) │ │ │ └────────┼────────┘ ↓ Result Assembly + Confidence Scoring ↓ Job Complete / Webhook 

Key decisions:

  • Async-first: jobs never block HTTP connections
  • FOR UPDATE SKIP LOCKED: safe concurrent job claiming without a separate queue
  • Two-tier LLM routing: cheap model for simple schemas, strong model for complex ones + automatic escalation on validation failure
  • Content-hash cache: SHA-256(document + schema) catches exact duplicates at zero cost
  • Magic-byte format detection: more robust than trusting file extensions
  • HMAC-signed webhooks: recipients can verify payload integrity

Project Structure

cmd/api/ — HTTP server entry point cmd/worker/ — Job processor entry point internal/ api/ — HTTP handlers, routing, embedded frontend api/middleware/ — Auth, logging, rate limiting auth/ — API key generation and hashing config/ — Environment-based configuration database/ — PostgreSQL stores (jobs, tenants, webhooks) domain/ — Core types shared across packages extraction/ — Chunking engine ingestion/ — Format detection, text extraction (PDF/OCR/DOCX) jobs/ — Worker loop and job processing pipeline llm/ — Model routing and structured extraction storage/ — Object storage interface (local filesystem only) webhook/ — Webhook delivery with HMAC signing + retries migrations/ — SQL schema (auto-applied on API startup) testdata/ — Sample documents for testing scripts/ — Dev utilities (seed tenant) Dockerfile — Multi-stage build: api and worker targets 

Stack

Go 1.24 · PostgreSQL 16 · Redis 7 · OpenAI API · Docker · Fly.io

System dependencies (for text extraction):

  • poppler-utils — pdftotext for native PDFs
  • tesseract-ocr — OCR for scanned PDFs and images
  • pandoc — DOCX to text conversion

Known limitations

  • Storage: only local filesystem (LocalStore) is implemented. S3 support is stubbed but not built.
  • Schema validation: validates structure (type=object, properties present, each property has a type), but does not implement the full JSON Schema specification.
  • Job list pagination: limit/offset work and response includes a total count, but there is no cursor-based pagination.
  • Worker cache: Redis-backed with a configurable TTL (WORKER_CACHE_TTL, default 24h), but no LRU eviction beyond TTL.
  • make migrate: runs psql directly — requires psql installed on your machine. When using Docker (docker compose up), migrations run automatically on API startup instead.

About

Async document intelligence API — upload any PDF/DOCX/image + a JSON Schema, get back structured JSON with per-field confidence scores. Go, PostgreSQL, GPT

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors