Time-Aware Synthetic Data Platform
Schema-aware synthetic data for databases, APIs, and pipelines. Realistic, relational, privacy-safe.
Quick Start • Demo • Scenarios • Workflows • Docs • Walkthrough • Contributing
Data Forge generates business-valid, cross-table, time-consistent, privacy-safe synthetic data—not hand-written fixtures, but test-ready data that respects schemas, foreign keys, business rules, and optional anomaly injection. Built for demos, UAT, integration testing, and pipeline development.
Open-source • Python backend • Next.js frontend • Local-first • Made for data engineers
| Typical approach | Data Forge |
|---|---|
| Hand-rolled CSV/JSON fixtures | Schema-driven generation — use your DDL, JSON Schema, or OpenAPI; FK and rules respected |
| One-off scripts per project | Reusable scenarios — save configs, run from wizard or Advanced, compare runs, version with scenarios |
| No pipeline realism | ETL modes — full snapshot, incremental, CDC; bronze/silver/gold; schema drift and messiness profiles |
| Guesswork on scale | Benchmark mode — scale presets, workload profiles, throughput and memory metrics |
| Isolated tables | Domain packs — 10+ packs (SaaS, e-commerce, fintech, healthcare, IoT, etc.) with event streams and benchmark relevance |
| Area | Capabilities |
|---|---|
| Schema | SQL DDL, JSON Schema, OpenAPI (table-like); YAML business rules; PK/FK resolution |
| Generation | Full snapshot, incremental, CDC; bronze/silver/gold; drift, messiness; anomaly injection |
| Export | CSV, JSON, JSONL, Parquet, SQL inserts |
| Load | SQLite, DuckDB, PostgreSQL, Snowflake, BigQuery |
| Integrations | dbt seeds, Great Expectations, Airflow DAGs, OpenAPI contract fixtures |
| Simulation | Event streams, time patterns (steady/burst/seasonal/growth), replay modes |
| Benchmark | Scale presets (small → xlarge), workload profiles, throughput/memory metrics |
| Quality | Schema validation, GE expectations, manifest reconciliation, privacy detection |
cd data-forge uv syncIf uv is unavailable, use a standard virtual environment:
python -m venv .venv .venv\Scripts\python -m pip install --upgrade pip .venv\Scripts\python -m pip install -e ".[dev]"Optional extras:
uv sync --extra warehouseor.venv\Scripts\python -m pip install -e ".[warehouse]"uv sync --extra streamlitor.venv\Scripts\python -m pip install -e ".[streamlit]"uv sync --extra allor.venv\Scripts\python -m pip install -e ".[all]"
uv run data-forge generate --pack saas_billing --scale 1000 -o output -f parquet uv run data-forge generate --pack ecommerce --scale 2000 -o output -f sql# Terminal 1: API uv run uvicorn data_forge.api.main:app --reload --port 8000 # Terminal 2: Frontend cd frontend && npm run devOpen http://localhost:3000. See docs/demo-walkthrough.md for a step-by-step walkthrough.
docker compose up --buildThis starts:
- API on
http://localhost:8000 - Frontend on
http://localhost:3001
Optional local Postgres stack for adapter workflows:
docker compose --profile db up --buildOne-command demo (no cloud credentials):
make demo-data # or: ./scripts/run_demo.ps1 (Windows) / ./scripts/run_demo.sh (Linux/macOS)Outputs go to demo_output/. Start the API and frontend to inspect runs and artifacts in the UI.
- Scenarios — Save configs from Advanced or the Create Wizard; load in wizard or Advanced; update or save-as-new; edit metadata on the scenario detail page. Import from
examples/scenarios/. - Runs — Every generation or benchmark creates a run (history, timeline, logs, artifacts). Compare two runs side-by-side (config, summary, benchmark, raw diff).
- Artifacts — Datasets, event streams, dbt seeds, GE suites, DAGs; filter by run or type.
- Generate a dataset — Wizard or Advanced → pick pack or custom schema (Schema Studio) → set scale/options → run. Output in Runs and Artifacts.
- Run simulation — Advanced → Pipeline Simulation → enable event streams, set density/pattern/replay → run.
- Benchmark warehouse load — Advanced → Benchmark → profile and scale preset → run. View throughput and duration on run detail.
- Compare runs — Runs → open a run → “Compare with another run”, or
/runs/compare. Use raw JSON diff for debugging. - Save scenario — After configuring, “Save as scenario” (or “Update scenario” if loaded). Reuse from Scenarios or wizard.
data-forge/ ├── src/data_forge/ # Backend │ ├── api/ # FastAPI app, routers, stores, middleware │ ├── models/ # Schema, config, generation, manifest │ ├── engine.py # Core run_generation, export_result │ ├── schema_ingest/ # SQL, JSON Schema, OpenAPI parsers │ ├── rule_engine/ # YAML business rules │ ├── generators/ # Primitives, distributions, FK resolution │ ├── adapters/ # SQLite, DuckDB, Postgres, Snowflake, BigQuery │ ├── exporters/ # CSV, JSON, Parquet, SQL │ ├── domain_packs/ # Pre-built schemas and rules │ ├── simulation/ # Event streams, time patterns │ └── services/ # Run, scenario, lineage, retention, metrics ├── frontend/ # Next.js product UI (wizard, runs, Schema Studio) ├── tests/ # Pytest backend tests ├── frontend/e2e/ # Playwright E2E tests ├── examples/scenarios/ # Example scenario JSONs ├── scripts/ # validate_all.*, run_demo.* ├── docs/ # Architecture, API reference, testing, CI, security └── .github/workflows/ # CI (backend, frontend, E2E) - CLI —
data-forge generate,benchmark,validate,reconcile,packs; full control from the shell. - API — Start runs, list runs/artifacts/scenarios, compare runs, preflight, benchmark. Local JSON persistence.
- UI — Wizard and Advanced config, run history and detail, scenario library, artifact browser, Schema Studio (custom schemas, validation, version diff), validation center, run comparison.
- CI — GitHub Actions: backend ruff, mypy, pytest; frontend type-check, tests, build; Playwright E2E. Local:
make validate-all. See docs/ci-cd.md. - API —
POST /api/runs/generate,GET /api/runs,GET /api/runs/compare,POST /api/benchmark,GET /api/scenarios,GET /api/artifacts,POST /api/preflight.
- Simulation: Event streams (e.g. order lifecycle, payments), time patterns, replay modes. Packs: ecommerce, fintech, logistics, IoT, social, saas_billing.
- Benchmark: Scale presets (small ~10k → xlarge ~10M), workload profiles, throughput (rows/s), duration, memory estimates.
uv run data-forge benchmark --pack saas_billing --scale 5000 --iterations 3 --output-json bench.json| Adapter | Usage |
|---|---|
| SQLite | --load sqlite --db-uri ./data.db |
| DuckDB | --load duckdb --db-uri ./data.duckdb |
| PostgreSQL | --load postgres --db-uri postgresql://user:pass@host/db |
| Snowflake / BigQuery | Env vars or flags (see CLI help) |
dbt, Great Expectations, Airflow, Reconciliation
- dbt:
--export-dbt --dbt-dir ./dbt_project— seeds, sources, schema tests. - GE:
--export-ge --ge-dir ./great_expectations; validate withdata-forge validate-ge. - Airflow:
--export-airflow --airflow-dir ./airflow— DAG templates. - Reconciliation:
data-forge reconcile --manifest manifest.json --data ./output --schema schemas/...
| Goal | Command |
|---|---|
| Sample generation | uv run data-forge generate --pack saas_billing --scale 1000 -o output -f parquet |
| Benchmark | uv run data-forge benchmark --pack saas_billing --scale 5000 --iterations 3 --output-json bench.json |
| List packs | uv run data-forge packs |
| Load scenario (UI) | Scenarios → Import scenario → choose from examples/scenarios/ |
| Compare runs (UI) | Runs → “Compare with another run” or /runs/compare |
make validate-all # or: scripts/validate_all.ps1 | scripts/validate_all.shSteps: backend ruff, mypy, pytest; frontend typecheck, unit tests, build. E2E: make e2e or cd frontend && npm run e2e. See CONTRIBUTING.md and docs/testing.md.
| Doc | Description |
|---|---|
| Architecture | Repository structure, API surface, schema system, frontend routes, CI/E2E |
| API Reference | REST endpoints, request/response shapes, errors (413/429), lineage & manifest |
| Testing | Backend (ruff, mypy, pytest), frontend (Vitest), E2E (Playwright), validation checklist |
| CI/CD | GitHub Actions pipeline, strict gates, local parity, troubleshooting |
| Security | Schema limits, rate limiting, path safety, preview safety |
| Schema Studio | Custom schemas: form/JSON, validation, preview, version history, restore |
| Lineage & reproducibility | Run lineage, manifest, custom schema provenance |
| Demo walkthrough | Step-by-step UI walkthrough |
For the full docs index, versioning, and release checklist: docs/INDEX.md, docs/versioning.md, docs/release-checklist.md. See CHANGELOG.md for version history.
UI screenshots and demo assets: docs/screenshots/. Target filenames and checklist: docs/screenshots/SCREENSHOT-CHECKLIST.md.
We welcome contributions. CONTRIBUTING.md covers setup, full validation (make validate-all, E2E), and adding scenarios/packs/tests. CODE_OF_CONDUCT.md and SECURITY.md for community and security.
- Local-first: No cloud deployment; run API and frontend locally or in CI. Storage is file-based (or optional SQLite) for runs and scenarios.
- E2E: Local Playwright runs auto-start API + frontend through
frontend/playwright.config.ts; CI starts both explicitly. See docs/testing.md. - Build: On some environments (e.g. OneDrive-synced folders),
npm run buildmay fail with EPERM; run from a non-synced path or close other tools. Type-check and unit tests still validate the codebase. - Maturity: Data Forge is open-source and actively developed; some integrations and adapters are evolving. Check docs/architecture-current-state.md and docs/INDEX.md for current scope.
MIT.
