Data Forge

Time-Aware Synthetic Data Platform

Schema-aware synthetic data for databases, APIs, and pipelines. Realistic, relational, privacy-safe.

Quick Start • Demo • Scenarios • Workflows • Docs • Walkthrough • Contributing

Data Forge generates business-valid, cross-table, time-consistent, privacy-safe synthetic data—not hand-written fixtures, but test-ready data that respects schemas, foreign keys, business rules, and optional anomaly injection. Built for demos, UAT, integration testing, and pipeline development.

Open-source • Python backend • Next.js frontend • Local-first • Made for data engineers

✨ Why Data Forge is different

Typical approach	Data Forge
Hand-rolled CSV/JSON fixtures	Schema-driven generation — use your DDL, JSON Schema, or OpenAPI; FK and rules respected
One-off scripts per project	Reusable scenarios — save configs, run from wizard or Advanced, compare runs, version with scenarios
No pipeline realism	ETL modes — full snapshot, incremental, CDC; bronze/silver/gold; schema drift and messiness profiles
Guesswork on scale	Benchmark mode — scale presets, workload profiles, throughput and memory metrics
Isolated tables	Domain packs — 10+ packs (SaaS, e-commerce, fintech, healthcare, IoT, etc.) with event streams and benchmark relevance

📋 Feature matrix

Area	Capabilities
Schema	SQL DDL, JSON Schema, OpenAPI (table-like); YAML business rules; PK/FK resolution
Generation	Full snapshot, incremental, CDC; bronze/silver/gold; drift, messiness; anomaly injection
Export	CSV, JSON, JSONL, Parquet, SQL inserts
Load	SQLite, DuckDB, PostgreSQL, Snowflake, BigQuery
Integrations	dbt seeds, Great Expectations, Airflow DAGs, OpenAPI contract fixtures
Simulation	Event streams, time patterns (steady/burst/seasonal/growth), replay modes
Benchmark	Scale presets (small → xlarge), workload profiles, throughput/memory metrics
Quality	Schema validation, GE expectations, manifest reconciliation, privacy detection

🚀 Quick Start

Install

cd data-forge uv sync

If uv is unavailable, use a standard virtual environment:

python -m venv .venv .venv\Scripts\python -m pip install --upgrade pip .venv\Scripts\python -m pip install -e ".[dev]"

Optional extras:

uv sync --extra warehouse or .venv\Scripts\python -m pip install -e ".[warehouse]"
uv sync --extra streamlit or .venv\Scripts\python -m pip install -e ".[streamlit]"
uv sync --extra all or .venv\Scripts\python -m pip install -e ".[all]"

Generate from a domain pack

uv run data-forge generate --pack saas_billing --scale 1000 -o output -f parquet uv run data-forge generate --pack ecommerce --scale 2000 -o output -f sql

Run the Product UI

# Terminal 1: API uv run uvicorn data_forge.api.main:app --reload --port 8000 # Terminal 2: Frontend cd frontend && npm run dev

Open http://localhost:3000. See docs/demo-walkthrough.md for a step-by-step walkthrough.

Run with Docker Compose (reproducible local baseline)

docker compose up --build

This starts:

API on http://localhost:8000
Frontend on http://localhost:3001

Optional local Postgres stack for adapter workflows:

docker compose --profile db up --build

🎬 Demo

One-command demo (no cloud credentials):

make demo-data # or: ./scripts/run_demo.ps1 (Windows) / ./scripts/run_demo.sh (Linux/macOS)

Outputs go to demo_output/. Start the API and frontend to inspect runs and artifacts in the UI.

📋 Scenarios & runs

Scenarios — Save configs from Advanced or the Create Wizard; load in wizard or Advanced; update or save-as-new; edit metadata on the scenario detail page. Import from examples/scenarios/.
Runs — Every generation or benchmark creates a run (history, timeline, logs, artifacts). Compare two runs side-by-side (config, summary, benchmark, raw diff).
Artifacts — Datasets, event streams, dbt seeds, GE suites, DAGs; filter by run or type.

🔄 Core workflows

Generate a dataset — Wizard or Advanced → pick pack or custom schema (Schema Studio) → set scale/options → run. Output in Runs and Artifacts.
Run simulation — Advanced → Pipeline Simulation → enable event streams, set density/pattern/replay → run.
Benchmark warehouse load — Advanced → Benchmark → profile and scale preset → run. View throughput and duration on run detail.
Compare runs — Runs → open a run → “Compare with another run”, or /runs/compare. Use raw JSON diff for debugging.
Save scenario — After configuring, “Save as scenario” (or “Update scenario” if loaded). Reuse from Scenarios or wizard.

📁 Project structure at a glance

data-forge/ ├── src/data_forge/ # Backend │ ├── api/ # FastAPI app, routers, stores, middleware │ ├── models/ # Schema, config, generation, manifest │ ├── engine.py # Core run_generation, export_result │ ├── schema_ingest/ # SQL, JSON Schema, OpenAPI parsers │ ├── rule_engine/ # YAML business rules │ ├── generators/ # Primitives, distributions, FK resolution │ ├── adapters/ # SQLite, DuckDB, Postgres, Snowflake, BigQuery │ ├── exporters/ # CSV, JSON, Parquet, SQL │ ├── domain_packs/ # Pre-built schemas and rules │ ├── simulation/ # Event streams, time patterns │ └── services/ # Run, scenario, lineage, retention, metrics ├── frontend/ # Next.js product UI (wizard, runs, Schema Studio) ├── tests/ # Pytest backend tests ├── frontend/e2e/ # Playwright E2E tests ├── examples/scenarios/ # Example scenario JSONs ├── scripts/ # validate_all.*, run_demo.* ├── docs/ # Architecture, API reference, testing, CI, security └── .github/workflows/ # CI (backend, frontend, E2E)

🛠 Developer platform capabilities

CLI — data-forge generate, benchmark, validate, reconcile, packs; full control from the shell.
API — Start runs, list runs/artifacts/scenarios, compare runs, preflight, benchmark. Local JSON persistence.
UI — Wizard and Advanced config, run history and detail, scenario library, artifact browser, Schema Studio (custom schemas, validation, version diff), validation center, run comparison.
CI — GitHub Actions: backend ruff, mypy, pytest; frontend type-check, tests, build; Playwright E2E. Local: make validate-all. See docs/ci-cd.md.
API — POST /api/runs/generate, GET /api/runs, GET /api/runs/compare, POST /api/benchmark, GET /api/scenarios, GET /api/artifacts, POST /api/preflight.

⚡ Pipeline simulation & benchmark

Simulation: Event streams (e.g. order lifecycle, payments), time patterns, replay modes. Packs: ecommerce, fintech, logistics, IoT, social, saas_billing.
Benchmark: Scale presets (small ~10k → xlarge ~10M), workload profiles, throughput (rows/s), duration, memory estimates.

uv run data-forge benchmark --pack saas_billing --scale 5000 --iterations 3 --output-json bench.json

🗄 Database loading

Adapter	Usage
SQLite	`--load sqlite --db-uri ./data.db`
DuckDB	`--load duckdb --db-uri ./data.duckdb`
PostgreSQL	`--load postgres --db-uri postgresql://user:pass@host/db`
Snowflake / BigQuery	Env vars or flags (see CLI help)

📚 Integrations

dbt, Great Expectations, Airflow, Reconciliation

dbt: --export-dbt --dbt-dir ./dbt_project — seeds, sources, schema tests.
GE: --export-ge --ge-dir ./great_expectations; validate with data-forge validate-ge.
Airflow: --export-airflow --airflow-dir ./airflow — DAG templates.
Reconciliation: data-forge reconcile --manifest manifest.json --data ./output --schema schemas/...

📋 Example commands

Goal	Command
Sample generation	`uv run data-forge generate --pack saas_billing --scale 1000 -o output -f parquet`
Benchmark	`uv run data-forge benchmark --pack saas_billing --scale 5000 --iterations 3 --output-json bench.json`
List packs	`uv run data-forge packs`
Load scenario (UI)	Scenarios → Import scenario → choose from `examples/scenarios/`
Compare runs (UI)	Runs → “Compare with another run” or `/runs/compare`

🧪 Testing and validation

make validate-all # or: scripts/validate_all.ps1 | scripts/validate_all.sh

Steps: backend ruff, mypy, pytest; frontend typecheck, unit tests, build. E2E: make e2e or cd frontend && npm run e2e. See CONTRIBUTING.md and docs/testing.md.

📖 Docs (canonical)

Doc	Description
Architecture	Repository structure, API surface, schema system, frontend routes, CI/E2E
API Reference	REST endpoints, request/response shapes, errors (413/429), lineage & manifest
Testing	Backend (ruff, mypy, pytest), frontend (Vitest), E2E (Playwright), validation checklist
CI/CD	GitHub Actions pipeline, strict gates, local parity, troubleshooting
Security	Schema limits, rate limiting, path safety, preview safety
Schema Studio	Custom schemas: form/JSON, validation, preview, version history, restore
Lineage & reproducibility	Run lineage, manifest, custom schema provenance
Demo walkthrough	Step-by-step UI walkthrough

For the full docs index, versioning, and release checklist: docs/INDEX.md, docs/versioning.md, docs/release-checklist.md. See CHANGELOG.md for version history.

📸 Screenshots and assets

UI screenshots and demo assets: docs/screenshots/. Target filenames and checklist: docs/screenshots/SCREENSHOT-CHECKLIST.md.

🤝 Contributing

We welcome contributions. CONTRIBUTING.md covers setup, full validation (make validate-all, E2E), and adding scenarios/packs/tests. CODE_OF_CONDUCT.md and SECURITY.md for community and security.

⚠️ Known limitations

Local-first: No cloud deployment; run API and frontend locally or in CI. Storage is file-based (or optional SQLite) for runs and scenarios.
E2E: Local Playwright runs auto-start API + frontend through frontend/playwright.config.ts; CI starts both explicitly. See docs/testing.md.
Build: On some environments (e.g. OneDrive-synced folders), npm run build may fail with EPERM; run from a non-synced path or close other tools. Type-check and unit tests still validate the codebase.
Maturity: Data Forge is open-source and actively developed; some integrations and adapters are evolving. Check docs/architecture-current-state.md and docs/INDEX.md for current scope.

📄 License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github		.github
docs		docs
examples/scenarios		examples/scenarios
frontend		frontend
rules		rules
schemas		schemas
scripts		scripts
src/data_forge		src/data_forge
tests		tests
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Forge

✨ Why Data Forge is different

📋 Feature matrix

🚀 Quick Start

Install

Generate from a domain pack

Run the Product UI

Run with Docker Compose (reproducible local baseline)

🎬 Demo

📋 Scenarios & runs

🔄 Core workflows

📁 Project structure at a glance

🛠 Developer platform capabilities

⚡ Pipeline simulation & benchmark

🗄 Database loading

📚 Integrations

📋 Example commands

🧪 Testing and validation

📖 Docs (canonical)

📸 Screenshots and assets

🤝 Contributing

⚠️ Known limitations

📄 License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Forge

✨ Why Data Forge is different

📋 Feature matrix

🚀 Quick Start

Install

Generate from a domain pack

Run the Product UI

Run with Docker Compose (reproducible local baseline)

🎬 Demo

📋 Scenarios & runs

🔄 Core workflows

📁 Project structure at a glance

🛠 Developer platform capabilities

⚡ Pipeline simulation & benchmark

🗄 Database loading

📚 Integrations

📋 Example commands

🧪 Testing and validation

📖 Docs (canonical)

📸 Screenshots and assets

🤝 Contributing

⚠️ Known limitations

📄 License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages