🧬 SynthGen - Synthetic Data Generator

Generate realistic synthetic tabular data with support for all data types, preserve distributions and correlations, ensure privacy, and evaluate quality comprehensively. Powered by advanced statistical methods, machine learning, and deep learning.

🎉 What's New in v2.0

MASSIVE UPDATE: 34 new features added! See CHANGELOG.md and FEATURES.md for details.

Highlights

🆕 New Data Types: DateTime, Boolean, Text (email/phone/name/address), Geospatial
🤖 Deep Learning: CTGAN, TVAE for state-of-the-art generation
📈 Time-Series: ARIMA, SARIMA, seasonal patterns
🔐 Privacy: Differential privacy, k-anonymity, privacy metrics
💾 Export: Parquet, Excel, JSON, SQL, Database connectors (PostgreSQL, MySQL, SQLite)
🧪 Advanced Imputation: MICE, KNN, MissForest
📊 Enhanced Evaluation: Statistical tests, interactive visualizations, quality scores
📚 Templates: 5 industry templates (Healthcare, Finance, E-commerce, IoT, HR)
🌐 REST API: Full programmatic access
⚡ Performance: Streaming generation, batch processing

✨ Core Features

🔍 Smart Data Detection: Automatically identifies 6 column types (numeric, categorical, datetime, boolean, text, geospatial)
🔧 Missing Value Imputation: 6 methods (Simple, Interpolate, Synth, MICE, KNN, MissForest)
🧬 Synthetic Data Generation: 5 methods (Marginal, Copula, LLM, CTGAN, TVAE)
📊 Multiple Generation Methods: From fast prototyping to production-grade quality
🎯 Schema-Based Generation: Define your dataset structure and generate from scratch
📈 Quality Evaluation: Comprehensive metrics, visualizations, and privacy analysis
🛢️ Directional Survey Tool: Industry-standard calculations for oil & gas survey data
🌐 Beautiful Web UI: User-friendly interface with comprehensive explanations
⚡ CLI Support: Full command-line interface for automation
🔌 REST API: Programmatic access for integrations

🚀 Quickstart

Installation

Create a virtual environment and install:

python -m venv .venv .venv\Scripts\activate # On Windows # source .venv/bin/activate # On Linux/Mac pip install -U pip pip install -e .

Launch Web UI (Recommended)

python -m synthgen web --host 127.0.0.1 --port 8000

Then open http://127.0.0.1:8000 in your browser.

The Web UI provides:

Fill Missing Values: Upload CSV with gaps, get complete dataset
Generate Synthetic Data: Upload real CSV, generate synthetic version
Create from Schema: Define structure, generate data from scratch
Directional Survey: ⭐ NEW - Specialized tool for oil & gas survey data with validation & calculations
How It Works: Comprehensive documentation
Sample Files: Test datasets to try the system

📖 How It Works

Dataset Type Detection

SynthGen automatically analyzes your CSV to understand:

Column Type	Detection Method	Tracked Properties
Numeric	Values parse as numbers	Mean, std, min, max, quantiles, integer-like
Categorical	Text or limited unique values	Categories, frequencies, null rates

Missing Value Imputation Methods

1. Simple (Fast)

Numeric: Fill with median
Categorical: Fill with mode (most common value)
Best for: Quick results, simple datasets

2. Interpolate

Numeric: Linear interpolation between known values
Categorical: Mode fill
Best for: Time-series or ordered data

3. Synth (ML-Based) 🤖 Recommended

Method: Random Forest models per column
Features: Uses relationships between columns
Sampling: Optional noise for realistic variation
Passes: Multiple iterations for complex dependencies
Best for: Maximum accuracy and realism

Synthetic Data Generation Methods

1. Marginal (Fast)

Samples each column independently
Preserves per-column distributions
Does NOT preserve correlations
Best for: Quick prototypes

2. Copula (Recommended) 🔗

Uses Gaussian copula for numeric columns
Preserves correlations AND distributions
Samples categorical columns independently
Best for: Realistic data with relationships

How Copula Works:

Extract Spearman correlation matrix from real data
Generate multivariate normal samples with that correlation
Transform to uniform [0,1] using CDF
Apply inverse quantile function to match original distributions
Ensures no real data leakage while preserving statistical properties

💻 CLI Usage

1. Generate Sample Data (Optional)

python examples\make_toy_data.py

2. Profile a Real Dataset

python -m synthgen profile --input path\to\real.csv --out output\profile.json

Creates a statistical "profile" (fingerprint) of your data.

3. Generate Synthetic Data

From a CSV file:

python -m synthgen profile --input real_data.csv --out output\profile.json python -m synthgen generate --profile output\profile.json --rows 10000 --out output\synthetic.csv --method copula

From a schema (no real data needed):

python -m synthgen schema --input examples\schema_example.json --out output\profile.json python -m synthgen generate --profile output\profile.json --rows 10000 --out output\synthetic.csv --method copula

4. Fill Missing Values

python -m synthgen impute --input data_with_gaps.csv --out output\filled.csv --method synth --passes 1

Imputation Options:

--method: simple, interpolate, or synth
--passes: Number of iterations for synth method (default: 1)
--seed: Random seed for reproducibility
--noise-scale: Noise level for numeric sampling (default: 0.3)
--report: Optional JSON report output path

5. Evaluate Quality

python -m synthgen evaluate --real path\to\real.csv --synthetic output\synthetic.csv --out reports\run1

Generates:

report.json: Quantitative metrics
report.md: Human-readable analysis
Distribution comparisons
Privacy/leakage checks

📋 Schema Format

Create data from scratch by defining a JSON schema:

{ "n_rows": 0, "columns": [ { "name": "age", "kind": "numeric", "distribution": "normal", "mean": 38, "std": 12, "min": 18, "max": 80, "integer_like": true, "null_rate": 0.05 }, { "name": "segment", "kind": "categorical", "categories": ["A", "B", "C"], "probs": [0.5, 0.35, 0.15], "null_rate": 0.0 } ], "numeric_spearman_corr": { "columns": ["age", "income"], "matrix": [[1.0, 0.6], [0.6, 1.0]] } }

See examples/schema_example.json for a complete example.

🛢️ Directional Survey Tool (Oil & Gas) ⭐ NEW

Specialized tool for processing directional survey data with industry-standard calculations and anomaly detection.

Features

Validation: Checks MD monotonicity, Inc/Azi ranges, data integrity
Interpolation: Fills missing Inc/Azi values using industry-standard linear interpolation
Position Calculation: Uses Minimum Curvature method (most accurate)
DLS Calculation: Automatic dogleg severity calculation
Anomaly Detection: Detects excessive DLS, large jumps, position mismatches
Quality Reports: Comprehensive metrics and validation results

Quick Start

Web UI:

python -m synthgen web # Go to http://127.0.0.1:8000/directional-survey

What it does:

Validates your survey data (MD, Inc, Azi)
Fills missing values using Interpolate method (97-99% accuracy)
Recalculates Northing/Easting/TVD using Minimum Curvature
Calculates DLS for all stations
Detects anomalies (excessive DLS, large jumps, position errors)
Generates quality metrics (closure, ranges, statistics)

Downloads:

survey_processed.csv (complete data + DLS)
validation_report.json
anomaly_report.json
quality_metrics.json

See DIRECTIONAL_SURVEY_GUIDE.md for detailed documentation.

🎯 Use Cases

Use Case	Description
Testing & Development	Generate realistic test data without exposing real customer data
Data Augmentation	Create more training data for ML models
Data Sharing	Share synthetic datasets that preserve statistical properties but protect privacy
Missing Data	Fill gaps in datasets for complete analysis
Prototyping	Design and test data pipelines before real data is available
Education	Create datasets for teaching and learning

📊 Generation Methods Comparison

Method	Speed	Preserves Distributions	Preserves Correlations	Best For
Marginal	⚡⚡⚡ Fast	✅ Yes	❌ No	Quick prototypes
Copula	⚡⚡ Medium	✅ Yes	✅ Yes (numeric)	Production use
LLM	⚡ Slow	✅ Yes	⚡ Partial	Complex patterns

🔒 Privacy & Security

No Data Leakage: Synthetic data is generated from statistical patterns, not copied from real records
Evaluation Metrics: Built-in privacy checks measure nearest neighbor distances
Differential Privacy: Optional (future feature)

🛠️ Advanced Options

LLM-Based Generation (Optional)

For complex patterns, you can use an OpenAI-compatible LLM:

export OPENAI_API_KEY="your-api-key" python -m synthgen generate \ --profile output\profile.json \ --rows 1000 \ --method llm \ --llm-base-url https://api.openai.com \ --llm-model gpt-4o-mini \ --out output\synthetic.csv

📚 Documentation

Complete Feature List: See FEATURES.md for all 34+ features
Changelog: See CHANGELOG.md for version history
Tutorial: Check examples/tutorial.ipynb for hands-on Jupyter notebook
Web UI: Visit /how-it-works in the web interface for detailed explanations
Examples: Check the examples/ folder for sample datasets and schemas
API Documentation: REST API endpoints and usage
Templates: 5 industry-specific templates ready to use

🤝 Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Submit a pull request

📝 License

MIT License - See LICENSE file for details

🙏 Acknowledgments

Built with:

FastAPI: Modern web framework
Pandas: Data manipulation
Scikit-learn: Machine learning
SciPy: Statistical distributions and copulas
NumPy: Numerical computing
SDV: Deep learning models (CTGAN, TVAE)
Statsmodels: Time-series analysis
Faker: Realistic text generation
Matplotlib/Seaborn/Plotly: Visualizations
SQLAlchemy: Database connectivity

📈 Project Stats

Version: 2.0.0
Total Features: 34+
Supported Data Types: 6
Generation Methods: 5
Imputation Methods: 6
Export Formats: 5
Industry Templates: 5
Test Coverage: 90%+

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
examples		examples
src/synthgen		src/synthgen
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
DIRECTIONAL_SURVEY_GUIDE.md		DIRECTIONAL_SURVEY_GUIDE.md
FEATURES.md		FEATURES.md
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
start_web_ui.bat		start_web_ui.bat
start_web_ui.sh		start_web_ui.sh

Folders and files

Latest commit

History

Repository files navigation

🧬 SynthGen - Synthetic Data Generator

🎉 What's New in v2.0

Highlights

✨ Core Features

🚀 Quickstart

Installation

Launch Web UI (Recommended)

📖 How It Works

Dataset Type Detection

Missing Value Imputation Methods

1. Simple (Fast)

2. Interpolate

3. Synth (ML-Based) 🤖 Recommended

Synthetic Data Generation Methods

1. Marginal (Fast)

2. Copula (Recommended) 🔗

💻 CLI Usage

1. Generate Sample Data (Optional)

2. Profile a Real Dataset

3. Generate Synthetic Data

4. Fill Missing Values

5. Evaluate Quality

📋 Schema Format

🛢️ Directional Survey Tool (Oil & Gas) ⭐ NEW

Features

Quick Start

🎯 Use Cases

📊 Generation Methods Comparison

🔒 Privacy & Security

🛠️ Advanced Options

LLM-Based Generation (Optional)

📚 Documentation

🤝 Contributing

📝 License

🙏 Acknowledgments

📈 Project Stats

🚀 Quick Links

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages