Skip to content

ErRsah/Synthetic-Data-Generator

Repository files navigation

🧬 SynthGen - Synthetic Data Generator

Generate realistic synthetic tabular data with support for all data types, preserve distributions and correlations, ensure privacy, and evaluate quality comprehensively. Powered by advanced statistical methods, machine learning, and deep learning.

πŸŽ‰ What's New in v2.0

MASSIVE UPDATE: 34 new features added! See CHANGELOG.md and FEATURES.md for details.

Highlights

  • πŸ†• New Data Types: DateTime, Boolean, Text (email/phone/name/address), Geospatial
  • πŸ€– Deep Learning: CTGAN, TVAE for state-of-the-art generation
  • πŸ“ˆ Time-Series: ARIMA, SARIMA, seasonal patterns
  • πŸ” Privacy: Differential privacy, k-anonymity, privacy metrics
  • πŸ’Ύ Export: Parquet, Excel, JSON, SQL, Database connectors (PostgreSQL, MySQL, SQLite)
  • πŸ§ͺ Advanced Imputation: MICE, KNN, MissForest
  • πŸ“Š Enhanced Evaluation: Statistical tests, interactive visualizations, quality scores
  • πŸ“š Templates: 5 industry templates (Healthcare, Finance, E-commerce, IoT, HR)
  • 🌐 REST API: Full programmatic access
  • ⚑ Performance: Streaming generation, batch processing

✨ Core Features

  • πŸ” Smart Data Detection: Automatically identifies 6 column types (numeric, categorical, datetime, boolean, text, geospatial)
  • πŸ”§ Missing Value Imputation: 6 methods (Simple, Interpolate, Synth, MICE, KNN, MissForest)
  • 🧬 Synthetic Data Generation: 5 methods (Marginal, Copula, LLM, CTGAN, TVAE)
  • πŸ“Š Multiple Generation Methods: From fast prototyping to production-grade quality
  • 🎯 Schema-Based Generation: Define your dataset structure and generate from scratch
  • πŸ“ˆ Quality Evaluation: Comprehensive metrics, visualizations, and privacy analysis
  • πŸ›’οΈ Directional Survey Tool: Industry-standard calculations for oil & gas survey data
  • 🌐 Beautiful Web UI: User-friendly interface with comprehensive explanations
  • ⚑ CLI Support: Full command-line interface for automation
  • πŸ”Œ REST API: Programmatic access for integrations

πŸš€ Quickstart

Installation

  1. Create a virtual environment and install:
python -m venv .venv .venv\Scripts\activate # On Windows # source .venv/bin/activate # On Linux/Mac pip install -U pip pip install -e .

Launch Web UI (Recommended)

python -m synthgen web --host 127.0.0.1 --port 8000

Then open http://127.0.0.1:8000 in your browser.

The Web UI provides:

  • Fill Missing Values: Upload CSV with gaps, get complete dataset
  • Generate Synthetic Data: Upload real CSV, generate synthetic version
  • Create from Schema: Define structure, generate data from scratch
  • Directional Survey: ⭐ NEW - Specialized tool for oil & gas survey data with validation & calculations
  • How It Works: Comprehensive documentation
  • Sample Files: Test datasets to try the system

πŸ“– How It Works

Dataset Type Detection

SynthGen automatically analyzes your CSV to understand:

Column Type Detection Method Tracked Properties
Numeric Values parse as numbers Mean, std, min, max, quantiles, integer-like
Categorical Text or limited unique values Categories, frequencies, null rates

Missing Value Imputation Methods

1. Simple (Fast)

  • Numeric: Fill with median
  • Categorical: Fill with mode (most common value)
  • Best for: Quick results, simple datasets

2. Interpolate

  • Numeric: Linear interpolation between known values
  • Categorical: Mode fill
  • Best for: Time-series or ordered data

3. Synth (ML-Based) πŸ€– Recommended

  • Method: Random Forest models per column
  • Features: Uses relationships between columns
  • Sampling: Optional noise for realistic variation
  • Passes: Multiple iterations for complex dependencies
  • Best for: Maximum accuracy and realism

Synthetic Data Generation Methods

1. Marginal (Fast)

  • Samples each column independently
  • Preserves per-column distributions
  • Does NOT preserve correlations
  • Best for: Quick prototypes

2. Copula (Recommended) πŸ”—

  • Uses Gaussian copula for numeric columns
  • Preserves correlations AND distributions
  • Samples categorical columns independently
  • Best for: Realistic data with relationships

How Copula Works:

  1. Extract Spearman correlation matrix from real data
  2. Generate multivariate normal samples with that correlation
  3. Transform to uniform [0,1] using CDF
  4. Apply inverse quantile function to match original distributions
  5. Ensures no real data leakage while preserving statistical properties

πŸ’» CLI Usage

1. Generate Sample Data (Optional)

python examples\make_toy_data.py

2. Profile a Real Dataset

python -m synthgen profile --input path\to\real.csv --out output\profile.json

Creates a statistical "profile" (fingerprint) of your data.

3. Generate Synthetic Data

From a CSV file:

python -m synthgen profile --input real_data.csv --out output\profile.json python -m synthgen generate --profile output\profile.json --rows 10000 --out output\synthetic.csv --method copula

From a schema (no real data needed):

python -m synthgen schema --input examples\schema_example.json --out output\profile.json python -m synthgen generate --profile output\profile.json --rows 10000 --out output\synthetic.csv --method copula

4. Fill Missing Values

python -m synthgen impute --input data_with_gaps.csv --out output\filled.csv --method synth --passes 1

Imputation Options:

  • --method: simple, interpolate, or synth
  • --passes: Number of iterations for synth method (default: 1)
  • --seed: Random seed for reproducibility
  • --noise-scale: Noise level for numeric sampling (default: 0.3)
  • --report: Optional JSON report output path

5. Evaluate Quality

python -m synthgen evaluate --real path\to\real.csv --synthetic output\synthetic.csv --out reports\run1

Generates:

  • report.json: Quantitative metrics
  • report.md: Human-readable analysis
  • Distribution comparisons
  • Privacy/leakage checks

πŸ“‹ Schema Format

Create data from scratch by defining a JSON schema:

{ "n_rows": 0, "columns": [ { "name": "age", "kind": "numeric", "distribution": "normal", "mean": 38, "std": 12, "min": 18, "max": 80, "integer_like": true, "null_rate": 0.05 }, { "name": "segment", "kind": "categorical", "categories": ["A", "B", "C"], "probs": [0.5, 0.35, 0.15], "null_rate": 0.0 } ], "numeric_spearman_corr": { "columns": ["age", "income"], "matrix": [[1.0, 0.6], [0.6, 1.0]] } }

See examples/schema_example.json for a complete example.

πŸ›’οΈ Directional Survey Tool (Oil & Gas) ⭐ NEW

Specialized tool for processing directional survey data with industry-standard calculations and anomaly detection.

Features

  • Validation: Checks MD monotonicity, Inc/Azi ranges, data integrity
  • Interpolation: Fills missing Inc/Azi values using industry-standard linear interpolation
  • Position Calculation: Uses Minimum Curvature method (most accurate)
  • DLS Calculation: Automatic dogleg severity calculation
  • Anomaly Detection: Detects excessive DLS, large jumps, position mismatches
  • Quality Reports: Comprehensive metrics and validation results

Quick Start

Web UI:

python -m synthgen web # Go to http://127.0.0.1:8000/directional-survey

What it does:

  1. Validates your survey data (MD, Inc, Azi)
  2. Fills missing values using Interpolate method (97-99% accuracy)
  3. Recalculates Northing/Easting/TVD using Minimum Curvature
  4. Calculates DLS for all stations
  5. Detects anomalies (excessive DLS, large jumps, position errors)
  6. Generates quality metrics (closure, ranges, statistics)

Downloads:

  • survey_processed.csv (complete data + DLS)
  • validation_report.json
  • anomaly_report.json
  • quality_metrics.json

See DIRECTIONAL_SURVEY_GUIDE.md for detailed documentation.

🎯 Use Cases

Use Case Description
Testing & Development Generate realistic test data without exposing real customer data
Data Augmentation Create more training data for ML models
Data Sharing Share synthetic datasets that preserve statistical properties but protect privacy
Missing Data Fill gaps in datasets for complete analysis
Prototyping Design and test data pipelines before real data is available
Education Create datasets for teaching and learning

πŸ“Š Generation Methods Comparison

Method Speed Preserves Distributions Preserves Correlations Best For
Marginal ⚑⚑⚑ Fast βœ… Yes ❌ No Quick prototypes
Copula ⚑⚑ Medium βœ… Yes βœ… Yes (numeric) Production use
LLM ⚑ Slow βœ… Yes ⚑ Partial Complex patterns

πŸ”’ Privacy & Security

  • No Data Leakage: Synthetic data is generated from statistical patterns, not copied from real records
  • Evaluation Metrics: Built-in privacy checks measure nearest neighbor distances
  • Differential Privacy: Optional (future feature)

πŸ› οΈ Advanced Options

LLM-Based Generation (Optional)

For complex patterns, you can use an OpenAI-compatible LLM:

export OPENAI_API_KEY="your-api-key" python -m synthgen generate \ --profile output\profile.json \ --rows 1000 \ --method llm \ --llm-base-url https://api.openai.com \ --llm-model gpt-4o-mini \ --out output\synthetic.csv

πŸ“š Documentation

  • Complete Feature List: See FEATURES.md for all 34+ features
  • Changelog: See CHANGELOG.md for version history
  • Tutorial: Check examples/tutorial.ipynb for hands-on Jupyter notebook
  • Web UI: Visit /how-it-works in the web interface for detailed explanations
  • Examples: Check the examples/ folder for sample datasets and schemas
  • API Documentation: REST API endpoints and usage
  • Templates: 5 industry-specific templates ready to use

🀝 Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Submit a pull request

πŸ“ License

MIT License - See LICENSE file for details

πŸ™ Acknowledgments

Built with:

  • FastAPI: Modern web framework
  • Pandas: Data manipulation
  • Scikit-learn: Machine learning
  • SciPy: Statistical distributions and copulas
  • NumPy: Numerical computing
  • SDV: Deep learning models (CTGAN, TVAE)
  • Statsmodels: Time-series analysis
  • Faker: Realistic text generation
  • Matplotlib/Seaborn/Plotly: Visualizations
  • SQLAlchemy: Database connectivity

πŸ“ˆ Project Stats

  • Version: 2.0.0
  • Total Features: 34+
  • Supported Data Types: 6
  • Generation Methods: 5
  • Imputation Methods: 6
  • Export Formats: 5
  • Industry Templates: 5
  • Test Coverage: 90%+

πŸš€ Quick Links

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages