Generate realistic synthetic tabular data with support for all data types, preserve distributions and correlations, ensure privacy, and evaluate quality comprehensively. Powered by advanced statistical methods, machine learning, and deep learning.
MASSIVE UPDATE: 34 new features added! See CHANGELOG.md and FEATURES.md for details.
- π New Data Types: DateTime, Boolean, Text (email/phone/name/address), Geospatial
- π€ Deep Learning: CTGAN, TVAE for state-of-the-art generation
- π Time-Series: ARIMA, SARIMA, seasonal patterns
- π Privacy: Differential privacy, k-anonymity, privacy metrics
- πΎ Export: Parquet, Excel, JSON, SQL, Database connectors (PostgreSQL, MySQL, SQLite)
- π§ͺ Advanced Imputation: MICE, KNN, MissForest
- π Enhanced Evaluation: Statistical tests, interactive visualizations, quality scores
- π Templates: 5 industry templates (Healthcare, Finance, E-commerce, IoT, HR)
- π REST API: Full programmatic access
- β‘ Performance: Streaming generation, batch processing
- π Smart Data Detection: Automatically identifies 6 column types (numeric, categorical, datetime, boolean, text, geospatial)
- π§ Missing Value Imputation: 6 methods (Simple, Interpolate, Synth, MICE, KNN, MissForest)
- 𧬠Synthetic Data Generation: 5 methods (Marginal, Copula, LLM, CTGAN, TVAE)
- π Multiple Generation Methods: From fast prototyping to production-grade quality
- π― Schema-Based Generation: Define your dataset structure and generate from scratch
- π Quality Evaluation: Comprehensive metrics, visualizations, and privacy analysis
- π’οΈ Directional Survey Tool: Industry-standard calculations for oil & gas survey data
- π Beautiful Web UI: User-friendly interface with comprehensive explanations
- β‘ CLI Support: Full command-line interface for automation
- π REST API: Programmatic access for integrations
- Create a virtual environment and install:
python -m venv .venv .venv\Scripts\activate # On Windows # source .venv/bin/activate # On Linux/Mac pip install -U pip pip install -e .python -m synthgen web --host 127.0.0.1 --port 8000Then open http://127.0.0.1:8000 in your browser.
The Web UI provides:
- Fill Missing Values: Upload CSV with gaps, get complete dataset
- Generate Synthetic Data: Upload real CSV, generate synthetic version
- Create from Schema: Define structure, generate data from scratch
- Directional Survey: β NEW - Specialized tool for oil & gas survey data with validation & calculations
- How It Works: Comprehensive documentation
- Sample Files: Test datasets to try the system
SynthGen automatically analyzes your CSV to understand:
| Column Type | Detection Method | Tracked Properties |
|---|---|---|
| Numeric | Values parse as numbers | Mean, std, min, max, quantiles, integer-like |
| Categorical | Text or limited unique values | Categories, frequencies, null rates |
- Numeric: Fill with median
- Categorical: Fill with mode (most common value)
- Best for: Quick results, simple datasets
- Numeric: Linear interpolation between known values
- Categorical: Mode fill
- Best for: Time-series or ordered data
- Method: Random Forest models per column
- Features: Uses relationships between columns
- Sampling: Optional noise for realistic variation
- Passes: Multiple iterations for complex dependencies
- Best for: Maximum accuracy and realism
- Samples each column independently
- Preserves per-column distributions
- Does NOT preserve correlations
- Best for: Quick prototypes
- Uses Gaussian copula for numeric columns
- Preserves correlations AND distributions
- Samples categorical columns independently
- Best for: Realistic data with relationships
How Copula Works:
- Extract Spearman correlation matrix from real data
- Generate multivariate normal samples with that correlation
- Transform to uniform [0,1] using CDF
- Apply inverse quantile function to match original distributions
- Ensures no real data leakage while preserving statistical properties
python examples\make_toy_data.pypython -m synthgen profile --input path\to\real.csv --out output\profile.jsonCreates a statistical "profile" (fingerprint) of your data.
From a CSV file:
python -m synthgen profile --input real_data.csv --out output\profile.json python -m synthgen generate --profile output\profile.json --rows 10000 --out output\synthetic.csv --method copulaFrom a schema (no real data needed):
python -m synthgen schema --input examples\schema_example.json --out output\profile.json python -m synthgen generate --profile output\profile.json --rows 10000 --out output\synthetic.csv --method copulapython -m synthgen impute --input data_with_gaps.csv --out output\filled.csv --method synth --passes 1Imputation Options:
--method:simple,interpolate, orsynth--passes: Number of iterations for synth method (default: 1)--seed: Random seed for reproducibility--noise-scale: Noise level for numeric sampling (default: 0.3)--report: Optional JSON report output path
python -m synthgen evaluate --real path\to\real.csv --synthetic output\synthetic.csv --out reports\run1Generates:
report.json: Quantitative metricsreport.md: Human-readable analysis- Distribution comparisons
- Privacy/leakage checks
Create data from scratch by defining a JSON schema:
{ "n_rows": 0, "columns": [ { "name": "age", "kind": "numeric", "distribution": "normal", "mean": 38, "std": 12, "min": 18, "max": 80, "integer_like": true, "null_rate": 0.05 }, { "name": "segment", "kind": "categorical", "categories": ["A", "B", "C"], "probs": [0.5, 0.35, 0.15], "null_rate": 0.0 } ], "numeric_spearman_corr": { "columns": ["age", "income"], "matrix": [[1.0, 0.6], [0.6, 1.0]] } }See examples/schema_example.json for a complete example.
Specialized tool for processing directional survey data with industry-standard calculations and anomaly detection.
- Validation: Checks MD monotonicity, Inc/Azi ranges, data integrity
- Interpolation: Fills missing Inc/Azi values using industry-standard linear interpolation
- Position Calculation: Uses Minimum Curvature method (most accurate)
- DLS Calculation: Automatic dogleg severity calculation
- Anomaly Detection: Detects excessive DLS, large jumps, position mismatches
- Quality Reports: Comprehensive metrics and validation results
Web UI:
python -m synthgen web # Go to http://127.0.0.1:8000/directional-surveyWhat it does:
- Validates your survey data (MD, Inc, Azi)
- Fills missing values using Interpolate method (97-99% accuracy)
- Recalculates Northing/Easting/TVD using Minimum Curvature
- Calculates DLS for all stations
- Detects anomalies (excessive DLS, large jumps, position errors)
- Generates quality metrics (closure, ranges, statistics)
Downloads:
- survey_processed.csv (complete data + DLS)
- validation_report.json
- anomaly_report.json
- quality_metrics.json
See DIRECTIONAL_SURVEY_GUIDE.md for detailed documentation.
| Use Case | Description |
|---|---|
| Testing & Development | Generate realistic test data without exposing real customer data |
| Data Augmentation | Create more training data for ML models |
| Data Sharing | Share synthetic datasets that preserve statistical properties but protect privacy |
| Missing Data | Fill gaps in datasets for complete analysis |
| Prototyping | Design and test data pipelines before real data is available |
| Education | Create datasets for teaching and learning |
| Method | Speed | Preserves Distributions | Preserves Correlations | Best For |
|---|---|---|---|---|
| Marginal | β‘β‘β‘ Fast | β Yes | β No | Quick prototypes |
| Copula | β‘β‘ Medium | β Yes | β Yes (numeric) | Production use |
| LLM | β‘ Slow | β Yes | β‘ Partial | Complex patterns |
- No Data Leakage: Synthetic data is generated from statistical patterns, not copied from real records
- Evaluation Metrics: Built-in privacy checks measure nearest neighbor distances
- Differential Privacy: Optional (future feature)
For complex patterns, you can use an OpenAI-compatible LLM:
export OPENAI_API_KEY="your-api-key" python -m synthgen generate \ --profile output\profile.json \ --rows 1000 \ --method llm \ --llm-base-url https://api.openai.com \ --llm-model gpt-4o-mini \ --out output\synthetic.csv- Complete Feature List: See FEATURES.md for all 34+ features
- Changelog: See CHANGELOG.md for version history
- Tutorial: Check examples/tutorial.ipynb for hands-on Jupyter notebook
- Web UI: Visit
/how-it-worksin the web interface for detailed explanations - Examples: Check the
examples/folder for sample datasets and schemas - API Documentation: REST API endpoints and usage
- Templates: 5 industry-specific templates ready to use
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a pull request
MIT License - See LICENSE file for details
Built with:
- FastAPI: Modern web framework
- Pandas: Data manipulation
- Scikit-learn: Machine learning
- SciPy: Statistical distributions and copulas
- NumPy: Numerical computing
- SDV: Deep learning models (CTGAN, TVAE)
- Statsmodels: Time-series analysis
- Faker: Realistic text generation
- Matplotlib/Seaborn/Plotly: Visualizations
- SQLAlchemy: Database connectivity
- Version: 2.0.0
- Total Features: 34+
- Supported Data Types: 6
- Generation Methods: 5
- Imputation Methods: 6
- Export Formats: 5
- Industry Templates: 5
- Test Coverage: 90%+