DataHelm is a data engineering framework focused on:
- source ingestion orchestration
- dbt transformation workflows
- notebook-based dashboard execution
- reusable provider connectors (SharePoint, GCS, S3, BigQuery)
- optional local-LLM analytics query scaffolding
- Config-driven ingestion using YAML in
config/api/ - Dagster orchestration for jobs, schedules, and sensors
- dbt project execution through
analytics/dbt_runner.pyand dbt configs - Dashboard generation with Dagstermill notebooks
- Reusable handlers/connectors for multiple external providers
- Optional NL-to-SQL module (
analytics/nl_query/) for local Ollama-based analytics workflows
The repository follows layered responsibilities:
handlers/: provider-specific source connectors and API handlersingestion/: ingestion factory + native ingestion implementationsanalytics/: dbt, dashboard, and optional NL-query modulesdagster_op/: orchestration objects (jobs, schedules, repository)config/: all runtime configuration (api, dbt, dashboard, analytics metadata)tests/: unit tests for handlers, ingestion, analytics, and scripts
config/ api/ dbt/ dashboard/ analytics/ analytics/ dbt_projects/ notebooks/ nl_query/ dagster_op/ handlers/ api/ sharepoint/ gcs/ s3/ bigquery/ ingestion/ tests/ scripts/ docs/ - Python 3.12+
- PostgreSQL (reachable from local environment)
- Optional: Docker, local Ollama, dbt CLI
python3 -m venv .venv source .venv/bin/activate pip install --upgrade pip pip install -e .Create a .env file in repository root with required values, for example:
DB_HOST=${DB_HOST} DB_PORT=${DB_PORT} DB_USER=${DB_USER} DB_PASSWORD=${DB_PASSWORD} DB_NAME=${DB_NAME} CLASHOFCLANS_API_TOKEN=${CLASHOFCLANS_API_TOKEN}python scripts/run_dagster_dev.pyUseful option:
python scripts/run_dagster_dev.py --print-onlyDefines source-level extraction, publish targets, schedules, and column mapping.
Example currently included:
CLASHOFCLANS_PLAYER_STATS
Defines dbt units, selection/exclusion rules, vars, and schedules.
Defines notebook path, source table mapping, chart columns, and cadence.
Defines dataset metadata for the isolated NL-to-SQL module.
The repository includes reusable connector classes under handlers/:
handlers/sharepoint/sharepoint.py- Microsoft Graph auth + site/file access helpers
handlers/gcs/gcs.py- upload/download/list/delete/signed URL helpers
handlers/s3/s3.py- upload/download/list/delete/presigned URL helpers
handlers/bigquery/bigquery.py- query, row fetch, dataframe load, schema helpers
analytics/nl_query/ is an isolated module for natural-language-to-SQL generation using local Ollama:
- semantic catalog loader
- SQL read-only safety guard
- Ollama client wrapper
- orchestration service
Run all tests:
.venv/bin/python -m pytest -qCurrent suite covers:
- ingestion and handler behavior
- analytics factory and runner logic
- connector modules (SharePoint, GCS, S3, BigQuery)
- script behavior
- NL-query safety and service paths
dev: integration branchmaster: release/production branch
Workflows:
- CI: tests on development and PR flows
- Docker Release: image build/publish on
master - Deploy Release: workflow_run/manual deployment orchestration
Container image is defined via Dockerfile.
Default runtime command starts Dagster gRPC:
python -m dagster api grpc -m dagster_op.repositoryDeployment flow is workflow-based:
- production auto-path after successful Docker release
- manual staging/production dispatch path
- Contribution guide:
CONTRIBUTING.md - Code of conduct:
CODE_OF_CONDUCT.md - Security reporting:
SECURITY.md
For complete, long-form project documentation (operations, architecture, and runbook-style details), see:
docs/document.md
