Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 45 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

DataHelm is a data engineering framework focused on the following:

- source ingestion and orchestration
- Source ingestion and orchestration
- dbt transformation workflows
- notebook-based dashboard execution
- reusable provider connectors (SharePoint, GCS, S3, and BigQuery)
- optional local LLM analytics query scaffolding
- Notebook-based dashboard execution
- Reusable provider connectors (SharePoint, GCS, S3, and BigQuery)
- Optional local LLM analytics query scaffolding

![DataHelm Architecture](https://github.com/DevStrikerTech/datahelm/blob/master/docs/architecture.png?raw=true)

Expand Down Expand Up @@ -53,18 +53,20 @@ ingestion/
tests/
scripts/
docs/
```
````

## Local Setup

### Prerequisites

- Python 3.12+
- PostgreSQL (accessible from the local environment)
- Optional: Docker, local Ollama, dbt CLI
Python 3.12+
PostgreSQL (accessible from the local environment)
Optional: Docker, local Ollama, dbt CLI

### Installation

Run the following commands to set up the local environment:

```bash
python3 -m venv .venv
source .venv/bin/activate
Expand All @@ -74,9 +76,9 @@ pip install -e .

### Environment Variables

Create a `.env` file in the repository root with the required values, for example:
Create a file named `.env` in the root of the repository with the required values, for example:

```env
```text
DB_HOST=${DB_HOST}
DB_PORT=${DB_PORT}
DB_USER=${DB_USER}
Expand All @@ -87,90 +89,85 @@ CLASHOFCLANS_API_TOKEN=${CLASHOFCLANS_API_TOKEN}

### Run Dagster Locally

To start Dagster locally, run:

```bash
python scripts/run_dagster_dev.py
```

Useful option for quick verification:
For a quick verification without executing jobs, run:

```bash
python scripts/run_dagster_dev.py --print-only
```

## Configuration Model

### Ingestion Config (`config/api/*.yaml`)
### Ingestion Config (config/api/*.yaml)

Defines source-level extraction, publish targets, schedules, and column mapping.
Example included: CLASHOFCLANS_PLAYER_STATS

Example currently included:

- `CLASHOFCLANS_PLAYER_STATS`

### dbt Config (`config/dbt/projects.yaml`)
### dbt Config (config/dbt/projects.yaml)

Defines dbt units, selection/exclusion rules, vars, and schedules.

### Dashboard Config (`config/dashboard/projects.yaml`)
### Dashboard Config (config/dashboard/projects.yaml)

Defines notebook path, source table mapping, chart columns, and cadence.

### Analytics Semantic Config (`config/analytics/semantic_catalog.yaml`)
### Analytics Semantic Config (config/analytics/semantic_catalog.yaml)

Defines dataset metadata for the isolated NL-to-SQL module.

## Reusable Connectors

The repository includes reusable connector classes under `handlers/`:
The repository includes reusable connector classes under handlers/:

- `handlers/sharepoint/sharepoint.py`
- Microsoft Graph auth + site/file access helpers
- `handlers/gcs/gcs.py`
- upload/download/list/delete/signed URL helpers
- `handlers/s3/s3.py`
- upload/download/list/delete/presigned URL helpers
- `handlers/bigquery/bigquery.py`
- query, row fetch, dataframe load, schema helpers
handlers/sharepoint/sharepoint.py – Microsoft Graph auth + site/file access helpers
handlers/gcs/gcs.py – Upload/download/list/delete/signed URL helpers
handlers/s3/s3.py – Upload/download/list/delete/presigned URL helpers
handlers/bigquery/bigquery.py – Query, row fetch, dataframe load, schema helpers

## Local LLM Analytics Module

`analytics/nl_query/` is an isolated module for natural-language-to-SQL generation using local Ollama:
analytics/nl_query/ is an isolated module for natural-language-to-SQL generation using local Ollama:

- semantic catalog loader
- SQL read-only safety guard
- Ollama client wrapper
- orchestration service
* Semantic catalog loader
* SQL read-only safety guard
* Ollama client wrapper
* Orchestration service

## Testing

Run all tests:
Run all tests with the following command:

```bash
.venv/bin/python -m pytest -q
```

The current test suite includes coverage for:

- ingestion and handler behavior
- analytics factory and runner logic
- connector modules (SharePoint, GCS, S3, BigQuery)
- script behavior
- NL-query safety and service paths
* Ingestion and handler behavior
* Analytics factory and runner logic
* Connector modules (SharePoint, GCS, S3, BigQuery)
* Script behavior
* NL-query safety and service paths

## CI/CD and Branching

- `dev`: integration branch
- `master`: release/production branch
* dev: integration branch
* master: release/production branch

Workflows:

- **CI**: tests on development and PR flows
- **Docker Release**: image build/publish on `master`
- **Deploy Release**: workflow_run/manual deployment orchestration
* CI: tests on development and PR flows
* Docker Release: image build/publish on master
* Deploy Release: workflow_run/manual deployment orchestration

## Containerization

Container image is defined via `Dockerfile`.
Container image is defined via Dockerfile.

Default runtime command starts the Dagster gRPC server:

Expand All @@ -182,17 +179,11 @@ python -m dagster api grpc -m dagster_op.repository

Deployment flow is workflow-based:

- production auto-path after successful Docker release
- manual staging/production dispatch path

## Contributing and Governance

- Contribution guide: `CONTRIBUTING.md`
- Code of conduct: `CODE_OF_CONDUCT.md`
- Security reporting: `SECURITY.md`
* Production auto-path after successful Docker release
* Manual staging/production dispatch path

## Detailed Technical Documentation

For complete, long-form project documentation (operations, architecture, and runbook-style details), see:

- `docs/document.md`
docs/document.md
Loading