Skip to content
View ojasshukla01's full-sized avatar

Block or report ojasshukla01

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
ojasshukla01/README.md

Contents

Profile: Overview · Flagship repositories · Recent activity · Contact

Deep dive: Data Forge · Technical stack · Repository index · Online presence · Operating principles · Document notes


Overview

Senior data engineer with six years’ experience delivering cloud-native analytics platforms on AWS, GCP, Azure, and Snowflake, with depth in Apache Kafka, dbt, DuckDB, and lakehouse patterns. Focus areas include real-time and batch data pipelines, observability, data governance, and open-source tooling.

Focus Project Description
Synthetic data Data Forge Schema-aware, time-consistent test data for databases, APIs, and pipelines
Agent safety SQLSense MCP server for guardrailed, audited SQL execution for AI agents
Developer tooling token-doctor Local-first CLI for tokens, changelogs, deprecation windows, and calendars
; Machine-readable summary (illustrative) [profile] title = senior_data_engineer domains = streaming, batch, governance, synthetic_data, agent_safety stores = snowflake, bigquery, duckdb, lakehouse delivery = airflow, cicd, terraform

Flagship repositories

Representative open-source work. Contributions welcome where repositories are licensed and document how to contribute.

Project Summary Technologies
OpenCompliance ESG ESG analytics, PDF reporting, data quality Streamlit, FastAPI, DuckDB, Python
Data Forge Time-aware synthetic data; DDL and OpenAPI; CDC-style exports Python, FastAPI, Next.js, DuckDB, warehouses
LLM Learning Path Generator LLM-assisted learning paths and gap analysis Streamlit, LangChain, DuckDB, OpenAI
token-doctor Token lifecycle debugging, changelogs, sunsets, ICS — local-first Python, SQLite, CLI
SQLSense Audited, constrained SQL over MCP for software agents Python, MCP
Health Analytics BI Dashboard Healthcare KPIs and business intelligence patterns Power BI, analytics

Recent activity

Auto-refreshed markdown (GitHub + Medium). Expand How this section is updated for implementation details.

How this section is updated

A GitHub Action runs once per day and rewrites the HTML comment markers in this file. Sources: GraphQL (paginated public repos; one combined query for starred repos + authored PR search; retries on transient failures) and Medium RSS (retries). Pattern from Building a self-updating profile README for GitHub · scripts/build_readme.py · .github/workflows/update-profile-readme.yml · Dependabot.

Latest GitHub releases

Latest on Medium

Recent pull requests

One row per repository (your most recently updated authored PR in each), so the profile stays easy to scan.

Expand pull request list

Recently starred repositories

Tools and references I am tracking (public stars). Repositories already shown under Recent pull requests are omitted here to avoid duplication.

Expand starred repositories

Recently updated public repositories


Data Forge platform model

Data Forge is a time-aware synthetic data platform: define schemas and rules (DDL, JSON Schema, OpenAPI, domain packs), select generation modes (snapshot, incremental, CDC-style, medallion layers), and export privacy-conscious, relational test data to files and warehouses, with a Next.js front end and Python / FastAPI API. The diagram below is a simplified architecture view (not all adapters are shown).

flowchart TB subgraph define[Define] SCH[DDL JSON Schema OpenAPI] PACK[Domain packs and business rules] SCN[Scenarios and reproducible runs] end subgraph core[Data Forge engine] GEN[Generate relational time-consistent rows] CDC[Snapshot incremental CDC style flows] FX[Drift messiness anomaly profiles] end subgraph ship[Export and load] FILES[Parquet JSON JSONL CSV SQL inserts] WH[DuckDB Postgres SQLite Snowflake BigQuery] end subgraph product[Product surface] API[FastAPI service] UI[Next.js app] end subgraph ecosystem[Downstream tooling] DBT[dbt seeds and tests] GE[Great Expectations] AF[Airflow style orchestration] end SCH --> GEN PACK --> GEN SCN --> GEN GEN --> CDC GEN --> FX GEN --> FILES FILES --> WH UI --> API API --> GEN WH -.-> DBT GEN -.-> GE FILES -.-> AF 
Loading

Technical stack

Domain Technologies
Languages Python, SQL, JavaScript, R, Scala
Cloud Google Cloud, AWS, Microsoft Azure, Snowflake
Data and streaming Apache Spark, Databricks, BigQuery, Apache Kafka, Apache Airflow, dbt, DuckDB
Delivery Docker, Terraform, GitHub Actions, Kubernetes

Repository index

Data platforms and pipelines

Applications and tooling

Archive and experiments


Online presence

Channel Link Description
Portfolio Professional site (Vercel) Case studies, projects, experience; a custom domain shortens the visible URL in the hero if you add one in Vercel
Writing Medium Data engineering and practice

Site stack: React, Next.js, Tailwind CSS, Vercel.


Operating principles

  • Engineering discipline — Prefer composable designs, explicit contracts, and observability where operational risk warrants it.
  • Quality — Automated tests and documentation aligned to system boundaries and user-facing behavior.
  • Collaboration — Clear technical writing, constructive code review, and open-source releases when they benefit others.
  • Continuous learning — Lakehouse platforms, streaming systems, and safe data access in agent-assisted workflows.

Additional interests: Competitive swimming, strategy games, large language models and data systems literature.


Contact

Open to senior data engineering roles, consulting, technical writing, and mentoring. Based in Sydney, Australia.

Links: same row at the top of this profile (Portfolio, LinkedIn, Medium, Repositories, Email).

Optional support: Buy Me a Coffee.


Ojas Shukla · Senior Data Engineer

GFM and Mermaid only. Recent activity is plain markdown (daily workflow). Pull requests and stars sit in collapsible sections to keep the profile easy to scan.


Document notes

Technical reference for this file: GitHub profile README for ojasshukla01/ojasshukla01. Rendered with GFM and Mermaid.

Format, manifest, rendering
Component Role Reference
Diagrams Architecture Mermaid on GitHub
Markup Structure GitHub Flavored Markdown
Activity lists Daily workflow: batched GraphQL (stars + PR search in one call; paginated repos), retries on transient errors, stars exclude PR repos, Medium RSS with retries → marker blocks scripts/build_readme.py, workflow; Dependabot; pattern from Simon Willison

Local refresh: pip install -r requirements.txt. python scripts/build_readme.py --dry-run loads Medium without a token; GitHub sections need a token. Full run: PowerShell $env:GITHUB_TOKEN = (gh auth token); python scripts/build_readme.py, or a PAT with scope to read your data via the GraphQL API. Optional: MEDIUM_FEED_URL to override the Medium RSS URL.

GitHub profile: Pin up to six repositories that match your Flagship table so the grid above the README reinforces the same story.

If you adapt this layout, replace ojasshukla01 in URLs and cite upstream tools you embed.

{ "kind": "github-profile-readme", "repository": "ojasshukla01/ojasshukla01", "markdown": "GitHub-Flavored-Markdown", "dynamic_assets": [], "static_semantics": ["mermaid", "tables", "details-summary", "fenced-code-blocks"] }
flowchart LR U[Reader] --> GH[GitHub GFM renderer] GH --> T[Tables and prose] GH --> M[Mermaid engine] 
Loading

Pinned Loading

  1. data-forge data-forge Public

    Schema-aware synthetic data for databases, APIs, and pipelines. Realistic, relational, privacy-safe.

    Python 1 1

  2. token-doctor token-doctor Public

    Local-first CLI to debug API tokens, track platform changelogs & sunsets, and export calendar alerts. 50+ platforms. Tokens in OS keychain only.

    Python

  3. health-analytics-bi-dashboard health-analytics-bi-dashboard Public

  4. opencompliance-esg opencompliance-esg Public

    Python

  5. llm-learning-path-generator llm-learning-path-generator Public

    Python

  6. raj8github/sqlsense raj8github/sqlsense Public

    Safe, audited SQL for AI agents via MCP.

    Python 3 1