Ojas Shukla ojasshukla01

Ojas Shukla

Senior Data Engineer

Streaming and batch analytics, lakehouse architecture, data governance, synthetic data, and agent-safe SQL.

Portfolio · LinkedIn · Medium · Repositories · Email

_{Shown when ojasshukla01/ojasshukla01 is public. Profile README docs.}

Overview

Senior data engineer with six years’ experience delivering cloud-native analytics platforms on AWS, GCP, Azure, and Snowflake, with depth in Apache Kafka, dbt, DuckDB, and lakehouse patterns. Focus areas include real-time and batch data pipelines, observability, data governance, and open-source tooling.

Focus	Project	Description
Synthetic data	Data Forge	Schema-aware, time-consistent test data for databases, APIs, and pipelines
Agent safety	SQLSense	MCP server for guardrailed, audited SQL execution for AI agents
Developer tooling	token-doctor	Local-first CLI for tokens, changelogs, deprecation windows, and calendars

; Machine-readable summary (illustrative) [profile] title = senior_data_engineer domains = streaming, batch, governance, synthetic_data, agent_safety stores = snowflake, bigquery, duckdb, lakehouse delivery = airflow, cicd, terraform

Flagship repositories

Representative open-source work. Contributions welcome where repositories are licensed and document how to contribute.

Project	Summary	Technologies
OpenCompliance ESG	ESG analytics, PDF reporting, data quality	Streamlit, FastAPI, DuckDB, Python
Data Forge	Time-aware synthetic data; DDL and OpenAPI; CDC-style exports	Python, FastAPI, Next.js, DuckDB, warehouses
LLM Learning Path Generator	LLM-assisted learning paths and gap analysis	Streamlit, LangChain, DuckDB, OpenAI
token-doctor	Token lifecycle debugging, changelogs, sunsets, ICS — local-first	Python, SQLite, CLI
SQLSense	Audited, constrained SQL over MCP for software agents	Python, MCP
Health Analytics BI Dashboard	Healthcare KPIs and business intelligence patterns	Power BI, analytics

Recent activity

Auto-refreshed markdown (GitHub + Medium). Expand How this section is updated for implementation details.

How this section is updated

A GitHub Action runs once per day and rewrites the HTML comment markers in this file. Sources: GraphQL (paginated public repos; one combined query for starred repos + authored PR search; retries on transient failures) and Medium RSS (retries). Pattern from Building a self-updating profile README for GitHub · scripts/build_readme.py · .github/workflows/update-profile-readme.yml · Dependabot.

Latest GitHub releases

ojasshukla01/data-forge v0.1.0 — 2026-03-15 · release

Latest on Medium

The Architecture of Curiosity: Crafting a Data Engineering Portfolio from Invisible Work

Recent pull requests

One row per repository (your most recently updated authored PR in each), so the profile stays easy to scan.

Expand pull request list

ojasshukla01/data-forge: feat: enhance documentation and CI/CD processes — 2026-03-22 · merged
raj8github/sqlsense: refactor: clean up imports and exception handling in various modules — 2026-03-04 · merged

Recently starred repositories

Tools and references I am tracking (public stars). Repositories already shown under Recent pull requests are omitted here to avoid duplication.

Expand starred repositories

duckdb/duckdb — 2026-03-03 · starred
DataExpert-io/data-engineer-handbook — 2026-03-03 · starred
cloudcommunity/Free-Certifications — 2026-03-03 · starred

Data Forge platform model

Data Forge is a time-aware synthetic data platform: define schemas and rules (DDL, JSON Schema, OpenAPI, domain packs), select generation modes (snapshot, incremental, CDC-style, medallion layers), and export privacy-conscious, relational test data to files and warehouses, with a Next.js front end and Python / FastAPI API. The diagram below is a simplified architecture view (not all adapters are shown).

flowchart TB subgraph define[Define] SCH[DDL JSON Schema OpenAPI] PACK[Domain packs and business rules] SCN[Scenarios and reproducible runs] end subgraph core[Data Forge engine] GEN[Generate relational time-consistent rows] CDC[Snapshot incremental CDC style flows] FX[Drift messiness anomaly profiles] end subgraph ship[Export and load] FILES[Parquet JSON JSONL CSV SQL inserts] WH[DuckDB Postgres SQLite Snowflake BigQuery] end subgraph product[Product surface] API[FastAPI service] UI[Next.js app] end subgraph ecosystem[Downstream tooling] DBT[dbt seeds and tests] GE[Great Expectations] AF[Airflow style orchestration] end SCH --> GEN PACK --> GEN SCN --> GEN GEN --> CDC GEN --> FX GEN --> FILES FILES --> WH UI --> API API --> GEN WH -.-> DBT GEN -.-> GE FILES -.-> AF

Technical stack

Domain	Technologies
Languages	Python, SQL, JavaScript, R, Scala
Cloud	Google Cloud, AWS, Microsoft Azure, Snowflake
Data and streaming	Apache Spark, Databricks, BigQuery, Apache Kafka, Apache Airflow, dbt, DuckDB
Delivery	Docker, Terraform, GitHub Actions, Kubernetes

Repository index

Data platforms and pipelines

Lakehouse360 — Ingest, transform, data quality; Streamlit, DuckDB, dbt
Data Engineering Case Studies — Batch and streaming patterns, BigQuery, Airflow, dbt
auto-map-au (AutoMap360) — Suburb-to-region geospatial reference (AU, NZ, IN), Streamlit QA
data-pipeline — Pipeline reference implementation
bharatstream-sql — SQL and analytics backend
streaming-platform — Video stack with React

Applications and tooling

prompt-hub — Prompt management and sharing
ojas-portfolio — Portfolio site source
sop_generator_app · sop-generator-frontend — Statement-of-purpose tooling
web-bases-analysis-intrusion-detection-system — Network intrusion detection analysis

Archive and experiments

git-activity-simulator — CLI for synthetic Git activity (demonstrations and learning)
sql-injection — Security lab (C#)
Torrent_automate — Automation utilities
hug-lite — Lightweight Hugging Face–related experiment

Online presence

Channel	Link	Description
Portfolio	Professional site (Vercel)	Case studies, projects, experience; a custom domain shortens the visible URL in the hero if you add one in Vercel
Writing	Medium	Data engineering and practice

Site stack: React, Next.js, Tailwind CSS, Vercel.

Operating principles

Engineering discipline — Prefer composable designs, explicit contracts, and observability where operational risk warrants it.
Quality — Automated tests and documentation aligned to system boundaries and user-facing behavior.
Collaboration — Clear technical writing, constructive code review, and open-source releases when they benefit others.
Continuous learning — Lakehouse platforms, streaming systems, and safe data access in agent-assisted workflows.

Additional interests: Competitive swimming, strategy games, large language models and data systems literature.

Contact

Open to senior data engineering roles, consulting, technical writing, and mentoring. Based in Sydney, Australia.

Links: same row at the top of this profile (Portfolio, LinkedIn, Medium, Repositories, Email).

Optional support: Buy Me a Coffee.

Ojas Shukla · Senior Data Engineer

_{GFM and Mermaid only. Recent activity is plain markdown (daily workflow). Pull requests and stars sit in collapsible sections to keep the profile easy to scan.}

Document notes

Technical reference for this file: GitHub profile README for ojasshukla01/ojasshukla01. Rendered with GFM and Mermaid.

Format, manifest, rendering

Component	Role	Reference
Diagrams	Architecture	Mermaid on GitHub
Markup	Structure	GitHub Flavored Markdown
Activity lists	Daily workflow: batched GraphQL (stars + PR search in one call; paginated repos), retries on transient errors, stars exclude PR repos, Medium RSS with retries → marker blocks	`scripts/build_readme.py`, workflow; Dependabot; pattern from Simon Willison

Local refresh: pip install -r requirements.txt. python scripts/build_readme.py --dry-run loads Medium without a token; GitHub sections need a token. Full run: PowerShell $env:GITHUB_TOKEN = (gh auth token); python scripts/build_readme.py, or a PAT with scope to read your data via the GraphQL API. Optional: MEDIUM_FEED_URL to override the Medium RSS URL.

GitHub profile: Pin up to six repositories that match your Flagship table so the grid above the README reinforces the same story.

If you adapt this layout, replace ojasshukla01 in URLs and cite upstream tools you embed.

{ "kind": "github-profile-readme", "repository": "ojasshukla01/ojasshukla01", "markdown": "GitHub-Flavored-Markdown", "dynamic_assets": [], "static_semantics": ["mermaid", "tables", "details-summary", "fenced-code-blocks"] }

flowchart LR U[Reader] --> GH[GitHub GFM renderer] GH --> T[Tables and prose] GH --> M[Mermaid engine]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly