Senior Data Engineer
Streaming and batch analytics, lakehouse architecture, data governance, synthetic data, and agent-safe SQL.
Portfolio · LinkedIn · Medium · Repositories · Email
Shown when ojasshukla01/ojasshukla01 is public. Profile README docs.
Profile: Overview · Flagship repositories · Recent activity · Contact
Deep dive: Data Forge · Technical stack · Repository index · Online presence · Operating principles · Document notes
Senior data engineer with six years’ experience delivering cloud-native analytics platforms on AWS, GCP, Azure, and Snowflake, with depth in Apache Kafka, dbt, DuckDB, and lakehouse patterns. Focus areas include real-time and batch data pipelines, observability, data governance, and open-source tooling.
| Focus | Project | Description |
|---|---|---|
| Synthetic data | Data Forge | Schema-aware, time-consistent test data for databases, APIs, and pipelines |
| Agent safety | SQLSense | MCP server for guardrailed, audited SQL execution for AI agents |
| Developer tooling | token-doctor | Local-first CLI for tokens, changelogs, deprecation windows, and calendars |
; Machine-readable summary (illustrative) [profile] title = senior_data_engineer domains = streaming, batch, governance, synthetic_data, agent_safety stores = snowflake, bigquery, duckdb, lakehouse delivery = airflow, cicd, terraformRepresentative open-source work. Contributions welcome where repositories are licensed and document how to contribute.
| Project | Summary | Technologies |
|---|---|---|
| OpenCompliance ESG | ESG analytics, PDF reporting, data quality | Streamlit, FastAPI, DuckDB, Python |
| Data Forge | Time-aware synthetic data; DDL and OpenAPI; CDC-style exports | Python, FastAPI, Next.js, DuckDB, warehouses |
| LLM Learning Path Generator | LLM-assisted learning paths and gap analysis | Streamlit, LangChain, DuckDB, OpenAI |
| token-doctor | Token lifecycle debugging, changelogs, sunsets, ICS — local-first | Python, SQLite, CLI |
| SQLSense | Audited, constrained SQL over MCP for software agents | Python, MCP |
| Health Analytics BI Dashboard | Healthcare KPIs and business intelligence patterns | Power BI, analytics |
Auto-refreshed markdown (GitHub + Medium). Expand How this section is updated for implementation details.
How this section is updated
A GitHub Action runs once per day and rewrites the HTML comment markers in this file. Sources: GraphQL (paginated public repos; one combined query for starred repos + authored PR search; retries on transient failures) and Medium RSS (retries). Pattern from Building a self-updating profile README for GitHub · scripts/build_readme.py · .github/workflows/update-profile-readme.yml · Dependabot.
- ojasshukla01/data-forge
v0.1.0— 2026-03-15 · release
One row per repository (your most recently updated authored PR in each), so the profile stays easy to scan.
Expand pull request list
- ojasshukla01/data-forge: feat: enhance documentation and CI/CD processes — 2026-03-22 · merged
- raj8github/sqlsense: refactor: clean up imports and exception handling in various modules — 2026-03-04 · merged
Tools and references I am tracking (public stars). Repositories already shown under Recent pull requests are omitted here to avoid duplication.
Expand starred repositories
- duckdb/duckdb — 2026-03-03 · starred
- DataExpert-io/data-engineer-handbook — 2026-03-03 · starred
- cloudcommunity/Free-Certifications — 2026-03-03 · starred
- ojasshukla01/data-forge — 2026-03-23
- ojasshukla01/token-doctor — 2026-03-05
- ojasshukla01/sqlsense — 2026-03-04
- ojasshukla01/Torrent_automate — 2026-02-28
- ojasshukla01/data-pipeline — 2026-01-31
- ojasshukla01/llm-learning-path-generator — 2025-10-11
- ojasshukla01/hug-lite — 2025-07-15
- ojasshukla01/auto-map-au — 2025-06-13
Data Forge is a time-aware synthetic data platform: define schemas and rules (DDL, JSON Schema, OpenAPI, domain packs), select generation modes (snapshot, incremental, CDC-style, medallion layers), and export privacy-conscious, relational test data to files and warehouses, with a Next.js front end and Python / FastAPI API. The diagram below is a simplified architecture view (not all adapters are shown).
flowchart TB subgraph define[Define] SCH[DDL JSON Schema OpenAPI] PACK[Domain packs and business rules] SCN[Scenarios and reproducible runs] end subgraph core[Data Forge engine] GEN[Generate relational time-consistent rows] CDC[Snapshot incremental CDC style flows] FX[Drift messiness anomaly profiles] end subgraph ship[Export and load] FILES[Parquet JSON JSONL CSV SQL inserts] WH[DuckDB Postgres SQLite Snowflake BigQuery] end subgraph product[Product surface] API[FastAPI service] UI[Next.js app] end subgraph ecosystem[Downstream tooling] DBT[dbt seeds and tests] GE[Great Expectations] AF[Airflow style orchestration] end SCH --> GEN PACK --> GEN SCN --> GEN GEN --> CDC GEN --> FX GEN --> FILES FILES --> WH UI --> API API --> GEN WH -.-> DBT GEN -.-> GE FILES -.-> AF | Domain | Technologies |
|---|---|
| Languages | Python, SQL, JavaScript, R, Scala |
| Cloud | Google Cloud, AWS, Microsoft Azure, Snowflake |
| Data and streaming | Apache Spark, Databricks, BigQuery, Apache Kafka, Apache Airflow, dbt, DuckDB |
| Delivery | Docker, Terraform, GitHub Actions, Kubernetes |
- Lakehouse360 — Ingest, transform, data quality; Streamlit, DuckDB, dbt
- Data Engineering Case Studies — Batch and streaming patterns, BigQuery, Airflow, dbt
- auto-map-au (AutoMap360) — Suburb-to-region geospatial reference (AU, NZ, IN), Streamlit QA
- data-pipeline — Pipeline reference implementation
- bharatstream-sql — SQL and analytics backend
- streaming-platform — Video stack with React
- prompt-hub — Prompt management and sharing
- ojas-portfolio — Portfolio site source
- sop_generator_app · sop-generator-frontend — Statement-of-purpose tooling
- web-bases-analysis-intrusion-detection-system — Network intrusion detection analysis
- git-activity-simulator — CLI for synthetic Git activity (demonstrations and learning)
- sql-injection — Security lab (C#)
- Torrent_automate — Automation utilities
- hug-lite — Lightweight Hugging Face–related experiment
| Channel | Link | Description |
|---|---|---|
| Portfolio | Professional site (Vercel) | Case studies, projects, experience; a custom domain shortens the visible URL in the hero if you add one in Vercel |
| Writing | Medium | Data engineering and practice |
Site stack: React, Next.js, Tailwind CSS, Vercel.
- Engineering discipline — Prefer composable designs, explicit contracts, and observability where operational risk warrants it.
- Quality — Automated tests and documentation aligned to system boundaries and user-facing behavior.
- Collaboration — Clear technical writing, constructive code review, and open-source releases when they benefit others.
- Continuous learning — Lakehouse platforms, streaming systems, and safe data access in agent-assisted workflows.
Additional interests: Competitive swimming, strategy games, large language models and data systems literature.
Open to senior data engineering roles, consulting, technical writing, and mentoring. Based in Sydney, Australia.
Links: same row at the top of this profile (Portfolio, LinkedIn, Medium, Repositories, Email).
Optional support: Buy Me a Coffee.
Ojas Shukla · Senior Data Engineer
GFM and Mermaid only. Recent activity is plain markdown (daily workflow). Pull requests and stars sit in collapsible sections to keep the profile easy to scan.
Technical reference for this file: GitHub profile README for ojasshukla01/ojasshukla01. Rendered with GFM and Mermaid.
Format, manifest, rendering
| Component | Role | Reference |
|---|---|---|
| Diagrams | Architecture | Mermaid on GitHub |
| Markup | Structure | GitHub Flavored Markdown |
| Activity lists | Daily workflow: batched GraphQL (stars + PR search in one call; paginated repos), retries on transient errors, stars exclude PR repos, Medium RSS with retries → marker blocks | scripts/build_readme.py, workflow; Dependabot; pattern from Simon Willison |
Local refresh: pip install -r requirements.txt. python scripts/build_readme.py --dry-run loads Medium without a token; GitHub sections need a token. Full run: PowerShell $env:GITHUB_TOKEN = (gh auth token); python scripts/build_readme.py, or a PAT with scope to read your data via the GraphQL API. Optional: MEDIUM_FEED_URL to override the Medium RSS URL.
GitHub profile: Pin up to six repositories that match your Flagship table so the grid above the README reinforces the same story.
If you adapt this layout, replace ojasshukla01 in URLs and cite upstream tools you embed.
{ "kind": "github-profile-readme", "repository": "ojasshukla01/ojasshukla01", "markdown": "GitHub-Flavored-Markdown", "dynamic_assets": [], "static_semantics": ["mermaid", "tables", "details-summary", "fenced-code-blocks"] }flowchart LR U[Reader] --> GH[GitHub GFM renderer] GH --> T[Tables and prose] GH --> M[Mermaid engine] 