Designed for software engineers crossing over into AI, this guide focuses on system architecture, deployment patterns, and operational rigor for LLMs, RAG, Prompt Engineering, Agents, and Evals.
This guide is for you if:
- You're a senior software engineer (5+ years) moving into AI/ML engineering
- You're preparing for system design interviews at AI-focused companies or big tech AI teams
- You build distributed systems and want to understand how AI components change the design
- You want to go from "I've used the OpenAI API" to "I can design and defend a production AI system"
This guide is NOT for you if:
- You're looking for ML theory or math (read Goodfellow's Deep Learning textbook instead)
- You want paper summaries without practical context
- You're a researcher who needs academic rigor over engineering pragmatism
How transformers work, tokenization, context windows, when to fine-tune vs RAG.
- 01-transformer-intuition — How transformers work, no math
- 02-tokenization — Tokens are money
- 03-attention-mechanisms — Self-attention, KV cache, Flash Attention
- 04-context-windows — Long context tradeoffs
- 05-training-pipeline — RAG vs fine-tune vs prompt (THE decision)
- 06-model-landscape — Model comparison table
- 07-small-language-models — When to use Phi/Gemma instead of GPT-4o
- 08-quantization — INT8/INT4, GGUF, running models on cheap hardware
- 09-fine-tuning — LoRA, QLoRA, when NOT to fine-tune
- 10-distillation-and-pruning — Making models cheaper
CoT, structured generation, prompt optimization, injection defense.
- 01-prompting-patterns — Zero-shot to Tree of Thought
- 02-context-engineering — The underrated skill that separates good from great
- 03-structured-generation — Instructor, JSON mode, Outlines
- 04-prompt-optimization — DSPy, meta-prompting, eval-driven
- 05-prompt-security — Injection attacks and defenses
The complete RAG stack: chunking, embeddings, vector DBs, hybrid search, advanced patterns.
- 01-rag-fundamentals — What/why/when, the naive pipeline
- 02-embedding-models — MTEB, dimensions, Matryoshka
- 03-vector-indexing — HNSW vs IVF, FAISS
- 04-vector-databases — Decision matrix, cost at scale
- 05-chunking-strategies — THE key lever most teams get wrong
- 06-hybrid-search — Dense + BM25 + RRF
- 07-reranking — Cross-encoders, Cohere, two-stage
- 08-query-transformation — HyDE, multi-query, decomposition
- 09-advanced-rag-patterns — GraphRAG, Agentic RAG, Self-RAG, CRAG
- 10-multimodal-rag — ColPali, PDFs with tables and images
- 11-rag-evaluation — RAGAS, debug flowchart
ReAct, tool use, MCP, LangGraph, multi-agent systems, memory.
- 01-agent-fundamentals — ReAct, perception-action loop, failure modes
- 02-tool-use-and-function-calling — OpenAI vs Claude vs Gemini formats
- 03-mcp-protocol — Full MCP, server code, security
- 04-langchain-overview — What it does well and where it falls short
- 05-langgraph-deep-dive — Stateful graphs, persistence, human-in-loop
- 06-dspy-framework — Compile don't prompt
- 07-crewai-and-autogen — Honest assessment of multi-agent frameworks
- 08-llamaindex-haystack — Data frameworks vs orchestration frameworks
- 09-multi-agent-systems — When you actually need multiple agents
- 10-memory-and-state — Memory tiers, Mem0, Zep, checkpointing
- 11-agentic-patterns — Reflection, map-reduce, DAG patterns
- 12-browser-and-computer-use — Playwright, Claude Computer Use
How to actually measure if your system works: RAGAS, LLM-as-judge, production eval.
- 01-eval-fundamentals — Why eval is hard, the eval pipeline
- 02-retrieval-and-rag-eval — Precision@K, MRR, NDCG, RAGAS
- 03-llm-as-judge — Pointwise vs pairwise, calibration
- 04-agent-and-e2e-eval — Task completion, A/B testing, continuous eval
Observability, guardrails, caching, inference infra, cost optimization.
- 01-observability-and-tracing — LangSmith vs Langfuse, what to log
- 02-guardrails-and-safety — Defense-in-depth, NeMo, LlamaGuard, Presidio
- 03-caching-strategies — Multi-layer caching, semantic cache
- 04-inference-infrastructure — GPU table, vLLM vs TGI, auto-scaling
- 05-drift-and-monitoring — Drift types, detection, remediation
- 06-mlops-for-llms — CI/CD, prompt versioning, blue-green
- 07-cost-optimization — Token optimization, model routing, batch
Interview framework, 5 full case studies, 30 practice problems, 60+ conceptual questions.
- 01-interview-framework — The 45-min structure. Worth the whole repo.
- 02-design-patterns-catalog — Full catalog with decision tree
- 03-architecture-templates — 6 reference architectures with cost models
- 04-case-enterprise-rag — Full worked design: enterprise knowledge base
- 05-case-code-assistant — Full worked design: GitHub Copilot-style
- 06-case-customer-support — Full worked design: support automation
- 07-case-doc-intelligence — Full worked design: document understanding
- 08-case-search-engine — Full worked design: AI-powered search
- 09-practice-problems — 30 problems with solution skeletons
- 10-conceptual-questions — 60+ questions with full conversational answers
Model pricing, glossary, cost formulas, essential papers.
- model-pricing-reference — Current pricing for all major models
- glossary — Terms defined in plain English
- cost-estimation-formulas — Spreadsheet-ready formulas
- essential-papers — The 20 papers worth reading
Working implementations: RAG pipeline, LangGraph agent, MCP server, eval pipeline.
- 01-basic-rag — Minimal RAG in 100 lines
- 02-advanced-rag — Hybrid search + reranking
- 03-langgraph-agent — Stateful agent with tools
- 04-mcp-server — Working MCP server
- 05-eval-pipeline — RAGAS + LLM-as-judge
- 06-semantic-cache — Semantic caching with Redis
- 07-structured-output — Instructor + Pydantic
The guide is intentionally opinionated. If you disagree with a recommendation, open an issue with your reasoning and production evidence. PRs welcome for:
- Factual errors or outdated information (especially model specs and pricing)
- Missing failure modes from your production experience
