Certification Challenge Project — AI Engineering Bootcamp Cohort 8
An intelligent question-answering system for GDELT (Global Database of Events, Language, and Tone) documentation, powered by Retrieval-Augmented Generation.
Core Docs
- 🏗️ CLAUDE.md — canonical developer guide for this repository
- 📋 Full Deliverables — complete certification submissions
- 📊 Task Rubric — 100-point grading breakdown
- 🧠 docs/initial-architecture.md — early conceptual design (frozen)
Architecture Documentation (Auto-Generated) Located in the architecture/ directory — produced by the Claude Agent SDK Analyzer:
| File | Purpose |
|---|---|
architecture/README.md | Top-level architecture summary and system lifecycle |
architecture/docs/01_component_inventory.md | Detailed module inventory |
architecture/docs/03_data_flows.md | Ingestion → retrieval → evaluation flows |
architecture/docs/04_api_reference.md | Public API reference |
architecture/diagrams/02_architecture_diagrams.md | Layered system & runtime dependency diagrams |
🧠 Note:
docs/initial-architecture.mdcaptures the original design sketch before automation and is no longer updated.
The currentarchitecture/tree is generated automatically from the live codebase.
- Python 3.11 +
uvpackage managerOPENAI_API_KEY(required)COHERE_API_KEY(optional — reranking)
git clone https://github.com/aie8-cert-challenge/gdelt-knowledge-base.git cd cert-challenge uv venv --python 3.11 source .venv/bin/activate # Linux/WSL/Mac # .venv\Scripts\activate # Windows uv pip install -e .cp .env.example .env # Edit .env with: # OPENAI_API_KEY=your_key_here # COHERE_API_KEY=optional# Interactive LangGraph Studio UI uv add langgraph-cli[inmem] uv run langgraph dev --allow-blocking # → http://localhost:2024 # Studio: https://smith.langchain.com/studio/?baseUrl=http://127.0.0.1:2024 # CLI evaluation python scripts/run_eval_harness.py # or make eval # Quick validation make validateThis system follows a 5-layer architecture:
| Layer | Purpose | Key Modules |
|---|---|---|
| Configuration | External services (OpenAI, Qdrant, Cohere) | src/config.py |
| Data | Ingestion + persistence (HF datasets + manifest) | src/utils/ |
| Retrieval | Multi-strategy search (naive, BM25, ensemble, rerank) | src/retrievers.py |
| Orchestration | LangGraph workflows (retrieve → generate) | src/graph.py, src/state.py |
| Execution | Scripts and LangGraph Server entrypoints | scripts/, app/graph_app.py |
Design Principles
- Factory pattern → deferred initialization of retrievers/graphs
- Singleton pattern → resource caching (
@lru_cache) - Strategy pattern → interchangeable retriever implementations
See the generated diagrams in architecture/diagrams/02_architecture_diagrams.md.
- Load 12 QA pairs (golden testset)
- Load 38 source docs from Hugging Face
- Create Qdrant vector store
- Build 4 retriever strategies
- Execute 48 RAG queries
- Evaluate with RAGAS metrics (Faithfulness, Answer Relevancy, Context Precision, Context Recall)
- Persist results →
deliverables/evaluation_evidence/
Cost ≈ $5 – 6 per run / 20–30 min.
Every major stage in this project — from data ingestion to evaluation runs — is signed by machine-readable manifests to ensure traceability, data integrity, and AI assistant accountability.
Records the creation of intermediate artifacts (raw → interim → HF datasets) with:
- Environment fingerprint (
langchain,ragas,pyarrow, etc.) - Source & golden testset paths and SHA-256 hashes
- Quick schema preview of extracted columns
- Hugging Face lineage metadata linking to:
dwb2023/gdelt-rag-sources-v2dwb2023/gdelt-rag-golden-testset-v2
This ensures that every RAGAS evaluation run references a reproducible, signed dataset snapshot.
Documents the execution of each evaluation:
- Models:
gpt-4.1-mini(RAG generation),text-embedding-3-small - Retrievers: naive, BM25, ensemble, cohere_rerank
- Deterministic configuration (
temperature=0) - Metrics: Faithfulness, Answer Relevancy, Context Precision, Context Recall
- SHA links back to the ingestion manifest for complete lineage
Together, these manifests create a verifiable audit trail between:
source PDFs → testset → vector store → evaluation results
By preserving SHA-256 fingerprints and HF lineage, this mechanism “signs” the dataset state — keeping automated assistants and evaluation scripts consistent, comparable, and honest.
| Component | Technology | Purpose |
|---|---|---|
| LLM | OpenAI GPT-4.1-mini | Deterministic RAG generation |
| Embeddings | text-embedding-3-small | 1536-dim semantic vectors |
| Vector DB | Qdrant | Fast cosine search |
| Orchestration | LangGraph 0.6.7 + LangChain 0.3.19 + | Graph-based workflows |
| Evaluation | RAGAS 0.2.10 (pinned) | Stable evaluation API |
| Monitoring | LangSmith | LLM trace observability |
| Data | Hugging Face Datasets | Reproducible versioned sources |
| UI | LangGraph Studio UI | Prototype chat interface |
Source of truth:
deliverables/evaluation_evidence/RUN_MANIFEST.json(auto-generated each run).
| Retriever | Faithfulness | Answer Relevancy | Context Precision | Context Recall | Avg |
|---|---|---|---|---|---|
| Cohere Rerank | 95.8% | 94.8% | 93.1% | 96.7% | 95.1% |
| Ensemble | 93.4% | 94.6% | 87.5% | 98.8% | 93.6% |
| BM25 | 94.2% | 94.8% | 85.8% | 98.8% | 93.4% |
| Naive | 94.0% | 94.4% | 88.5% | 98.8% | 93.9% |
Note on retriever terminology: The current codebase implements 4 retrievers (naive, bm25, ensemble, cohere_rerank). In evaluation reporting and published datasets, "baseline" may appear as an alias for the "naive" retriever, which serves as the performance comparison reference using simple dense vector search.
Provenance: dataset paths & SHA-256 fingerprints are recorded in data/interim/manifest.json. Reproducibility: these numbers are written by the evaluation run into RUN_MANIFEST.json. TLDR summary: data/processed/comparative_ragas_results.csv
This project publishes 4 datasets to HuggingFace Hub for reproducibility and benchmarking.
Scientific Value: These datasets provide the first publicly available evaluation suite for GDELT-focused RAG systems, enabling reproducible benchmarking of retrieval strategies with complete evaluation transparency.
1. dwb2023/gdelt-rag-sources-v2 - 38 GDELT documentation pages
- Content: GDELT GKG 2.1 architecture docs, knowledge graph construction guides, Baltimore Bridge Collapse case study
- Format: Parquet (analytics), JSONL (human-readable), HF Datasets (fast loading)
- Schema:
page_content(1.5k-5.2k chars),metadata(author, title, page, creation_date, etc.) - Use: Populate vector stores, document chunking experiments, GDELT research
- License: Apache 2.0
2. dwb2023/gdelt-rag-golden-testset-v2 - 12 QA pairs
- Content: Synthetically generated questions (RAGAS 0.2.10), ground truth answers, reference contexts
- Topics: GDELT data formats, Translingual features (65 languages), date extraction, proximity context, emotions
- Schema:
user_input(question),reference_contexts(ground truth passages),reference(answer),synthesizer_name - Use: Benchmark RAG systems using RAGAS metrics, validate retrieval performance
- License: Apache 2.0
3. dwb2023/gdelt-rag-evaluation-inputs - 60 evaluation records
- Content: Consolidated RAGAS inputs from 5 retrieval strategies (baseline, naive, BM25, ensemble, cohere_rerank)
- Schema:
retriever,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name - Use: Benchmark new retrievers, analyze retrieval quality, reproduce certification results, debug RAG pipelines
- License: Apache 2.0
4. dwb2023/gdelt-rag-evaluation-metrics - 60 evaluation records with RAGAS scores
- Content: Detailed RAGAS evaluation results with per-question metric scores
- Schema: All evaluation-inputs fields PLUS
faithfulness,answer_relevancy,context_precision,context_recall(float64, 0-1) - Key Findings: Cohere Rerank winner (95.08% avg), Baseline (93.92% avg), Best Precision: Cohere (+4.55% vs baseline)
- Use: Performance analysis, error analysis, train retrieval models with RAGAS scores as quality labels, RAG evaluation research
- License: Apache 2.0
Why These Datasets Matter:
- Reproducibility: Complete evaluation pipeline with versioned datasets and SHA-256 checksums
- Benchmarking: Standard testset for comparing retrieval strategies across 4 RAGAS metrics
- Quality Labels: RAGAS scores serve as training labels for learning-to-rank models
- Domain-Specific: GDELT knowledge graph QA pairs rare in existing RAG datasets
- Evaluation Transparency: Full evaluation inputs + metrics for analysis and debugging
- Multi-Format: Parquet (analytics), JSONL (human-readable), HF Datasets (fast loading)
Research Applications:
- RAG Researchers: Benchmark retrieval strategies, analyze failure modes, validate hypotheses
- GDELT Analysts: Build Q&A systems, train domain-specific embeddings, extend to other GDELT resources
- Evaluation Researchers: Study RAGAS behavior, compare automatic vs human metrics, develop new methodologies
- Educators: Teach RAG best practices, demonstrate comparative analysis, illustrate data provenance
from datasets import load_dataset # Load evaluation datasets eval_ds = load_dataset("dwb2023/gdelt-rag-evaluation-datasets") # Filter by retriever cohere_evals = eval_ds['train'].filter(lambda x: x['retriever'] == 'cohere_rerank') print(f"Cohere Rerank: {len(cohere_evals)} examples") # Load detailed results with metrics results_ds = load_dataset("dwb2023/gdelt-rag-detailed-results") # Analyze performance by retriever import pandas as pd df = results_ds['train'].to_pandas() performance = df.groupby('retriever')[ ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall'] ].mean() print(performance)Output:
faithfulness answer_relevancy context_precision context_recall retriever baseline 0.9351 0.9335 0.9459 0.9410 bm25 0.9462 0.9583 0.9519 0.9511 cohere_rerank 0.9508 0.9321 0.9670 0.9668 ensemble 0.9424 0.9542 0.9477 0.9486 naive 0.9351 0.9335 0.9459 0.9410 If you use these datasets in your research, please cite:
@misc{branson2025gdelt-rag-datasets, author = {Branson, Don}, title = {GDELT RAG Evaluation Datasets: Benchmarking Retrieval Strategies for Knowledge Graph Q\&A}, year = {2025}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/dwb2023}}, note = {Datasets: gdelt-rag-sources-v2, gdelt-rag-golden-testset-v2, gdelt-rag-evaluation-inputs, gdelt-rag-evaluation-metrics} } @article{myers2025gdelt, title={Talking to GDELT Through Knowledge Graphs}, author={Myers, A. and Vargas, M. and Aksoy, S. G. and Joslyn, C. and Wilson, B. and Burke, L. and Grimes, T.}, journal={arXiv preprint arXiv:2503.07584v3}, year={2025} }Provenance Chain:
- Source: arXiv:2503.07584v3 "Talking to GDELT Through Knowledge Graphs" (PDF)
- Extraction: PyMuPDFLoader (page-level chunking)
- Testset Generation: RAGAS 0.2.10 synthetic data generation
- Evaluation: GPT-4.1-mini (LLM), text-embedding-3-small (embeddings), Cohere rerank-v3.5
- Validation: SHA-256 checksums in
data/interim/manifest.json
Quality Guarantees:
- ✅ RAGAS 0.2.10 schema validation
- ✅ SHA-256 fingerprints for data integrity
- ✅ Manifest tracking (timestamps, model versions, package versions)
- ✅ 100% validation pass rate (
make validate) - ✅ Apache 2.0 licensed (open access)
Versioning: -v2 suffix indicates second iteration after fresh ingestion. Pin to specific revision for reproducibility: load_dataset("dwb2023/gdelt-rag-sources-v2", revision="abc123")
Known Limitations:
- Domain-specific (GDELT documentation), may not generalize to other domains
- Synthetic questions (RAGAS-generated, not human-authored)
- English-only (despite GDELT's multilingual capabilities)
- Small scale (12 evaluation questions - sufficient for comparative analysis, not large-scale benchmarking)
- Model bias (RAGAS metrics computed using GPT-4, inherits model biases)
- Temporal snapshot (based on GDELT documentation as of January 2025)
Dataset Cards: See HuggingFace for complete metadata and schemas
| Path | Purpose |
|---|---|
src/ | Core modular RAG framework (config, retrievers, graph, state, utils) |
scripts/ | Executable workflows for data ingestion, evaluation, and validation |
architecture/ | Auto-generated architecture snapshots (Claude Agent SDK Analyzer) |
├── 00_README.md | System overview and lifecycle summary |
├── docs/ | Component inventory, data flows, and API reference |
└── diagrams/ | Mermaid dependency and system diagrams |
docs/ | Certification artifacts and legacy design documentation |
└── initial-architecture.md | Original hand-drawn architecture sketch (frozen, not updated) |
data/ | Complete RAG dataset lineage and provenance chain (Parquet-first) |
├── raw/ | Original GDELT PDFs |
├── interim/ | Extracted text, Hugging Face datasets, and manifest fingerprints |
│ ├── manifest.json → Ingestion provenance manifest (dataset lineage, SHA-256 hashes) | |
├── processed/ | Working data (Parquet, ZSTD compressed) - evaluation results + RUN_MANIFEST.json |
└── deliverables/ | Derived data (CSV for human review, regenerable via make deliverables) |
├── evaluation_evidence/ | Human-readable CSV files generated from data/processed/ |
│ └── RUN_MANIFEST.json → Copied from data/processed/ | |
app/ | Lightweight LangGraph API (graph_app.py) and entrypoint |
deliverables/ | High-level evaluation reports and comparative analyses |
Makefile | Task automation for environment setup, validation, and architecture snapshots |
docker-compose.yml | Local container configuration for LangGraph + Qdrant stack |
I created a Claude Agent SDK based multi-agent process to create reproducible architecture snapshots. (Still a prototype, but has already helped refine my architectural workflow)
python -m ra_orchestrators.architecture_orchestrator "GDELT architecture"Generates comprehensive repository analysis in ra_output/:
- Component inventory
- Architecture diagrams
- Data flow analysis
- API documentation
- Final synthesis
| Path | Purpose |
|---|---|
ra_agents/ | Individual agent definitions for design and architecture workflows |
ra_orchestrators/ | Multi-agent orchestration logic coordinating Claude SDK agents |
ra_tools/ | Tools that extend agent capabilities via MCP and external APIs |
ra_output/ | Generated artifacts and agentic outputs (documentation drafts, diagrams, etc.) |
| Pattern | Purpose | Example |
|---|---|---|
| Factory | Deferred initialization | create_retrievers() / build_graph() |
| Singleton | Cached resources (@lru_cache) | get_llm(), get_qdrant() |
| Strategy | Swap retrieval algorithms | 4 retrievers share .invoke() API |
Apache 2.0 — see LICENSE
Contact: Don Branson (dwb2023) – AI Engineering Bootcamp Cohort 8
This repository distinguishes between historical design artifacts and current architecture snapshots:
docs/initial-architecture.md→ conceptual blueprint (frozen)architecture/→ live system documentation (auto-generated)
This separation ensures long-term clarity and traceability of architectural evolution.
