A pluggable benchmarking framework for evaluating memory and context systems.
- 🔌 Interoperable: mix and match any provider with any benchmark
- 🧩 Bring your own benchmarks: plug in custom datasets and tasks
- ♻️ Checkpointed runs: resume from any pipeline stage (ingest → index → search → answer → evaluate)
- 🆚 Multi‑provider comparison: run the same benchmark across providers side‑by‑side
- 🧪 Judge‑agnostic: swap GPT‑4o, Claude, Gemini, etc. without code changes
- 📊 Structured reports: export run status, failures, and metrics for analysis
- 🖥️ Web UI: inspect runs, questions, and failures interactively, in real-time!
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Benchmarks │ │ Providers │ │ Judges │ │ (LoCoMo, │ │ (Supermem, │ │ (GPT-4o, │ │ LongMem..) │ │ Mem0, Zep) │ │ Claude..) │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────────────────┼──────────────────┘ ▼ ┌───────────────────────┐ │ MemoryBench │ └───────────┬───────────┘ ▼ ┌────────┬─────────┬────────┬──────────┬────────┐ │ Ingest │ Indexing│ Search │ Answer │Evaluate│ └────────┴─────────┴────────┴──────────┴────────┘ bun install cp .env.example .env.local # Add your API keys bun run src/index.ts run -p supermemory -b locomo# Providers (at least one) SUPERMEMORY_API_KEY= MEM0_API_KEY= ZEP_API_KEY= # Judges (at least one) OPENAI_API_KEY= ANTHROPIC_API_KEY= GOOGLE_API_KEY=| Command | Description |
|---|---|
run | Full pipeline: ingest → index → search → answer → evaluate → report |
compare | Run benchmark across multiple providers simultaneously |
ingest | Ingest benchmark data into provider |
search | Run search phase only |
test | Test single question |
status | Check run progress |
list-questions | Browse benchmark questions |
show-failures | Debug failed questions |
serve | Start web UI |
help | Show help (help providers, help models, help benchmarks) |
-p, --provider Memory provider (supermemory, mem0, zep) -b, --benchmark Benchmark (locomo, longmemeval, convomem) -j, --judge Judge model (gpt-4o, sonnet-4, gemini-2.5-flash, etc.) -r, --run-id Run identifier (auto-generated if omitted) -m, --answering-model Model for answer generation (default: gpt-4o) -l, --limit Limit number of questions -q, --question-id Specific question (for test command) --force Clear checkpoint and restart # Full run bun run src/index.ts run -p mem0 -b locomo # With custom run ID bun run src/index.ts run -p mem0 -b locomo -r my-test # Resume existing run bun run src/index.ts run -r my-test # Limited questions bun run src/index.ts run -p supermemory -b locomo -l 10 # Different models bun run src/index.ts run -p zep -b longmemeval -j sonnet-4 -m gemini-2.5-flash # Compare multiple providers bun run src/index.ts compare -p supermemory,mem0,zep -b locomo -s 5 # Test single question bun run src/index.ts test -r my-test -q question_42 # Debug bun run src/index.ts status -r my-test bun run src/index.ts show-failures -r my-test1. INGEST Load benchmark sessions → Push to provider 2. INDEX Wait for provider indexing 3. SEARCH Query provider → Retrieve context 4. ANSWER Build prompt → Generate answer via LLM 5. EVALUATE Compare to ground truth → Score via judge 6. REPORT Aggregate scores → Output accuracy + latency Each phase checkpoints independently. Failed runs resume from last successful point.
Runs persist to data/runs/{runId}/:
checkpoint.json- Run state and progressresults/- Search results per questionreport.json- Final report
Re-running same ID resumes. Use --force to restart.
| Component | Guide |
|---|---|
| Add Provider | src/providers/README.md |
| Add Benchmark | src/benchmarks/README.md |
| Add Judge | src/judges/README.md |
| Project Structure | src/README.md |
MIT