Skip to content

supermemoryai/memorybench

Repository files navigation

MemoryBench

A pluggable benchmarking framework for evaluating memory and context systems.

original

Features

  • 🔌 Interoperable: mix and match any provider with any benchmark
  • 🧩 Bring your own benchmarks: plug in custom datasets and tasks
  • ♻️ Checkpointed runs: resume from any pipeline stage (ingest → index → search → answer → evaluate)
  • 🆚 Multi‑provider comparison: run the same benchmark across providers side‑by‑side
  • 🧪 Judge‑agnostic: swap GPT‑4o, Claude, Gemini, etc. without code changes
  • 📊 Structured reports: export run status, failures, and metrics for analysis
  • 🖥️ Web UI: inspect runs, questions, and failures interactively, in real-time!
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Benchmarks │ │ Providers │ │ Judges │ │ (LoCoMo, │ │ (Supermem, │ │ (GPT-4o, │ │ LongMem..) │ │ Mem0, Zep) │ │ Claude..) │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────────────────┼──────────────────┘ ▼ ┌───────────────────────┐ │ MemoryBench │ └───────────┬───────────┘ ▼ ┌────────┬─────────┬────────┬──────────┬────────┐ │ Ingest │ Indexing│ Search │ Answer │Evaluate│ └────────┴─────────┴────────┴──────────┴────────┘ 

Quick Start

bun install cp .env.example .env.local # Add your API keys bun run src/index.ts run -p supermemory -b locomo

Configuration

# Providers (at least one) SUPERMEMORY_API_KEY= MEM0_API_KEY= ZEP_API_KEY= # Judges (at least one) OPENAI_API_KEY= ANTHROPIC_API_KEY= GOOGLE_API_KEY=

Commands

Command Description
run Full pipeline: ingest → index → search → answer → evaluate → report
compare Run benchmark across multiple providers simultaneously
ingest Ingest benchmark data into provider
search Run search phase only
test Test single question
status Check run progress
list-questions Browse benchmark questions
show-failures Debug failed questions
serve Start web UI
help Show help (help providers, help models, help benchmarks)

Options

-p, --provider Memory provider (supermemory, mem0, zep) -b, --benchmark Benchmark (locomo, longmemeval, convomem) -j, --judge Judge model (gpt-4o, sonnet-4, gemini-2.5-flash, etc.) -r, --run-id Run identifier (auto-generated if omitted) -m, --answering-model Model for answer generation (default: gpt-4o) -l, --limit Limit number of questions -q, --question-id Specific question (for test command) --force Clear checkpoint and restart 

Examples

# Full run bun run src/index.ts run -p mem0 -b locomo # With custom run ID bun run src/index.ts run -p mem0 -b locomo -r my-test # Resume existing run bun run src/index.ts run -r my-test # Limited questions bun run src/index.ts run -p supermemory -b locomo -l 10 # Different models bun run src/index.ts run -p zep -b longmemeval -j sonnet-4 -m gemini-2.5-flash # Compare multiple providers bun run src/index.ts compare -p supermemory,mem0,zep -b locomo -s 5 # Test single question bun run src/index.ts test -r my-test -q question_42 # Debug bun run src/index.ts status -r my-test bun run src/index.ts show-failures -r my-test

Pipeline

1. INGEST Load benchmark sessions → Push to provider 2. INDEX Wait for provider indexing 3. SEARCH Query provider → Retrieve context 4. ANSWER Build prompt → Generate answer via LLM 5. EVALUATE Compare to ground truth → Score via judge 6. REPORT Aggregate scores → Output accuracy + latency 

Each phase checkpoints independently. Failed runs resume from last successful point.

Checkpointing

Runs persist to data/runs/{runId}/:

  • checkpoint.json - Run state and progress
  • results/ - Search results per question
  • report.json - Final report

Re-running same ID resumes. Use --force to restart.

Extending

Component Guide
Add Provider src/providers/README.md
Add Benchmark src/benchmarks/README.md
Add Judge src/judges/README.md
Project Structure src/README.md

License

MIT

About

Unified benchmark for evaluating conversational memory and RAG across multiple datasets

Topics

Resources

License

Stars

Watchers

Forks

Languages