MemoryBench

A pluggable benchmarking framework for evaluating memory and context systems.

Features

🔌 Interoperable: mix and match any provider with any benchmark
🧩 Bring your own benchmarks: plug in custom datasets and tasks
♻️ Checkpointed runs: resume from any pipeline stage (ingest → index → search → answer → evaluate)
🆚 Multi‑provider comparison: run the same benchmark across providers side‑by‑side
🧪 Judge‑agnostic: swap GPT‑4o, Claude, Gemini, etc. without code changes
📊 Structured reports: export run status, failures, and metrics for analysis
🖥️ Web UI: inspect runs, questions, and failures interactively, in real-time!

┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Benchmarks │ │ Providers │ │ Judges │ │ (LoCoMo, │ │ (Supermem, │ │ (GPT-4o, │ │ LongMem..) │ │ Mem0, Zep) │ │ Claude..) │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────────────────┼──────────────────┘ ▼ ┌───────────────────────┐ │ MemoryBench │ └───────────┬───────────┘ ▼ ┌────────┬─────────┬────────┬──────────┬────────┐ │ Ingest │ Indexing│ Search │ Answer │Evaluate│ └────────┴─────────┴────────┴──────────┴────────┘

Quick Start

bun install cp .env.example .env.local # Add your API keys bun run src/index.ts run -p supermemory -b locomo

Configuration

# Providers (at least one) SUPERMEMORY_API_KEY= MEM0_API_KEY= ZEP_API_KEY= # Judges (at least one) OPENAI_API_KEY= ANTHROPIC_API_KEY= GOOGLE_API_KEY=

Commands

Command	Description
`run`	Full pipeline: ingest → index → search → answer → evaluate → report
`compare`	Run benchmark across multiple providers simultaneously
`ingest`	Ingest benchmark data into provider
`search`	Run search phase only
`test`	Test single question
`status`	Check run progress
`list-questions`	Browse benchmark questions
`show-failures`	Debug failed questions
`serve`	Start web UI
`help`	Show help (`help providers`, `help models`, `help benchmarks`)

Options

-p, --provider Memory provider (supermemory, mem0, zep) -b, --benchmark Benchmark (locomo, longmemeval, convomem) -j, --judge Judge model (gpt-4o, sonnet-4, gemini-2.5-flash, etc.) -r, --run-id Run identifier (auto-generated if omitted) -m, --answering-model Model for answer generation (default: gpt-4o) -l, --limit Limit number of questions -q, --question-id Specific question (for test command) --force Clear checkpoint and restart

Examples

# Full run bun run src/index.ts run -p mem0 -b locomo # With custom run ID bun run src/index.ts run -p mem0 -b locomo -r my-test # Resume existing run bun run src/index.ts run -r my-test # Limited questions bun run src/index.ts run -p supermemory -b locomo -l 10 # Different models bun run src/index.ts run -p zep -b longmemeval -j sonnet-4 -m gemini-2.5-flash # Compare multiple providers bun run src/index.ts compare -p supermemory,mem0,zep -b locomo -s 5 # Test single question bun run src/index.ts test -r my-test -q question_42 # Debug bun run src/index.ts status -r my-test bun run src/index.ts show-failures -r my-test

Pipeline

1. INGEST Load benchmark sessions → Push to provider 2. INDEX Wait for provider indexing 3. SEARCH Query provider → Retrieve context 4. ANSWER Build prompt → Generate answer via LLM 5. EVALUATE Compare to ground truth → Score via judge 6. REPORT Aggregate scores → Output accuracy + latency

Each phase checkpoints independently. Failed runs resume from last successful point.

Checkpointing

Runs persist to data/runs/{runId}/:

checkpoint.json - Run state and progress
results/ - Search results per question
report.json - Final report

Re-running same ID resumes. Use --force to restart.

Extending

Component	Guide
Add Provider	src/providers/README.md
Add Benchmark	src/benchmarks/README.md
Add Judge	src/judges/README.md
Project Structure	src/README.md

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
skills/memorybench		skills/memorybench
src		src
ui		ui
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
framework.md		framework.md
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MemoryBench

Features

Quick Start

Configuration

Commands

Options

Examples

Pipeline

Checkpointing

Extending

License

About

Uh oh!

Uh oh!

Contributors 5

Languages

Folders and files

Latest commit

History

Repository files navigation

MemoryBench

Features

Quick Start

Configuration

Commands

Options

Examples

Pipeline

Checkpointing

Extending

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 5

Languages