Name	Name	Last commit message	Last commit date
parent directory ..
datasets	datasets
expected	expected
gh-pages	gh-pages
runner	runner
sql	sql
.gitignore	.gitignore
README.md	README.md
run_cranfield.sh	run_cranfield.sh
run_memory_stress.sh	run_memory_stress.sh
run_parallel_scaling.sh	run_parallel_scaling.sh

pg_textsearch Benchmarks

Performance benchmarks for the pg_textsearch BM25 full-text search extension.

Quick Start

# Run Cranfield benchmark (quick validation, ~1400 docs) ./runner/run_benchmark.sh cranfield # Run MS MARCO benchmark (8.8M passages) - requires download ./runner/run_benchmark.sh msmarco --download --load --query # Run Wikipedia benchmark ./runner/run_benchmark.sh wikipedia --download --load --query # Run all benchmarks ./runner/run_benchmark.sh all

Datasets

Cranfield Collection (Quick Validation)

Size: 1,400 aerodynamics abstracts, 225 queries
Purpose: Quick validation of BM25 correctness and basic performance
Time: ~1-2 minutes total

MS MARCO Passage Ranking (Full Scale)

Size: 8.8 million passages, 6,980 dev queries with relevance judgments
Purpose: Large-scale performance benchmarking, search quality evaluation
Source: Microsoft MS MARCO
Download: ~2GB compressed
Time: Index build may take 30+ minutes depending on hardware

Wikipedia (Real-World Content)

Size: Configurable (10K, 100K, 1M, or full ~6M articles)
Purpose: Real-world document lengths and vocabulary
Source: Wikimedia Dumps
Time: Varies significantly by size

Benchmark Runner

The main runner script is runner/run_benchmark.sh:

./runner/run_benchmark.sh [dataset] [options] # Datasets: # msmarco - MS MARCO Passage Ranking (8.8M passages) # wikipedia - Wikipedia articles # cranfield - Cranfield collection (1,400 docs) # all - Run all benchmarks # Options: # --download - Download dataset if not present # --load - Load data and create index (drops existing) # --query - Run query benchmarks only # --report - Generate markdown report # --port PORT - Postgres port (default: 5433 for release build)

Running Benchmarks

Prerequisites

PostgreSQL with pg_textsearch installed
For Wikipedia: pip install wikiextractor
For best results, use a release build of Postgres (port 5433)

Download Data

# MS MARCO cd datasets/msmarco && ./download.sh # Wikipedia (full) cd datasets/wikipedia && ./download.sh full # Wikipedia (subset for testing) cd datasets/wikipedia && ./download.sh 100K

Load and Index

# Using psql directly psql -p 5433 -v data_dir="'$PWD/datasets/msmarco/data'" \ -f datasets/msmarco/load.sql # Or use the runner ./runner/run_benchmark.sh msmarco --load --port 5433

Run Query Benchmarks

psql -p 5433 -f datasets/msmarco/queries.sql

Metrics Collected

Index Build

Total time to build BM25 index
Memory usage during index build

Query Performance

Single query latency (p50, p95, p99)
Batch query throughput (QPS)
Query latency by query type:
- Single-word queries
- Multi-word queries
- Question-style queries
- Rare term queries

Search Quality (MS MARCO)

MRR@10 (Mean Reciprocal Rank)
Comparison with known relevance judgments

CI Integration

Benchmarks run automatically:

On PR: Cranfield only (quick validation)
Nightly: MS MARCO subset
Weekly: Full benchmark suite

See .github/workflows/benchmark.yml for configuration.

Results

Historical results are stored in results/ (gitignored).

Each run produces:

benchmark_[dataset]_[timestamp].md - Markdown report
[dataset]_load_[timestamp].log - Load phase logs
[dataset]_queries_[timestamp].log - Query benchmark logs
metrics_[timestamp].env - Machine-readable metrics

Interpreting Results

Index Build Time

Dataset	Expected Time (Release Build)
Cranfield	< 5 seconds
MS MARCO	15-60 minutes
Wikipedia (100K)	2-10 minutes
Wikipedia (full)	1-4 hours

Query Latency

For a well-tuned system with sufficient memory:

Top-10 queries: < 100ms
Batch throughput: 10-100 QPS (depending on query complexity)

Memory Requirements

Cranfield: < 64MB
MS MARCO: 1-4GB (depending on index memory limit)
Wikipedia: 512MB-8GB (depending on size)

Adding New Benchmarks

To add a new dataset:

Create directory: datasets/[name]/
Add scripts:
- download.sh - Download and prepare data
- load.sql - Load data and create index
- queries.sql - Query benchmarks
Update this README

Comparison with Other Systems

For competitive benchmarks, you can run the same queries against:

Native Postgres ts_rank (built-in full-text search)
Other PostgreSQL full-text search extensions
External systems (Elasticsearch, Meilisearch)

See datasets/[name]/queries_native.sql for native Postgres equivalents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

pg_textsearch Benchmarks

Quick Start

Datasets

Cranfield Collection (Quick Validation)

MS MARCO Passage Ranking (Full Scale)

Wikipedia (Real-World Content)

Benchmark Runner

Running Benchmarks

Prerequisites

Download Data

Load and Index

Run Query Benchmarks

Metrics Collected

Index Build

Query Performance

Search Quality (MS MARCO)

CI Integration

Results

Interpreting Results

Index Build Time

Query Latency

Memory Requirements

Adding New Benchmarks

Comparison with Other Systems

FilesExpand file tree

benchmarks

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmarks

Folders and files

parent directory

README.md

pg_textsearch Benchmarks

Quick Start

Datasets

Cranfield Collection (Quick Validation)

MS MARCO Passage Ranking (Full Scale)

Wikipedia (Real-World Content)

Benchmark Runner

Running Benchmarks

Prerequisites

Download Data

Load and Index

Run Query Benchmarks

Metrics Collected

Index Build

Query Performance

Search Quality (MS MARCO)

CI Integration

Results

Interpreting Results

Index Build Time

Query Latency

Memory Requirements

Adding New Benchmarks

Comparison with Other Systems