This directory contains benchmarks for evaluating CK's search performance against industry standards.
Evaluates CK's code search and retrieval capabilities using real-world GitHub issues from the SWE-bench dataset.
- Dataset: 2,294 real GitHub issues from popular Python repositories
- Task: Given an issue description, retrieve relevant files that need to be modified
- Baseline: BM25 retrieval (as used in SWE-bench evaluations)
- CK Advantage: Tests hybrid semantic + lexical search vs pure lexical search
See swe-bench/README.md for detailed setup and usage instructions.
Each benchmark has its own directory with:
README.md- Detailed documentationrun.py- Main benchmark runner scriptrequirements.txt- Python dependenciesdata/- Downloaded benchmark data (gitignored)results/- Benchmark results
Benchmark results and performance comparisons are documented in each benchmark's directory.
To add a new benchmark:
- Create a new directory:
benchmarks/<benchmark-name>/ - Add README, run script, and requirements
- Update this main README with a description
- Ensure large data files are gitignored