The benchmark dataset is available on HuggingFace:
- Dataset: anhnh2002/codewikibench
- Paper: arXiv:2510.24428
The dataset contains benchmark data for 22 open-source repositories across multiple programming languages:
- JS/TS: Chart.js, marktext, puppeteer, storybook, mermaid, svelte
- Python: graphrag, rasa, OpenHands
- C: qmk_firmware, libsql, sumatrapdf, wazuh
- C++: electron, x64dbg, json
- C#: FluentValidation, git-credential-manager, ml-agents
- Java: logstash, material-components-android, trino
Each repository includes:
- metadata: Repository URL and commit ID
- docs_tree: Original documentation tree structure
- structured_docs: Parsed and structured documentation
- rubrics: Evaluation rubrics for assessing documentation quality
from datasets import load_dataset import json # Load the dataset dataset = load_dataset("anhnh2002/codewikibench") # Access a specific repository repo_data = dataset['train'][0] print(f"Repository: {repo_data['repo_name']}") print(f"Commit: {repo_data['commit_id']}") # Parse JSON fields docs_tree = json.loads(repo_data['docs_tree']) structured_docs = json.loads(repo_data['structured_docs']) rubrics = json.loads(repo_data['rubrics'])Pull docs folder from original repository (example result)
bash ./download_github_folder.sh --github_repo_url https://github.com/All-Hands-AI/OpenHands.git --folder_path docs --commit_id <COMMIT_ID>Parse official docs (example result)
python docs_parser/parse_official_docs.py --repo_name OpenHandsCrawl deepwiki docs (example result)
python docs_parser/crawl_deepwiki_docs.py --url https://deepwiki.com/AnhMinh-Le/OpenHands --output-dir ../data/OpenHands/deepwiki/docsParse deepwiki docs (example result)
python docs_parser/parse_generated_docs.py --input-dir ../data/OpenHands/deepwiki/docs --output-dir ../data/OpenHands/deepwikiParse codewiki docs (example example)
python docs_parser/parse_generated_docs.py --input-dir /home/anhnh/CodeWiki/output/docs/All-Hands-AI--OpenHands --output-dir ../data/OpenHands/codewiki[NOTE] To evaluate any other types of documentation, you need to parse it into structured_docs.json and its backbone docs_tree.json (see parsed example)
Generate rubrics with multiple models
bash ./run_rubrics_pipeline.sh --repo-name OpenHands --models claude-sonnet-4,kimi-k2-instruct --visualizeRun evaluation with multiple models
bash ./run_evaluation_pipeline.sh --repo-name OpenHands --reference deepwiki-agent --models kimi-k2-instruct --visualize --batch-size 8 bash ./run_evaluation_pipeline.sh --repo-name OpenHands --reference deepwiki-agent --models kimi-k2-instruct,gpt-oss-120b,gemini-2.5-flash --visualize --batch-size 4# Using the complete pipeline (recommended) bash ./run_evaluation_pipeline.sh --repo-name OpenHands --reference deepwiki --visualize # Manual visualization of specific results # Summary view python judge/visualize_evaluation.py --repo-name OpenHands --reference deepwiki --format summary # Detailed view with all requirements python judge/visualize_evaluation.py --repo-name OpenHands --reference deepwiki --format detailed # Show only poorly documented requirements (score < 0.5) python judge/visualize_evaluation.py --repo-name OpenHands --reference deepwiki --format detailed --max-score 0.5 # Export to CSV for analysis python judge/visualize_evaluation.py --repo-name OpenHands --reference deepwiki --format csv # Export to Markdown report python judge/visualize_evaluation.py --repo-name OpenHands --reference deepwiki --format markdown# Count lines in the main branch (use the latest commit ID) python3 count_lines_of_code.py https://github.com/All-Hands-AI/OpenHands.git HEAD # Count lines at a specific commit python3 count_lines_of_code.py https://github.com/All-Hands-AI/OpenHands.git a1b2c3d4e5f6 # Show detailed file-by-file breakdown python3 count_lines_of_code.py https://github.com/All-Hands-AI/OpenHands.git 30604c40fc6e9ac914089376f41e118582954f22If you use this dataset or codebase in your research, please cite:
@misc{hoang2025codewikievaluatingaisability, title={CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases}, author={Anh Nguyen Hoang and Minh Le-Anh and Bach Le and Nghi D. Q. Bui}, year={2025}, eprint={2510.24428}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2510.24428}, }